diff --git a/.github/ISSUE_TEMPLATE/---feature-request-.md b/.github/ISSUE_TEMPLATE/---feature-request-.md new file mode 100644 index 0000000000000000000000000000000000000000..57708855dce4fcc81e719c59082a8a42415eba47 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/---feature-request-.md @@ -0,0 +1,27 @@ +--- +name: 建议(Feature request) +about: 您可以提出您的建议。 You could use this template for reporting a suggestion  issue. + +--- + +欢迎您对PaddlePaddle提出建议,非常感谢您对PaddlePaddle的贡献! +在留下您的建议时,辛苦您同步提供如下信息: +- 版本、环境信息 +1)PaddlePaddle版本:请提供您的PaddlePaddle版本号,例如1.1 +2)CPU/GPU:您是否使用GPU进行训练,如是,请提供您的CUDA和cuDNN版本号 +3)系统环境:请您描述系统类型、版本,例如Mac OS 10.14 +- 复现信息:如为报错,请给出复现环境、复现步骤 +- 建议描述:请您详细描述,您认为需优化的功能 + +Thank you for contributing to PaddlePaddle. +Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before. +Please make sure that this is a feature request. +**System information** +-PaddlePaddle version (eg.1.1)or CommitID +-CPU: including CPUMKL/OpenBlas/MKLDNN version +-GPU: including CUDA/CUDNN version +-OS Platform (eg.Mac OS 10.14) +**To Reproduce** +Steps to reproduce the behavior +**Describe the feature and the current behavior/state.** +**Any Other info.** diff --git a/.github/ISSUE_TEMPLATE/---inference-issue-.md b/.github/ISSUE_TEMPLATE/---inference-issue-.md new file mode 100644 index 0000000000000000000000000000000000000000..37bdc8889e2722dda964ba82c2ac36cef5e60110 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/---inference-issue-.md @@ -0,0 +1,40 @@ +--- +name: 预测(Inference Issue) +about: 您可以提问预测中报错、应用等问题。 You could use this template for reporting an inference issue. + +--- + +为使您的问题得到快速解决,在建立Issue前,请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】 + +如果您没有查询到相似问题,为快速解决您的提问,建立issue时请提供如下细节信息: +- 标题:简洁、精准描述您的问题,例如“最新预测库的API文档在哪儿 ” +- 版本、环境信息: +    1)PaddlePaddle版本:请提供您的PaddlePaddle版本号(如1.1)或CommitID +    2)CPU:预测若用CPU,请提供CPU型号,MKL/OpenBlas/MKLDNN/等数学库使用情况 +    3)GPU:预测若用GPU,请提供GPU型号、CUDA和CUDNN版本号 +    4)系统环境:请您描述系统类型、版本(如Mac OS 10.14),Python版本 +-预测信息 +    1)C++预测:请您提供预测库安装包的版本信息,及其中的version.txt文件 +    2)CMake包含路径的完整命令 +    3)API信息(如调用请提供) +    4)预测库来源:官网下载/特殊环境(如BCLOUD编译) +- 复现信息:如为报错,请给出复现环境、复现步骤 +- 问题描述:请详细描述您的问题,同步贴出报错信息、日志/代码关键片段 + +Thank you for contributing to PaddlePaddle. +Before submitting the issue, you could search issue in the github in case that th +If there is no solution,please make sure that this is an inference issue including the following details : +**System information** +-PaddlePaddle version (eg.1.1)or CommitID +-CPU: including CPUMKL/OpenBlas/MKLDNN version +-GPU: including CUDA/CUDNN version +-OS Platform (eg.Mac OS 10.14) +-Python version +-Cmake orders +-C++version.txt +-API information +**To Reproduce** +Steps to reproduce the behavior +**Describe your current behavior** +**Code to reproduce the issue** +**Other info / logs** diff --git a/.github/ISSUE_TEMPLATE/---installation-issue-.md b/.github/ISSUE_TEMPLATE/---installation-issue-.md new file mode 100644 index 0000000000000000000000000000000000000000..ce4ba589324673baa4aa39760bcdcd66ecfdd36b --- /dev/null +++ b/.github/ISSUE_TEMPLATE/---installation-issue-.md @@ -0,0 +1,40 @@ +--- +name: 安装(Installation Issue) +about: 您可以提问安装、编译出现报错等问题。 You could use this template for reporting an installation +  issue. + +--- + +为使您的问题得到快速解决,在建立Issue前,请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】 + +建立issue时,为快速解决问题,请您根据使用情况给出如下信息: +- 标题:请包含关键词“安装错误”/“编译错误”,例如“Mac编译错误” +- 版本、环境信息: +    1)PaddlePaddle版本:请提供您的PaddlePaddle版本号(如1.1)或CommitID +    2)CPU:请提供CPU型号,MKL/OpenBlas/MKLDNN/等数学库的使用情况 +    3)GPU:请提供GPU型号,CUDA和CUDNN版本号 +    4)系统环境:请说明系统类型、版本(如Mac OS 10.14)、Python版本 +- 安装方式信息: +1)pip安装/docker安装 +2)本地编译:请提供cmake命令,编译命令 +3)docker编译:请提供docker镜像,编译命令            +  特殊环境请注明:如离线安装等 +- 复现信息:如为报错,请给出复现环境、复现步骤 +- 问题描述:请详细描述您的问题,同步贴出报错信息、日志/代码关键片段 + +Thank you for contributing to PaddlePaddle. +Before submitting the issue, you could search issue in Github in case that there was a similar issue submitted or resolved before. +If there is no solution,please make sure that this is an installation issue including the following details: +**System information** +-PaddlePaddle version (eg.1.1)or CommitID +-CPU: including CPUMKL/OpenBlas/MKLDNN version +-GPU: including CUDA/CUDNN version +-OS Platform (eg. Mac OS 10.14) +-Python version +- Install method: pip install/install with docker/build from source(without docker)/build within docker +- Other special cases that you think may be related to this problem, eg. offline install, special internet condition   +**To Reproduce** +Steps to reproduce the behavior +**Describe your current behavior** +**Code to reproduce the issue** +**Other info / logs** diff --git a/.github/ISSUE_TEMPLATE/---model-issue-.md b/.github/ISSUE_TEMPLATE/---model-issue-.md new file mode 100644 index 0000000000000000000000000000000000000000..7cb52f37b90262d2704fe32d8093fac19ed74b5f --- /dev/null +++ b/.github/ISSUE_TEMPLATE/---model-issue-.md @@ -0,0 +1,36 @@ +--- +name: 模型(Model Issue) +about: 您可以提问模型、算法、数据集方向的使用报错等问题。You could use this template for reporting a model/ + algorithm/dataset  issue. + +--- + +为使您的问题得到快速解决,在建立Issue前,请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】 + +建立issue时,为快速解决问题,请您根据使用情况给出如下信息: +- 标题:简洁、精准描述您的问题,例如“ssd 模型前置lstm报错  ” +- 版本、环境信息: +    1)PaddlePaddle版本:请提供PaddlePaddle版本号,例如1.1或CommitID +    2)CPU:请提供CPU型号,MKL/OpenBlas/MKLDNN/等数学库的使用情况 +    3)GPU:请提供GPU型号,CUDA和CUDNN版本号 +    4)系统环境:请说明系统类型、版本(例如Mac OS 10.14),Python版本 +- 模型信息 +    1)模型名称 2)使用数据集名称 3)使用算法名称 4)模型链接 +- 复现信息:如为报错,请给出复现环境、复现步骤 +- 问题描述:请详细描述您的问题,同步贴出报错信息、日志/代码关键片段 + +Thank you for contributing to PaddlePaddle. +Before submitting the issue, you could search issue in the github.Probably there was a similar issue submitted or resolved before. +If there is no solution,please make sure that this is a issue of models including the following details: +**System information** +-PaddlePaddle version (eg.1.1)or CommitID +-CPU: including CPUMKL/OpenBlas/MKLDNN version +-GPU: including CUDA/CUDNN version +-OS Platform (eg.Mac OS 10.14) +-Python version +-Name of Models&Dataset/details of operator +**To Reproduce** +Steps to reproduce the behavior +**Describe your current behavior** +**Code to reproduce the issue** +**Other info / logs** diff --git a/.github/ISSUE_TEMPLATE/---others-.md b/.github/ISSUE_TEMPLATE/---others-.md new file mode 100644 index 0000000000000000000000000000000000000000..6a291153e43f51e43646a1c860ec253361b137d6 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/---others-.md @@ -0,0 +1,33 @@ +--- +name: 其他(Others) +about: 如上述分类未包含您的问题,可在此提出。 You could use this template for reporting other issues + +--- + +为使您的问题得到快速解决,在建立Issues前,请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】 + +如果您没有查询到相似问题,为快速解决您的提问,建立issue时请提供如下细节信息: +- 标题:简洁、精准概括您的问题 +- 版本、环境信息: +    1)PaddlePaddle版本:请提供您的PaddlePaddle版本号,例如1.1或CommitID +    2)CPU/GPU:如果您使用GPU训练,请提供GPU驱动版本、CUDA和cuDNN版本号 +    3)系统环境:请您描述系统类型、版本,例如Mac OS 10.14 +    4)Python版本号 +    5)显存信息 +- 复现信息:如为报错,请给出复现环境、复现步骤 +- 问题描述:请详细描述您的问题,同步贴出报错信息、日志/代码关键片段 + +Thank you for contributing to PaddlePaddle. +Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before. +If there is no solution,please provide us with the following details : +**System information** +-PaddlePaddle version (eg.1.1)or CommitID +-CPU: including CPUMKL/OpenBlas/MKLDNN version +-GPU: including CUDA/cuDNN version +-OS Platform and Distribution(eg.Mac OS 10.14) +-Python version +**To Reproduce** +Steps to reproduce the behavior +**Describe your current behavior** +**Code to reproduce the issue** +**Other info / logs** diff --git a/.github/ISSUE_TEMPLATE/---training-issue-.md b/.github/ISSUE_TEMPLATE/---training-issue-.md new file mode 100644 index 0000000000000000000000000000000000000000..29e8383d9779229328c7c12d04ba6173fd0c8ba1 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/---training-issue-.md @@ -0,0 +1,38 @@ +--- +name: 训练(Training issue) +about: 您可以提问训练中报错、应用、出core等问题。 You could use this template for reporting an training +  issue. + +--- + +为使您的问题得到快速解决,在建立Issues前,请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】 + +如果您没有查询到相似问题,为快速解决您的提问,建立issue时请提供如下细节信息: +- 标题:简洁、精准概括您的问题,例如“Insufficient Memory xxx" ” +- 版本、环境信息: +    1)PaddlePaddle版本:请提供您的PaddlePaddle版本号,例如1.1或CommitID +    2)CPU:预测若用CPU,请提供CPU型号,MKL/OpenBlas/MKLDNN/等数学库使用情况 +    3)GPU:预测若用GPU,请提供GPU型号、CUDA和CUDNN版本号 +    4)系统环境:请您描述系统类型、版本,例如Mac OS 10.14,Python版本 +- 训练信息 +    1)单机/多机,单卡/多卡 +    2)显存信息 +    3)Operator信息 +- 复现信息:如为报错,请给出复现环境、复现步骤 +- 问题描述:请详细描述您的问题,同步贴出报错信息、日志、可复现的代码片段 + +Thank you for contributing to PaddlePaddle. +Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before. +If there is no solution,please make sure that this is a training issue including the following details: +**System information** +-PaddlePaddle version (eg.1.1)or CommitID +-CPU: including CPUMKL/OpenBlas/MKLDNN version +-GPU: including CUDA/CUDNN version +-OS Platform (eg.Mac OS 10.14) +-Other imformation: Distriuted training/informantion of operator/ +Graphics card storage +**To Reproduce** +Steps to reproduce the behavior +**Describe your current behavior** +**Code to reproduce the issue** +**Other info / logs** diff --git a/.gitignore b/.gitignore index 90138f996cf9cacc3c1cbff0cf2600eefca3f305..369fa1cb919c82caec326d1429c8a2eba3b928d6 100644 --- a/.gitignore +++ b/.gitignore @@ -1,9 +1,11 @@ +python/paddle/fluid/tests/unittests/reader_reset_test.recordio paddle/operators/check_t.save paddle/operators/check_tensor.ls paddle/operators/tensor.save python/paddle/v2/fluid/tests/book/image_classification_resnet.inference.model/ python/paddle/v2/fluid/tests/book/image_classification_vgg.inference.model/ python/paddle/v2/fluid/tests/book/label_semantic_roles.inference.model/ +paddle/fluid/operators/distributed/send_recv.proto *.DS_Store *.vs build/ @@ -28,3 +30,5 @@ third_party/ build_* # clion workspace. cmake-build-* +paddle/fluid/operators/distributed/send_recv.proto +model_test diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index e718b32cb6c48d11e73600509a17db107f438708..d8112837dc9627bc2e501940b8e97c89e97c45ff 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -42,12 +42,6 @@ repos: entry: bash ./tools/codestyle/pylint_pre_commit.hook language: system files: \.(py)$ -- repo: https://github.com/PaddlePaddle/pre-commit-golang - sha: 8337620115c25ff8333f1b1a493bd031049bd7c0 - hooks: - - id: go-fmt - types: - - go - repo: local hooks: - id: copyright_checker diff --git a/.travis.yml b/.travis.yml index 8c2d9f143b3102c142ec2d5d193b82936d04fba8..87de895ddadfafb644a49ef6887b1321cf29ef53 100644 --- a/.travis.yml +++ b/.travis.yml @@ -4,7 +4,6 @@ cache: - $HOME/.ccache - $HOME/.cache/pip - $TRAVIS_BUILD_DIR/build/third_party - - $TRAVIS_BUILD_DIR/build_android/third_party sudo: required dist: trusty services: @@ -13,7 +12,6 @@ os: - linux env: - JOB=check_style - - JOB=build_android addons: ssh_known_hosts: 13.229.163.131 before_install: diff --git a/AUTHORS.md b/AUTHORS.md index 41b7193677a0208ba2fa82b72862292572dcb6ef..da91933f4697f3b7c08feab20d703e22397b3757 100644 --- a/AUTHORS.md +++ b/AUTHORS.md @@ -25,6 +25,7 @@ | kexinzhao | Ke-Xin Zhao | | kuke | Yi-Bing Liu | | lcy-seso | Ying Cao | +| cjld | Dun Liang | | lipeng-unisound | Peng Li | | liuyuan | Yuan Liu | | livc | Zhao Li | @@ -42,7 +43,10 @@ | QiJune | Jun Qi | | qingqing01 | Qing-Qing Dang | | reyoung | Yang Yu | +| Sand3r- | Michal Gallus | +| sfraczek | Sylwester Fraczek | | Superjom | Chun-Wei Yan | +| tensor-tang | Jian Tang | | tianbingsz | Tian-Bing Xu | | tpatejko | Tomasz Patejko | | typhoonzero | Yi Wu | @@ -51,6 +55,7 @@ | wangyang59 | Yang Wang | | wangzhen-nlp | Zhen Wang | | wen-bo-yang | Wen-Bo Yang | +| wojtuss | Wojciech Uss | | wwhu | Wei-Wei Hu | | xinghai-sun | Xing-Hai Sun | | Xreki | Yi-Qun Liu | diff --git a/CMakeLists.txt b/CMakeLists.txt index 6aa2e1715b92d73aa4e5e97d5e52ffac51451d80..26d94384a9150735aa8341fd8a18cb039895ff91 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -24,13 +24,24 @@ message(STATUS "CXX compiler: ${CMAKE_CXX_COMPILER}, version: " "${CMAKE_CXX_COMPILER_ID} ${CMAKE_CXX_COMPILER_VERSION}") message(STATUS "C compiler: ${CMAKE_C_COMPILER}, version: " "${CMAKE_C_COMPILER_ID} ${CMAKE_C_COMPILER_VERSION}") +message(STATUS "AR tools: ${CMAKE_AR}") + if(WIN32) + set(CMAKE_SUPPRESS_REGENERATION ON) set(CMAKE_STATIC_LIBRARY_PREFIX lib) + add_definitions("/DGOOGLE_GLOG_DLL_DECL=") + set(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} /bigobj /MTd") + set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} /bigobj /MT") + set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} /bigobj /MTd") + set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} /bigobj /MT") + add_compile_options(/wd4068 /wd4129 /wd4244 /wd4267 /wd4297 /wd4530 /wd4577 /wd4819 /wd4838) + set(PADDLE_LINK_FLAGS "/IGNORE:4006 /IGNORE:4098 /IGNORE:4217 /IGNORE:4221") + set(CMAKE_STATIC_LINKER_FLAGS "${CMAKE_STATIC_LINKER_FLAGS} ${PADDLE_LINK_FLAGS}") + set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} ${PADDLE_LINK_FLAGS}") + set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} ${PADDLE_LINK_FLAGS}") endif(WIN32) -if(NOT CMAKE_CROSSCOMPILING) - find_package(CUDA QUIET) -endif(NOT CMAKE_CROSSCOMPILING) +find_package(CUDA QUIET) find_package(Git REQUIRED) find_package(Threads REQUIRED) @@ -41,35 +52,27 @@ option(WITH_GPU "Compile PaddlePaddle with NVIDIA GPU" ${CUDA_F option(WITH_AMD_GPU "Compile PaddlePaddle with AMD GPU" OFF) option(WITH_AVX "Compile PaddlePaddle with AVX intrinsics" ${AVX_FOUND}) option(WITH_MKL "Compile PaddlePaddle with MKL support." ${AVX_FOUND}) +option(WITH_NGRAPH "Compile PaddlePaddle with nGraph support." OFF) option(WITH_DSO "Compile PaddlePaddle with dynamic linked CUDA" ON) option(WITH_TESTING "Compile PaddlePaddle with unit testing" OFF) -option(WITH_SWIG_PY "Compile PaddlePaddle with inference api" ON) option(WITH_PYTHON "Compile PaddlePaddle with python interpreter" ON) -option(WITH_DOUBLE "Compile PaddlePaddle with double precision" OFF) -option(WITH_RDMA "Compile PaddlePaddle with RDMA support" OFF) -option(WITH_TIMER "Compile PaddlePaddle with stats timer" OFF) -option(WITH_PROFILER "Compile PaddlePaddle with GPU profiler" OFF) -option(WITH_DOC "Compile PaddlePaddle with documentation" OFF) +option(WITH_PROFILER "Compile PaddlePaddle with GPU profiler and gperftools" OFF) +option(WITH_JEMALLOC "Compile PaddlePaddle with jemalloc" OFF) option(WITH_COVERAGE "Compile PaddlePaddle with code coverage" OFF) option(COVERALLS_UPLOAD "Package code coverage data to coveralls" OFF) -option(ON_TRAVIS "Exclude special unit test on Travis CI" OFF) -option(WITH_C_API "Compile PaddlePaddle with C-API(Prediction)" OFF) -option(WITH_FLUID_ONLY "Compile PaddlePaddle fluid only" OFF) -option(WITH_GOLANG "Compile PaddlePaddle with GOLANG" OFF) -option(GLIDE_INSTALL "Download and install go dependencies " ON) -option(USE_NNPACK "Compile PaddlePaddle with NNPACK library" OFF) option(WITH_DISTRIBUTE "Compile with distributed support" OFF) -option(USE_EIGEN_FOR_BLAS "Use matrix multiplication in Eigen" OFF) -option(EIGEN_USE_THREADS "Compile with multi-threaded Eigen" OFF) -option(WITH_ARM_FP16 "Use half precision support on armv8.2-a cpu" OFF) -option(WITH_FAST_BUNDLE_TEST "Bundle tests that can be run in a single process together to reduce launch overhead" OFF) +option(WITH_PSLIB "Compile with pslib support" OFF) option(WITH_CONTRIB "Compile the third-party contributation" OFF) option(REPLACE_ENFORCE_GLOG "Replace PADDLE_ENFORCE with glog/CHECK for better debug." OFF) +# TODO(Superjomn) Remove WITH_ANAKIN option if not needed latter. option(WITH_ANAKIN "Compile with Anakin library" OFF) +option(ANAKIN_BUILD_FAT_BIN "Build anakin cuda fat-bin lib for all device plantform, ignored when WITH_ANAKIN=OFF" OFF) +option(ANAKIN_BUILD_CROSS_PLANTFORM "Build anakin lib for any nvidia device plantform. ignored when WITH_ANAKIN=OFF" ON) option(WITH_GRPC "Use grpc as the default rpc framework" ${WITH_DISTRIBUTE}) option(WITH_BRPC_RDMA "Use brpc rdma as the rpc protocal" OFF) -option(WITH_INFERENCE "Compile fluid inference library" ON) -option(WITH_INFERENCE_API_TEST "Test fluid inference high-level api interface" OFF) +option(ON_INFER "Turn on inference optimization." OFF) +option(WITH_INFERENCE_API_TEST "Test fluid inference C++ high-level api interface" OFF) +option(WITH_HIGH_LEVEL_API_TEST "Test fluid python high-level api interface" OFF) option(WITH_SYSTEM_BLAS "Use system blas library" OFF) option(PY_VERSION "Compile PaddlePaddle with python3 support" ${PY_VERSION}) option(WITH_FAST_MATH "Make use of fast math library, might affect the precision to some extent" ON) @@ -87,38 +90,14 @@ if(NOT CMAKE_BUILD_TYPE) FORCE) endif() -if(ANDROID OR IOS) - if(ANDROID) - if(${CMAKE_SYSTEM_VERSION} VERSION_LESS "16") - message(FATAL_ERROR "Unsupport standalone toolchains with Android API level lower than 16") - endif() - endif() - - set(WITH_GPU OFF CACHE STRING - "Disable GPU when cross-compiling for Android and iOS" FORCE) - set(WITH_AVX OFF CACHE STRING - "Disable AVX when cross-compiling for Android and iOS" FORCE) - set(WITH_PYTHON OFF CACHE STRING - "Disable PYTHON when cross-compiling for Android and iOS" FORCE) - set(WITH_RDMA OFF CACHE STRING - "Disable RDMA when cross-compiling for Android and iOS" FORCE) +if (APPLE) set(WITH_MKL OFF CACHE STRING - "Disable MKL when cross-compiling for Android and iOS" FORCE) - set(WITH_GOLANG OFF CACHE STRING - "Disable golang when cross-compiling for Android and iOS" FORCE) - - # Compile PaddlePaddle mobile inference library - if (NOT WITH_C_API) - set(WITH_C_API ON CACHE STRING - "Always compile the C_API when cross-compiling for Android and iOS" FORCE) - endif() - set(MOBILE_INFERENCE ON) - add_definitions(-DPADDLE_MOBILE_INFERENCE) + "Disable MKL for building on mac" FORCE) endif() -if (APPLE OR WIN32) - set(WITH_MKL OFF CACHE STRING - "Disable MKL for building on mac and windows" FORCE) +if (WIN32) + set(WITH_DISTRIBUTE OFF CACHE STRING + "Disable DISTRIBUTE when compiling for Windows" FORCE) endif() set(THIRD_PARTY_PATH "${CMAKE_BINARY_DIR}/third_party" CACHE STRING @@ -130,21 +109,7 @@ set(FLUID_INSTALL_DIR "${CMAKE_BINARY_DIR}/fluid_install_dir" CACHE STRING set(FLUID_INFERENCE_INSTALL_DIR "${CMAKE_BINARY_DIR}/fluid_inference_install_dir" CACHE STRING "A path setting fluid inference shared and static libraries") -if (WITH_C_API AND WITH_PYTHON) - message(WARNING "It is suggest not embedded a python interpreter in Paddle " - "when using C-API. It will give an unpredictable behavior when using a " - "different Python interpreter from compiling.") -endif() - -if (WITH_C_API) - set(WITH_FLUID_ONLY OFF CACHE STRING "Disable install fluid when compile the C_API" FORCE) -endif() - -if(MOBILE_INFERENCE) - set(THIRD_PARTY_BUILD_TYPE MinSizeRel) -else() - set(THIRD_PARTY_BUILD_TYPE Release) -endif() +set(THIRD_PARTY_BUILD_TYPE Release) set(WITH_MKLML ${WITH_MKL}) if (NOT DEFINED WITH_MKLDNN) @@ -172,22 +137,31 @@ include(external/protobuf) # download, build, install protobuf include(external/python) # download, build, install python include(external/openblas) # download, build, install openblas include(external/mkldnn) # download, build, install mkldnn -include(external/swig) # download, build, install swig +include(external/ngraph) # download, build, install nGraph include(external/boost) # download boost -include(external/any) # download libn::any include(external/eigen) # download eigen3 include(external/pybind11) # download pybind11 include(external/cares) include(external/cub) - -if (NOT WIN32) -# there is no official support of snappystream, warpctc, nccl, cupti in windows +include(external/rocprim) +include(external/xxhash) # download xxhash +include(external/dlpack) include(external/snappy) # download snappy include(external/snappystream) # download snappystream include(external/warpctc) # download, build, install warpctc + +if (NOT WIN32) +# there is no official support of nccl, cupti in windows include(cupti) +include(external/gzstream) endif (NOT WIN32) +if(WITH_PSLIB) + include(external/libmct) + include(external/pslib_brpc) + include(external/pslib) +endif(WITH_PSLIB) + if(WITH_DISTRIBUTE) if(WITH_GRPC) include(external/grpc) @@ -218,86 +192,62 @@ include(configure) # add paddle env configuration if(WITH_GPU) include(cuda) include(tensorrt) + include(anakin_subgraph) +endif() + +if(WITH_GPU AND NOT WIN32) + message(STATUS "add dgc lib.") + include(external/dgc) endif() + if(WITH_MKL OR WITH_MKLML) include(external/anakin) elseif() set(WITH_ANAKIN OFF CACHE STRING "Anakin is used in MKL only now." FORCE) endif() +if (WITH_PROFILER) + find_package(Gperftools REQUIRED) + include_directories(${GPERFTOOLS_INCLUDE_DIR}) + add_definitions(-DWITH_GPERFTOOLS) +endif() + +if (WITH_JEMALLOC) + find_package(JeMalloc REQUIRED) + include_directories(${JEMALLOC_INCLUDE_DIR}) + add_definitions(-DPADDLE_WITH_JEMALLOC) +endif() + include(generic) # simplify cmake module include(package) # set paddle packages include(ccache) # set ccache for compilation include(util) # set unittest and link libs -include(rdma) # set rdma libraries include(version) # set PADDLE_VERSION include(coveralls) # set code coverage include(inference_lib) # add paddle fluid inference libraries include_directories("${PADDLE_SOURCE_DIR}") -include_directories("${PADDLE_SOURCE_DIR}/paddle/legacy/cuda/include") -include_directories("${CMAKE_CURRENT_BINARY_DIR}/proto") -include_directories("${CMAKE_CURRENT_BINARY_DIR}/go/pserver/client/c") - -set(EXTERNAL_LIBS - gflags - glog - ${CBLAS_LIBRARIES} - protobuf - zlib - ${PYTHON_LIBRARIES} -) if(WITH_AMD_GPU) find_package(HIP) include(hip) endif(WITH_AMD_GPU) -if(WITH_MKLML) - list(APPEND EXTERNAL_LIBS ${MKLML_IOMP_LIB}) -endif() - -if(WITH_LIBXSMM) - list(APPEND EXTERNAL_LIBS ${LIBXSMM_LIBS}) -endif() - -if(WITH_MKLDNN) - list(APPEND EXTERNAL_LIBS ${MKLDNN_LIB}) -endif() - -if(USE_NNPACK) - include(external/nnpack) - list(APPEND EXTERNAL_LIBS ${NNPACK_LIBS}) -endif(USE_NNPACK) - -add_subdirectory(proto) - -if(NOT MOBILE_INFERENCE AND NOT WITH_FLUID_ONLY) - # "add_subdirectory(go)" should be placed after the following loine, - # because it depends on paddle/optimizer. - add_subdirectory(paddle/legacy/optimizer) -endif() - -# "add_subdirectory(paddle)" and "add_subdirectory(python)" should be -# placed after this block, because they depends on it. -if(WITH_GOLANG) - enable_language(Go) - add_subdirectory(go) -endif(WITH_GOLANG) - set(PADDLE_PYTHON_BUILD_DIR "${CMAKE_CURRENT_BINARY_DIR}/python/build") set(CMAKE_CXX_FLAGS_RELWITHDEBINFO "-O3 -g -DNDEBUG") set(CMAKE_C_FLAGS_RELWITHDEBINFO "-O3 -g -DNDEBUG") +if (ON_INFER) + message(STATUS "On inference mode, will take place some specific optimization.") + add_definitions(-DPADDLE_ON_INFERENCE) +else() + #TODO(luotao), combine this warning with `make inference_lib_dist` command. + message(WARNING "On inference mode, will take place some specific optimization. Turn on the ON_INFER flag when building inference_lib only.") +endif() + add_subdirectory(paddle) if(WITH_PYTHON) add_subdirectory(python) endif() - -if(WITH_DOC) - find_package(Sphinx REQUIRED) - find_python_module(recommonmark REQUIRED) - add_subdirectory(doc) -endif() diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index b878f37a5b8e807e5aa346e0074a741f2f8b6cc5..62b26b99bcbeddc91ed1bd0702b0d6aec2e674bf 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -128,7 +128,7 @@ Please install pre-commit, which automatically reformat the changes to C/C++ and Please remember to add related unit tests. -- For C/C++ code, please follow [`google-test` Primer](https://github.com/google/googletest/blob/master/googletest/docs/Primer.md). +- For C/C++ code, please follow [`google-test` Primer](https://github.com/google/googletest/blob/master/googletest/docs/primer.md) . - For Python code, please use [Python's standard `unittest` package](http://pythontesting.net/framework/unittest/unittest-introduction/). @@ -156,7 +156,7 @@ python \ This will enable VLOG messages generated by `buddy_allocator.{h,cc}` and in the verbose range of 0 to 3, so you will see above example VLOG message, which is in level 3. This suggests that we output overall messages in lower verbose levels, so they display with higher probability. When coding C++, please follow the verbose level convention as follows: -- verbose level 1: [framework](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/framework) -- verbose level 3: [operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators) -- verbose level 5: [memory](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/memory), [platform](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/platform) -- verbose level 7: [math](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/legacy/math) +- verbose level 1: [framework](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/framework) +- verbose level 3: [operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/operators) +- verbose level 5: [memory](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/memory), [platform](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/platform) +- verbose level 7: [math](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/operators/math/) diff --git a/Dockerfile b/Dockerfile index 738bba9bc2e1ab19709722fe04f1490b1b13bd4f..c248ac119caa1f493e4866b02551eb900d3bf391 100644 --- a/Dockerfile +++ b/Dockerfile @@ -11,17 +11,38 @@ RUN /bin/bash -c 'if [[ -n ${UBUNTU_MIRROR} ]]; then sed -i 's#http://archive.ub # ENV variables ARG WITH_GPU ARG WITH_AVX -ARG WITH_DOC ENV WOBOQ OFF ENV WITH_GPU=${WITH_GPU:-ON} ENV WITH_AVX=${WITH_AVX:-ON} -ENV WITH_DOC=${WITH_DOC:-OFF} ENV HOME /root # Add bash enhancements COPY ./paddle/scripts/docker/root/ /root/ +# Prepare packages for Python +RUN apt-get update && \ + apt-get install -y make build-essential libssl-dev zlib1g-dev libbz2-dev \ + libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev \ + xz-utils tk-dev libffi-dev liblzma-dev + +# Install Python3.6 +RUN mkdir -p /root/python_build/ && wget -q https://www.sqlite.org/2018/sqlite-autoconf-3250300.tar.gz && \ + tar -zxf sqlite-autoconf-3250300.tar.gz && cd sqlite-autoconf-3250300 && \ + ./configure -prefix=/usr/local && make -j8 && make install && cd ../ && rm sqlite-autoconf-3250300.tar.gz && \ + wget -q https://www.python.org/ftp/python/3.6.0/Python-3.6.0.tgz && \ + tar -xzf Python-3.6.0.tgz && cd Python-3.6.0 && \ + CFLAGS="-Wformat" ./configure --prefix=/usr/local/ --enable-shared > /dev/null && \ + make -j8 > /dev/null && make altinstall > /dev/null + +# Install Python3.7 +RUN wget -q https://www.python.org/ftp/python/3.7.0/Python-3.7.0.tgz && \ + tar -xzf Python-3.7.0.tgz && cd Python-3.7.0 && \ + CFLAGS="-Wformat" ./configure --prefix=/usr/local/ --enable-shared > /dev/null && \ + make -j8 > /dev/null && make altinstall > /dev/null + +RUN rm -r /root/python_build + RUN apt-get update && \ apt-get install -y --allow-downgrades patchelf \ python3 python3-dev python3-pip \ @@ -54,8 +75,9 @@ RUN curl -s -q https://glide.sh/get | sh # and its size is only one-third of the official one. # 2. Manually add ~IPluginFactory() in IPluginFactory class of NvInfer.h, otherwise, it couldn't work in paddle. # See https://github.com/PaddlePaddle/Paddle/issues/10129 for details. -RUN wget -qO- http://paddlepaddledeps.cdn.bcebos.com/TensorRT-4.0.0.3.Ubuntu-16.04.4.x86_64-gnu.cuda-8.0.cudnn7.0.tar.gz | \ - tar -xz -C /usr/local && \ + +RUN wget -q https://paddlepaddledeps.cdn.bcebos.com/TensorRT-4.0.1.6-ubuntu14.04.x86_64-gnu.cuda.8.0.cudnn7.0.tar.gz --no-check-certificate && \ + tar -zxf TensorRT-4.0.1.6-ubuntu14.04.x86_64-gnu.cuda.8.0.cudnn7.0.tar.gz -C /usr/local && \ cp -rf /usr/local/TensorRT/include /usr && \ cp -rf /usr/local/TensorRT/lib /usr @@ -71,34 +93,52 @@ RUN localedef -i en_US -f UTF-8 en_US.UTF-8 # specify sphinx version as 1.5.6 and remove -U option for [pip install -U # sphinx-rtd-theme] since -U option will cause sphinx being updated to newest # version(1.7.1 for now), which causes building documentation failed. -RUN pip3 install -U wheel && \ - pip3 install -U docopt PyYAML sphinx==1.5.6 && \ - pip3 install sphinx-rtd-theme==0.1.9 recommonmark && \ +RUN pip3 --no-cache-dir install -U wheel && \ + pip3 --no-cache-dir install -U docopt PyYAML sphinx==1.5.6 && \ + pip3 --no-cache-dir install sphinx-rtd-theme==0.1.9 recommonmark && \ + pip3.6 --no-cache-dir install -U wheel && \ + pip3.6 --no-cache-dir install -U docopt PyYAML sphinx==1.5.6 && \ + pip3.6 --no-cache-dir install sphinx-rtd-theme==0.1.9 recommonmark && \ + pip3.7 --no-cache-dir install -U wheel && \ + pip3.7 --no-cache-dir install -U docopt PyYAML sphinx==1.5.6 && \ + pip3.7 --no-cache-dir install sphinx-rtd-theme==0.1.9 recommonmark && \ easy_install -U pip && \ - pip install -U wheel && \ - pip install -U docopt PyYAML sphinx==1.5.6 && \ - pip install sphinx-rtd-theme==0.1.9 recommonmark - -RUN pip3 install pre-commit 'ipython==5.3.0' && \ - pip3 install 'ipykernel==4.6.0' 'jupyter==1.0.0' && \ - pip3 install opencv-python && \ - pip install pre-commit 'ipython==5.3.0' && \ - pip install 'ipykernel==4.6.0' 'jupyter==1.0.0' && \ - pip install opencv-python + pip --no-cache-dir install -U pip setuptools wheel && \ + pip --no-cache-dir install -U docopt PyYAML sphinx==1.5.6 && \ + pip --no-cache-dir install sphinx-rtd-theme==0.1.9 recommonmark + +RUN pip3 --no-cache-dir install 'pre-commit==1.10.4' 'ipython==5.3.0' && \ + pip3 --no-cache-dir install 'ipykernel==4.6.0' 'jupyter==1.0.0' && \ + pip3 --no-cache-dir install opencv-python && \ + pip3.6 --no-cache-dir install 'pre-commit==1.10.4' 'ipython==5.3.0' && \ + pip3.6 --no-cache-dir install 'ipykernel==4.6.0' 'jupyter==1.0.0' && \ + pip3.6 --no-cache-dir install opencv-python && \ + pip3.7 --no-cache-dir install 'pre-commit==1.10.4' 'ipython==5.3.0' && \ + pip3.7 --no-cache-dir install 'ipykernel==4.6.0' 'jupyter==1.0.0' && \ + pip3.7 --no-cache-dir install opencv-python && \ + pip --no-cache-dir install 'pre-commit==1.10.4' 'ipython==5.3.0' && \ + pip --no-cache-dir install 'ipykernel==4.6.0' 'jupyter==1.0.0' && \ + pip --no-cache-dir install opencv-python #For docstring checker -RUN pip3 install pylint pytest astroid isort -RUN pip install pylint pytest astroid isort LinkChecker +RUN pip3 --no-cache-dir install pylint pytest astroid isort +RUN pip3.6 --no-cache-dir install pylint pytest astroid isort +RUN pip3.7 --no-cache-dir install pylint pytest astroid isort +RUN pip --no-cache-dir install pylint pytest astroid isort LinkChecker COPY ./python/requirements.txt /root/ -RUN pip3 install -r /root/requirements.txt -RUN pip install -r /root/requirements.txt +RUN pip3 --no-cache-dir install -r /root/requirements.txt +RUN pip3.6 --no-cache-dir install -r /root/requirements.txt +RUN pip3.7 --no-cache-dir install -r /root/requirements.txt +RUN pip --no-cache-dir install -r /root/requirements.txt # To fix https://github.com/PaddlePaddle/Paddle/issues/1954, we use # the solution in https://urllib3.readthedocs.io/en/latest/user-guide.html#ssl-py2 -RUN apt-get install -y libssl-dev libffi-dev -RUN pip3 install certifi urllib3[secure] -RUN pip install certifi urllib3[secure] +RUN apt-get install -y libssl-dev libffi-dev && apt-get clean -y +RUN pip3 --no-cache-dir install certifi urllib3[secure] +RUN pip3.6 --no-cache-dir install certifi urllib3[secure] +RUN pip3.7 --no-cache-dir install certifi urllib3[secure] +RUN pip --no-cache-dir install certifi urllib3[secure] # Install woboq_codebrowser to /woboq @@ -108,6 +148,14 @@ RUN git clone https://github.com/woboq/woboq_codebrowser /woboq && \ -DCMAKE_BUILD_TYPE=Release . \ make) +# ar mishandles 4GB files +# https://sourceware.org/bugzilla/show_bug.cgi?id=14625 +# remove them when apt-get support 2.27 and higher version +RUN wget -q https://launchpad.net/ubuntu/+archive/primary/+sourcefiles/binutils/2.27-9ubuntu1/binutils_2.27.orig.tar.gz && \ + tar -xzf binutils_2.27.orig.tar.gz && \ + cd binutils-2.27 && \ + ./configure && make -j && make install && cd .. && rm -rf binutils-2.27 binutils_2.27.orig.tar.gz + # Configure OpenSSH server. c.f. https://docs.docker.com/engine/examples/running_ssh_service RUN mkdir /var/run/sshd RUN echo 'root:root' | chpasswd diff --git a/Dockerfile.android b/Dockerfile.android deleted file mode 100644 index 48db2efea21a648657e3f490c95429b9a29ede52..0000000000000000000000000000000000000000 --- a/Dockerfile.android +++ /dev/null @@ -1,42 +0,0 @@ -FROM ubuntu:16.04 -MAINTAINER PaddlePaddle Authors - -ARG UBUNTU_MIRROR -RUN /bin/bash -c 'if [[ -n ${UBUNTU_MIRROR} ]]; then sed -i 's#http://archive.ubuntu.com/ubuntu#${UBUNTU_MIRROR}#g' /etc/apt/sources.list; fi' - -# ENV variables -ARG ANDROID_ABI -ARG ANDROID_API - -ENV ANDROID_ABI=${ANDROID_ABI:-"armeabi-v7a"} -ENV ANDROID_API=${ANDROID_API:-21} - -ENV HOME=/root \ - ANDROID_NDK_HOME=/opt/android-ndk-linux \ - ANDROID_TOOLCHAINS_DIR=/opt/toolchains - -RUN apt-get update && \ - apt-get install -y \ - git python-dev python-pip python-numpy \ - wget curl tar unzip gcc g++ locales clang-format-3.8 swig cmake && \ - apt-get clean -y - -# git credential to skip password typing -RUN git config --global credential.helper store - -# Fix locales to en_US.UTF-8 -RUN localedef -i en_US -f UTF-8 en_US.UTF-8 - -RUN pip install --upgrade pip==9.0.3 && \ - pip install -U 'protobuf==3.1.0' && \ - pip install -U wheel sphinx && \ - pip install pre-commit - -# Android NDK -RUN mkdir -p ${ANDROID_TOOLCHAINS_DIR} && \ - mkdir -p /opt/android-ndk-tmp && \ - cd /opt/android-ndk-tmp && \ - wget -q https://dl.google.com/android/repository/android-ndk-r14b-linux-x86_64.zip && \ - unzip -q android-ndk-r14b-linux-x86_64.zip && \ - mv android-ndk-r14b ${ANDROID_NDK_HOME} && \ - rm -rf /opt/android-ndk-tmp diff --git a/README.md b/README.md index de924fc5fc756f2486ff825b0bccd999402b2997..5c428e9900762a208eebbfd053ce98663f803345 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,13 @@ # PaddlePaddle +English | [简体中文](./README_cn.md) [![Build Status](https://travis-ci.org/PaddlePaddle/Paddle.svg?branch=develop)](https://travis-ci.org/PaddlePaddle/Paddle) -[![Documentation Status](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](http://www.paddlepaddle.org/docs/develop/documentation/en/getstarted/index_en.html) -[![Documentation Status](https://img.shields.io/badge/中文文档-最新-brightgreen.svg)](http://www.paddlepaddle.org/docs/develop/documentation/zh/getstarted/index_cn.html) +[![Documentation Status](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](http://paddlepaddle.org/documentation/docs/en/1.3/beginners_guide/index_en.html) +[![Documentation Status](https://img.shields.io/badge/中文文档-最新-brightgreen.svg)](http://paddlepaddle.org/documentation/docs/zh/1.3/beginners_guide/index.html) [![Release](https://img.shields.io/github/release/PaddlePaddle/Paddle.svg)](https://github.com/PaddlePaddle/Paddle/releases) [![License](https://img.shields.io/badge/license-Apache%202-blue.svg)](LICENSE) - Welcome to the PaddlePaddle GitHub. PaddlePaddle (PArallel Distributed Deep LEarning) is an easy-to-use, @@ -18,8 +18,7 @@ learning to many products at Baidu. Our vision is to enable deep learning for everyone via PaddlePaddle. Please refer to our [release announcement](https://github.com/PaddlePaddle/Paddle/releases) to track the latest feature of PaddlePaddle. - -### Latest PaddlePaddle Release: [Fluid 1.0.0](https://github.com/PaddlePaddle/Paddle/tree/release/1.0.0) +### Latest PaddlePaddle Release: [Fluid 1.3.0](https://github.com/PaddlePaddle/Paddle/tree/release/1.3) ### Install Latest Stable Release: ``` # Linux CPU @@ -27,9 +26,9 @@ pip install paddlepaddle # Linux GPU cuda9cudnn7 pip install paddlepaddle-gpu # Linux GPU cuda8cudnn7 -pip install paddlepaddle-gpu==0.15.0.post87 +pip install paddlepaddle-gpu==1.3.0.post87 # Linux GPU cuda8cudnn5 -pip install paddlepaddle-gpu==0.15.0.post85 +pip install paddlepaddle-gpu==1.3.0.post85 # For installation on other platform, refer to http://paddlepaddle.org/ ``` @@ -76,30 +75,29 @@ pip install paddlepaddle-gpu==0.15.0.post85 ## Installation -It is recommended to read [this doc](http://paddlepaddle.org/documentation/docs/zh/1.0/beginners_guide/index.html) on our website. +It is recommended to read [this doc](http://paddlepaddle.org/documentation/docs/en/1.3/beginners_guide/index_en.html) on our website. ## Documentation -We provide [English](http://paddlepaddle.org/documentation/docs/en/1.0.0/getstarted/index_en.html) and -[Chinese](http://paddlepaddle.org/documentation/docs/zh/1.0/beginners_guide/index.html) documentation. +We provide [English](http://paddlepaddle.org/documentation/docs/en/1.3/beginners_guide/index_en.html) and +[Chinese](http://paddlepaddle.org/documentation/docs/zh/1.3/beginners_guide/index.html) documentation. - [Deep Learning 101](https://github.com/PaddlePaddle/book) You might want to start from this online interactive book that can run in a Jupyter Notebook. -- [Distributed Training](http://paddlepaddle.org/documentation/docs/zh/1.0/user_guides/howto/training/cluster_howto.html) +- [Distributed Training](http://paddlepaddle.org/documentation/docs/en/1.3/user_guides/howto/training/multi_node_en.html) You can run distributed training jobs on MPI clusters. -- [Python API](http://paddlepaddle.org/documentation/api/zh/1.0/fluid.html) +- [Python API](http://paddlepaddle.org/documentation/docs/en/1.3/api/index_en.html) Our new API enables much shorter programs. -- [How to Contribute](http://paddlepaddle.org/documentation/docs/zh/1.0/advanced_usage/development/contribute_to_paddle.html) +- [How to Contribute](http://paddlepaddle.org/documentation/docs/en/1.3/advanced_usage/development/contribute_to_paddle/index_en.html) We appreciate your contributions! - ## Ask Questions You are welcome to submit questions and bug reports as [Github Issues](https://github.com/PaddlePaddle/Paddle/issues). diff --git a/README_cn.md b/README_cn.md new file mode 100644 index 0000000000000000000000000000000000000000..b7b0e75e5524cc483a8c203a382e7f339f91694f --- /dev/null +++ b/README_cn.md @@ -0,0 +1,88 @@ +# PaddlePaddle + +[English](./README.md) | 简体中文 + +[![Build Status](https://travis-ci.org/PaddlePaddle/Paddle.svg?branch=develop)](https://travis-ci.org/PaddlePaddle/Paddle) +[![Documentation Status](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](http://paddlepaddle.org/documentation/docs/en/1.3/beginners_guide/index_en.html) +[![Documentation Status](https://img.shields.io/badge/中文文档-最新-brightgreen.svg)](http://paddlepaddle.org/documentation/docs/zh/1.3/beginners_guide/index.html) +[![Release](https://img.shields.io/github/release/PaddlePaddle/Paddle.svg)](https://github.com/PaddlePaddle/Paddle/releases) +[![License](https://img.shields.io/badge/license-Apache%202-blue.svg)](LICENSE) + +欢迎来到 PaddlePaddle GitHub + +PaddlePaddle (PArallel Distributed Deep LEarning) 是一个简单易用、高效灵活、可扩展的深度学习平台,最初由百度科学家和工程师共同开发,目的是将深度学习技术应用到百度的众多产品中。 + +我们的愿景是让每个人都能通过PaddlePaddle接触深度学习 + +跟进PaddlePaddle最新特性请参考我们的[版本说明](https://github.com/PaddlePaddle/Paddle/releases) + +### PaddlePaddle最新版本: [Fluid 1.3.0](https://github.com/PaddlePaddle/Paddle/tree/release/1.3) +### 安装最新稳定版本: +``` +# Linux CPU +pip install paddlepaddle +# Linux GPU cuda9cudnn7 +pip install paddlepaddle-gpu +# Linux GPU cuda8cudnn7 +pip install paddlepaddle-gpu==1.3.0.post87 +# Linux GPU cuda8cudnn5 +pip install paddlepaddle-gpu==1.3.0.post85 + +# 其他平台上的安装指引请参考 http://paddlepaddle.org/ +``` + +## 特性 + +- **灵活性** + + PaddlePaddle支持丰富的神经网络架构和优化算法。易于配置复杂模型,例如带有注意力机制或复杂记忆连接的神经网络机器翻译模型。 + +- **高效性** + + 为了高效使用异步计算资源,PaddlePaddle对框架的不同层进行优化,包括计算、存储、架构和通信。下面是一些样例: + + - 通过SSE/AVX 内置函数、BLAS库(例如MKL、OpenBLAS、cuBLAS)或定制的CPU/GPU内核优化数学操作。 + - 通过MKL-DNN库优化CNN网络 + - 高度优化循环网络,无需执行 `padding` 操作即可处理 **变长** 序列 + - 针对高维稀疏数据模型,优化了局部和分布式训练。 + + +- **稳定性** + + 有了 PaddlePaddle,使得利用各种CPU/GPU和机器来加速训练变得简单。PaddlePaddle 通过优化通信可以实现巨大吞吐量和快速执行。 + +- **与产品相连** + + 另外,PaddlePaddle 的设计也易于部署。在百度,PaddlePaddle 已经部署到含有巨大用户量的产品和服务上,包括广告点击率(CTR)预测、大规模图像分类、光学字符识别(OCR)、搜索排序,计算机病毒检测、推荐系统等等。PaddlePaddle广泛应用于百度产品中,产生了非常重要的影响。我们希望您也能探索 PaddlePaddle 的能力,为您的产品创造新的影响力和效果。 + +## 安装 + +推荐阅读官网上的[安装说明](http://paddlepaddle.org/documentation/docs/zh/1.3/beginners_guide/install/index_cn.html) + +## 文档 + +我们提供[英文](http://paddlepaddle.org/documentation/docs/en/1.3/beginners_guide/index_en.html)和 +[中文](http://paddlepaddle.org/documentation/docs/zh/1.3/beginners_guide/index.html) 文档 + +- [深度学习101](https://github.com/PaddlePaddle/book) + + 或许您想从这个在线交互式书籍开始,可以在Jupyter Notebook中运行 + +- [分布式训练](http://paddlepaddle.org/documentation/docs/zh/1.3/user_guides/howto/training/multi_node.html) + + 可以在MPI集群上运行分布式训练任务 + +- [Python API](http://paddlepaddle.org/documentation/docs/zh/1.3/api_cn/index_cn.html) + + 新的API支持代码更少更简洁的程序 + +- [贡献方式](http://paddlepaddle.org/documentation/docs/zh/1.3/advanced_usage/development/contribute_to_paddle/index_cn.html) + + 欢迎您的贡献! + +## 答疑 + +欢迎您将问题和bug报告以[Github Issues](https://github.com/PaddlePaddle/Paddle/issues)的形式提交 + +## 版权和许可证 +PaddlePaddle由[Apache-2.0 license](LICENSE)提供 diff --git a/benchmark/IntelOptimizedPaddle.md b/benchmark/IntelOptimizedPaddle.md deleted file mode 100644 index 8b7dc5b7db800896eb4de2054ab5e584aed93999..0000000000000000000000000000000000000000 --- a/benchmark/IntelOptimizedPaddle.md +++ /dev/null @@ -1,112 +0,0 @@ -# Benchmark - -Machine: - -- Server: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, 2 Sockets, 20 Cores per socket -- Laptop: TBD - -System: CentOS release 6.3 (Final), Docker 1.12.1. - -PaddlePaddle: -- paddlepaddle/paddle:0.11.0 (for MKLML and MKL-DNN) - - MKL-DNN tag v0.11 - - MKLML 2018.0.1.20171007 -- paddlepaddle/paddle:0.11.0-openblas (for OpenBLAS) - - OpenBLAS v0.2.20 - -On each machine, we will test and compare the performance of training on single node using MKL-DNN / MKLML / OpenBLAS respectively. - -## Benchmark Model - -### Server - -#### Training -Test on batch size 64, 128, 256 on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz -Pay attetion that the speed below includes forward, backward and parameter update time. So we can not directly compare the data with the benchmark of caffe `time` [command](https://github.com/PaddlePaddle/Paddle/blob/develop/benchmark/caffe/image/run.sh#L9), which only contain forward and backward. The updating time of parameter would become very heavy when the weight size are large, especially on alexnet. - -Input image size - 3 * 224 * 224, Time: images/second - -- VGG-19 - -| BatchSize | 64 | 128 | 256 | -|--------------|-------| -----| --------| -| OpenBLAS | 7.80 | 9.00 | 10.80 | -| MKLML | 12.12 | 13.70 | 16.18 | -| MKL-DNN | 28.46 | 29.83 | 30.44 | - - - - - ResNet-50 - -| BatchSize | 64 | 128 | 256 | -|--------------|-------| ------| -------| -| OpenBLAS | 25.22 | 25.68 | 27.12 | -| MKLML | 32.52 | 31.89 | 33.12 | -| MKL-DNN | 81.69 | 82.35 | 84.08 | - - - - - GoogLeNet - -| BatchSize | 64 | 128 | 256 | -|--------------|-------| ------| -------| -| OpenBLAS | 89.52 | 96.97 | 108.25 | -| MKLML | 128.46| 137.89| 158.63 | -| MKL-DNN     | 250.46| 264.83| 269.50 | - - - -- AlexNet - -| BatchSize | 64 | 128 | 256 | -|--------------|--------| ------ | -------| -| OpenBLAS | 45.62 | 72.79 | 107.22 | -| MKLML | 66.37 | 105.60 | 144.04 | -| MKL-DNN | 399.00 | 498.94 | 626.53 | - - - -#### Inference -Test on batch size 1, 2, 4, 8, 16 on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz -- VGG-19 - -| BatchSize | 1 | 2 | 4 | 8 | 16 | -|-----------|-------|-------|-------|-------|-------| -| OpenBLAS | 1.10 | 1.96 | 3.62 | 3.63 | 2.25 | -| MKLML | 5.58 | 9.80 | 15.15 | 21.21 | 28.67 | -| MKL-DNN | 75.07 | 88.64 | 82.58 | 92.29 | 96.75 | - - - -- ResNet-50 - -| BatchSize | 1 | 2 | 4 | 8 | 16 | -|-----------|-------|--------|--------|--------|--------| -| OpenBLAS | 3.31 | 6.72 | 11.59 | 13.17 | 9.27 | -| MKLML | 6.33 | 12.02 | 22.88 | 40.53 | 63.09 | -| MKL-DNN | 107.83| 148.84 | 177.78 | 189.35 | 217.69 | - - - -- GoogLeNet - -| BatchSize | 1 | 2 | 4 | 8 | 16 | -|-----------|--------|--------|--------|--------|--------| -| OpenBLAS | 12.06 | 23.56 | 34.48 | 36.45 | 23.12 | -| MKLML | 22.74 | 41.56 | 81.22 | 133.47 | 210.53 | -| MKL-DNN | 175.10 | 272.92 | 450.70 | 512.00 | 600.94 | - - - -- AlexNet - -| BatchSize | 1 | 2 | 4 | 8 | 16 | -|-----------|--------|--------|--------|--------|--------| -| OpenBLAS | 3.53 | 6.23 | 15.04 | 26.06 | 31.62 | -| MKLML | 21.32 | 36.55 | 73.06 | 131.15 | 192.77 | -| MKL-DNN | 442.91 | 656.41 | 719.10 | 847.68 | 850.51 | - - - -### Laptop -TBD diff --git a/benchmark/README.md b/benchmark/README.md deleted file mode 100644 index 367013f0457f9bbb9ae1335ea63dce181316d444..0000000000000000000000000000000000000000 --- a/benchmark/README.md +++ /dev/null @@ -1,168 +0,0 @@ -# Benchmark - -Machine: - -- CPU: 12-core Intel(R) Xeon(R) CPU E5-2620 v2 @2.10GHz -- GPU: Tesla K40m -- cuDNN: v5.1 -- system: Docker 1.12.1, all platforms are tested in docker environment. - -Platforms: - -- PaddlePaddle: paddledev/paddle:gpu-devel-v0.9.0a0 -- Tensorflow: gcr.io/tensorflow/tensorflow:0.11.0rc0-gpu -- Caffe: kaixhin/cuda-caffe - -Several convolutional neural networks and recurrent neural networks are used to test. - -## Image - -### Benchmark Model - -AlexNet, GoogleNet and a small network used in Caffe. - -- [AlexNet](https://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet): but the group size is one. - -- [GoogleNet](https://github.com/BVLC/caffe/tree/master/models/bvlc_googlenet): but remove loss1 and loss2 when testing benchmark. - -- [SmallNet](https://github.com/BVLC/caffe/blob/master/examples/cifar10/cifar10\_quick\_train\_test.prototxt) - - -### Single-GPU - -- AlexNet: input - 3 * 227 * 227, Time: ms/batch - -| BatchSize | 64 | 128 | 256 | 512 | -|--------------|-----| -----| ------| -----| -| PaddlePaddle | 195 | 334 | 602 | 1629 | -| TensorFlow | 223 | 364 | 645 | 1235 | -| Caffe | 324 | 627 | 1232 | 2513 | - -**Notation** - -All platforms use cuDNN-v5.1. We see that caffe is slower in this experiment, because its workspace limit size of cuDNN-conv interface is 8 * 1024 * 1024, which is smaller in PaddlePaddle and TensorFlow. Note that Caffe will be faster if increasing the workspace limit size. - -- GoogletNet: input - 3 * 224 * 224, Time: ms/batch - - -| BatchSize | 64 | 128 | 256 | -|--------------|-------| -------| --------| -| PaddlePaddle | 613 | 1149 | 2348 | -| TensorFlow | 644 | 1176 | 2219 | -| Caffe | 694 | 1364 | out of memory | - -- SmallNet: input - 3 * 32 * 32, Time ms/batch - -| BatchSize | 64 | 128 | 256 | 512 | -|--------------|--------| -------- | --------|---------| -| PaddlePaddle | 10.463 | 18.184 | 33.113 | 63.039 | -| TensorFlow | 9 | 15 | 28 | 59 | -| Caffe | 9.373 | 16.6606 | 31.4797 | 59.719 | - -**Notation** - -All the single-GPU experiments in caffe use `caffe time` to calculate elapsed time, which does not include parameter updating time. However, both PaddlePaddle and TensorFlow experiments contain the parameter updating time. As compared with the total time, this part is relatively little on single machine, we can ignore it. - -In Tensorflow, they implement algorithm searching method instead of using the algorithm searching interface in cuDNN. - -### Multi-GPU: 4 GPUs - -- AlexNet, ms / batch - -| total-BatchSize | 128 * 4 | 256 * 4 | -|------------------|----------| -----------| -| PaddlePaddle | 347 | 622 | -| TensorFlow | 377 | 675 | -| Caffe | 1229 | 2435 | - -For example, if `total-BatchSize = 128 * 4`, the speedup ratio is calculated by - -``` - time_at_1gpu_batch_128 * 4 / time_at_4gpu_total_batch_512 -= (334 * 4)/347 -= 3.85 -``` - - - - -- GoogleNet, ms / batch - -| total-BatchSize | 128 * 4 | 256 * 4 | -|-------------------|--------------| ----------- | -| PaddlePaddle | 1178 | 2367 | -| TensorFlow | 1210 | 2292 | -| Caffe | 2007 | out of memory | - - - - -## RNN -We use lstm network for text classfication to test benchmark. - -### Dataset -- [IMDB](http://www.iro.umontreal.ca/~lisa/deep/data/imdb.pkl) -- Sequence length is 100. In fact, PaddlePaddle supports training with variable-length sequence, but TensorFlow needs to pad. Thus, we also pad sequence length to 100 in PaddlePaddle in order to compare. -- Dictionary size=30000 -- Peephole connection is used in `lstmemory` by default in PaddlePaddle. It is also configured in TensorFlow. - -### Single-GPU - -#### LSTM in Text Classification - -Testing `2 lstm layer + fc` network with different hidden size and batch size. - -- Batch size = 64, ms / batch - -| hidden_size | 256 | 512 | 1280 | -|--------------|-------| -------| --------| -| PaddlePaddle | 83 | 184 | 641 | -| TensorFlow | 175 | 280 | 818 | - -- Batch size = 128, ms / batch - -| hidden_size | 256 | 512 | 1280 | -|--------------|------- | -------| --------| -| PaddlePaddle | 110 | 261 | 1007 | -| TensorFlow | 181 | 361 | 1237 | - - -- Batch size = 256, ms / batch - -| hidden_size | 256 | 512 | 1280 | -|--------------|-------| -------| --------| -| PaddlePaddle | 170 | 414 | 1655 | -| TensorFlow | 238 | 536 | 1905 | - - - -#### Seq2Seq - -The benchmark of sequence-to-sequence network will be added later. - - -### Multi GPU: 4 GPUs - -#### LSTM in Text Classification - -- hidden_size = 256, ms / batch - -| batch_size | 256 | 512 | -|--------------| -------| --------| -| PaddlePaddle | 90 | 118 | -| TensorFlow | 226 | 118 | - - -- hidden_size = 512, ms / batch - -| batch_size | 256 | 512 | -|--------------| -------| --------| -| PaddlePaddle | 189 | 268 | -| TensorFlow | 297 | 383 | - - - - -#### Seq2Seq - -The benchmark of sequence-to-sequence network will be added later. diff --git a/benchmark/fluid/Dockerfile b/benchmark/fluid/Dockerfile index 2e1e0d376899fd664866621263db62258e7c3869..81ea870050fe5db4a60fee40221991e38de6bd2e 100644 --- a/benchmark/fluid/Dockerfile +++ b/benchmark/fluid/Dockerfile @@ -15,9 +15,6 @@ RUN ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.7 /usr/lib/libcudnn.so && ln -s RUN pip install -U pip RUN pip install -U kubernetes paddlepaddle -RUN sh -c 'echo "import paddle.v2 as paddle\npaddle.dataset.cifar.train10()\npaddle.dataset.flowers.fetch()" | python' -RUN sh -c 'echo "import paddle.v2 as paddle\npaddle.dataset.mnist.train()\npaddle.dataset.mnist.test()\npaddle.dataset.imdb.fetch()" | python' -RUN sh -c 'echo "import paddle.v2 as paddle\npaddle.dataset.imikolov.fetch()" | python' RUN pip uninstall -y paddlepaddle && mkdir /workspace ADD https://raw.githubusercontent.com/PaddlePaddle/cloud/develop/docker/paddle_k8s /usr/bin diff --git a/benchmark/fluid/args.py b/benchmark/fluid/args.py index 9540900b112f54594bbfdbc8d7cd3b6e1f5269dd..ff616ddbb2cb1cb7f348d6d164815823b08b7629 100644 --- a/benchmark/fluid/args.py +++ b/benchmark/fluid/args.py @@ -142,5 +142,10 @@ def parse_args(): choices=['reduce', 'all_reduce'], default='all_reduce', help='Specify the reduce strategy, can be reduce, all_reduce') + parser.add_argument( + '--fuse_broadcast_op', + action='store_true', + help='If set, would fuse multiple broadcast operators into one fused_broadcast operator.' + ) args = parser.parse_args() return args diff --git a/benchmark/paddle/image/check_env.sh b/benchmark/fluid/check_env.sh similarity index 100% rename from benchmark/paddle/image/check_env.sh rename to benchmark/fluid/check_env.sh diff --git a/benchmark/fluid/fluid_benchmark.py b/benchmark/fluid/fluid_benchmark.py index ddd9fe809853a830ca676cc98f1819f683866def..df159a334e86d62e175bce3b363b74ec78c1fd64 100644 --- a/benchmark/fluid/fluid_benchmark.py +++ b/benchmark/fluid/fluid_benchmark.py @@ -81,9 +81,11 @@ def dist_transpile(trainer_id, args, train_prog, startup_prog): # the role, should be either PSERVER or TRAINER training_role = os.getenv("PADDLE_TRAINING_ROLE") - config = distribute_transpiler.DistributeTranspilerConfig() + config = fluid.DistributeTranspilerConfig() config.slice_var_up = not args.no_split_var + config.min_block_size = 1048576 t = distribute_transpiler.DistributeTranspiler(config=config) + t.transpile( trainer_id, # NOTE: *MUST* use train_prog, for we are using with guard to @@ -240,7 +242,6 @@ def train_parallel(train_args, test_args, args, train_prog, test_prog, if args.use_fake_data or args.use_reader_op: try: - fetch_ret = exe.run(fetch_list) except fluid.core.EOFException as eof: break diff --git a/benchmark/paddle/image/alexnet.py b/benchmark/paddle/image/alexnet.py deleted file mode 100644 index 9efc3f0494e4a817a7357f29e684f621bce1921e..0000000000000000000000000000000000000000 --- a/benchmark/paddle/image/alexnet.py +++ /dev/null @@ -1,93 +0,0 @@ -# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from paddle.trainer_config_helpers import * - -height = 227 -width = 227 -num_class = 1000 -batch_size = get_config_arg('batch_size', int, 128) -gp = get_config_arg('layer_num', int, 1) -is_infer = get_config_arg("is_infer", bool, False) -num_samples = get_config_arg('num_samples', int, 2560) - -args = { - 'height': height, - 'width': width, - 'color': True, - 'num_class': num_class, - 'is_infer': is_infer, - 'num_samples': num_samples -} -define_py_data_sources2( - "train.list" if not is_infer else None, - "test.list" if is_infer else None, - module="provider", - obj="process", - args=args) - -settings( - batch_size=batch_size, - learning_rate=0.01 / batch_size, - learning_method=MomentumOptimizer(0.9), - regularization=L2Regularization(0.0005 * batch_size)) - -# conv1 -net = data_layer('data', size=height * width * 3) -net = img_conv_layer( - input=net, - filter_size=11, - num_channels=3, - num_filters=96, - stride=4, - padding=1) -net = img_cmrnorm_layer(input=net, size=5, scale=0.0001, power=0.75) -net = img_pool_layer(input=net, pool_size=3, stride=2) - -# conv2 -net = img_conv_layer( - input=net, filter_size=5, num_filters=256, stride=1, padding=2, groups=gp) -net = img_cmrnorm_layer(input=net, size=5, scale=0.0001, power=0.75) -net = img_pool_layer(input=net, pool_size=3, stride=2) - -# conv3 -net = img_conv_layer( - input=net, filter_size=3, num_filters=384, stride=1, padding=1) -# conv4 -net = img_conv_layer( - input=net, filter_size=3, num_filters=384, stride=1, padding=1, groups=gp) - -# conv5 -net = img_conv_layer( - input=net, filter_size=3, num_filters=256, stride=1, padding=1, groups=gp) -net = img_pool_layer(input=net, pool_size=3, stride=2) - -net = fc_layer( - input=net, - size=4096, - act=ReluActivation(), - layer_attr=ExtraAttr(drop_rate=0.5)) -net = fc_layer( - input=net, - size=4096, - act=ReluActivation(), - layer_attr=ExtraAttr(drop_rate=0.5)) -net = fc_layer(input=net, size=1000, act=SoftmaxActivation()) - -if is_infer: - outputs(net) -else: - lab = data_layer('label', num_class) - loss = cross_entropy(input=net, label=lab) - outputs(loss) diff --git a/benchmark/paddle/image/googlenet.py b/benchmark/paddle/image/googlenet.py deleted file mode 100644 index 2a850ccb7f2c75b467554181fc5f4aa8f2b97a09..0000000000000000000000000000000000000000 --- a/benchmark/paddle/image/googlenet.py +++ /dev/null @@ -1,245 +0,0 @@ -#!/usr/bin/env python -from paddle.trainer_config_helpers import * - -height = 224 -width = 224 -num_class = 1000 -batch_size = get_config_arg('batch_size', int, 128) -use_gpu = get_config_arg('use_gpu', bool, True) -is_infer = get_config_arg("is_infer", bool, False) -num_samples = get_config_arg('num_samples', int, 2560) - -args = { - 'height': height, - 'width': width, - 'color': True, - 'num_class': num_class, - 'is_infer': is_infer, - 'num_samples': num_samples -} -define_py_data_sources2( - "train.list" if not is_infer else None, - "test.list" if is_infer else None, - module="provider", - obj="process", - args=args) - -settings( - batch_size=batch_size, - learning_rate=0.01 / batch_size, - learning_method=MomentumOptimizer(0.9), - regularization=L2Regularization(0.0005 * batch_size)) - -conv_projection = conv_projection if use_gpu else img_conv_layer - -def inception2(name, input, channels, \ - filter1, - filter3R, filter3, - filter5R, filter5, - proj): - - conv1 = name + '_1' - conv3r = name + '_3r' - conv3 = name + '_3' - conv5r = name + '_5r' - conv5 = name + '_5' - maxpool = name + '_max' - convproj = name + '_proj' - - cov1 = img_conv_layer( - name=conv1, - input=input, - filter_size=1, - num_channels=channels, - num_filters=filter1, - stride=1, - padding=0) - - cov3r = img_conv_layer( - name=conv3r, - input=input, - filter_size=1, - num_channels=channels, - num_filters=filter3R, - stride=1, - padding=0) - cov3 = img_conv_layer( - name=conv3, - input=cov3r, - filter_size=3, - num_filters=filter3, - stride=1, - padding=1) - - cov5r = img_conv_layer( - name=conv5r, - input=input, - filter_size=1, - num_channels=channels, - num_filters=filter5R, - stride=1, - padding=0) - cov5 = img_conv_layer( - name=conv5, - input=cov5r, - filter_size=5, - num_filters=filter5, - stride=1, - padding=2) - - pool1 = img_pool_layer( - name=maxpool, - input=input, - pool_size=3, - num_channels=channels, - stride=1, - padding=1) - covprj = img_conv_layer( - name=convproj, - input=pool1, - filter_size=1, - num_filters=proj, - stride=1, - padding=0) - - cat = concat_layer(name=name, input=[cov1, cov3, cov5, covprj]) - return cat - -def inception(name, input, channels, \ - filter1, - filter3R, filter3, - filter5R, filter5, - proj): - - cov1 = conv_projection( - input=input, - filter_size=1, - num_channels=channels, - num_filters=filter1, - stride=1, - padding=0) - - cov3r = img_conv_layer( - name=name + '_3r', - input=input, - filter_size=1, - num_channels=channels, - num_filters=filter3R, - stride=1, - padding=0) - cov3 = conv_projection( - input=cov3r, filter_size=3, num_filters=filter3, stride=1, padding=1) - - cov5r = img_conv_layer( - name=name + '_5r', - input=input, - filter_size=1, - num_channels=channels, - num_filters=filter5R, - stride=1, - padding=0) - cov5 = conv_projection( - input=cov5r, filter_size=5, num_filters=filter5, stride=1, padding=2) - - pool1 = img_pool_layer( - name=name + '_max', - input=input, - pool_size=3, - num_channels=channels, - stride=1, - padding=1) - covprj = conv_projection( - input=pool1, filter_size=1, num_filters=proj, stride=1, padding=0) - - cat = concat_layer( - name=name, - input=[cov1, cov3, cov5, covprj], - bias_attr=True if use_gpu else False, - act=ReluActivation()) - return cat - - -data = data_layer(name="input", size=3 * height * width) - -# stage 1 -conv1 = img_conv_layer( - name="conv1", - input=data, - filter_size=7, - num_channels=3, - num_filters=64, - stride=2, - padding=3) -pool1 = img_pool_layer( - name="pool1", input=conv1, pool_size=3, num_channels=64, stride=2) - -# stage 2 -conv2_1 = img_conv_layer( - name="conv2_1", - input=pool1, - filter_size=1, - num_filters=64, - stride=1, - padding=0) -conv2_2 = img_conv_layer( - name="conv2_2", - input=conv2_1, - filter_size=3, - num_filters=192, - stride=1, - padding=1) -pool2 = img_pool_layer( - name="pool2", input=conv2_2, pool_size=3, num_channels=192, stride=2) - -# stage 3 -ince3a = inception("ince3a", pool2, 192, 64, 96, 128, 16, 32, 32) -ince3b = inception("ince3b", ince3a, 256, 128, 128, 192, 32, 96, 64) -pool3 = img_pool_layer( - name="pool3", input=ince3b, num_channels=480, pool_size=3, stride=2) - -# stage 4 -ince4a = inception("ince4a", pool3, 480, 192, 96, 208, 16, 48, 64) -ince4b = inception("ince4b", ince4a, 512, 160, 112, 224, 24, 64, 64) -ince4c = inception("ince4c", ince4b, 512, 128, 128, 256, 24, 64, 64) -ince4d = inception("ince4d", ince4c, 512, 112, 144, 288, 32, 64, 64) -ince4e = inception("ince4e", ince4d, 528, 256, 160, 320, 32, 128, 128) -pool4 = img_pool_layer( - name="pool4", input=ince4e, num_channels=832, pool_size=3, stride=2) - -# stage 5 -ince5a = inception("ince5a", pool4, 832, 256, 160, 320, 32, 128, 128) -ince5b = inception("ince5b", ince5a, 832, 384, 192, 384, 48, 128, 128) -pool5 = img_pool_layer( - name="pool5", - input=ince5b, - num_channels=1024, - pool_size=7, - stride=7, - pool_type=AvgPooling()) - -# We remove loss1 and loss2 for all system when testing benchmark -# output 1 -# pool_o1 = img_pool_layer(name="pool_o1", input=ince4a, num_channels=512, pool_size=5, stride=3, pool_type=AvgPooling()) -# conv_o1 = img_conv_layer(name="conv_o1", input=pool_o1, filter_size=1, num_filters=128, stride=1, padding=0) -# fc_o1 = fc_layer(name="fc_o1", input=conv_o1, size=1024, layer_attr=ExtraAttr(drop_rate=0.7), act=ReluActivation()) -# out1 = fc_layer(name="output1", input=fc_o1, size=1000, act=SoftmaxActivation()) -# loss1 = cross_entropy(name='loss1', input=out1, label=lab, coeff=0.3) - -# output 2 -#pool_o2 = img_pool_layer(name="pool_o2", input=ince4d, num_channels=528, pool_size=5, stride=3, pool_type=AvgPooling()) -#conv_o2 = img_conv_layer(name="conv_o2", input=pool_o2, filter_size=1, num_filters=128, stride=1, padding=0) -#fc_o2 = fc_layer(name="fc_o2", input=conv_o2, size=1024, layer_attr=ExtraAttr(drop_rate=0.7), act=ReluActivation()) -#out2 = fc_layer(name="output2", input=fc_o2, size=1000, act=SoftmaxActivation()) -#loss2 = cross_entropy(name='loss2', input=out2, label=lab, coeff=0.3) - -# output 3 -dropout = dropout_layer(name="dropout", input=pool5, dropout_rate=0.4) -out3 = fc_layer( - name="output3", input=dropout, size=1000, act=SoftmaxActivation()) - -if is_infer: - outputs(out3) -else: - lab = data_layer(name="label", size=num_class) - loss3 = cross_entropy(name='loss3', input=out3, label=lab) - outputs(loss3) diff --git a/benchmark/paddle/image/plotlog.py b/benchmark/paddle/image/plotlog.py deleted file mode 100644 index 8679d4f272d1b7aaf8d5a397f07698a6b70e4fcd..0000000000000000000000000000000000000000 --- a/benchmark/paddle/image/plotlog.py +++ /dev/null @@ -1,114 +0,0 @@ -# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import sys -import argparse -import matplotlib.pyplot as plt - - -def parse_args(): - parser = argparse.ArgumentParser('Parse Log') - parser.add_argument( - '--file_path', '-f', type=str, help='the path of the log file') - parser.add_argument( - '--sample_rate', - '-s', - type=float, - default=1.0, - help='the rate to take samples from log') - parser.add_argument( - '--log_period', '-p', type=int, default=1, help='the period of log') - - args = parser.parse_args() - return args - - -def parse_file(file_name): - loss = [] - error = [] - with open(file_name) as f: - for i, line in enumerate(f): - line = line.strip() - if not line.startswith('pass'): - continue - line_split = line.split(' ') - if len(line_split) != 5: - continue - - loss_str = line_split[2][:-1] - cur_loss = float(loss_str.split('=')[-1]) - loss.append(cur_loss) - - err_str = line_split[3][:-1] - cur_err = float(err_str.split('=')[-1]) - error.append(cur_err) - - accuracy = [1.0 - err for err in error] - - return loss, accuracy - - -def sample(metric, sample_rate): - interval = int(1.0 / sample_rate) - if interval > len(metric): - return metric[:1] - - num = len(metric) / interval - idx = [interval * i for i in range(num)] - metric_sample = [metric[id] for id in idx] - return metric_sample - - -def plot_metric(metric, - batch_id, - graph_title, - line_style='b-', - line_label='y', - line_num=1): - plt.figure() - plt.title(graph_title) - if line_num == 1: - plt.plot(batch_id, metric, line_style, label=line_label) - else: - for i in range(line_num): - plt.plot(batch_id, metric[i], line_style[i], label=line_label[i]) - plt.xlabel('batch') - plt.ylabel(graph_title) - plt.legend() - plt.savefig(graph_title + '.jpg') - plt.close() - - -def main(): - args = parse_args() - assert args.sample_rate > 0. and args.sample_rate <= 1.0, "The sample rate should in the range (0, 1]." - - loss, accuracy = parse_file(args.file_path) - batch = [args.log_period * i for i in range(len(loss))] - - batch_sample = sample(batch, args.sample_rate) - loss_sample = sample(loss, args.sample_rate) - accuracy_sample = sample(accuracy, args.sample_rate) - - plot_metric(loss_sample, batch_sample, 'loss', line_label='loss') - plot_metric( - accuracy_sample, - batch_sample, - 'accuracy', - line_style='g-', - line_label='accuracy') - - -if __name__ == '__main__': - main() diff --git a/benchmark/paddle/image/provider.py b/benchmark/paddle/image/provider.py deleted file mode 100644 index 6ad817ccefab3e44a8f962e907ba2110a6ed4a45..0000000000000000000000000000000000000000 --- a/benchmark/paddle/image/provider.py +++ /dev/null @@ -1,47 +0,0 @@ -# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import io, os -import random -import numpy as np -from paddle.trainer.PyDataProvider2 import * - - -def initHook(settings, height, width, color, num_class, **kwargs): - settings.height = height - settings.width = width - settings.color = color - settings.num_class = num_class - if settings.color: - settings.data_size = settings.height * settings.width * 3 - else: - settings.data_size = settings.height * settings.width - settings.is_infer = kwargs.get('is_infer', False) - settings.num_samples = kwargs.get('num_samples', 2560) - if settings.is_infer: - settings.slots = [dense_vector(settings.data_size)] - else: - settings.slots = [dense_vector(settings.data_size), integer_value(1)] - - -@provider( - init_hook=initHook, min_pool_size=-1, cache=CacheType.CACHE_PASS_IN_MEM) -def process(settings, file_list): - for i in xrange(settings.num_samples): - img = np.random.rand(1, settings.data_size).reshape(-1, 1).flatten() - if settings.is_infer: - yield img.astype('float32') - else: - lab = random.randint(0, settings.num_class - 1) - yield img.astype('float32'), int(lab) diff --git a/benchmark/paddle/image/resnet.py b/benchmark/paddle/image/resnet.py deleted file mode 100644 index 2846e4763f1cda4602f03af5ec649d57ee6cf0d8..0000000000000000000000000000000000000000 --- a/benchmark/paddle/image/resnet.py +++ /dev/null @@ -1,230 +0,0 @@ -#!/usr/bin/env python -from paddle.trainer_config_helpers import * - -height = 224 -width = 224 -num_class = 1000 -batch_size = get_config_arg('batch_size', int, 64) -layer_num = get_config_arg("layer_num", int, 50) -is_infer = get_config_arg("is_infer", bool, False) -num_samples = get_config_arg('num_samples', int, 2560) - -args = { - 'height': height, - 'width': width, - 'color': True, - 'num_class': num_class, - 'is_infer': is_infer, - 'num_samples': num_samples -} -define_py_data_sources2( - "train.list" if not is_infer else None, - "test.list" if is_infer else None, - module="provider", - obj="process", - args=args) - -settings( - batch_size=batch_size, - learning_rate=0.01 / batch_size, - learning_method=MomentumOptimizer(0.9), - regularization=L2Regularization(0.0005 * batch_size)) - - -#######################Network Configuration ############# -def conv_bn_layer(name, - input, - filter_size, - num_filters, - stride, - padding, - channels=None, - active_type=ReluActivation()): - """ - A wrapper for conv layer with batch normalization layers. - Note: - conv layer has no activation. - """ - - tmp = img_conv_layer( - name=name + "_conv", - input=input, - filter_size=filter_size, - num_channels=channels, - num_filters=num_filters, - stride=stride, - padding=padding, - act=LinearActivation(), - bias_attr=False) - return batch_norm_layer( - name=name + "_bn", - input=tmp, - act=active_type, - use_global_stats=is_infer) - - -def bottleneck_block(name, input, num_filters1, num_filters2): - """ - A wrapper for bottlenect building block in ResNet. - Last conv_bn_layer has no activation. - Addto layer has activation of relu. - """ - last_name = conv_bn_layer( - name=name + '_branch2a', - input=input, - filter_size=1, - num_filters=num_filters1, - stride=1, - padding=0) - last_name = conv_bn_layer( - name=name + '_branch2b', - input=last_name, - filter_size=3, - num_filters=num_filters1, - stride=1, - padding=1) - last_name = conv_bn_layer( - name=name + '_branch2c', - input=last_name, - filter_size=1, - num_filters=num_filters2, - stride=1, - padding=0, - active_type=LinearActivation()) - - return addto_layer( - name=name + "_addto", input=[input, last_name], act=ReluActivation()) - - -def mid_projection(name, input, num_filters1, num_filters2, stride=2): - """ - A wrapper for middile projection in ResNet. - projection shortcuts are used for increasing dimensions, - and other shortcuts are identity - branch1: projection shortcuts are used for increasing - dimensions, has no activation. - branch2x: bottleneck building block, shortcuts are identity. - """ - # stride = 2 - branch1 = conv_bn_layer( - name=name + '_branch1', - input=input, - filter_size=1, - num_filters=num_filters2, - stride=stride, - padding=0, - active_type=LinearActivation()) - - last_name = conv_bn_layer( - name=name + '_branch2a', - input=input, - filter_size=1, - num_filters=num_filters1, - stride=stride, - padding=0) - last_name = conv_bn_layer( - name=name + '_branch2b', - input=last_name, - filter_size=3, - num_filters=num_filters1, - stride=1, - padding=1) - - last_name = conv_bn_layer( - name=name + '_branch2c', - input=last_name, - filter_size=1, - num_filters=num_filters2, - stride=1, - padding=0, - active_type=LinearActivation()) - - return addto_layer( - name=name + "_addto", input=[branch1, last_name], act=ReluActivation()) - - -img = data_layer(name='image', size=height * width * 3) - - -def deep_res_net(res2_num=3, res3_num=4, res4_num=6, res5_num=3): - """ - A wrapper for 50,101,152 layers of ResNet. - res2_num: number of blocks stacked in conv2_x - res3_num: number of blocks stacked in conv3_x - res4_num: number of blocks stacked in conv4_x - res5_num: number of blocks stacked in conv5_x - """ - # For ImageNet - # conv1: 112x112 - tmp = conv_bn_layer( - "conv1", - input=img, - filter_size=7, - channels=3, - num_filters=64, - stride=2, - padding=3) - tmp = img_pool_layer(name="pool1", input=tmp, pool_size=3, stride=2) - - # conv2_x: 56x56 - tmp = mid_projection( - name="res2_1", input=tmp, num_filters1=64, num_filters2=256, stride=1) - for i in xrange(2, res2_num + 1, 1): - tmp = bottleneck_block( - name="res2_" + str(i), input=tmp, num_filters1=64, num_filters2=256) - - # conv3_x: 28x28 - tmp = mid_projection( - name="res3_1", input=tmp, num_filters1=128, num_filters2=512) - for i in xrange(2, res3_num + 1, 1): - tmp = bottleneck_block( - name="res3_" + str(i), - input=tmp, - num_filters1=128, - num_filters2=512) - - # conv4_x: 14x14 - tmp = mid_projection( - name="res4_1", input=tmp, num_filters1=256, num_filters2=1024) - for i in xrange(2, res4_num + 1, 1): - tmp = bottleneck_block( - name="res4_" + str(i), - input=tmp, - num_filters1=256, - num_filters2=1024) - - # conv5_x: 7x7 - tmp = mid_projection( - name="res5_1", input=tmp, num_filters1=512, num_filters2=2048) - for i in xrange(2, res5_num + 1, 1): - tmp = bottleneck_block( - name="res5_" + str(i), - input=tmp, - num_filters1=512, - num_filters2=2048) - - tmp = img_pool_layer( - name='avgpool', - input=tmp, - pool_size=7, - stride=1, - pool_type=AvgPooling()) - - return fc_layer(input=tmp, size=num_class, act=SoftmaxActivation()) - - -if layer_num == 50: - resnet = deep_res_net(3, 4, 6, 3) -elif layer_num == 101: - resnet = deep_res_net(3, 4, 23, 3) -elif layer_num == 152: - resnet = deep_res_net(3, 8, 36, 3) -else: - print("Wrong layer number.") - -if is_infer: - outputs(resnet) -else: - lbl = data_layer(name="label", size=num_class) - loss = cross_entropy(name='loss', input=resnet, label=lbl) - outputs(loss) diff --git a/benchmark/paddle/image/run.sh b/benchmark/paddle/image/run.sh deleted file mode 100755 index 5b58a8d773aab795e5439b0f0e5d81bec66b5f56..0000000000000000000000000000000000000000 --- a/benchmark/paddle/image/run.sh +++ /dev/null @@ -1,53 +0,0 @@ -#!/bin/bash - -set -e - -function train() { - cfg=$1 - thread=$2 - bz=$3 - args="batch_size=$3" - prefix=$4 - paddle train --job=time \ - --config=$cfg \ - --use_gpu=True \ - --trainer_count=$thread \ - --log_period=10 \ - --test_period=100 \ - --config_args=$args \ - > logs/$prefix-${thread}gpu-$bz.log 2>&1 -} - -if [ ! -d "train.list" ]; then - echo " " > train.list -fi -if [ ! -d "logs" ]; then - mkdir logs -fi - -#========single-gpu=========# -# alexnet -train alexnet.py 1 64 alexnet -train alexnet.py 1 128 alexnet -train alexnet.py 1 256 alexnet -train alexnet.py 1 512 alexnet - -# googlenet -train googlenet.py 1 64 googlenet -train googlenet.py 1 128 googlenet -train googlenet.py 1 256 googlenet - -# smallnet -train smallnet_mnist_cifar.py 1 64 smallnet -train smallnet_mnist_cifar.py 1 128 smallnet -train smallnet_mnist_cifar.py 1 256 smallnet -train smallnet_mnist_cifar.py 1 512 smallnet - - -############################ -#========multi-gpus=========# -train alexnet.py 4 512 alexnet -train alexnet.py 4 1024 alexnet - -train googlenet.py 4 512 googlenet -train googlenet.py 4 1024 googlenet diff --git a/benchmark/paddle/image/run_mkl_infer.sh b/benchmark/paddle/image/run_mkl_infer.sh deleted file mode 100755 index 0fad5e04cc992a3ec97591d3833957bb7517a8f3..0000000000000000000000000000000000000000 --- a/benchmark/paddle/image/run_mkl_infer.sh +++ /dev/null @@ -1,89 +0,0 @@ -#!/bin/bash - -set -e - -function clock_to_seconds() { - hours=`echo $1 | awk -F ':' '{print $1}'` - mins=`echo $1 | awk -F ':' '{print $2}'` - secs=`echo $1 | awk -F ':' '{print $3}'` - echo `awk 'BEGIN{printf "%.2f",('$secs' + '$mins' * 60 + '$hours' * 3600)}'` -} - -function infer() { - unset OMP_NUM_THREADS MKL_NUM_THREADS OMP_DYNAMIC KMP_AFFINITY - topology=$1 - layer_num=$2 - bs=$3 - use_mkldnn=$4 - if [ $4 == "True" ]; then - thread=1 - log="logs/infer-${topology}-${layer_num}-mkldnn-${bs}.log" - elif [ $4 == "False" ]; then - thread=`nproc` - if [ $thread -gt $bs ]; then - thread=$bs - fi - log="logs/infer-${topology}-${layer_num}-${thread}mklml-${bs}.log" - else - echo "Wrong input $4, use True or False." - exit 0 - fi - - models_in="models/${topology}-${layer_num}/pass-00000/" - if [ ! -d $models_in ]; then - echo "Training model ${topology}_${layer_num}" - paddle train --job=train \ - --config="${topology}.py" \ - --use_mkldnn=True \ - --use_gpu=False \ - --trainer_count=1 \ - --num_passes=1 \ - --save_dir="models/${topology}-${layer_num}" \ - --config_args="batch_size=128,layer_num=${layer_num},num_samples=256" \ - > /dev/null 2>&1 - echo "Done" - fi - log_period=$((256 / bs)) - paddle train --job=test \ - --config="${topology}.py" \ - --use_mkldnn=$use_mkldnn \ - --use_gpu=False \ - --trainer_count=$thread \ - --log_period=$log_period \ - --config_args="batch_size=${bs},layer_num=${layer_num},is_infer=True" \ - --init_model_path=$models_in \ - 2>&1 | tee ${log} - - # calculate the last 5 logs period time of 1280 samples, - # the time before are burning time. - start=`tail ${log} -n 7 | head -n 1 | awk -F ' ' '{print $2}' | xargs` - end=`tail ${log} -n 2 | head -n 1 | awk -F ' ' '{print $2}' | xargs` - start_sec=`clock_to_seconds $start` - end_sec=`clock_to_seconds $end` - fps=`awk 'BEGIN{printf "%.2f",(1280 / ('$end_sec' - '$start_sec'))}'` - echo "Last 1280 samples start: ${start}(${start_sec} sec), end: ${end}(${end_sec} sec;" >> ${log} - echo "FPS: $fps images/sec" 2>&1 | tee -a ${log} -} - -if [ ! -f "train.list" ]; then - echo " " > train.list -fi -if [ ! -f "test.list" ]; then - echo " " > test.list -fi -if [ ! -d "logs" ]; then - mkdir logs -fi -if [ ! -d "models" ]; then - mkdir -p models -fi - -# inference benchmark -for use_mkldnn in True False; do - for batchsize in 1 2 4 8 16; do - infer vgg 19 $batchsize $use_mkldnn - infer resnet 50 $batchsize $use_mkldnn - infer googlenet v1 $batchsize $use_mkldnn - infer alexnet 2 $batchsize $use_mkldnn - done -done diff --git a/benchmark/paddle/image/run_mkl_train.sh b/benchmark/paddle/image/run_mkl_train.sh deleted file mode 100755 index 1583bf134a276a08aa2f8e84dc63adbb205a83d6..0000000000000000000000000000000000000000 --- a/benchmark/paddle/image/run_mkl_train.sh +++ /dev/null @@ -1,54 +0,0 @@ -#!/bin/bash - -set -e - -function train() { - unset OMP_NUM_THREADS MKL_NUM_THREADS OMP_DYNAMIC KMP_AFFINITY - topology=$1 - layer_num=$2 - bs=$3 - use_mkldnn=$4 - if [ $4 == "True" ]; then - thread=1 - log="logs/train-${topology}-${layer_num}-mkldnn-${bs}.log" - elif [ $4 == "False" ]; then - thread=`nproc` - # each trainer_count use only 1 core to avoid conflict - log="logs/train-${topology}-${layer_num}-${thread}mklml-${bs}.log" - else - echo "Wrong input $4, use True or False." - exit 0 - fi - args="batch_size=${bs},layer_num=${layer_num}" - config="${topology}.py" - paddle train --job=time \ - --config=$config \ - --use_mkldnn=$use_mkldnn \ - --use_gpu=False \ - --trainer_count=$thread \ - --log_period=10 \ - --test_period=100 \ - --config_args=$args \ - 2>&1 | tee ${log} - - avg_time=`tail ${log} -n 1 | awk -F ' ' '{print $8}' | sed 's/avg=//'` - fps=`awk 'BEGIN{printf "%.2f",('$bs' / '$avg_time' * 1000)}'` - echo "FPS: $fps images/sec" 2>&1 | tee -a ${log} -} - -if [ ! -f "train.list" ]; then - echo " " > train.list -fi -if [ ! -d "logs" ]; then - mkdir logs -fi - -# training benchmark -for use_mkldnn in True False; do - for batchsize in 64 128 256; do - train vgg 19 $batchsize $use_mkldnn - train resnet 50 $batchsize $use_mkldnn - train googlenet v1 $batchsize $use_mkldnn - train alexnet 2 $batchsize $use_mkldnn - done -done diff --git a/benchmark/paddle/image/run_openblas_infer.sh b/benchmark/paddle/image/run_openblas_infer.sh deleted file mode 100755 index 987381cabc2e793886099212660723c122b73bb0..0000000000000000000000000000000000000000 --- a/benchmark/paddle/image/run_openblas_infer.sh +++ /dev/null @@ -1,71 +0,0 @@ -#!/bin/bash - -set -e - -function clock_to_seconds() { - hours=`echo $1 | awk -F ':' '{print $1}'` - mins=`echo $1 | awk -F ':' '{print $2}'` - secs=`echo $1 | awk -F ':' '{print $3}'` - echo `awk 'BEGIN{printf "%.2f",('$secs' + '$mins' * 60 + '$hours' * 3600)}'` -} - -function infer() { - export OPENBLAS_MAIN_FREE=1 - topology=$1 - layer_num=$2 - bs=$3 - trainers=`nproc` - if [ $trainers -gt $bs ]; then - trainers=$bs - fi - log="logs/infer-${topology}-${layer_num}-${trainers}openblas-${bs}.log" - threads=$((`nproc` / trainers)) - if [ $threads -eq 0 ]; then - threads=1 - fi - export OPENBLAS_NUM_THREADS=$threads - - models_in="models/${topology}-${layer_num}/pass-00000/" - if [ ! -d $models_in ]; then - echo "./run_mkl_infer.sh to save the model first" - exit 0 - fi - log_period=$((32 / bs)) - paddle train --job=test \ - --config="${topology}.py" \ - --use_mkldnn=False \ - --use_gpu=False \ - --trainer_count=$trainers \ - --log_period=$log_period \ - --config_args="batch_size=${bs},layer_num=${layer_num},is_infer=True,num_samples=256" \ - --init_model_path=$models_in \ - 2>&1 | tee ${log} - - # calculate the last 5 logs period time of 160(=32*5) samples, - # the time before are burning time. - start=`tail ${log} -n 7 | head -n 1 | awk -F ' ' '{print $2}' | xargs` - end=`tail ${log} -n 2 | head -n 1 | awk -F ' ' '{print $2}' | xargs` - start_sec=`clock_to_seconds $start` - end_sec=`clock_to_seconds $end` - fps=`awk 'BEGIN{printf "%.2f",(160 / ('$end_sec' - '$start_sec'))}'` - echo "Last 160 samples start: ${start}(${start_sec} sec), end: ${end}(${end_sec} sec;" >> ${log} - echo "FPS: $fps images/sec" 2>&1 | tee -a ${log} -} - -if [ ! -f "train.list" ]; then - echo " " > train.list -fi -if [ ! -f "test.list" ]; then - echo " " > test.list -fi -if [ ! -d "logs" ]; then - mkdir logs -fi - -# inference benchmark -for batchsize in 1 2 4 8 16; do - infer vgg 19 $batchsize - infer resnet 50 $batchsize - infer googlenet v1 $batchsize - infer alexnet 2 $batchsize -done diff --git a/benchmark/paddle/image/run_openblas_train.sh b/benchmark/paddle/image/run_openblas_train.sh deleted file mode 100755 index cc64e1d09da02087b1737190a0b75dc7758600a6..0000000000000000000000000000000000000000 --- a/benchmark/paddle/image/run_openblas_train.sh +++ /dev/null @@ -1,43 +0,0 @@ -#!/bin/bash - -set -e - -function train() { - export OPENBLAS_NUM_THREADS=1 - topology=$1 - layer_num=$2 - bs=$3 - thread=`nproc` - # each trainer_count use only 1 core to avoid conflict - log="logs/train-${topology}-${layer_num}-${thread}openblas-${bs}.log" - args="batch_size=${bs},layer_num=${layer_num}" - config="${topology}.py" - paddle train --job=time \ - --config=$config \ - --use_mkldnn=False \ - --use_gpu=False \ - --trainer_count=$thread \ - --log_period=3 \ - --test_period=30 \ - --config_args=$args \ - 2>&1 | tee ${log} - - avg_time=`tail ${log} -n 1 | awk -F ' ' '{print $8}' | sed 's/avg=//'` - fps=`awk 'BEGIN{printf "%.2f",('$bs' / '$avg_time' * 1000)}'` - echo "FPS: $fps images/sec" 2>&1 | tee -a ${log} -} - -if [ ! -f "train.list" ]; then - echo " " > train.list -fi -if [ ! -d "logs" ]; then - mkdir logs -fi - -# training benchmark -for batchsize in 64 128 256; do - train vgg 19 $batchsize - train resnet 50 $batchsize - train googlenet v1 $batchsize - train alexnet 2 $batchsize -done diff --git a/benchmark/paddle/image/smallnet_mnist_cifar.py b/benchmark/paddle/image/smallnet_mnist_cifar.py deleted file mode 100644 index 58879c454f37991405d83bbb593bb5d1e977ff53..0000000000000000000000000000000000000000 --- a/benchmark/paddle/image/smallnet_mnist_cifar.py +++ /dev/null @@ -1,49 +0,0 @@ -#!/usr/bin/env python - -from paddle.trainer_config_helpers import * - -height = 32 -width = 32 -num_class = 10 - -batch_size = get_config_arg('batch_size', int, 128) - -args = {'height': height, 'width': width, 'color': True, 'num_class': num_class} -define_py_data_sources2( - "train.list", None, module="provider", obj="process", args=args) - -settings( - batch_size=batch_size, - learning_rate=0.01 / batch_size, - learning_method=MomentumOptimizer(0.9), - regularization=L2Regularization(0.0005 * batch_size)) - -# conv1 -net = data_layer('data', size=height * width * 3) -net = img_conv_layer( - input=net, - filter_size=5, - num_channels=3, - num_filters=32, - stride=1, - padding=2) -net = img_pool_layer(input=net, pool_size=3, stride=2, padding=1) - -# conv2 -net = img_conv_layer( - input=net, filter_size=5, num_filters=32, stride=1, padding=2) -net = img_pool_layer( - input=net, pool_size=3, stride=2, padding=1, pool_type=AvgPooling()) - -# conv3 -net = img_conv_layer( - input=net, filter_size=3, num_filters=64, stride=1, padding=1) -net = img_pool_layer( - input=net, pool_size=3, stride=2, padding=1, pool_type=AvgPooling()) - -net = fc_layer(input=net, size=64, act=ReluActivation()) -net = fc_layer(input=net, size=10, act=SoftmaxActivation()) - -lab = data_layer('label', num_class) -loss = classification_cost(input=net, label=lab) -outputs(loss) diff --git a/benchmark/paddle/image/vgg.py b/benchmark/paddle/image/vgg.py deleted file mode 100644 index ca0a6798fb8c35b68cf84d263855955eb93ba0b0..0000000000000000000000000000000000000000 --- a/benchmark/paddle/image/vgg.py +++ /dev/null @@ -1,119 +0,0 @@ -#!/usr/bin/env python -from paddle.trainer_config_helpers import * - -height = 224 -width = 224 -num_class = 1000 -batch_size = get_config_arg('batch_size', int, 64) -layer_num = get_config_arg('layer_num', int, 19) -is_infer = get_config_arg("is_infer", bool, False) -num_samples = get_config_arg('num_samples', int, 2560) - -args = { - 'height': height, - 'width': width, - 'color': True, - 'num_class': num_class, - 'is_infer': is_infer, - 'num_samples': num_samples -} -define_py_data_sources2( - "train.list" if not is_infer else None, - "test.list" if is_infer else None, - module="provider", - obj="process", - args=args) - -settings( - batch_size=batch_size, - learning_rate=0.001 / batch_size, - learning_method=MomentumOptimizer(0.9), - regularization=L2Regularization(0.0005 * batch_size)) - -img = data_layer(name='image', size=height * width * 3) - - -def vgg_network(vgg_num=3): - tmp = img_conv_group( - input=img, - num_channels=3, - conv_padding=1, - conv_num_filter=[64, 64], - conv_filter_size=3, - conv_act=ReluActivation(), - pool_size=2, - pool_stride=2, - pool_type=MaxPooling()) - - tmp = img_conv_group( - input=tmp, - conv_num_filter=[128, 128], - conv_padding=1, - conv_filter_size=3, - conv_act=ReluActivation(), - pool_stride=2, - pool_type=MaxPooling(), - pool_size=2) - - channels = [] - for i in range(vgg_num): - channels.append(256) - tmp = img_conv_group( - input=tmp, - conv_num_filter=channels, - conv_padding=1, - conv_filter_size=3, - conv_act=ReluActivation(), - pool_stride=2, - pool_type=MaxPooling(), - pool_size=2) - channels = [] - for i in range(vgg_num): - channels.append(512) - tmp = img_conv_group( - input=tmp, - conv_num_filter=channels, - conv_padding=1, - conv_filter_size=3, - conv_act=ReluActivation(), - pool_stride=2, - pool_type=MaxPooling(), - pool_size=2) - tmp = img_conv_group( - input=tmp, - conv_num_filter=channels, - conv_padding=1, - conv_filter_size=3, - conv_act=ReluActivation(), - pool_stride=2, - pool_type=MaxPooling(), - pool_size=2) - - tmp = fc_layer( - input=tmp, - size=4096, - act=ReluActivation(), - layer_attr=ExtraAttr(drop_rate=0.5)) - - tmp = fc_layer( - input=tmp, - size=4096, - act=ReluActivation(), - layer_attr=ExtraAttr(drop_rate=0.5)) - - return fc_layer(input=tmp, size=num_class, act=SoftmaxActivation()) - - -if layer_num == 16: - vgg = vgg_network(3) -elif layer_num == 19: - vgg = vgg_network(4) -else: - print("Wrong layer number.") - -if is_infer: - outputs(vgg) -else: - lab = data_layer('label', num_class) - loss = cross_entropy(input=vgg, label=lab) - outputs(loss) diff --git a/benchmark/paddle/rnn/imdb.py b/benchmark/paddle/rnn/imdb.py deleted file mode 100755 index 2a67f9b0cf52484d9d44fe9db0b1e57cdd20fd43..0000000000000000000000000000000000000000 --- a/benchmark/paddle/rnn/imdb.py +++ /dev/null @@ -1,60 +0,0 @@ -# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from __future__ import print_function -import six.moves.cPickle as pickle -import gzip -import os -import numpy - - -def get_dataset_file(dataset, default_dataset, origin): - data_dir, data_file = os.path.split(dataset) - if (not os.path.isfile(dataset)) and data_file == default_dataset: - from six.moves import urllib - print('Downloading data from %s' % origin) - urllib.request.urlretrieve(origin, dataset) - - return dataset - - -def create_data(path="imdb.pkl"): - - if (not os.path.isfile('imdb.train.pkl')): - path = get_dataset_file( - path, "imdb.pkl", - "http://www.iro.umontreal.ca/~lisa/deep/data/imdb.pkl") - - if path.endswith(".gz"): - f = gzip.open(path, 'rb') - else: - f = open(path, 'rb') - - train_set = pickle.load(f) - test_set = pickle.load(f) - f.close() - - pickle.dump(train_set, open('imdb.train.pkl', 'wb')) - pickle.dump(test_set, open('imdb.test.pkl', 'wb')) - - if (not os.path.isfile('train.list')): - file('train.list', 'w').write('imdb.train.pkl\n') - - -def main(): - create_data('imdb.pkl') - - -if __name__ == "__main__": - main() diff --git a/benchmark/paddle/rnn/provider.py b/benchmark/paddle/rnn/provider.py deleted file mode 100644 index 23cc0c44a98d0ae7f586d1a376a603198f2c6144..0000000000000000000000000000000000000000 --- a/benchmark/paddle/rnn/provider.py +++ /dev/null @@ -1,86 +0,0 @@ -# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import io, os -import random -import numpy as np -import six.moves.cPickle as pickle -from paddle.trainer.PyDataProvider2 import * - - -def remove_unk(x, n_words): - return [[1 if w >= n_words else w for w in sen] for sen in x] - - -# ============================================================== -# tensorflow uses fixed length, but PaddlePaddle can process -# variable-length. Padding is used in benchmark in order to -# compare with other platform. -# ============================================================== -def pad_sequences(sequences, - maxlen=None, - dtype='int32', - padding='post', - truncating='post', - value=0.): - lengths = [len(s) for s in sequences] - - nb_samples = len(sequences) - if maxlen is None: - maxlen = np.max(lengths) - - x = (np.ones((nb_samples, maxlen)) * value).astype(dtype) - for idx, s in enumerate(sequences): - if len(s) == 0: - continue # empty list was found - if truncating == 'pre': - trunc = s[-maxlen:] - elif truncating == 'post': - trunc = s[:maxlen] - else: - raise ValueError("Truncating type '%s' not understood" % padding) - - if padding == 'post': - x[idx, :len(trunc)] = trunc - elif padding == 'pre': - x[idx, -len(trunc):] = trunc - else: - raise ValueError("Padding type '%s' not understood" % padding) - return x - - -def initHook(settings, vocab_size, pad_seq, maxlen, **kwargs): - settings.vocab_size = vocab_size - settings.pad_seq = pad_seq - settings.maxlen = maxlen - settings.input_types = [ - integer_value_sequence(vocab_size), integer_value(2) - ] - - -@provider( - init_hook=initHook, min_pool_size=-1, cache=CacheType.CACHE_PASS_IN_MEM) -def process(settings, file): - f = open(file, 'rb') - train_set = pickle.load(f) - f.close() - x, y = train_set - - # remove unk, namely remove the words out of dictionary - x = remove_unk(x, settings.vocab_size) - if settings.pad_seq: - x = pad_sequences(x, maxlen=settings.maxlen, value=0.) - - for i in range(len(y)): - yield map(int, x[i]), int(y[i]) diff --git a/benchmark/paddle/rnn/rnn.py b/benchmark/paddle/rnn/rnn.py deleted file mode 100755 index 83eb3e565473f7e7e91cddeaa3cd2aafb7e3df2c..0000000000000000000000000000000000000000 --- a/benchmark/paddle/rnn/rnn.py +++ /dev/null @@ -1,38 +0,0 @@ -#!/usr/bin/env python - -from paddle.trainer_config_helpers import * -import imdb - -num_class = 2 -vocab_size = 30000 -fixedlen = 100 -batch_size = get_config_arg('batch_size', int, 128) -lstm_num = get_config_arg('lstm_num', int, 1) -hidden_size = get_config_arg('hidden_size', int, 128) -# whether to pad sequence into fixed length -pad_seq = get_config_arg('pad_seq', bool, True) -imdb.create_data('imdb.pkl') - -args = {'vocab_size': vocab_size, 'pad_seq': pad_seq, 'maxlen': fixedlen} -define_py_data_sources2( - "train.list", None, module="provider", obj="process", args=args) - -settings( - batch_size=batch_size, - learning_rate=2e-3, - learning_method=AdamOptimizer(), - regularization=L2Regularization(8e-4), - gradient_clipping_threshold=25) - -net = data_layer('data', size=vocab_size) -net = embedding_layer(input=net, size=128) - -for i in xrange(lstm_num): - net = simple_lstm(input=net, size=hidden_size) - -net = last_seq(input=net) -net = fc_layer(input=net, size=2, act=SoftmaxActivation()) - -lab = data_layer('label', num_class) -loss = classification_cost(input=net, label=lab) -outputs(loss) diff --git a/benchmark/paddle/rnn/run.sh b/benchmark/paddle/rnn/run.sh deleted file mode 100755 index f99a562b3f88a98560f4bf7aee98ceee9daefe67..0000000000000000000000000000000000000000 --- a/benchmark/paddle/rnn/run.sh +++ /dev/null @@ -1,52 +0,0 @@ -#!/bin/bash - -set -e - -function train() { - cfg=$1 - thread=$2 - args="lstm_num=${3},seq_pad=${4},hidden_size=${5},batch_size=${6}" - paddle train --job=time \ - --config=$cfg \ - --use_gpu=1 \ - --trainer_count=$thread \ - --log_period=10 \ - --test_period=100 \ - --num_passes=1 \ - --feed_data=1 \ - --config_args=$args \ - >logs/rnn-pad${4}-${thread}gpu-lstm${3}-batch${6}-hid${5}.log 2>&1 -} - -if [ ! -d "logs" ]; then - mkdir logs -fi - -## padding, single gpu -#-----config--gpu--lstm_num--padding--hidden_size--batch_size -## lstm_num=2, batch_size=64 -train rnn.py 1 2 1 256 64 -train rnn.py 1 2 1 512 64 -train rnn.py 1 2 1 1280 64 - -## lstm_num=2, batch_size=128 -train rnn.py 1 2 1 256 128 -train rnn.py 1 2 1 512 128 -train rnn.py 1 2 1 1280 128 - -## lstm_num=4, batch_size=256 -train rnn.py 1 2 1 256 256 -train rnn.py 1 2 1 512 256 -train rnn.py 1 2 1 1280 256 - - -#==================multi gpus=====================# -# hidden_size=256, lstm_num=2, different batch size -train rnn.py 4 2 1 256 128 -train rnn.py 4 2 1 256 256 -train rnn.py 4 2 1 256 512 - -# hidden_size=512, lstm_num=4, different batch size -train rnn.py 4 2 1 512 128 -train rnn.py 4 2 1 512 256 -train rnn.py 4 2 1 512 512 diff --git a/benchmark/tensorflow/machine_translation.py b/benchmark/tensorflow/machine_translation.py index 8f77dce98353af53803246be8dc61063836b7867..7837669edc7a206c03e5b9fa2989bf45b35f0605 100644 --- a/benchmark/tensorflow/machine_translation.py +++ b/benchmark/tensorflow/machine_translation.py @@ -35,8 +35,6 @@ import os import argparse import time -import paddle.v2 as paddle - parser = argparse.ArgumentParser(description=__doc__) parser.add_argument( "--embedding_dim", diff --git a/benchmark/tensorflow/mnist.py b/benchmark/tensorflow/mnist.py index 7140eed6eaff49b5c65f9ccb2e38f113a4cdbdbf..03d533fecfededddd3956ba83ea600456782cfc9 100644 --- a/benchmark/tensorflow/mnist.py +++ b/benchmark/tensorflow/mnist.py @@ -21,7 +21,6 @@ import time import numpy as np import tensorflow as tf -import paddle.v2 as paddle DTYPE = tf.float32 diff --git a/benchmark/tensorflow/resnet.py b/benchmark/tensorflow/resnet.py index c432fa8d59571e128b9ff9e3ffa1949b792ef3a4..fdb044195766b847e16a0cc33424a999c1d9166e 100644 --- a/benchmark/tensorflow/resnet.py +++ b/benchmark/tensorflow/resnet.py @@ -27,7 +27,6 @@ import argparse import time import numpy as np -import paddle.v2 as paddle import tensorflow as tf DTYPE = tf.float32 diff --git a/benchmark/tensorflow/stacked_dynamic_lstm.py b/benchmark/tensorflow/stacked_dynamic_lstm.py index 5285033005044d907d0b7e91eb66ee7281c4f27a..1f532dc2fa082ea0f6b1da560e1a57b96d2ef1bb 100644 --- a/benchmark/tensorflow/stacked_dynamic_lstm.py +++ b/benchmark/tensorflow/stacked_dynamic_lstm.py @@ -21,8 +21,6 @@ import argparse import time import tensorflow as tf -import paddle.v2 as paddle - def parse_args(): parser = argparse.ArgumentParser("LSTM model benchmark.") diff --git a/benchmark/tensorflow/vgg.py b/benchmark/tensorflow/vgg.py index fba5ec71a46b3ac8b2e1244424c39fd5192e5458..d32c835bd7a7dafaafe0970fb6b422db3c866370 100644 --- a/benchmark/tensorflow/vgg.py +++ b/benchmark/tensorflow/vgg.py @@ -13,7 +13,6 @@ # limitations under the License. """VGG16 benchmark in TensorFlow""" import tensorflow as tf -import paddle.v2 as paddle import numpy as np import argparse import time diff --git a/cmake/FindGperftools.cmake b/cmake/FindGperftools.cmake new file mode 100644 index 0000000000000000000000000000000000000000..928f573a4fb82391859e334d50e6c8ed0e26aae2 --- /dev/null +++ b/cmake/FindGperftools.cmake @@ -0,0 +1,63 @@ +# Tries to find Gperftools. +# +# Usage of this module as follows: +# +# find_package(Gperftools) +# +# Variables used by this module, they can change the default behaviour and need +# to be set before calling find_package: +# +# Gperftools_ROOT_DIR Set this variable to the root installation of +# Gperftools if the module has problems finding +# the proper installation path. +# +# Variables defined by this module: +# +# GPERFTOOLS_FOUND System has Gperftools libs/headers +# GPERFTOOLS_LIBRARIES The Gperftools libraries (tcmalloc & profiler) +# GPERFTOOLS_INCLUDE_DIR The location of Gperftools headers + +find_library(GPERFTOOLS_TCMALLOC + NAMES tcmalloc + HINTS ${Gperftools_ROOT_DIR}/lib) + +find_library(GPERFTOOLS_PROFILER + NAMES profiler + HINTS ${Gperftools_ROOT_DIR}/lib) + +find_library(GPERFTOOLS_TCMALLOC_AND_PROFILER + NAMES tcmalloc_and_profiler + HINTS ${Gperftools_ROOT_DIR}/lib) + +find_path(GPERFTOOLS_INCLUDE_DIR + NAMES gperftools/heap-profiler.h + HINTS ${Gperftools_ROOT_DIR}/include) + +set(GPERFTOOLS_LIBRARIES ${GPERFTOOLS_TCMALLOC_AND_PROFILER}) + +include(FindPackageHandleStandardArgs) +find_package_handle_standard_args( + Gperftools + DEFAULT_MSG + GPERFTOOLS_LIBRARIES + GPERFTOOLS_INCLUDE_DIR) + +mark_as_advanced( + Gperftools_ROOT_DIR + GPERFTOOLS_TCMALLOC + GPERFTOOLS_PROFILER + GPERFTOOLS_TCMALLOC_AND_PROFILER + GPERFTOOLS_LIBRARIES + GPERFTOOLS_INCLUDE_DIR) + +# create IMPORTED targets +if (Gperftools_FOUND AND NOT TARGET gperftools::tcmalloc) + add_library(gperftools::tcmalloc UNKNOWN IMPORTED) + set_target_properties(gperftools::tcmalloc PROPERTIES + IMPORTED_LOCATION ${GPERFTOOLS_TCMALLOC} + INTERFACE_INCLUDE_DIRECTORIES "${GPERFTOOLS_INCLUDE_DIR}") + add_library(gperftools::profiler UNKNOWN IMPORTED) + set_target_properties(gperftools::profiler PROPERTIES + IMPORTED_LOCATION ${GPERFTOOLS_PROFILER} + INTERFACE_INCLUDE_DIRECTORIES "${GPERFTOOLS_INCLUDE_DIR}") +endif() diff --git a/cmake/FindJeMalloc.cmake b/cmake/FindJeMalloc.cmake new file mode 100644 index 0000000000000000000000000000000000000000..b95287160ba610b2dfa93ba15e7c7c8214d80ac1 --- /dev/null +++ b/cmake/FindJeMalloc.cmake @@ -0,0 +1,28 @@ +# - Find JeMalloc library +# Find the native JeMalloc includes and library +# +# JEMALLOC_INCLUDE_DIR - where to find jemalloc.h, etc. +# JEMALLOC_LIBRARIES - List of libraries when using jemalloc. +# JEMALLOC_FOUND - True if jemalloc found. + +find_path(JEMALLOC_INCLUDE_DIR + NAMES jemalloc/jemalloc.h + HINTS ${JEMALLOC_ROOT_DIR}/include) + +find_library(JEMALLOC_LIBRARIES + NAMES jemalloc + HINTS ${JEMALLOC_ROOT_DIR}/lib) + +include(FindPackageHandleStandardArgs) +find_package_handle_standard_args(jemalloc DEFAULT_MSG JEMALLOC_LIBRARIES JEMALLOC_INCLUDE_DIR) + +mark_as_advanced( + JEMALLOC_LIBRARIES + JEMALLOC_INCLUDE_DIR) + +if (JEMALLOC_FOUND) + add_library(jemalloc::jemalloc UNKNOWN IMPORTED) + set_target_properties(jemalloc::jemalloc PROPERTIES + IMPORTED_LOCATION ${JEMALLOC_LIBRARIES} + INTERFACE_INCLUDE_DIRECTORIES "${JEMALLOC_INCLUDE_DIR}") +endif() diff --git a/cmake/FindSphinx.cmake b/cmake/FindSphinx.cmake deleted file mode 100644 index f74cd4ff8c9c2c52319b18ac37264167b3718eae..0000000000000000000000000000000000000000 --- a/cmake/FindSphinx.cmake +++ /dev/null @@ -1,147 +0,0 @@ -# - This module looks for Sphinx -# Find the Sphinx documentation generator -# -# This modules defines -# SPHINX_EXECUTABLE -# SPHINX_FOUND - -find_program(SPHINX_EXECUTABLE - NAMES sphinx-build - PATHS - /usr/bin - /usr/local/bin - /opt/local/bin - DOC "Sphinx documentation generator" -) - -if( NOT SPHINX_EXECUTABLE ) - set(_Python_VERSIONS - 2.7 2.6 2.5 2.4 2.3 2.2 2.1 2.0 1.6 1.5 - ) - - foreach( _version ${_Python_VERSIONS} ) - set( _sphinx_NAMES sphinx-build-${_version} ) - - find_program( SPHINX_EXECUTABLE - NAMES ${_sphinx_NAMES} - PATHS - /usr/bin - /usr/local/bin - /opt/loca/bin - DOC "Sphinx documentation generator" - ) - endforeach() -endif() - -include(FindPackageHandleStandardArgs) - -find_package_handle_standard_args(Sphinx DEFAULT_MSG - SPHINX_EXECUTABLE -) - - -option( SPHINX_HTML_OUTPUT "Build a single HTML with the whole content." ON ) -option( SPHINX_DIRHTML_OUTPUT "Build HTML pages, but with a single directory per document." OFF ) -option( SPHINX_HTMLHELP_OUTPUT "Build HTML pages with additional information for building a documentation collection in htmlhelp." OFF ) -option( SPHINX_QTHELP_OUTPUT "Build HTML pages with additional information for building a documentation collection in qthelp." OFF ) -option( SPHINX_DEVHELP_OUTPUT "Build HTML pages with additional information for building a documentation collection in devhelp." OFF ) -option( SPHINX_EPUB_OUTPUT "Build HTML pages with additional information for building a documentation collection in epub." OFF ) -option( SPHINX_LATEX_OUTPUT "Build LaTeX sources that can be compiled to a PDF document using pdflatex." OFF ) -option( SPHINX_MAN_OUTPUT "Build manual pages in groff format for UNIX systems." OFF ) -option( SPHINX_TEXT_OUTPUT "Build plain text files." OFF ) - - -mark_as_advanced( - SPHINX_EXECUTABLE - SPHINX_HTML_OUTPUT - SPHINX_DIRHTML_OUTPUT - SPHINX_HTMLHELP_OUTPUT - SPHINX_QTHELP_OUTPUT - SPHINX_DEVHELP_OUTPUT - SPHINX_EPUB_OUTPUT - SPHINX_LATEX_OUTPUT - SPHINX_MAN_OUTPUT - SPHINX_TEXT_OUTPUT -) - -function( Sphinx_add_target target_name builder conf cache source destination ) - add_custom_target( ${target_name} ALL - COMMAND ${SPHINX_EXECUTABLE} -b ${builder} - -d ${cache} - -c ${conf} - ${source} - ${destination} - COMMENT "Generating sphinx documentation: ${builder}" - COMMAND cd ${destination} && ln -sf ./index_*.html index.html - ) - - set_property( - DIRECTORY APPEND PROPERTY - ADDITIONAL_MAKE_CLEAN_FILES - ${destination} - ) -endfunction() - -# Target dependencies can be optionally listed at the end. -function( Sphinx_add_targets target_base_name conf source base_destination ) - - set( _dependencies ) - - foreach( arg IN LISTS ARGN ) - set( _dependencies ${_dependencies} ${arg} ) - endforeach() - - if( ${SPHINX_HTML_OUTPUT} ) - Sphinx_add_target( ${target_base_name}_html html ${conf} ${source} ${base_destination}/html ) - - add_dependencies( ${target_base_name}_html ${_dependencies} ) - endif() - - if( ${SPHINX_DIRHTML_OUTPUT} ) - Sphinx_add_target( ${target_base_name}_dirhtml dirhtml ${conf} ${source} ${base_destination}/dirhtml ) - - add_dependencies( ${target_base_name}_dirhtml ${_dependencies} ) - endif() - - if( ${SPHINX_QTHELP_OUTPUT} ) - Sphinx_add_target( ${target_base_name}_qthelp qthelp ${conf} ${source} ${base_destination}/qthelp ) - - add_dependencies( ${target_base_name}_qthelp ${_dependencies} ) - endif() - - if( ${SPHINX_DEVHELP_OUTPUT} ) - Sphinx_add_target( ${target_base_name}_devhelp devhelp ${conf} ${source} ${base_destination}/devhelp ) - - add_dependencies( ${target_base_name}_devhelp ${_dependencies} ) - endif() - - if( ${SPHINX_EPUB_OUTPUT} ) - Sphinx_add_target( ${target_base_name}_epub epub ${conf} ${source} ${base_destination}/epub ) - - add_dependencies( ${target_base_name}_epub ${_dependencies} ) - endif() - - if( ${SPHINX_LATEX_OUTPUT} ) - Sphinx_add_target( ${target_base_name}_latex latex ${conf} ${source} ${base_destination}/latex ) - - add_dependencies( ${target_base_name}_latex ${_dependencies} ) - endif() - - if( ${SPHINX_MAN_OUTPUT} ) - Sphinx_add_target( ${target_base_name}_man man ${conf} ${source} ${base_destination}/man ) - - add_dependencies( ${target_base_name}_man ${_dependencies} ) - endif() - - if( ${SPHINX_TEXT_OUTPUT} ) - Sphinx_add_target( ${target_base_name}_text text ${conf} ${source} ${base_destination}/text ) - - add_dependencies( ${target_base_name}_text ${_dependencies} ) - endif() - - if( ${BUILD_TESTING} ) - sphinx_add_target( ${target_base_name}_linkcheck linkcheck ${conf} ${source} ${base_destination}/linkcheck ) - - add_dependencies( ${target_base_name}_linkcheck ${_dependencies} ) - endif() -endfunction() diff --git a/cmake/anakin_subgraph.cmake b/cmake/anakin_subgraph.cmake new file mode 100644 index 0000000000000000000000000000000000000000..4a7d32a63553df31e0928e7b30249ff3e809cba1 --- /dev/null +++ b/cmake/anakin_subgraph.cmake @@ -0,0 +1,32 @@ +if(NOT WITH_GPU) + return() +endif() + +set(ANAKIN_ROOT "/usr" CACHE PATH "ANAKIN ROOT") +find_path(ANAKIN_INCLUDE_DIR anakin_config.h + PATHS ${ANAKIN_ROOT} ${ANAKIN_ROOT}/include + $ENV{ANAKIN_ROOT} $ENV{ANAKIN_ROOT}/include + NO_DEFAULT_PATH +) + +find_library(ANAKIN_LIBRARY NAMES libanakin_saber_common.so libanakin.so + PATHS ${ANAKIN_ROOT} + $ENV{ANAKIN_ROOT} $ENV{ANAKIN_ROOT}/lib + NO_DEFAULT_PATH + DOC "Path to ANAKIN library.") + +if(ANAKIN_INCLUDE_DIR AND ANAKIN_LIBRARY) + if(WITH_DSO) + set(ANAKIN_FOUND ON) + endif(WITH_DSO) +else() + set(ANAKIN_FOUND OFF) +endif() + +if(ANAKIN_FOUND) + message(STATUS "Current ANAKIN header is ${ANAKIN_INCLUDE_DIR}/anakin_config.h. ") + include_directories(${ANAKIN_ROOT}/include) + include_directories(${ANAKIN_ROOT}/include/saber) + link_directories(${ANAKIN_ROOT}) + add_definitions(-DPADDLE_WITH_ANAKIN) +endif() diff --git a/cmake/cblas.cmake b/cmake/cblas.cmake index 24de8d9d7ced5f8111cc5d65f761b7506bde048e..52ac31d1d125afb89fb0ae783fba94ab9a0c5a1a 100644 --- a/cmake/cblas.cmake +++ b/cmake/cblas.cmake @@ -64,24 +64,18 @@ endif() ## Then find the reference-cblas. www.netlib.org/blas/ set(REFERENCE_CBLAS_ROOT $ENV{REFERENCE_CBLAS_ROOT} CACHE PATH "Folder contains reference-cblas") -if(NOT CMAKE_CROSSCOMPILING) - set(REFERENCE_CBLAS_INCLUDE_SEARCH_PATHS - ${REFERENCE_CBLAS_ROOT}/include - /usr/include - /usr/include/cblas - ) +set(REFERENCE_CBLAS_INCLUDE_SEARCH_PATHS + ${REFERENCE_CBLAS_ROOT}/include + /usr/include + /usr/include/cblas +) - set(REFERENCE_CBLAS_LIB_SEARCH_PATHS - ${REFERENCE_CBLAS_ROOT}/lib - /usr/lib - /usr/lib/blas/reference/ - /usr/lib/reference/ - ) -else() - # Disable the finding of reference cblas under host's system path - set(REFERENCE_CBLAS_INCLUDE_SEARCH_PATHS ${REFERENCE_CBLAS_ROOT}/include) - set(REFERENCE_CBLAS_LIB_SEARCH_PATHS ${REFERENCE_CBLAS_ROOT}/lib) -endif() +set(REFERENCE_CBLAS_LIB_SEARCH_PATHS + ${REFERENCE_CBLAS_ROOT}/lib + /usr/lib + /usr/lib/blas/reference/ + /usr/lib/reference/ +) if(WITH_SYSTEM_BLAS) find_path(REFERENCE_CBLAS_INCLUDE_DIR NAMES cblas.h PATHS @@ -98,10 +92,3 @@ if(WITH_SYSTEM_BLAS) message(STATUS "Found reference-cblas (include: ${CBLAS_INC_DIR}, library: ${CBLAS_LIBRARIES})") endif() endif() - -if(IOS_USE_VECLIB_FOR_BLAS AND VECLIB_FOUND) - set(CBLAS_FOUND ON) - set(CBLAS_PROVIDER vecLib) - set(CBLAS_INC_DIR ${VECLIB_INC_DIR}) - add_definitions(-DPADDLE_USE_VECLIB) -endif() diff --git a/cmake/configure.cmake b/cmake/configure.cmake index e9852f00b1835adec31373f58ac538f9685251e0..93d74bb0a8f726ad31685cbfc7831b5441cd5108 100644 --- a/cmake/configure.cmake +++ b/cmake/configure.cmake @@ -20,45 +20,18 @@ if(WITH_DSO) add_definitions(-DPADDLE_USE_DSO) endif(WITH_DSO) -if(WITH_DOUBLE) - add_definitions(-DPADDLE_TYPE_DOUBLE) -endif(WITH_DOUBLE) - -if(WITH_ARM_FP16) - add_definitions(-DPADDLE_ARM_FP16) - add_definitions("-march=armv8.2-a+fp16+simd") -endif(WITH_ARM_FP16) - if(WITH_TESTING) add_definitions(-DPADDLE_WITH_TESTING) endif(WITH_TESTING) -if(NOT WITH_TIMER) - add_definitions(-DPADDLE_DISABLE_TIMER) -endif(NOT WITH_TIMER) - -if(USE_EIGEN_FOR_BLAS) - add_definitions(-DPADDLE_USE_EIGEN_FOR_BLAS) -endif(USE_EIGEN_FOR_BLAS) - -if(EIGEN_USE_THREADS) - add_definitions(-DEIGEN_USE_THREADS) -endif(EIGEN_USE_THREADS) - if(NOT WITH_PROFILER) add_definitions(-DPADDLE_DISABLE_PROFILER) endif(NOT WITH_PROFILER) -if(NOT CMAKE_CROSSCOMPILING) - if(WITH_AVX AND AVX512F_FOUND) - set(SIMD_FLAG ${AVX512F_FLAG}) - elseif(WITH_AVX AND AVX2_FOUND) - set(SIMD_FLAG ${AVX2_FLAG}) - elseif(WITH_AVX AND AVX_FOUND) - set(SIMD_FLAG ${AVX_FLAG}) - elseif(SSE3_FOUND) - set(SIMD_FLAG ${SSE3_FLAG}) - endif() +if(WITH_AVX AND AVX_FOUND) + set(SIMD_FLAG ${AVX_FLAG}) +elseif(SSE3_FOUND) + set(SIMD_FLAG ${SSE3_FLAG}) endif() if(WIN32) @@ -84,12 +57,13 @@ if(WIN32) endif(NOT MSVC) endif(WIN32) -if(NOT WITH_GOLANG) - add_definitions(-DPADDLE_WITHOUT_GOLANG) -endif(NOT WITH_GOLANG) +if(WITH_PSLIB) + add_definitions(-DPADDLE_WITH_PSLIB) +endif() if(WITH_GPU) add_definitions(-DPADDLE_WITH_CUDA) + add_definitions(-DEIGEN_USE_GPU) FIND_PACKAGE(CUDA REQUIRED) @@ -133,6 +107,7 @@ if(WITH_GPU) message(WARNING "Anakin needs CUDNN >= 7.0 to compile. Force WITH_ANAKIN=OFF") set(WITH_ANAKIN OFF CACHE STRING "Anakin is valid only when CUDNN >= 7.0." FORCE) endif() + add_definitions(-DWITH_ANAKIN) endif() if(WITH_ANAKIN) # NOTICE(minqiyang): the end slash is important because $CUDNN_INCLUDE_DIR @@ -152,7 +127,12 @@ endif() if (WITH_MKLML AND MKLML_IOMP_LIB) message(STATUS "Enable Intel OpenMP with ${MKLML_IOMP_LIB}") - set(OPENMP_FLAGS "-fopenmp") + if(WIN32) + # openmp not support well for now on windows + set(OPENMP_FLAGS "") + else(WIN32) + set(OPENMP_FLAGS "-fopenmp") + endif(WIN32) set(CMAKE_C_CREATE_SHARED_LIBRARY_FORBIDDEN_FLAGS ${OPENMP_FLAGS}) set(CMAKE_CXX_CREATE_SHARED_LIBRARY_FORBIDDEN_FLAGS ${OPENMP_FLAGS}) set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OPENMP_FLAGS}") @@ -166,55 +146,6 @@ if(WITH_DISTRIBUTE) add_definitions(-DPADDLE_WITH_DISTRIBUTE) endif() -if(WITH_GOLANG) - # we need to symlink Paddle directory into GOPATH. If we - # don't do it and we have code that depends on Paddle, go - # get ./... will download a new Paddle repo from Github, - # without the changes in our current Paddle repo that we - # want to build. - set(GOPATH "${CMAKE_CURRENT_BINARY_DIR}/go") - file(MAKE_DIRECTORY ${GOPATH}) - set(PADDLE_IN_GOPATH "${GOPATH}/src/github.com/PaddlePaddle/Paddle") - file(MAKE_DIRECTORY "${PADDLE_IN_GOPATH}") - set(PADDLE_GO_PATH "${CMAKE_SOURCE_DIR}/go") - - add_custom_target(go_path) - add_custom_command(TARGET go_path - # Symlink Paddle directory into GOPATH - COMMAND mkdir -p ${PADDLE_IN_GOPATH} - COMMAND rm -rf ${PADDLE_IN_GOPATH} - COMMAND ln -sf ${CMAKE_SOURCE_DIR} ${PADDLE_IN_GOPATH} - # Automatically get all dependencies specified in the source code - # We can't run `go get -d ./...` for every target, because - # multiple `go get` can not run concurrently, but make need to be - # able to run with multiple jobs. - WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR} - ) - - if (GLIDE_INSTALL) - if(EXISTS $ENV{GOPATH}/bin/glide) - set(GLIDE "$ENV{GOPATH}/bin/glide") - else() - message(FATAL_ERROR "no glide executeble found: $ENV{GOPATH}/bin/glide") - endif() - - # this command will only run when the file it depends is missing - # or has changed, or the output is missing. - add_custom_command(OUTPUT ${CMAKE_BINARY_DIR}/glide - COMMAND env GOPATH=${GOPATH} ${GLIDE} install - COMMAND touch ${CMAKE_BINARY_DIR}/glide - DEPENDS ${PADDLE_SOURCE_DIR}/go/glide.lock - WORKING_DIRECTORY "${PADDLE_IN_GOPATH}/go" - ) - - # depends on the custom command which outputs - # ${CMAKE_BINARY_DIR}/glide, the custom command does not need to - # run every time this target is built. - add_custom_target(go_vendor DEPENDS ${CMAKE_BINARY_DIR}/glide go_path) - endif() - -endif(WITH_GOLANG) - if(WITH_GRPC) add_definitions(-DPADDLE_WITH_GRPC) endif(WITH_GRPC) @@ -222,3 +153,7 @@ endif(WITH_GRPC) if(WITH_BRPC_RDMA) add_definitions(-DPADDLE_WITH_BRPC_RDMA) endif(WITH_BRPC_RDMA) + +if(ON_INFER) + add_definitions(-DPADDLE_ON_INFERENCE) +endif(ON_INFER) diff --git a/cmake/cross_compiling/android.cmake b/cmake/cross_compiling/android.cmake deleted file mode 100644 index 4cf2be3bdf07ed018c57cd6bc305a3eda9c9a23d..0000000000000000000000000000000000000000 --- a/cmake/cross_compiling/android.cmake +++ /dev/null @@ -1,236 +0,0 @@ -# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# This is a toolchain file for cross-compiling for Android, and the -# configuration refers to the open-source resposity: -# https://github.com/taka-no-me/android-cmake -# Most of the variables are compatible with that used in -# https://developer.android.com/ndk/guides/cmake.html -# The supported variables are listed belows: -# -# ANDROID_STANDALONE_TOOLCHAIN -# ANDROID_TOOLCHAIN -# ANDROID_ABI -# ANDROID_NATIVE_API_LEVEL -# ANDROID_ARM_MODE -# ANDROID_ARM_NEON -# -# For CMake >= 3.7.0, all the settings will be delivered to CMake system -# variables to let CMake do the cross-compiling configurations itself. -# More detail of cross-compiling settings -# https://cmake.org/cmake/help/v3.7/manual/cmake-toolchains.7.html - -IF(NOT ANDROID) - return() -ENDIF() - -# check the exist of android standalone toolchain -IF(NOT DEFINED ANDROID_STANDALONE_TOOLCHAIN) - SET(ANDROID_STANDALONE_TOOLCHAIN $ENV{ANDROID_STANDALONE_TOOLCHAIN} - CACHE PATH "Folder holds the standalone toolchain of Android NDK") -ENDIF() -IF(NOT ANDROID_STANDALONE_TOOLCHAIN) - MESSAGE(WARNING "It is recommended to set ANDROID_STANDALONE_TOOLCHAIN to " - "use a standalone toolchain.\n" - "To cross-compile for Android, you need to:\n" - "1. Download an Android NDK from" - " https://developer.android.com/ndk/downloads/index.html\n" - "2. Setup a standalone toolchain" - "https://developer.android.google.cn/ndk/guides/standalone_toolchain.html?hl=zh-cn\n") -ENDIF() - -IF(NOT DEFINED CMAKE_SYSTEM_VERSION AND ANDROID_NATIVE_API_LEVEL) - IF(ANDROID_NATIVE_API_LEVEL MATCHES "^android-[0-9]+$") - STRING(REPLACE "android-" "" CMAKE_SYSTEM_VERSION "${CMAKE_MATCH_0}") - ELSEIF(ANDROID_NATIVE_API_LEVEL MATCHES "^[0-9]+$") - SET(CMAKE_SYSTEM_VERSION ${ANDROID_NATIVE_API_LEVEL}) - ENDIF() -ENDIF() - -IF(NOT DEFINED ANDROID_TOOLCHAIN) - SET(ANDROID_TOOLCHAIN clang) -ENDIF() - -IF(NOT DEFINED ANDROID_ABI) - SET(ANDROID_ABI "armeabi-v7a") -ENDIF() - -IF(NOT DEFINED ANDROID_ARM_MODE) - SET(ANDROID_ARM_MODE ON) -ENDIF() -IF(ANDROID_ARM_MODE) - SET(ANDROID_ARM_MODE_NAME "arm") -ELSE(ANDROID_ARM_MODE) - SET(ANDROID_ARM_MODE_NAME "thumb") -ENDIF(ANDROID_ARM_MODE) - -IF(NOT DEFINED ANDROID_ARM_NEON) - SET(ANDROID_ARM_NEON ON) -ENDIF() - -IF("${CMAKE_VERSION}" VERSION_LESS "3.7.0") - IF("${CMAKE_VERSION}" VERSION_LESS "3.1.0") - SET(CMAKE_SYSTEM_NAME "Linux") - ENDIF() - MESSAGE(WARNING "It is recommended to use CMake >= 3.7.0 (current version: " - "${CMAKE_VERSION}), when cross-compiling for Android.") - - IF(ANDROID_STANDALONE_TOOLCHAIN) - # Use standalone toolchain - SET(CMAKE_SYSROOT "${ANDROID_STANDALONE_TOOLCHAIN}/sysroot") - - IF(NOT CMAKE_SYSTEM_VERSION) - SET(ANDROID_STANDALONE_TOOLCHAIN_API "") - SET(ANDROID_API_LEVEL_H_REGEX "^[\t ]*#[\t ]*define[\t ]+__ANDROID_API__[\t ]+([0-9]+)") - FILE(STRINGS "${ANDROID_STANDALONE_TOOLCHAIN}/sysroot/usr/include/android/api-level.h" - ANDROID_API_LEVEL_H_CONTENT REGEX "${ANDROID_API_LEVEL_H_REGEX}") - IF(ANDROID_API_LEVEL_H_CONTENT MATCHES "${ANDROID_API_LEVEL_H_REGEX}") - SET(ANDROID_STANDALONE_TOOLCHAIN_API "${CMAKE_MATCH_1}") - ENDIF() - SET(CMAKE_SYSTEM_VERSION ${ANDROID_STANDALONE_TOOLCHAIN_API}) - ENDIF() - - # Toolchain - SET(ANDROID_TOOLCHAIN_ROOT ${ANDROID_STANDALONE_TOOLCHAIN}) - ELSE(ANDROID_NDK) - # TODO: use android ndk - ENDIF() - - IF(ANDROID_ABI MATCHES "^armeabi(-v7a)?$") - SET(ANDROID_TOOLCHAIN_NAME arm-linux-androideabi) - IF(ANDROID_ABI STREQUAL "armeabi") - SET(CMAKE_SYSTEM_PROCESSOR armv5te) - SET(ANDROID_CLANG_TRIPLE armv5te-none-linux-androideabi) - ELSEIF(ANDROID_ABI STREQUAL "armeabi-v7a") - SET(CMAKE_SYSTEM_PROCESSOR armv7-a) - SET(ANDROID_CLANG_TRIPLE armv7-none-linux-androideabi) - ENDIF() - ELSEIF(ANDROID_ABI STREQUAL "arm64-v8a") - SET(ANDROID_TOOLCHAIN_NAME aarch64-linux-android) - SET(CMAKE_SYSTEM_PROCESSOR aarch64) - SET(ANDROID_CLANG_TRIPLE aarch64-none-linux-android) - ELSE() - MESSAGE(FATAL_ERROR "Invalid Android ABI: ${ANDROID_ABI}.") - ENDIF() - SET(ANDROID_TOOLCHAIN_PREFIX "${ANDROID_TOOLCHAIN_ROOT}/bin/${ANDROID_TOOLCHAIN_NAME}-") - - IF(ANDROID_TOOLCHAIN STREQUAL clang) - SET(ANDROID_C_COMPILER_NAME clang) - SET(ANDROID_CXX_COMPILER_NAME clang++) - SET(CMAKE_C_COMPILER_TARGET ${ANDROID_CLANG_TRIPLE}) - SET(CMAKE_CXX_COMPILER_TARGET ${ANDROID_CLANG_TRIPLE}) - ELSEIF(ANDROID_TOOLCHAIN STREQUAL gcc) - SET(ANDROID_C_COMPILER_NAME gcc) - SET(ANDROID_CXX_COMPILER_NAME g++) - ELSE() - MESSAGE(FATAL_ERROR "Invalid Android toolchain: ${ANDROID_TOOLCHAIN}") - ENDIF() - - # C compiler - IF(NOT CMAKE_C_COMPILER) - SET(ANDROID_C_COMPILER "${ANDROID_TOOLCHAIN_PREFIX}${ANDROID_C_COMPILER_NAME}") - ELSE() - GET_FILENAME_COMPONENT(ANDROID_C_COMPILER ${CMAKE_C_COMPILER} PROGRAM) - ENDIF() - IF(NOT EXISTS ${ANDROID_C_COMPILER}) - MESSAGE(FATAL_ERROR "Cannot find C compiler: ${ANDROID_C_COMPILER}") - ENDIF() - - # CXX compiler - IF(NOT CMAKE_CXX_COMPILER) - SET(ANDROID_CXX_COMPILER "${ANDROID_TOOLCHAIN_PREFIX}${ANDROID_CXX_COMPILER_NAME}") - ELSE() - GET_FILENAME_COMPONENT(ANDROID_CXX_COMPILER ${CMAKE_CXX_COMPILER} PROGRAM) - ENDIF() - IF(NOT EXISTS ${ANDROID_CXX_COMPILER}) - MESSAGE(FATAL_ERROR "Cannot find CXX compiler: ${ANDROID_CXX_COMPILER}") - ENDIF() - - SET(CMAKE_C_COMPILER ${ANDROID_C_COMPILER} CACHE PATH "C compiler" FORCE) - SET(CMAKE_CXX_COMPILER ${ANDROID_CXX_COMPILER} CACHE PATH "CXX compiler" FORCE) - - # Toolchain and ABI specific flags. - SET(ANDROID_COMPILER_FLAGS "-ffunction-sections -fdata-sections") - SET(ANDROID_LINKER_FLAGS "-Wl,--gc-sections") - - IF(ANDROID_ABI STREQUAL "armeabi") - LIST(APPEND ANDROID_COMPILER_FLAGS - -march=armv5te - -mtune=xscale - -msoft-float) - ELSEIF(ANDROID_ABI STREQUAL "armeabi-v7a") - LIST(APPEND ANDROID_COMPILER_FLAGS - -march=armv7-a - -mfloat-abi=softfp) - IF(ANDROID_ARM_NEON) - LIST(APPEND ANDROID_COMPILER_FLAGS -mfpu=neon) - ELSE() - LIST(APPEND ANDROID_COMPILER_FLAGS -mfpu=vfpv3-d16) - ENDIF() - LIST(APPEND ANDROID_LINKER_FLAGS -Wl,--fix-cortex-a8) - ELSEIF(ANDROID_ABI STREQUAL "arm64-v8a") - LIST(APPEND ANDROID_COMPILER_FLAGS -march=armv8-a) - ENDIF() - - IF(ANDROID_ABI MATCHES "^armeabi(-v7a)?$") - IF(ANDROID_ARM_MODE) - LIST(APPEND ANDROID_COMPILER_FLAGS -marm) - ELSE() - LIST(APPEND ANDROID_COMPILER_FLAGS -mthumb) - ENDIF() - IF(ANDROID_TOOLCHAIN STREQUAL clang) - # Disable integrated-as for better compatibility. - LIST(APPEND ANDROID_COMPILER_FLAGS -fno-integrated-as) - ENDIF() - ENDIF() - - IF(ANDROID_TOOLCHAIN STREQUAL clang) - # CMake automatically forwards all compiler flags to the linker, - # and clang doesn't like having -Wa flags being used for linking. - # To prevent CMake from doing this would require meddling with - # the CMAKE__COMPILE_OBJECT rules, which would get quite messy. - LIST(APPEND ANDROID_LINKER_FLAGS -Qunused-arguments) - ENDIF() - - STRING(REPLACE ";" " " ANDROID_COMPILER_FLAGS "${ANDROID_COMPILER_FLAGS}") - STRING(REPLACE ";" " " ANDROID_LINKER_FLAGS "${ANDROID_LINKER_FLAGS}") - - SET(CMAKE_C_FLAGS "${ANDROID_COMPILER_FLAGS} ${CMAKE_C_FLAGS}" - CACHE STRING "C flags") - SET(CMAKE_CXX_FLAGS "${ANDROID_COMPILER_FLAGS} ${CMAKE_CXX_FLAGS}" - CACHE STRING "CXX flags") - SET(CMAKE_SHARED_LINKER_FLAGS "${ANDROID_LINKER_FLAGS} ${CMAKE_SHARED_LINKER_FLAGS}" - CACHE STRING "shared linker flags") - - SET(CMAKE_POSITION_INDEPENDENT_CODE TRUE) - SET(CMAKE_EXE_LINKER_FLAGS "-pie -fPIE ${ANDROID_LINKER_FLAGS} ${CMAKE_EXE_LINKER_FLAGS}" - CACHE STRING "executable linker flags") - - MESSAGE(STATUS "Android: Targeting API '${CMAKE_SYSTEM_VERSION}' " - "with architecture '${ANDROID_ARM_MODE_NAME}', " - "ABI '${ANDROID_ABI}', and processor '${CMAKE_SYSTEM_PROCESSOR}'") - MESSAGE(STATUS "System CMAKE_C_FLAGS: " ${CMAKE_C_FLAGS}) - MESSAGE(STATUS "System CMAKE_CXX_FLAGS: " ${CMAKE_CXX_FLAGS}) -ELSE() - IF(ANDROID_STANDALONE_TOOLCHAIN) - SET(CMAKE_ANDROID_STANDALONE_TOOLCHAIN ${ANDROID_STANDALONE_TOOLCHAIN}) - ENDIF() - SET(CMAKE_ANDROID_ARCH_ABI ${ANDROID_ABI}) - IF(ANDROID_ABI MATCHES "^armeabi(-v7a)?$") - SET(CMAKE_ANDROID_ARM_MODE ${ANDROID_ARM_MODE}) - IF(ANDROID_ABI STREQUAL "armeabi-v7a") - SET(CMAKE_ANDROID_ARM_NEON ${ANDROID_ARM_NEON}) - ENDIF() - ENDIF() -ENDIF() diff --git a/cmake/cross_compiling/host.cmake b/cmake/cross_compiling/host.cmake deleted file mode 100644 index f9c6b12136f488a9a6ab77b1ba673b6be75391b5..0000000000000000000000000000000000000000 --- a/cmake/cross_compiling/host.cmake +++ /dev/null @@ -1,49 +0,0 @@ -# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# find host C compiler -IF(HOST_C_COMPILER) - SET(HOST_C_COMPILER_NAME ${HOST_C_COMPILER}) -ELSEIF(NOT $ENV{CC} STREQUAL "") - SET(HOST_C_COMPILER_NAME $ENV{CC}) -ELSE() - SET(HOST_C_COMPILER_NAME cc) -ENDIF() - -GET_FILENAME_COMPONENT(HOST_C_COMPILER_PATH ${HOST_C_COMPILER_NAME} PROGRAM) -IF(NOT HOST_C_COMPILER_PATH OR NOT EXISTS ${HOST_C_COMPILER_PATH}) - MESSAGE(FATAL_ERROR "Cannot find host C compiler, set host C compiler:\n" - "\tcmake .. -DHOST_C_COMPILER=...") -ENDIF() - -# find host CXX compiler -IF(HOST_CXX_COMPILER) - SET(HOST_CXX_COMPILER_NAME ${HOST_CXX_COMPILER}) -ELSEIF(NOT $ENV{CXX} STREQUAL "") - SET(HOST_CXX_COMPILER_NAME $ENV{CXX}) -ELSE() - SET(HOST_CXX_COMPILER_NAME c++) -ENDIF() - -GET_FILENAME_COMPONENT(HOST_CXX_COMPILER_PATH ${HOST_CXX_COMPILER_NAME} PROGRAM) -IF(NOT HOST_CXX_COMPILER_PATH OR NOT EXISTS ${HOST_CXX_COMPILER_PATH}) - MESSAGE(FATAL_ERROR "Cannot find host CXX compiler, set host CXX compiler:\n" - "\tcmake .. -DHOST_CXX_COMPILER=...") -ENDIF() - -SET(HOST_C_COMPILER ${HOST_C_COMPILER_PATH} CACHE PATH "Host C compiler") -SET(HOST_CXX_COMPILER ${HOST_CXX_COMPILER_PATH} CACHE PATH "Host CXX compiler") - -MESSAGE(STATUS "Found host C compiler: " ${HOST_C_COMPILER}) -MESSAGE(STATUS "Found host CXX compiler: " ${HOST_CXX_COMPILER}) diff --git a/cmake/cross_compiling/ios.cmake b/cmake/cross_compiling/ios.cmake deleted file mode 100644 index 10d389ec8ed57ac2b15dd925ef99c8aff4807b05..0000000000000000000000000000000000000000 --- a/cmake/cross_compiling/ios.cmake +++ /dev/null @@ -1,347 +0,0 @@ -# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# This is a toolchain file for cross-compiling for iOS, and the -# configuration largely refers to public toolchain file: -# https://raw.githubusercontent.com/leetal/ios-cmake/master/ios.toolchain.cmake -# and -# https://github.com/cristeab/ios-cmake -# -# Supports options: -# IOS_PLATFORM = OS (default) or SIMULATOR -# This decides if SDKS will be selected from the iPhoneOS.platform or iPhoneSimulator.platform folders -# OS - the default, used to build for iPhone and iPad physical devices, which have an arm arch. -# SIMULATOR - used to build for the Simulator platforms, which have an x86 arch. -# IOS_ARCH -# The archectures wanted to support, such "arm64", "armv7;arm64" -# IOS_DEPLOYMENT_TARGET -# The minimum iOS deployment version, such as "7.0" -# IOS_ENABLE_BITCODE = ON (default) or OFF -# IOS_USE_VECLIB_FOR_BLAS = OFF (default) or ON -# IOS_DEVELOPER_ROOT = automatic(default) or /path/to/platform/Developer folder -# By default this location is automatcially chosen based on the IOS_PLATFORM value above. -# If set manually, it will override the default location and force the user of a particular Developer Platform -# IOS_SDK_ROOT = automatic(default) or /path/to/platform/Developer/SDKs/SDK folder -# By default this location is automatcially chosen based on the IOS_DEVELOPER_ROOT value. -# In this case it will always be the most up-to-date SDK found in the IOS_DEVELOPER_ROOT path. -# If set manually, this will force the use of a specific SDK version - -# Macros: -# set_xcode_property (TARGET XCODE_PROPERTY XCODE_VALUE) -# A convenience macro for setting xcode specific properties on targets -# example: set_xcode_property (myioslib IPHONEOS_DEPLOYMENT_TARGET "3.1") -# find_host_package (PROGRAM ARGS) -# A macro used to find executable programs on the host system, not within the iOS environment. -# Thanks to the android-cmake project for providing the command - -if(NOT IOS) - return() -endif() - -set(CMAKE_SYSTEM_NAME Darwin) - -# Get the Xcode version being used. -execute_process(COMMAND xcodebuild -version - OUTPUT_VARIABLE XCODE_VERSION - RESULT_VARIABLE XCODE_VERSION_RESULT - ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE) -if(NOT ${XCODE_VERSION_RESULT}) - string(REGEX MATCH "Xcode [0-9\\.]+" XCODE_VERSION "${XCODE_VERSION}") - string(REGEX REPLACE "Xcode ([0-9\\.]+)" "\\1" XCODE_VERSION "${XCODE_VERSION}") - message(STATUS "Building with Xcode version: ${XCODE_VERSION}") -else() - message(FATAL_ERROR "Cannot execute xcodebuild, please check whether xcode is installed.") -endif() - -# Required as of cmake 2.8.10 -set(CMAKE_OSX_DEPLOYMENT_TARGET "" CACHE STRING "Force unset of the deployment target for iOS" FORCE) - -# Setup iOS platform unless specified manually with IOS_PLATFORM -if(NOT DEFINED IOS_PLATFORM) - set(IOS_PLATFORM "OS") -endif() -set(IOS_PLATFORM ${IOS_PLATFORM} CACHE STRING "Type of iOS Platform") - -# Set the architecture for iOS -if(NOT DEFINED IOS_ARCH) - if(IOS_PLATFORM STREQUAL "OS") - set(IOS_ARCH "armv7;armv7s;arm64") - elseif(IOS_PLATFORM STREQUAL "SIMULATOR") - set(IOS_ARCH "i386;x86_64") - endif() -endif() -set(CMAKE_OSX_ARCHITECTURES ${IOS_ARCH} CACHE string "Build architecture for iOS") - -# Specify minimum iOS deployment version -if(NOT DEFINED IOS_DEPLOYMENT_TARGET) - set(IOS_DEPLOYMENT_TARGET "7.0") -endif() -set(IOS_DEPLOYMENT_TARGET ${IOS_DEPLOYMENT_TARGET} CACHE STRING "Minimum iOS version") - -# Whether to enable bitcode -if(NOT DEFINED IOS_ENABLE_BITCODE) - set(IOS_ENABLE_BITCODE ON) -endif() -set(IOS_ENABLE_BITCODE ${IOS_ENABLE_BITCODE} CACHE BOOL "Whether to enable bitcode") - -if(NOT DEFINED IOS_USE_VECLIB_FOR_BLAS) - set(IOS_USE_VECLIB_FOR_BLAS OFF) -endif() -set(IOS_USE_VECLIB_FOR_BLAS ${IOS_UES_VECLIB_FOR_BLAS} CACHE BOOL "Whether to use veclib") - -# Check the platform selection and setup for developer root -if(${IOS_PLATFORM} STREQUAL "OS") - set(IOS_PLATFORM_LOCATION "iPhoneOS.platform") - set(XCODE_IOS_PLATFORM iphoneos) - - # This causes the installers to properly locate the output libraries - set(CMAKE_XCODE_EFFECTIVE_PLATFORMS "-iphoneos") -elseif(${IOS_PLATFORM} STREQUAL "SIMULATOR") - set(IOS_PLATFORM_LOCATION "iPhoneSimulator.platform") - set(XCODE_IOS_PLATFORM iphonesimulator) - - # This causes the installers to properly locate the output libraries - set(CMAKE_XCODE_EFFECTIVE_PLATFORMS "-iphonesimulator") -elseif(${IOS_PLATFORM} STREQUAL "WATCHOS") - set(IOS_PLATFORM_LOCATION "WatchOS.platform") - set(XCODE_IOS_PLATFORM watchos) - - # This causes the installers to properly locate the output libraries - set(CMAKE_XCODE_EFFECTIVE_PLATFORMS "-watchos") -else(${IOS_PLATFORM} STREQUAL "OS") - message(FATAL_ERROR "Unsupported IOS_PLATFORM value selected. Please set to\n" - "\t OS, SIMULATOR, or WATCHOS.") -endif() - -# Check iOS developer toolchain -if(NOT DEFINED IOS_DEVELOPER_ROOT) - # Setup iOS developer location - execute_process(COMMAND xcode-select -print-path - OUTPUT_VARIABLE XCODE_DEVELOPER_DIR - RESULT_VARIABLE XCODE_DEVELOPER_DIR_RESULT - ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE) - # Xcode 4.3 changed the installation location, choose the most recent one available - if(${XCODE_VERSION} VERSION_LESS "4.3.0") - set(IOS_DEVELOPER_ROOT "/Developer/Platforms/${IOS_PLATFORM_LOCATION}/Developer") - else() - set(IOS_DEVELOPER_ROOT "${XCODE_DEVELOPER_DIR}/Platforms/${IOS_PLATFORM_LOCATION}/Developer") - endif() -endif() -if(EXISTS ${IOS_DEVELOPER_ROOT}) - set(IOS_DEVELOPER_ROOT ${IOS_DEVELOPER_ROOT} CACHE PATH "Location of iOS Platform") -else() - message(FATAL_ERROR "Invalid IOS_DEVELOPER_ROOT: ${IOS_DEVELOPER_ROOT} does not exist.") -endif() - -# Check iOS SDK -if(NOT DEFINED IOS_SDK_ROOT) - # Find and use the most recent iOS sdk - file(GLOB IOS_SDK_LISTS "${IOS_DEVELOPER_ROOT}/SDKs/*") - if(IOS_SDK_LISTS) - list(SORT IOS_SDK_LISTS) - list(REVERSE IOS_SDK_LISTS) - list(GET IOS_SDK_LISTS 0 IOS_SDK_ROOT) - else(IOS_SDK_LISTS) - message(FATAL_ERROR "No iOS SDK's found in default search path ${IOS_DEVELOPER_ROOT}." - " Please manually set IOS_SDK_ROOT or install the iOS SDK.") - endif(IOS_SDK_LISTS) -endif() -if(EXISTS ${IOS_SDK_ROOT}) - set(IOS_SDK_ROOT ${IOS_SDK_ROOT} CACHE PATH "Location of the selected iOS SDK") - message(STATUS "iOS toolchain: ${IOS_SDK_ROOT}") -else() - message(FATAL_ERROR "Invalid IOS_SDK_ROOT: ${IOS_SDK_ROOT} does not exist.") -endif() - -# Set the sysroot default to the most recent SDK -set(CMAKE_OSX_SYSROOT ${IOS_SDK_ROOT} CACHE PATH "Sysroot used for iOS support") - -# Get version of iOS SDK -execute_process(COMMAND xcodebuild -sdk ${CMAKE_OSX_SYSROOT} -version SDKVersion - OUTPUT_VARIABLE IOS_SDK_VERSION - RESULT_VARIABLE IOS_SDK_VERSION_RESULT - ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE) -if(${IOS_SDK_VERSION_RESULT}) - string(REGEX MATCH "(([0-9]+)\\.)+([0-9]+)" IOS_SDK_VERSION "${IOS_SDK_ROOT}") -endif() -if(NOT IOS_SDK_VERSION) - message(WARNING "Cannot get SDK's version.") - set(IOS_SDK_VERSION 1) -endif() -set(CMAKE_SYSTEM_VERSION ${IOS_SDK_VERSION}) - -# Find the C & C++ compilers for the specified SDK. -if(NOT CMAKE_C_COMPILER) - # Default to use clang - execute_process(COMMAND xcrun -sdk ${CMAKE_OSX_SYSROOT} -find clang - OUTPUT_VARIABLE IOS_C_COMPILER - RESULT_VARIABLE IOS_C_COMPILER_RESULT - ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE) - if(${IOS_C_COMPILER_RESULT}) - get_filename_component(IOS_C_COMPILER clang PROGRAM) - endif() -else(NOT CMAKE_C_COMPILER) - # User can set it in cmake command - get_filename_component(IOS_C_COMPILER ${CMAKE_C_COMPILER} PROGRAM) -endif(NOT CMAKE_C_COMPILER) -if(NOT EXISTS ${IOS_C_COMPILER}) - message(FATAL_ERROR "Cannot find C compiler: ${IOS_C_COMPILER}") -endif() - -if(NOT CMAKE_CXX_COMPILER) - # Default to use clang++ - execute_process(COMMAND xcrun -sdk ${CMAKE_OSX_SYSROOT} -find clang++ - OUTPUT_VARIABLE IOS_CXX_COMPILER - RESULT_VARIABLE IOS_CXX_COMPILER_RESULT - ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE) - if(${IOS_CXX_COMPILER_RESULT}) - get_filename_component(IOS_CXX_COMPILER clang++ PROGRAM) - endif() -else(NOT CMAKE_CXX_COMPILER) - # User can set it in cmake command - get_filename_component(IOS_CXX_COMPILER ${CMAKE_CXX_COMPILER} PROGRAM) -endif(NOT CMAKE_CXX_COMPILER) -if(NOT EXISTS ${IOS_CXX_COMPILER}) - message(FATAL_ERROR "Cannot find CXX compiler: ${IOS_CXX_COMPILER}") -endif() - -set(CMAKE_C_COMPILER ${IOS_C_COMPILER} CACHE PATH "C compiler" FORCE) -set(CMAKE_CXX_COMPILER ${IOS_CXX_COMPILER} CACHE PATH "CXX compiler" FORCE) - -set(CMAKE_C_OSX_COMPATIBILITY_VERSION_FLAG "-compatibility_version ") -set(CMAKE_C_OSX_CURRENT_VERSION_FLAG "-current_version ") -set(CMAKE_CXX_OSX_COMPATIBILITY_VERSION_FLAG "${CMAKE_C_OSX_COMPATIBILITY_VERSION_FLAG}") -set(CMAKE_CXX_OSX_CURRENT_VERSION_FLAG "${CMAKE_C_OSX_CURRENT_VERSION_FLAG}") - -# Set iOS specific C/C++ flags -if(IOS_PLATFORM STREQUAL "OS") - if(XCODE_VERSION VERSION_LESS "7.0") - set(XCODE_IOS_PLATFORM_VERSION_FLAGS "-mios-version-min=${IOS_DEPLOYMENT_TARGET}") - else() - # Xcode 7.0+ uses flags we can build directly from XCODE_IOS_PLATFORM. - set(XCODE_IOS_PLATFORM_VERSION_FLAGS "-m${XCODE_IOS_PLATFORM}-version-min=${IOS_DEPLOYMENT_TARGET}") - endif() -else() - set(XCODE_IOS_FLATFORM_VERSION_FLAGS "-mios-simulator-version-min=${IOS_DEPLOYMENT_TARGET}") -endif() - -if(IOS_ENABLE_BITCODE) - set(XCODE_IOS_BITCODE_FLAGS "${IOS_COMPILER_FLAGS} -fembed-bitcode") -else() - set(XCODE_IOS_BITCODE_FLAGS "") -endif() - -set(IOS_COMPILER_FLAGS "${XCODE_IOS_PLATFORM_VERSION_FLAGS} ${XCODE_IOS_BITCODE_FLAGS}") - -# Hidden visibilty is required for cxx on iOS -set(CMAKE_C_FLAGS "${IOS_COMPILER_FLAGS} ${CMAKE_C_FLAGS}" CACHE STRING "C flags") -set(CMAKE_CXX_FLAGS "${IOS_COMPILER_FLAGS} -fvisibility=hidden -fvisibility-inlines-hidden ${CMAKE_CXX_FLAGS}" CACHE STRING "CXX flags") - -set(IOS_LINK_FLAGS "${XCODE_IOS_PLATFORM_VERSION_FLAGS} -Wl,-search_paths_first") - -if(IOS_USE_VECLIB_FOR_BLAS) - # Find vecLib for iOS - set(VECLIB_SEARCH_DIRS - ${IOS_SDK_ROOT}/System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks - ${IOS_SDK_ROOT}/System/Library/Frameworks/Accelerate.framework/Frameworks - ) - find_path(VECLIB_INC_DIR vecLib.h PATHS ${VECLIB_SEARCH_DIRS}/vecLib.framework/Headers) - - include(FindPackageHandleStandardArgs) - find_package_handle_standard_args(vecLib DEFAULT_MSG VECLIB_INC_DIR) - - if(VECLIB_FOUND) - if(VECLIB_INC_DIR MATCHES "^/System/Library/Frameworks/vecLib.framework.*") - set(IOS_LINK_FLAGS ${IOS_LINK_FLAGS} -lcblas "-framework vecLib") - message(STATUS "Found standalone vecLib.framework") - else() - set(IOS_LINK_FLAGS ${IOS_LINK_FLAGS} -lcblas "-framework Accelerate") - message(STATUS "Found vecLib as part of Accelerate.framework") - endif() - - endif() -endif() - -set(CMAKE_C_LINK_FLAGS "${IOS_LINK_FLAGS} ${CMAKE_C_LINK_FLAGS}") -set(CMAKE_CXX_LINK_FLAGS "${IOS_LINK_FLAGS} ${CMAKE_CXX_LINK_FLAGS}") - -set(CMAKE_PLATFORM_HAS_INSTALLNAME 1) -if(NOT IOS_ENABLE_BITCODE) - set(CMAKE_SHARED_LIBRARY_CREATE_C_FLAGS "-dynamiclib -headerpad_max_install_names") - set(CMAKE_SHARED_MODULE_CREATE_C_FLAGS "-bundle -headerpad_max_install_names") -else() - set(CMAKE_SHARED_LIBRARY_CREATE_C_FLAGS "-dynamiclib") - set(CMAKE_SHARED_MODULE_CREATE_C_FLAGS "-bundle") -endif() -set(CMAKE_SHARED_MODULE_LOADER_C_FLAG "-Wl,-bundle_loader,") -set(CMAKE_SHARED_MODULE_LOADER_CXX_FLAG "-Wl,-bundle_loader,") -set(CMAKE_FIND_LIBRARY_SUFFIXES ".dylib" ".so" ".a") - -# hack: if a new cmake (which uses CMAKE_INSTALL_NAME_TOOL) runs on an old build tree -# (where install_name_tool was hardcoded) and where CMAKE_INSTALL_NAME_TOOL isn't in the cache -# and still cmake didn't fail in CMakeFindBinUtils.cmake (because it isn't rerun) -# hardcode CMAKE_INSTALL_NAME_TOOL here to install_name_tool, so it behaves as it did before, Alex -if(NOT DEFINED CMAKE_INSTALL_NAME_TOOL) - find_program(CMAKE_INSTALL_NAME_TOOL install_name_tool) -endif() - -# Set the find root to the iOS developer roots and to user defined paths -set(CMAKE_FIND_ROOT_PATH ${IOS_DEVELOPER_ROOT} ${IOS_SDK_ROOT} ${CMAKE_PREFIX_PATH} - CACHE string "iOS find search path root") - -# default to searching for frameworks first -set(CMAKE_FIND_FRAMEWORK FIRST) - -# set up the default search directories for frameworks -set(CMAKE_SYSTEM_FRAMEWORK_PATH - ${IOS_SDK_ROOT}/System/Library/Frameworks - ${IOS_SDK_ROOT}/System/Library/PrivateFrameworks - ${IOS_SDK_ROOT}/Developer/Library/Frameworks - ) - -# only search the iOS sdks, not the remainder of the host filesystem -set(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER) -set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY) -set(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY) - -message(STATUS "iOS: Targeting iOS '${CMAKE_SYSTEM_VERSION}', " - "building for '${IOS_PLATFORM}' platform, with architecture '${CMAKE_OSX_ARCHITECTURES}'") -message(STATUS "System CMAKE_C_FLAGS: ${CMAKE_C_FLAGS}") -message(STATUS "System CMAKE_CXX_FLAGS: ${CMAKE_CXX_FLAGS}") - -# Used in ExternalProject command -string(REPLACE ";" "\\$" EXTERNAL_IOS_ARCHITECTURES "${CMAKE_OSX_ARCHITECTURES}") -set(EXTERNAL_OPTIONAL_ARGS - -DCMAKE_OSX_SYSROOT=${CMAKE_OSX_SYSROOT} - -DCMAKE_OSX_ARCHITECTURES=${EXTERNAL_IOS_ARCHITECTURES}) - -# This little macro lets you set any XCode specific property -macro(set_xcode_property TARGET XCODE_PROPERTY XCODE_VALUE) - set_property (TARGET ${TARGET} PROPERTY XCODE_ATTRIBUTE_${XCODE_PROPERTY} ${XCODE_VALUE}) -endmacro(set_xcode_property) - -# This macro lets you find executable programs on the host system -macro(find_host_package) - set(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER) - set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY NEVER) - set(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE NEVER) - set(IOS FALSE) - - find_package(${ARGN}) - - set(IOS TRUE) - set(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM ONLY) - set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY) - set(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY) -endmacro(find_host_package) diff --git a/cmake/cross_compiling/raspberry_pi.cmake b/cmake/cross_compiling/raspberry_pi.cmake deleted file mode 100644 index 0425b2ae158b265fd6f8423b05190a8002f03f20..0000000000000000000000000000000000000000 --- a/cmake/cross_compiling/raspberry_pi.cmake +++ /dev/null @@ -1,84 +0,0 @@ -# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# This is a toolchain file for cross-compiling for Raspberry Pi. -# The supported variables are listed belows: -# -# RPI_TOOLCHAIN -# RPI_ARM_NEON -# -# Also you can set CMAKE_C/CXX_COMPILER yourself, through cmake arguments. - -IF(NOT RPI) - return() -ENDIF() - -SET(CMAKE_SYSTEM_NAME Linux) -SET(CMAKE_SYSTEM_VERSION 1) -SET(CMAKE_SYSTEM_PROCESSOR arm) - -# check the exist of raspberry pi toolchain -IF(NOT DEFINED RPI_TOOLCHAIN) - SET(RPI_TOOLCHAIN $ENV{RPI_TOOLCHAIN} - CACHE PATH "Folder holds the toolchain of Raspberr Pi") -ENDIF() -IF(NOT RPI_TOOLCHAIN) - MESSAGE(WARNING "It is recommended to set RPI_TOOLCHAIN to use toolchain.\n" - "To cross-compile for Raspberry Pi, you need to download the tools using:\n" - " git clone https://github.com/raspberrypi/tools\n") -ENDIF() - -IF(NOT DEFINED RPI_ARM_NEON) - SET(RPI_ARM_NEON ON) -ENDIF() - -IF(RPI_TOOLCHAIN) - SET(RPI_TOOLCHAIN_ROOT ${RPI_TOOLCHAIN}) - IF(RPI_TOOLCHAIN_ROOT MATCHES "gcc-linaro-arm-linux-gnueabihf-raspbian(-x64)?$") - # gcc-linaro-arm-linux-gnueabihf-raspbian - # gcc-linaro-arm-linux-gnueabihf-raspbian-x64 - SET(RPI_TOOLCHAIN_NAME arm-linux-gnueabihf) - ENDIF() - SET(RPI_TOOLCHAIN_PREFIX "${RPI_TOOLCHAIN_ROOT}/bin/${RPI_TOOLCHAIN_NAME}-") -ENDIF() - -# C compiler -IF(NOT CMAKE_C_COMPILER) - SET(RPI_C_COMPILER "${RPI_TOOLCHAIN_PREFIX}gcc") -ELSE() - GET_FILENAME_COMPONENT(RPI_C_COMPILER ${CMAKE_C_COMPILER} PROGRAM) -ENDIF() -IF(NOT EXISTS ${RPI_C_COMPILER}) - MESSAGE(FATAL_ERROR "Cannot find C compiler: ${RPI_C_COMPILER}") -ENDIF() - -# CXX compiler -IF(NOT CMAKE_CXX_COMPILER) - SET(RPI_CXX_COMPILER "${RPI_TOOLCHAIN_PREFIX}g++") -ELSE() - GET_FILENAME_COMPONENT(RPI_CXX_COMPILER ${CMAKE_CXX_COMPILER} PROGRAM) -ENDIF() -IF(NOT EXISTS ${RPI_CXX_COMPILER}) - MESSAGE(FATAL_ERROR "Cannot find CXX compiler: ${RPI_CXX_COMPILER}") -ENDIF() - -SET(CMAKE_C_COMPILER ${RPI_C_COMPILER} CACHE PATH "C compiler" FORCE) -SET(CMAKE_CXX_COMPILER ${RPI_CXX_COMPILER} CACHE PATH "CXX compiler" FORCE) - -IF(RPI_ARM_NEON) - SET(RPI_C_FLAGS "${RPI_C_FLAGS} -mfpu=neon") -ENDIF() - -SET(CMAKE_C_FLAGS "${RPI_C_FLAGS} ${CMAKE_C_FLAGS}" CACHE STRING "C flags") -SET(CMAKE_CXX_FLAGS "${RPI_C_FLAGS} ${CMAKE_CXX_FLAGS}" CACHE STRING "CXX flags") diff --git a/cmake/cuda.cmake b/cmake/cuda.cmake index f507bb41a1103c093e9569176ee868cfaac6bf7b..735846db1db04e3884d72ec62d911d9a0efec147 100644 --- a/cmake/cuda.cmake +++ b/cmake/cuda.cmake @@ -5,6 +5,8 @@ endif() set(paddle_known_gpu_archs "30 35 50 52 60 61 70") set(paddle_known_gpu_archs7 "30 35 50 52") set(paddle_known_gpu_archs8 "30 35 50 52 60 61") +set(paddle_known_gpu_archs9 "30 35 50 52 60 61 70") +set(paddle_known_gpu_archs10 "30 35 50 52 60 61 70 75") ###################################################################################### # A function for automatic detection of GPUs installed (if autodetection is enabled) @@ -59,11 +61,9 @@ endfunction() # select_nvcc_arch_flags(out_variable) function(select_nvcc_arch_flags out_variable) # List of arch names - set(archs_names "Kepler" "Maxwell" "Pascal" "All" "Manual") + set(archs_names "Kepler" "Maxwell" "Pascal" "Volta" "Turing" "All" "Manual") set(archs_name_default "All") - if(NOT CMAKE_CROSSCOMPILING) - list(APPEND archs_names "Auto") - endif() + list(APPEND archs_names "Auto") # set CUDA_ARCH_NAME strings (so it will be seen as dropbox in CMake-Gui) set(CUDA_ARCH_NAME ${archs_name_default} CACHE STRING "Select target NVIDIA GPU achitecture.") @@ -93,6 +93,8 @@ function(select_nvcc_arch_flags out_variable) set(cuda_arch_bin "60 61") elseif(${CUDA_ARCH_NAME} STREQUAL "Volta") set(cuda_arch_bin "70") + elseif(${CUDA_ARCH_NAME} STREQUAL "Turing") + set(cuda_arch_bin "75") elseif(${CUDA_ARCH_NAME} STREQUAL "All") set(cuda_arch_bin ${paddle_known_gpu_archs}) elseif(${CUDA_ARCH_NAME} STREQUAL "Auto") @@ -139,10 +141,12 @@ endfunction() message(STATUS "CUDA detected: " ${CUDA_VERSION}) if (${CUDA_VERSION} LESS 7.0) set(paddle_known_gpu_archs ${paddle_known_gpu_archs}) + add_definitions("-DPADDLE_CUDA_BINVER=\"60\"") elseif (${CUDA_VERSION} LESS 8.0) # CUDA 7.x set(paddle_known_gpu_archs ${paddle_known_gpu_archs7}) list(APPEND CUDA_NVCC_FLAGS "-D_MWAITXINTRIN_H_INCLUDED") list(APPEND CUDA_NVCC_FLAGS "-D__STRICT_ANSI__") + add_definitions("-DPADDLE_CUDA_BINVER=\"70\"") elseif (${CUDA_VERSION} LESS 9.0) # CUDA 8.x set(paddle_known_gpu_archs ${paddle_known_gpu_archs8}) list(APPEND CUDA_NVCC_FLAGS "-D_MWAITXINTRIN_H_INCLUDED") @@ -150,13 +154,24 @@ elseif (${CUDA_VERSION} LESS 9.0) # CUDA 8.x # CUDA 8 may complain that sm_20 is no longer supported. Suppress the # warning for now. list(APPEND CUDA_NVCC_FLAGS "-Wno-deprecated-gpu-targets") + add_definitions("-DPADDLE_CUDA_BINVER=\"80\"") +elseif (${CUDA_VERSION} LESS 10.0) # CUDA 9.x + set(paddle_known_gpu_archs ${paddle_known_gpu_archs9}) + list(APPEND CUDA_NVCC_FLAGS "-D_MWAITXINTRIN_H_INCLUDED") + list(APPEND CUDA_NVCC_FLAGS "-D__STRICT_ANSI__") + add_definitions("-DPADDLE_CUDA_BINVER=\"90\"") +elseif (${CUDA_VERSION} LESS 11.0) # CUDA 10.x + set(paddle_known_gpu_archs ${paddle_known_gpu_archs10}) + list(APPEND CUDA_NVCC_FLAGS "-D_MWAITXINTRIN_H_INCLUDED") + list(APPEND CUDA_NVCC_FLAGS "-D__STRICT_ANSI__") + add_definitions("-DPADDLE_CUDA_BINVER=\"100\"") endif() include_directories(${CUDA_INCLUDE_DIRS}) -list(APPEND EXTERNAL_LIBS ${CUDA_LIBRARIES} ${CUDA_rt_LIBRARY}) if(NOT WITH_DSO) - # TODO(panyx0718): CUPTI only allows DSO? - list(APPEND EXTERNAL_LIBS ${CUDNN_LIBRARY} ${CUPTI_LIBRARY} ${CUDA_CUBLAS_LIBRARIES} ${CUDA_curand_LIBRARY} ${NCCL_LIBRARY}) + if(WIN32) + set_property(GLOBAL PROPERTY CUDA_MODULES ${CUDNN_LIBRARY} ${CUDA_CUBLAS_LIBRARIES} ${CUDA_curand_LIBRARY}) + endif(WIN32) endif(NOT WITH_DSO) # setting nvcc arch flags @@ -185,21 +200,27 @@ list(APPEND CUDA_NVCC_FLAGS "-w") list(APPEND CUDA_NVCC_FLAGS "--expt-relaxed-constexpr") if (NOT WIN32) -if(CMAKE_BUILD_TYPE STREQUAL "Debug") - list(APPEND CUDA_NVCC_FLAGS ${CMAKE_CXX_FLAGS_DEBUG}) -elseif(CMAKE_BUILD_TYPE STREQUAL "Release") - list(APPEND CUDA_NVCC_FLAGS ${CMAKE_CXX_FLAGS_RELEASE}) -elseif(CMAKE_BUILD_TYPE STREQUAL "RelWithDebInfo") - list(APPEND CUDA_NVCC_FLAGS ${CMAKE_CXX_FLAGS_RELWITHDEBINFO}) -elseif(CMAKE_BUILD_TYPE STREQUAL "MinSizeRel") - # nvcc 9 does not support -Os. Use Release flags instead - list(APPEND CUDA_NVCC_FLAGS ${CMAKE_CXX_FLAGS_RELEASE}) -endif() + if(CMAKE_BUILD_TYPE STREQUAL "Debug") + list(APPEND CUDA_NVCC_FLAGS ${CMAKE_CXX_FLAGS_DEBUG}) + elseif(CMAKE_BUILD_TYPE STREQUAL "Release") + list(APPEND CUDA_NVCC_FLAGS ${CMAKE_CXX_FLAGS_RELEASE}) + elseif(CMAKE_BUILD_TYPE STREQUAL "RelWithDebInfo") + list(APPEND CUDA_NVCC_FLAGS ${CMAKE_CXX_FLAGS_RELWITHDEBINFO}) + elseif(CMAKE_BUILD_TYPE STREQUAL "MinSizeRel") + # nvcc 9 does not support -Os. Use Release flags instead + list(APPEND CUDA_NVCC_FLAGS ${CMAKE_CXX_FLAGS_RELEASE}) + endif() else(NOT WIN32) -if(CMAKE_BUILD_TYPE STREQUAL "Release") - list(APPEND CUDA_NVCC_FLAGS "-O3 -DNDEBUG") -else() - message(FATAL "Windows only support Release build now. Please set visual studio build type to Release, x64 build.") + list(APPEND CUDA_NVCC_FLAGS "-Xcompiler \"/wd 4244 /wd 4267 /wd 4819\"") + list(APPEND CUDA_NVCC_FLAGS "--compiler-options;/bigobj") + if(CMAKE_BUILD_TYPE STREQUAL "Debug") + list(APPEND CUDA_NVCC_FLAGS "-g -G") + # match the cl's _ITERATOR_DEBUG_LEVEL + list(APPEND CUDA_NVCC_FLAGS "-D_DEBUG") + elseif(CMAKE_BUILD_TYPE STREQUAL "Release") + list(APPEND CUDA_NVCC_FLAGS "-O3 -DNDEBUG") + else() + message(FATAL "Windows only support Release or Debug build now. Please set visual studio build type to Release/Debug, x64 build.") endif() endif(NOT WIN32) diff --git a/cmake/cudnn.cmake b/cmake/cudnn.cmake index cd51533926de7bb132ab7bfab1686d664a331410..fff1980637d029b8a392c166734d3c3b84fed867 100644 --- a/cmake/cudnn.cmake +++ b/cmake/cudnn.cmake @@ -2,7 +2,12 @@ if(NOT WITH_GPU) return() endif() -set(CUDNN_ROOT "/usr" CACHE PATH "CUDNN ROOT") +if(WIN32) + set(CUDNN_ROOT ${CUDA_TOOLKIT_ROOT_DIR}) +else(WIN32) + set(CUDNN_ROOT "/usr" CACHE PATH "CUDNN ROOT") +endif(WIN32) + find_path(CUDNN_INCLUDE_DIR cudnn.h PATHS ${CUDNN_ROOT} ${CUDNN_ROOT}/include $ENV{CUDNN_ROOT} $ENV{CUDNN_ROOT}/include ${CUDA_TOOLKIT_INCLUDE} @@ -39,9 +44,9 @@ if(WIN32) set(CUDNN_LIB_NAME "cudnn.lib" "cudnn64_7.dll") endif(WIN32) -if(Apple) +if(APPLE) set(CUDNN_LIB_NAME "libcudnn.dylib" "libcudnn.so") -endif(Apple) +endif(APPLE) find_library(CUDNN_LIBRARY NAMES ${CUDNN_LIB_NAME} # libcudnn_static.a PATHS ${CUDNN_CHECK_LIBRARY_DIRS} ${CUDNN_INCLUDE_DIR} ${__libpath_hist} @@ -84,6 +89,7 @@ if(CUDNN_FOUND) if(NOT CUDNN_MAJOR_VERSION) set(CUDNN_VERSION "???") else() + add_definitions("-DPADDLE_CUDNN_BINVER=\"${CUDNN_MAJOR_VERSION}\"") math(EXPR CUDNN_VERSION "${CUDNN_MAJOR_VERSION} * 1000 + ${CUDNN_MINOR_VERSION} * 100 + ${CUDNN_PATCHLEVEL_VERSION}") diff --git a/cmake/external/anakin.cmake b/cmake/external/anakin.cmake index 84354c446e2f54fa13b90fa37221eed90968b251..77f4b34537577c7c5782675e7af19c73bc9f8e32 100644 --- a/cmake/external/anakin.cmake +++ b/cmake/external/anakin.cmake @@ -58,19 +58,19 @@ ExternalProject_Add( -DPROTOBUF_ROOT=${THIRD_PARTY_PATH}/install/protobuf -DMKLML_ROOT=${THIRD_PARTY_PATH}/install/mklml -DENABLE_OP_TIMER=${ANAKIN_ENABLE_OP_TIMER} + -DBUILD_FAT_BIN=${ANAKIN_BUILD_FAT_BIN} + -DBUILD_CROSS_PLANTFORM=${ANAKIN_BUILD_CROSS_PLANTFORM} ${EXTERNAL_OPTIONAL_ARGS} CMAKE_CACHE_ARGS -DCMAKE_INSTALL_PREFIX:PATH=${ANAKIN_INSTALL_DIR} ) message(STATUS "Anakin for inference is enabled") message(STATUS "Anakin is set INCLUDE:${ANAKIN_INCLUDE} LIBRARY:${ANAKIN_LIBRARY}") - +add_dependencies(extern_anakin protobuf mklml) add_library(anakin_shared SHARED IMPORTED GLOBAL) set_property(TARGET anakin_shared PROPERTY IMPORTED_LOCATION ${ANAKIN_SHARED_LIB}) -add_dependencies(anakin_shared extern_anakin protobuf mklml) +add_dependencies(anakin_shared extern_anakin) add_library(anakin_saber SHARED IMPORTED GLOBAL) set_property(TARGET anakin_saber PROPERTY IMPORTED_LOCATION ${ANAKIN_SABER_LIB}) -add_dependencies(anakin_saber extern_anakin protobuf mklml) - -list(APPEND external_project_dependencies anakin_shared anakin_saber) +add_dependencies(anakin_saber extern_anakin) diff --git a/cmake/external/any.cmake b/cmake/external/any.cmake deleted file mode 100644 index 85cce80b70a1fcf57015ac7a264e4950616b2717..0000000000000000000000000000000000000000 --- a/cmake/external/any.cmake +++ /dev/null @@ -1,31 +0,0 @@ -INCLUDE(ExternalProject) - -SET(ANY_SOURCE_DIR ${THIRD_PARTY_PATH}/any) - -INCLUDE_DIRECTORIES(${ANY_SOURCE_DIR}/src/extern_lib_any) - -ExternalProject_Add( - extern_lib_any - ${EXTERNAL_PROJECT_LOG_ARGS} - GIT_REPOSITORY "https://github.com/PaddlePaddle/any.git" - GIT_TAG "15595d8324be9e8a9a80d9ae442fdd12bd66df5d" - PREFIX ${ANY_SOURCE_DIR} - UPDATE_COMMAND "" - CONFIGURE_COMMAND "" - BUILD_COMMAND "" - INSTALL_COMMAND "" - TEST_COMMAND "" -) - -if (${CMAKE_VERSION} VERSION_LESS "3.3.0") - set(dummyfile ${CMAKE_CURRENT_BINARY_DIR}/lib_any_dummy.c) - file(WRITE ${dummyfile} "const char * dummy_any = \"${dummyfile}\";") - add_library(lib_any STATIC ${dummyfile}) -else() - add_library(lib_any INTERFACE) -endif() - -add_dependencies(lib_any extern_lib_any) - -add_definitions(-DANY_IMPL_ANY_CAST_MOVEABLE) -LIST(APPEND external_project_dependencies lib_any) diff --git a/cmake/external/boost.cmake b/cmake/external/boost.cmake index ada61de8eb15ae10288ac54f588e9adf84acee37..ba8b5fc6c838b221fcfb559f1f01051fc09072a4 100644 --- a/cmake/external/boost.cmake +++ b/cmake/external/boost.cmake @@ -23,39 +23,30 @@ set(BOOST_PROJECT "extern_boost") # checked that the devtools package of CentOS 6 installs boost 1.41.0. # So we use 1.41.0 here. set(BOOST_VER "1.41.0") -if((NOT DEFINED BOOST_TAR) OR (NOT DEFINED BOOST_URL)) - message(STATUS "use pre defined download url") - set(BOOST_TAR "boost_1_41_0" CACHE STRING "" FORCE) - set(BOOST_URL "http://paddlepaddledeps.cdn.bcebos.com/${BOOST_TAR}.tar.gz" CACHE STRING "" FORCE) -endif() -IF (WIN32) - MESSAGE(WARNING, "In windows, boost can not be downloaded automaticlly, please build it manually and put it at " ${THIRD_PARTY_PATH}install/boost) -else() - MESSAGE(STATUS "BOOST_TAR: ${BOOST_TAR}, BOOST_URL: ${BOOST_URL}") -ENDIF(WIN32) +set(BOOST_TAR "boost_1_41_0" CACHE STRING "" FORCE) +set(BOOST_URL "http://paddlepaddledeps.bj.bcebos.com/${BOOST_TAR}.tar.gz" CACHE STRING "" FORCE) + +MESSAGE(STATUS "BOOST_TAR: ${BOOST_TAR}, BOOST_URL: ${BOOST_URL}") set(BOOST_SOURCES_DIR ${THIRD_PARTY_PATH}/boost) set(BOOST_DOWNLOAD_DIR "${BOOST_SOURCES_DIR}/src/${BOOST_PROJECT}") -set(BOOST_INCLUDE_DIR "${BOOST_DOWNLOAD_DIR}/${BOOST_TAR}" CACHE PATH "boost include directory." FORCE) -set_directory_properties(PROPERTIES CLEAN_NO_CUSTOM 1) +set(BOOST_INCLUDE_DIR "${BOOST_DOWNLOAD_DIR}" CACHE PATH "boost include directory." FORCE) +set_directory_properties(PROPERTIES CLEAN_NO_CUSTOM 1) include_directories(${BOOST_INCLUDE_DIR}) -if (NOT WIN32) ExternalProject_Add( ${BOOST_PROJECT} ${EXTERNAL_PROJECT_LOG_ARGS} DOWNLOAD_DIR ${BOOST_DOWNLOAD_DIR} - DOWNLOAD_COMMAND wget --no-check-certificate ${BOOST_URL} -c -q -O ${BOOST_TAR}.tar.gz - && tar zxf ${BOOST_TAR}.tar.gz + URL ${BOOST_URL} DOWNLOAD_NO_PROGRESS 1 PREFIX ${BOOST_SOURCES_DIR} CONFIGURE_COMMAND "" BUILD_COMMAND "" INSTALL_COMMAND "" UPDATE_COMMAND "" -) -endif(NOT WIN32) + ) if (${CMAKE_VERSION} VERSION_LESS "3.3.0" OR NOT WIN32) set(dummyfile ${CMAKE_CURRENT_BINARY_DIR}/boost_dummy.c) @@ -66,5 +57,4 @@ else() endif() add_dependencies(boost ${BOOST_PROJECT}) -list(APPEND external_project_dependencies boost) set(Boost_INCLUDE_DIR ${BOOST_INCLUDE_DIR}) diff --git a/cmake/external/brpc.cmake b/cmake/external/brpc.cmake index 30b227b6452abf44171a1a4e04569e66b16e67a4..989d1dbd4cf593e779b94f7bb5eda613f000859c 100644 --- a/cmake/external/brpc.cmake +++ b/cmake/external/brpc.cmake @@ -14,14 +14,16 @@ INCLUDE(ExternalProject) -find_library(SSL_LIBRARY NAMES ssl) +find_package(OpenSSL REQUIRED) + +message(STATUS "ssl:" ${OPENSSL_SSL_LIBRARY}) +message(STATUS "crypto:" ${OPENSSL_CRYPTO_LIBRARY}) + ADD_LIBRARY(ssl SHARED IMPORTED GLOBAL) -SET_PROPERTY(TARGET ssl PROPERTY IMPORTED_LOCATION ${SSL_LIBRARY}) +SET_PROPERTY(TARGET ssl PROPERTY IMPORTED_LOCATION ${OPENSSL_SSL_LIBRARY}) -find_library(CRYPTO_LIBRARY NAMES crypto) ADD_LIBRARY(crypto SHARED IMPORTED GLOBAL) -SET_PROPERTY(TARGET crypto PROPERTY IMPORTED_LOCATION ${CRYPTO_LIBRARY}) - +SET_PROPERTY(TARGET crypto PROPERTY IMPORTED_LOCATION ${OPENSSL_CRYPTO_LIBRARY}) SET(BRPC_SOURCES_DIR ${THIRD_PARTY_PATH}/brpc) SET(BRPC_INSTALL_DIR ${THIRD_PARTY_PATH}/install/brpc) @@ -31,14 +33,15 @@ SET(BRPC_LIBRARIES "${BRPC_INSTALL_DIR}/lib/libbrpc.a" CACHE FILEPATH "brpc libr INCLUDE_DIRECTORIES(${BRPC_INCLUDE_DIR}) # Reference https://stackoverflow.com/questions/45414507/pass-a-list-of-prefix-paths-to-externalproject-add-in-cmake-args -set(prefix_path "${THIRD_PARTY_PATH}/install/gflags|${THIRD_PARTY_PATH}/install/leveldb|${THIRD_PARTY_PATH}/install/snappy|${THIRD_PARTY_PATH}/install/gtest|${THIRD_PARTY_PATH}/install/protobuf|${THIRD_PARTY_PATH}/install/zlib") +set(prefix_path "${THIRD_PARTY_PATH}/install/gflags|${THIRD_PARTY_PATH}/install/leveldb|${THIRD_PARTY_PATH}/install/snappy|${THIRD_PARTY_PATH}/install/gtest|${THIRD_PARTY_PATH}/install/protobuf|${THIRD_PARTY_PATH}/install/zlib|${THIRD_PARTY_PATH}/install/glog") # If minimal .a is need, you can set WITH_DEBUG_SYMBOLS=OFF ExternalProject_Add( extern_brpc ${EXTERNAL_PROJECT_LOG_ARGS} + # TODO(gongwb): change to de newst repo when they changed. GIT_REPOSITORY "https://github.com/gongweibao/brpc" - GIT_TAG "7dc04defad1fd4173aae170c3fcbde131b65155a" + GIT_TAG "e9b67ec1b7458f2af5fae76451afe1e27e01b4b4" PREFIX ${BRPC_SOURCES_DIR} UPDATE_COMMAND "" CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} @@ -50,7 +53,7 @@ ExternalProject_Add( -DCMAKE_POSITION_INDEPENDENT_CODE=ON -DCMAKE_BUILD_TYPE=${THIRD_PARTY_BUILD_TYPE} -DCMAKE_PREFIX_PATH=${prefix_path} - -DBRPC_WITH_GLOG=ON + -DWITH_GLOG=ON -DIOBUF_WITH_HUGE_BLOCK=ON -DBRPC_WITH_RDMA=${WITH_BRPC_RDMA} ${EXTERNAL_OPTIONAL_ARGS} @@ -65,5 +68,4 @@ ADD_LIBRARY(brpc STATIC IMPORTED GLOBAL) SET_PROPERTY(TARGET brpc PROPERTY IMPORTED_LOCATION ${BRPC_LIBRARIES}) ADD_DEPENDENCIES(brpc extern_brpc) - -LIST(APPEND external_project_dependencies brpc) +add_definitions(-DBRPC_WITH_GLOG) diff --git a/cmake/external/cares.cmake b/cmake/external/cares.cmake index a743b572a6c3f6f152d85909500e9dbb35c72a01..52507a6ae4aabe300cf8bf88d0946c45a2c0e79c 100644 --- a/cmake/external/cares.cmake +++ b/cmake/external/cares.cmake @@ -13,7 +13,7 @@ # limitations under the License. # -IF(MOBILE_INFERENCE OR NOT WITH_DISTRIBUTE) +IF(NOT WITH_DISTRIBUTE) return() ENDIF() diff --git a/cmake/external/cub.cmake b/cmake/external/cub.cmake index c94849cf4b96746e6c507db2a6310c2f305dacf5..41ad8207743201fbddd1d678fc5122afe68207ae 100644 --- a/cmake/external/cub.cmake +++ b/cmake/external/cub.cmake @@ -31,5 +31,3 @@ else() endif() add_dependencies(cub extern_cub) - -LIST(APPEND externl_project_dependencies cub) diff --git a/cmake/external/dgc.cmake b/cmake/external/dgc.cmake new file mode 100644 index 0000000000000000000000000000000000000000..a58b8c68d7716a901db1907af64c4a344a24cfc6 --- /dev/null +++ b/cmake/external/dgc.cmake @@ -0,0 +1,42 @@ +# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +INCLUDE(ExternalProject) + +SET(DGC_SOURCES_DIR "${THIRD_PARTY_PATH}/dgc") +SET(DGC_INSTALL_DIR "${THIRD_PARTY_PATH}/install/dgc") +SET(DGC_INCLUDE_DIR "${DGC_INSTALL_DIR}/include" CACHE PATH "dgc include directory." FORCE) +SET(DGC_LIBRARIES "${DGC_INSTALL_DIR}/lib/libdgc.a" CACHE FILEPATH "dgc library." FORCE) +INCLUDE_DIRECTORIES(${DGC_INCLUDE_DIR}) + +ExternalProject_Add( + extern_dgc + ${EXTERNAL_PROJECT_LOG_ARGS} + GIT_REPOSITORY "https://github.com/PaddlePaddle/Fleet" + GIT_TAG "2d04dc3800cdd0601f1b65d547dabcc60b0cf9dc" + SOURCE_DIR "${DGC_SOURCES_DIR}" + CONFIGURE_COMMAND "" + BUILD_COMMAND cd collective && make -j + INSTALL_COMMAND mkdir -p ${DGC_INSTALL_DIR}/lib/ ${DGC_INCLUDE_DIR}/dgc + && cp ${DGC_SOURCES_DIR}/collective/build/lib/libdgc.a ${DGC_LIBRARIES} + && cp ${DGC_SOURCES_DIR}/collective/build/include/dgc.h ${DGC_INCLUDE_DIR}/dgc/ + BUILD_IN_SOURCE 1 +) + +ADD_LIBRARY(dgc STATIC IMPORTED GLOBAL) +SET_PROPERTY(TARGET dgc PROPERTY IMPORTED_LOCATION ${DGC_LIBRARIES}) +ADD_DEPENDENCIES(dgc extern_dgc) + +LIST(APPEND external_project_dependencies dgc) + diff --git a/cmake/external/dlpack.cmake b/cmake/external/dlpack.cmake new file mode 100644 index 0000000000000000000000000000000000000000..63dd16b28e40a0c2d5310bec011c721285049952 --- /dev/null +++ b/cmake/external/dlpack.cmake @@ -0,0 +1,29 @@ +include(ExternalProject) + +set(DLPACK_SOURCE_DIR ${THIRD_PARTY_PATH}/dlpack) +set(DLPACK_INCLUDE_DIR ${DLPACK_SOURCE_DIR}/src/extern_dlpack/include) + +include_directories(${DLPACK_INCLUDE_DIR}) + +ExternalProject_Add( + extern_dlpack + ${EXTERNAL_PROJECT_LOG_ARGS} + GIT_REPOSITORY "https://github.com/dmlc/dlpack.git" + GIT_TAG "v0.2" + PREFIX ${DLPACK_SOURCE_DIR} + UPDATE_COMMAND "" + CONFIGURE_COMMAND "" + BUILD_COMMAND "" + INSTALL_COMMAND "" + TEST_COMMAND "" +) + +if(${CMAKE_VERSION} VERSION_LESS "3.3.0") + set(dummyfile ${CMAKE_CURRENT_BINARY_DIR}/dlpack_dummy.c) + file(WRITE ${dummyfile} "const char *dummy = \"${dummyfile}\";") + add_library(dlpack STATIC ${dummyfile}) +else() + add_library(dlpack INTERFACE) +endif() + +add_dependencies(dlpack extern_dlpack) diff --git a/cmake/external/eigen.cmake b/cmake/external/eigen.cmake index 573ad5e5f06a93f38f24c6a8af3b45767e93a1a4..72441160f89d2c188d35fc6b08b5f0b6d746a1ad 100644 --- a/cmake/external/eigen.cmake +++ b/cmake/external/eigen.cmake @@ -17,7 +17,7 @@ if(WITH_AMD_GPU) extern_eigen3 ${EXTERNAL_PROJECT_LOG_ARGS} GIT_REPOSITORY "https://github.com/sabreshao/hipeigen.git" - GIT_TAG 0cba03ff9f8f9f70bbd92ac5857b031aa8fed6f9 + GIT_TAG 7cb2b6e5a4b4a1efe658abb215cd866c6fb2275e PREFIX ${EIGEN_SOURCE_DIR} UPDATE_COMMAND "" CONFIGURE_COMMAND "" @@ -52,5 +52,3 @@ else() endif() add_dependencies(eigen3 extern_eigen3) - -LIST(APPEND external_project_dependencies eigen3) diff --git a/cmake/external/gflags.cmake b/cmake/external/gflags.cmake index cf58cc39762351f8b37d073bcd218d249285bf52..911920ed6212b87aa25ba9a1faf7696fbcb22587 100644 --- a/cmake/external/gflags.cmake +++ b/cmake/external/gflags.cmake @@ -35,7 +35,12 @@ ExternalProject_Add( CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS} + -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE} + -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG} -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS} + -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG} + -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE} + -DBUILD_STATIC_LIBS=ON -DCMAKE_INSTALL_PREFIX=${GFLAGS_INSTALL_DIR} -DCMAKE_POSITION_INDEPENDENT_CODE=ON -DBUILD_TESTING=OFF @@ -48,22 +53,19 @@ ExternalProject_Add( IF(WIN32) IF(NOT EXISTS "${GFLAGS_INSTALL_DIR}/lib/libgflags.lib") add_custom_command(TARGET extern_gflags POST_BUILD - COMMAND cmake -E rename ${GFLAGS_INSTALL_DIR}/lib/gflags_static.lib ${GFLAGS_INSTALL_DIR}/lib/libgflags.lib - ) + COMMAND cmake -E copy ${GFLAGS_INSTALL_DIR}/lib/gflags_static.lib ${GFLAGS_INSTALL_DIR}/lib/libgflags.lib + ) ENDIF() ENDIF(WIN32) ADD_LIBRARY(gflags STATIC IMPORTED GLOBAL) SET_PROPERTY(TARGET gflags PROPERTY IMPORTED_LOCATION ${GFLAGS_LIBRARIES}) ADD_DEPENDENCIES(gflags extern_gflags) -LIST(APPEND external_project_dependencies gflags) - -IF(WITH_C_API) - INSTALL(DIRECTORY ${GFLAGS_INCLUDE_DIR} DESTINATION third_party/gflags) - IF(ANDROID) - INSTALL(FILES ${GFLAGS_LIBRARIES} DESTINATION third_party/gflags/lib/${ANDROID_ABI}) - ELSE() - INSTALL(FILES ${GFLAGS_LIBRARIES} DESTINATION third_party/gflags/lib) - ENDIF() -ENDIF() - +# On Windows (including MinGW), the Shlwapi library is used by gflags if available. +if (WIN32) + include(CheckIncludeFileCXX) + check_include_file_cxx("shlwapi.h" HAVE_SHLWAPI) + if (HAVE_SHLWAPI) + set_property(GLOBAL PROPERTY OS_DEPENDENCY_MODULES shlwapi.lib) + endif(HAVE_SHLWAPI) +endif (WIN32) diff --git a/cmake/external/glog.cmake b/cmake/external/glog.cmake index 25ef2970ac52f12f961c9c6d3a589fec4c80983f..7fa17ce6b7b106c47c486729d0136748c73176a7 100644 --- a/cmake/external/glog.cmake +++ b/cmake/external/glog.cmake @@ -20,20 +20,16 @@ SET(GLOG_INCLUDE_DIR "${GLOG_INSTALL_DIR}/include" CACHE PATH "glog include dire IF(WIN32) SET(GLOG_LIBRARIES "${GLOG_INSTALL_DIR}/lib/libglog.lib" CACHE FILEPATH "glog library." FORCE) + SET(GLOG_CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /wd4267 /wd4530") ELSE(WIN32) SET(GLOG_LIBRARIES "${GLOG_INSTALL_DIR}/lib/libglog.a" CACHE FILEPATH "glog library." FORCE) + SET(GLOG_CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS}) ENDIF(WIN32) INCLUDE_DIRECTORIES(${GLOG_INCLUDE_DIR}) -IF(ANDROID AND ${CMAKE_SYSTEM_VERSION} VERSION_LESS "21") - # Using the unofficial glog for Android API < 21 - SET(GLOG_REPOSITORY "https://github.com/Xreki/glog.git") - SET(GLOG_TAG "8a547150548b284382ccb6582408e9140ff2bea8") -ELSE() - SET(GLOG_REPOSITORY "https://github.com/google/glog.git") - SET(GLOG_TAG "v0.3.5") -ENDIF() +SET(GLOG_REPOSITORY "https://github.com/google/glog.git") +SET(GLOG_TAG "v0.3.5") ExternalProject_Add( extern_glog @@ -45,8 +41,12 @@ ExternalProject_Add( UPDATE_COMMAND "" CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} - -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS} + -DCMAKE_CXX_FLAGS=${GLOG_CMAKE_CXX_FLAGS} + -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE} + -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG} -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS} + -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG} + -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE} -DCMAKE_INSTALL_PREFIX=${GLOG_INSTALL_DIR} -DCMAKE_INSTALL_LIBDIR=${GLOG_INSTALL_DIR}/lib -DCMAKE_POSITION_INDEPENDENT_CODE=ON @@ -63,7 +63,7 @@ ExternalProject_Add( IF(WIN32) IF(NOT EXISTS "${GLOG_INSTALL_DIR}/lib/libglog.lib") add_custom_command(TARGET extern_glog POST_BUILD - COMMAND cmake -E rename ${GLOG_INSTALL_DIR}/lib/glog.lib ${GLOG_INSTALL_DIR}/lib/libglog.lib + COMMAND cmake -E copy ${GLOG_INSTALL_DIR}/lib/glog.lib ${GLOG_INSTALL_DIR}/lib/libglog.lib ) ENDIF() ENDIF(WIN32) @@ -72,14 +72,3 @@ ADD_LIBRARY(glog STATIC IMPORTED GLOBAL) SET_PROPERTY(TARGET glog PROPERTY IMPORTED_LOCATION ${GLOG_LIBRARIES}) ADD_DEPENDENCIES(glog extern_glog gflags) LINK_LIBRARIES(glog gflags) - -LIST(APPEND external_project_dependencies glog) - -IF(WITH_C_API) - INSTALL(DIRECTORY ${GLOG_INCLUDE_DIR} DESTINATION third_party/glog) - IF(ANDROID) - INSTALL(FILES ${GLOG_LIBRARIES} DESTINATION third_party/glog/lib/${ANDROID_ABI}) - ELSE() - INSTALL(FILES ${GLOG_LIBRARIES} DESTINATION third_party/glog/lib) - ENDIF() -ENDIF() diff --git a/cmake/external/grpc.cmake b/cmake/external/grpc.cmake index fd9835d023c67b76579913f2ec56c2444fea8c15..d96da470b3cbbd8092dbf80ec5f500af9afa2ce4 100644 --- a/cmake/external/grpc.cmake +++ b/cmake/external/grpc.cmake @@ -13,7 +13,7 @@ # limitations under the License. # -IF(MOBILE_INFERENCE OR NOT WITH_DISTRIBUTE) +IF(NOT WITH_DISTRIBUTE) return() ENDIF() @@ -44,7 +44,7 @@ ExternalProject_Add( # 3. keep only zlib, cares, protobuf, boringssl under "third_party", # checkout and clean other dirs under third_party # 4. remove .git, and package the directory. - URL "http://paddlepaddledeps.cdn.bcebos.com/grpc-v1.10.x.tar.gz" + URL "http://paddlepaddledeps.bj.bcebos.com/grpc-v1.10.x.tar.gz" URL_MD5 "1f268a2aff6759839dccd256adcc91cf" PREFIX ${GRPC_SOURCES_DIR} UPDATE_COMMAND "" diff --git a/cmake/external/gtest.cmake b/cmake/external/gtest.cmake index d335298742c73bf1fe44e6a778ab3c142711063d..e459526583bd5ee3c89807657f3c30376e57d971 100644 --- a/cmake/external/gtest.cmake +++ b/cmake/external/gtest.cmake @@ -12,8 +12,12 @@ # See the License for the specific language governing permissions and # limitations under the License. -IF(WITH_TESTING) - ENABLE_TESTING() +#FIXME:(gongwb) Move brpc's gtest dependency. +IF(WITH_TESTING OR (WITH_DISTRIBUTE AND NOT WITH_GRPC)) + IF(WITH_TESTING) + ENABLE_TESTING() + ENDIF(WITH_TESTING) + INCLUDE(ExternalProject) SET(GTEST_SOURCES_DIR ${THIRD_PARTY_PATH}/gtest) @@ -50,7 +54,11 @@ IF(WITH_TESTING) CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS} + -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE} + -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG} -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS} + -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG} + -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE} -DCMAKE_INSTALL_PREFIX=${GTEST_INSTALL_DIR} -DCMAKE_POSITION_INDEPENDENT_CODE=ON -DBUILD_GMOCK=ON @@ -71,5 +79,4 @@ IF(WITH_TESTING) SET_PROPERTY(TARGET gtest_main PROPERTY IMPORTED_LOCATION ${GTEST_MAIN_LIBRARIES}) ADD_DEPENDENCIES(gtest_main extern_gtest) - LIST(APPEND external_project_dependencies gtest gtest_main) -ENDIF(WITH_TESTING) +ENDIF(WITH_TESTING OR (WITH_DISTRIBUTE AND NOT WITH_GRPC)) diff --git a/cmake/external/gzstream.cmake b/cmake/external/gzstream.cmake new file mode 100644 index 0000000000000000000000000000000000000000..af7a8bfda6f7db12049203e7c9d54885884d8cf2 --- /dev/null +++ b/cmake/external/gzstream.cmake @@ -0,0 +1,44 @@ +# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +include (ExternalProject) + +# NOTE: gzstream is needed when linking with ctr reader. + +SET(GZSTREAM_SOURCES_DIR ${THIRD_PARTY_PATH}/gzstream) +SET(GZSTREAM_INSTALL_DIR ${THIRD_PARTY_PATH}/install/gzstream) +SET(GZSTREAM_INCLUDE_DIR "${GZSTREAM_INSTALL_DIR}/include/" CACHE PATH "gzstream include directory." FORCE) + +ExternalProject_Add( + extern_gzstream + DEPENDS zlib + GIT_REPOSITORY "https://github.com/jacquesqiao/gzstream.git" + GIT_TAG "" + PREFIX ${GZSTREAM_SOURCES_DIR} + UPDATE_COMMAND "" + CONFIGURE_COMMAND "" + BUILD_IN_SOURCE 1 + BUILD_COMMAND make EXTERN_CPPFLAGS="-I${THIRD_PARTY_PATH}/install/zlib/include" EXTERM_LDFLAGS="-L${THIRD_PARTY_PATH}/install/zlib/lib" -j8 + INSTALL_COMMAND mkdir -p ${GZSTREAM_INSTALL_DIR}/lib/ && mkdir -p ${GZSTREAM_INSTALL_DIR}/include/ + && cp ${GZSTREAM_SOURCES_DIR}/src/extern_gzstream/libgzstream.a ${GZSTREAM_INSTALL_DIR}/lib + && cp -r ${GZSTREAM_SOURCES_DIR}/src/extern_gzstream/gzstream.h ${GZSTREAM_INSTALL_DIR}/include +) + +ADD_LIBRARY(gzstream STATIC IMPORTED GLOBAL) +SET_PROPERTY(TARGET gzstream PROPERTY IMPORTED_LOCATION + "${GZSTREAM_INSTALL_DIR}/lib/libgzstream.a") + +include_directories(${GZSTREAM_INCLUDE_DIR}) +ADD_DEPENDENCIES(gzstream extern_gzstream zlib) diff --git a/cmake/external/leveldb.cmake b/cmake/external/leveldb.cmake index fb5091731da02b497a14f119e944905eee4979d5..ac0febd076e659927a6a882ff487c61ac130437a 100644 --- a/cmake/external/leveldb.cmake +++ b/cmake/external/leveldb.cmake @@ -24,8 +24,8 @@ ExternalProject_Add( extern_leveldb ${EXTERNAL_PROJECT_LOG_ARGS} PREFIX ${LEVELDB_SOURCES_DIR} - URL "https://github.com/google/leveldb/archive/v1.18.tar.gz" - URL_MD5 "73770de34a2a5ab34498d2e05b2b7fa0" + GIT_REPOSITORY "https://github.com/google/leveldb" + GIT_TAG v1.18 CONFIGURE_COMMAND "" BUILD_COMMAND CXXFLAGS=-fPIC make -j ${NUM_OF_PROCESSOR} libleveldb.a INSTALL_COMMAND mkdir -p ${LEVELDB_INSTALL_DIR}/lib/ @@ -39,6 +39,3 @@ ADD_DEPENDENCIES(extern_leveldb snappy) ADD_LIBRARY(leveldb STATIC IMPORTED GLOBAL) SET_PROPERTY(TARGET leveldb PROPERTY IMPORTED_LOCATION ${LEVELDB_LIBRARIES}) ADD_DEPENDENCIES(leveldb extern_leveldb) - -LIST(APPEND external_project_dependencies leveldb) - diff --git a/cmake/external/libmct.cmake b/cmake/external/libmct.cmake new file mode 100644 index 0000000000000000000000000000000000000000..b944f2945b7874ca76bf1a19e0a363f564851a62 --- /dev/null +++ b/cmake/external/libmct.cmake @@ -0,0 +1,75 @@ +# Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +IF(NOT ${WITH_LIBMCT}) + return() +ENDIF(NOT ${WITH_LIBMCT}) + +IF(WIN32 OR APPLE) + MESSAGE(WARNING + "Windows or Mac is not supported with LIBMCT in Paddle yet." + "Force WITH_LIBMCT=OFF") + SET(WITH_LIBMCT OFF CACHE STRING "Disable LIBMCT package in Windows and MacOS" FORCE) + return() +ENDIF() + +INCLUDE(ExternalProject) + +SET(LIBMCT_PROJECT "extern_libmct") +IF((NOT DEFINED LIBMCT_VER) OR (NOT DEFINED LIBMCT_URL)) + MESSAGE(STATUS "use pre defined download url") + SET(LIBMCT_VER "0.1.0" CACHE STRING "" FORCE) + SET(LIBMCT_NAME "libmct" CACHE STRING "" FORCE) + SET(LIBMCT_URL "https://raw.githubusercontent.com/PaddlePaddle/Fleet/release/${LIBMCT_VER}/${LIBMCT_NAME}.tar.gz" CACHE STRING "" FORCE) +ENDIF() +MESSAGE(STATUS "LIBMCT_NAME: ${LIBMCT_NAME}, LIBMCT_URL: ${LIBMCT_URL}") +SET(LIBMCT_SOURCE_DIR "${THIRD_PARTY_PATH}/libmct") +SET(LIBMCT_DOWNLOAD_DIR "${LIBMCT_SOURCE_DIR}/src/${LIBMCT_PROJECT}") +SET(LIBMCT_DST_DIR "libmct") +SET(LIBMCT_INSTALL_ROOT "${THIRD_PARTY_PATH}/install") +SET(LIBMCT_INSTALL_DIR ${LIBMCT_INSTALL_ROOT}/${LIBMCT_DST_DIR}) +SET(LIBMCT_ROOT ${LIBMCT_INSTALL_DIR}) +SET(LIBMCT_INC_DIR ${LIBMCT_ROOT}/include) +SET(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_RPATH}" "${LIBMCT_ROOT}/lib") + +INCLUDE_DIRECTORIES(${LIBMCT_INC_DIR}) + +FILE(WRITE ${LIBMCT_DOWNLOAD_DIR}/CMakeLists.txt + "PROJECT(LIBMCT)\n" + "cmake_minimum_required(VERSION 3.0)\n" + "install(DIRECTORY ${LIBMCT_NAME}/include ${LIBMCT_NAME}/lib \n" + " DESTINATION ${LIBMCT_DST_DIR})\n") + +ExternalProject_Add( + ${LIBMCT_PROJECT} + ${EXTERNAL_PROJECT_LOG_ARGS} + PREFIX ${LIBMCT_SOURCE_DIR} + DOWNLOAD_DIR ${LIBMCT_DOWNLOAD_DIR} + DOWNLOAD_COMMAND wget --no-check-certificate ${LIBMCT_URL} -c -q -O ${LIBMCT_NAME}.tar.gz + && tar zxvf ${LIBMCT_NAME}.tar.gz + DOWNLOAD_NO_PROGRESS 1 + UPDATE_COMMAND "" + CMAKE_ARGS -DCMAKE_INSTALL_PREFIX=${LIBMCT_INSTALL_ROOT} + CMAKE_CACHE_ARGS -DCMAKE_INSTALL_PREFIX:PATH=${LIBMCT_INSTALL_ROOT} +) + +if (${CMAKE_VERSION} VERSION_LESS "3.3.0" OR NOT WIN32) + set(dummyfile ${CMAKE_CURRENT_BINARY_DIR}/boost_dummy.c) + file(WRITE ${dummyfile} "const char *dummy = \"${dummyfile}\";") + add_library(libmct STATIC ${dummyfile}) +else() + add_library(libmct INTERFACE) +endif() + +ADD_DEPENDENCIES(libmct ${LIBMCT_PROJECT}) diff --git a/cmake/external/libxsmm.cmake b/cmake/external/libxsmm.cmake index 530f7ebe2813fb2f00c6b5b4d1f7b2f04fe650b0..69cdba7c5921f14a87172d95791332e364045b26 100644 --- a/cmake/external/libxsmm.cmake +++ b/cmake/external/libxsmm.cmake @@ -19,8 +19,8 @@ IF(NOT WITH_LIBXSMM) return() ENDIF() -IF(WIN32 OR APPLE OR ANDROID OR IOS) - MESSAGE(WARNING "Windows, Mac or Mobile are not supported with libxsmm in Paddle yet.") +IF(WIN32 OR APPLE) + MESSAGE(WARNING "Windows, Mac are not supported with libxsmm in Paddle yet.") SET(WITH_LIBXSMM OFF CACHE STRING "Disable LIBXSMM" FORCE) return() ENDIF() @@ -53,5 +53,3 @@ MESSAGE(STATUS "Libxsmm library: ${LIBXSMM_LIBS}") include_directories(${LIBXSMM_INCLUDE_DIR}) ADD_DEFINITIONS(-DPADDLE_WITH_LIBXSMM) ADD_DEPENDENCIES(libxsmm extern_libxsmm) -LIST(APPEND external_project_dependencies libxsmm) - diff --git a/cmake/external/mkldnn.cmake b/cmake/external/mkldnn.cmake index baf253df2755657b01b67c410f63b7d8422d4df3..b1e437a9007072c82ab375bf5ed79fc7d6c80c47 100644 --- a/cmake/external/mkldnn.cmake +++ b/cmake/external/mkldnn.cmake @@ -23,21 +23,27 @@ SET(MKLDNN_SOURCES_DIR ${THIRD_PARTY_PATH}/mkldnn) SET(MKLDNN_INSTALL_DIR ${THIRD_PARTY_PATH}/install/mkldnn) SET(MKLDNN_INC_DIR "${MKLDNN_INSTALL_DIR}/include" CACHE PATH "mkldnn include directory." FORCE) -IF(WIN32 OR APPLE) +IF(APPLE) MESSAGE(WARNING - "Windows or Mac is not supported with MKLDNN in Paddle yet." + "Mac is not supported with MKLDNN in Paddle yet." "Force WITH_MKLDNN=OFF") - SET(WITH_MKLDNN OFF CACHE STRING "Disable MKLDNN in Windows and MacOS" FORCE) + SET(WITH_MKLDNN OFF CACHE STRING "Disable MKLDNN in MacOS" FORCE) return() ENDIF() -SET(MKLDNN_LIB "${MKLDNN_INSTALL_DIR}/lib/libmkldnn.so" CACHE FILEPATH "mkldnn library." FORCE) -MESSAGE(STATUS "Set ${MKLDNN_INSTALL_DIR}/lib to runtime path") +# Introduce variables: +# * CMAKE_INSTALL_LIBDIR +INCLUDE(GNUInstallDirs) +SET(LIBDIR "lib") +if(CMAKE_INSTALL_LIBDIR MATCHES ".*lib64$") + SET(LIBDIR "lib64") +endif() + +MESSAGE(STATUS "Set ${MKLDNN_INSTALL_DIR}/l${LIBDIR} to runtime path") SET(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE) -SET(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_RPATH}" "${MKLDNN_INSTALL_DIR}/lib") +SET(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_RPATH}" "${MKLDNN_INSTALL_DIR}/${LIBDIR}") INCLUDE_DIRECTORIES(${MKLDNN_INC_DIR}) # For MKLDNN code to include internal headers. -INCLUDE_DIRECTORIES(${THIRD_PARTY_PATH}/install) # For Paddle code to include mkldnn.h IF(${CBLAS_PROVIDER} STREQUAL "MKLML") SET(MKLDNN_DEPENDS ${MKLML_PROJECT}) @@ -45,22 +51,34 @@ IF(${CBLAS_PROVIDER} STREQUAL "MKLML") ELSE() MESSAGE(FATAL_ERROR "Should enable MKLML when build MKLDNN") ENDIF() -SET(MKLDNN_FLAG "-Wno-error=strict-overflow -Wno-error=unused-result") -SET(MKLDNN_FLAG "${MKLDNN_FLAG} -Wno-unused-result -Wno-unused-value") -SET(MKLDNN_CFLAG "${CMAKE_C_FLAGS} ${MKLDNN_FLAG}") -SET(MKLDNN_CXXFLAG "${CMAKE_CXX_FLAGS} ${MKLDNN_FLAG}") + +IF(NOT WIN32) + SET(MKLDNN_FLAG "-Wno-error=strict-overflow -Wno-error=unused-result -Wno-error=array-bounds") + SET(MKLDNN_FLAG "${MKLDNN_FLAG} -Wno-unused-result -Wno-unused-value") + SET(MKLDNN_CFLAG "${CMAKE_C_FLAGS} ${MKLDNN_FLAG}") + SET(MKLDNN_CXXFLAG "${CMAKE_CXX_FLAGS} ${MKLDNN_FLAG}") +ELSE() + SET(MKLDNN_CXXFLAG "${CMAKE_CXX_FLAGS} /EHsc") +ENDIF(NOT WIN32) + ExternalProject_Add( ${MKLDNN_PROJECT} ${EXTERNAL_PROJECT_LOG_ARGS} DEPENDS ${MKLDNN_DEPENDS} - GIT_REPOSITORY "https://github.com/01org/mkl-dnn.git" - GIT_TAG "64e03a1939e0d526aa8e9f2e3f7dc0ad8d372944" + GIT_REPOSITORY "https://github.com/intel/mkl-dnn.git" + GIT_TAG "863ff6e7042cec7d2e29897fe9f0872e0888b0fc" PREFIX ${MKLDNN_SOURCES_DIR} UPDATE_COMMAND "" CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} CMAKE_ARGS -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} + CMAKE_ARGS -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE} + CMAKE_ARGS -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG} + CMAKE_ARGS -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS} + CMAKE_ARGS -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG} + CMAKE_ARGS -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE} CMAKE_ARGS -DCMAKE_INSTALL_PREFIX=${MKLDNN_INSTALL_DIR} CMAKE_ARGS -DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} + CMAKE_ARGS -DCMAKE_POSITION_INDEPENDENT_CODE=ON CMAKE_ARGS -DMKLROOT=${MKLML_ROOT} CMAKE_ARGS -DCMAKE_C_FLAGS=${MKLDNN_CFLAG} CMAKE_ARGS -DCMAKE_CXX_FLAGS=${MKLDNN_CXXFLAG} @@ -68,13 +86,17 @@ ExternalProject_Add( CMAKE_CACHE_ARGS -DCMAKE_INSTALL_PREFIX:PATH=${MKLDNN_INSTALL_DIR} -DMKLROOT:PATH=${MKLML_ROOT} ) +if(WIN32) + SET(MKLDNN_LIB "${MKLDNN_INSTALL_DIR}/${LIBDIR}/mkldnn.lib" CACHE FILEPATH "mkldnn library." FORCE) +else(WIN32) + SET(MKLDNN_LIB "${MKLDNN_INSTALL_DIR}/${LIBDIR}/libmkldnn.so" CACHE FILEPATH "mkldnn library." FORCE) +endif(WIN32) ADD_LIBRARY(shared_mkldnn SHARED IMPORTED GLOBAL) SET_PROPERTY(TARGET shared_mkldnn PROPERTY IMPORTED_LOCATION ${MKLDNN_LIB}) ADD_DEPENDENCIES(shared_mkldnn ${MKLDNN_PROJECT}) MESSAGE(STATUS "MKLDNN library: ${MKLDNN_LIB}") add_definitions(-DPADDLE_WITH_MKLDNN) -LIST(APPEND external_project_dependencies shared_mkldnn) # generate a static dummy target to track mkldnn dependencies # for cc_library(xxx SRCS xxx.c DEPS mkldnn) @@ -86,13 +108,13 @@ ADD_DEPENDENCIES(mkldnn ${MKLDNN_PROJECT}) # copy the real so.0 lib to install dir # it can be directly contained in wheel or capi -SET(MKLDNN_SHARED_LIB ${MKLDNN_INSTALL_DIR}/libmkldnn.so.0) -ADD_CUSTOM_COMMAND(OUTPUT ${MKLDNN_SHARED_LIB} - COMMAND cp ${MKLDNN_LIB} ${MKLDNN_SHARED_LIB} - DEPENDS mkldnn) +if(WIN32) + SET(MKLDNN_SHARED_LIB ${MKLDNN_INSTALL_DIR}/bin/mkldnn.dll) +else(WIN32) + SET(MKLDNN_SHARED_LIB ${MKLDNN_INSTALL_DIR}/libmkldnn.so.0) + ADD_CUSTOM_COMMAND(OUTPUT ${MKLDNN_SHARED_LIB} + COMMAND ${CMAKE_COMMAND} -E copy ${MKLDNN_LIB} ${MKLDNN_SHARED_LIB} + DEPENDS mkldnn shared_mkldnn) +endif(WIN32) ADD_CUSTOM_TARGET(mkldnn_shared_lib ALL DEPENDS ${MKLDNN_SHARED_LIB}) - -IF(WITH_C_API) - INSTALL(FILES ${MKLDNN_SHARED_LIB} DESTINATION lib) -ENDIF() - +ADD_DEPENDENCIES(mkldnn_shared_lib ${MKLDNN_PROJECT} mkldnn) diff --git a/cmake/external/mklml.cmake b/cmake/external/mklml.cmake index dc5427acd45f5da90317e7a3dc25f5453e2a7a00..142fce816de4f06aa0a36b91e3e4ecb962a8dc2a 100644 --- a/cmake/external/mklml.cmake +++ b/cmake/external/mklml.cmake @@ -16,61 +16,62 @@ IF(NOT ${WITH_MKLML}) return() ENDIF(NOT ${WITH_MKLML}) -IF(WIN32 OR APPLE) - MESSAGE(WARNING - "Windows or Mac is not supported with MKLML in Paddle yet." - "Force WITH_MKLML=OFF") - SET(WITH_MKLML OFF CACHE STRING "Disable MKLML package in Windows and MacOS" FORCE) +IF(APPLE) + MESSAGE(WARNING "Mac is not supported with MKLML in Paddle yet. Force WITH_MKLML=OFF.") + SET(WITH_MKLML OFF CACHE STRING "Disable MKLML package in MacOS" FORCE) return() ENDIF() INCLUDE(ExternalProject) - -SET(MKLML_PROJECT "extern_mklml") -IF((NOT DEFINED MKLML_VER) OR (NOT DEFINED MKLML_URL)) - MESSAGE(STATUS "use pre defined download url") - SET(MKLML_VER "mklml_lnx_2019.0.20180710" CACHE STRING "" FORCE) - SET(MKLML_URL "http://paddlepaddledeps.cdn.bcebos.com/${MKLML_VER}.tgz" CACHE STRING "" FORCE) -ENDIF() -MESSAGE(STATUS "MKLML_VER: ${MKLML_VER}, MKLML_URL: ${MKLML_URL}") -SET(MKLML_SOURCE_DIR "${THIRD_PARTY_PATH}/mklml") -SET(MKLML_DOWNLOAD_DIR "${MKLML_SOURCE_DIR}/src/${MKLML_PROJECT}") SET(MKLML_DST_DIR "mklml") SET(MKLML_INSTALL_ROOT "${THIRD_PARTY_PATH}/install") SET(MKLML_INSTALL_DIR ${MKLML_INSTALL_ROOT}/${MKLML_DST_DIR}) SET(MKLML_ROOT ${MKLML_INSTALL_DIR}) SET(MKLML_INC_DIR ${MKLML_ROOT}/include) SET(MKLML_LIB_DIR ${MKLML_ROOT}/lib) -SET(MKLML_LIB ${MKLML_LIB_DIR}/libmklml_intel.so) -SET(MKLML_IOMP_LIB ${MKLML_LIB_DIR}/libiomp5.so) SET(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_RPATH}" "${MKLML_ROOT}/lib") -INCLUDE_DIRECTORIES(${MKLML_INC_DIR}) +SET(TIME_VERSION "2019.0.1.20181227") +IF(WIN32) + SET(MKLML_VER "mklml_win_${TIME_VERSION}" CACHE STRING "" FORCE) + SET(MKLML_URL "https://paddlepaddledeps.bj.bcebos.com/${MKLML_VER}.zip" CACHE STRING "" FORCE) + SET(MKLML_LIB ${MKLML_LIB_DIR}/mklml.lib) + SET(MKLML_IOMP_LIB ${MKLML_LIB_DIR}/libiomp5md.lib) + SET(MKLML_SHARED_LIB ${MKLML_LIB_DIR}/mklml.dll) + SET(MKLML_SHARED_IOMP_LIB ${MKLML_LIB_DIR}/libiomp5md.dll) +ELSE() + #TODO(intel-huying): + # Now enable Erf function in mklml library temporarily, it will be updated as offical version later. + SET(MKLML_VER "Glibc225_vsErf_mklml_lnx_${TIME_VERSION}" CACHE STRING "" FORCE) + SET(MKLML_URL "http://paddlepaddledeps.bj.bcebos.com/${MKLML_VER}.tgz" CACHE STRING "" FORCE) + SET(MKLML_LIB ${MKLML_LIB_DIR}/libmklml_intel.so) + SET(MKLML_IOMP_LIB ${MKLML_LIB_DIR}/libiomp5.so) + SET(MKLML_SHARED_LIB ${MKLML_LIB_DIR}/libmklml_intel.so) + SET(MKLML_SHARED_IOMP_LIB ${MKLML_LIB_DIR}/libiomp5.so) +ENDIF() -FILE(WRITE ${MKLML_DOWNLOAD_DIR}/CMakeLists.txt - "PROJECT(MKLML)\n" - "cmake_minimum_required(VERSION 3.0)\n" - "install(DIRECTORY ${MKLML_VER}/include ${MKLML_VER}/lib \n" - " DESTINATION ${MKLML_DST_DIR})\n") +SET(MKLML_PROJECT "extern_mklml") +MESSAGE(STATUS "MKLML_VER: ${MKLML_VER}, MKLML_URL: ${MKLML_URL}") +SET(MKLML_SOURCE_DIR "${THIRD_PARTY_PATH}/mklml") +SET(MKLML_DOWNLOAD_DIR "${MKLML_SOURCE_DIR}/src/${MKLML_PROJECT}") ExternalProject_Add( ${MKLML_PROJECT} ${EXTERNAL_PROJECT_LOG_ARGS} - PREFIX ${MKLML_SOURCE_DIR} + PREFIX ${MKLML_SOURCE_DIR} + URL ${MKLML_URL} DOWNLOAD_DIR ${MKLML_DOWNLOAD_DIR} - DOWNLOAD_COMMAND wget --no-check-certificate ${MKLML_URL} -c -q -O ${MKLML_VER}.tgz - && tar zxf ${MKLML_VER}.tgz DOWNLOAD_NO_PROGRESS 1 - UPDATE_COMMAND "" - CMAKE_ARGS -DCMAKE_INSTALL_PREFIX=${MKLML_INSTALL_ROOT} - CMAKE_CACHE_ARGS -DCMAKE_INSTALL_PREFIX:PATH=${MKLML_INSTALL_ROOT} + CONFIGURE_COMMAND "" + BUILD_COMMAND "" + UPDATE_COMMAND "" + INSTALL_COMMAND + ${CMAKE_COMMAND} -E copy_directory ${MKLML_DOWNLOAD_DIR}/include ${MKLML_INC_DIR} && + ${CMAKE_COMMAND} -E copy_directory ${MKLML_DOWNLOAD_DIR}/lib ${MKLML_LIB_DIR} ) +INCLUDE_DIRECTORIES(${MKLML_INC_DIR}) + ADD_LIBRARY(mklml SHARED IMPORTED GLOBAL) SET_PROPERTY(TARGET mklml PROPERTY IMPORTED_LOCATION ${MKLML_LIB}) ADD_DEPENDENCIES(mklml ${MKLML_PROJECT}) -LIST(APPEND external_project_dependencies mklml) - -IF(WITH_C_API) - INSTALL(FILES ${MKLML_LIB} ${MKLML_IOMP_LIB} DESTINATION lib) -ENDIF() diff --git a/cmake/external/ngraph.cmake b/cmake/external/ngraph.cmake new file mode 100644 index 0000000000000000000000000000000000000000..23998b497e7a796b5487a287163f98a28e8d63d7 --- /dev/null +++ b/cmake/external/ngraph.cmake @@ -0,0 +1,84 @@ +# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +add_library(ngraph INTERFACE) + +IF(WIN32 OR APPLE) + MESSAGE(WARNING + "Windows or Mac is not supported with nGraph in Paddle yet." + "Force WITH_NGRAPH=OFF") + SET(WITH_NGRAPH OFF CACHE STRING "Disable nGraph in Windows and MacOS" FORCE) +ENDIF() + +IF(${WITH_NGRAPH} AND NOT ${WITH_MKLDNN}) + MESSAGE(WARNING + "nGraph needs mkl-dnn to be enabled." + "Force WITH_NGRAPH=OFF") + SET(WITH_NGRAPH OFF CACHE STRING "Disable nGraph if mkl-dnn is disabled" FORCE) +ENDIF() + +IF(NOT ${WITH_NGRAPH}) + return() +ENDIF() + +INCLUDE(GNUInstallDirs) + +INCLUDE(ExternalProject) + +SET(NGRAPH_PROJECT "extern_ngraph") +SET(NGRAPH_GIT_TAG "a444f7a959b7d87f2c117c9b57a4c387759e481e") +SET(NGRAPH_SOURCES_DIR ${THIRD_PARTY_PATH}/ngraph) +SET(NGRAPH_INSTALL_DIR ${THIRD_PARTY_PATH}/install/ngraph) +SET(NGRAPH_INC_DIR ${NGRAPH_INSTALL_DIR}/include) +SET(NGRAPH_LIB_DIR ${NGRAPH_INSTALL_DIR}/${CMAKE_INSTALL_LIBDIR}) +SET(NGRAPH_SHARED_LIB_NAME libngraph.so) +SET(NGRAPH_CPU_LIB_NAME libcpu_backend.so) +if(CMAKE_BUILD_TYPE STREQUAL "Debug") + SET(NGRAPH_TBB_LIB_NAME libtbb_debug.so.2) +else() + SET(NGRAPH_TBB_LIB_NAME libtbb.so.2) +endif() +SET(NGRAPH_GIT_REPO "https://github.com/NervanaSystems/ngraph.git") +SET(NGRAPH_SHARED_LIB ${NGRAPH_LIB_DIR}/${NGRAPH_SHARED_LIB_NAME}) +SET(NGRAPH_CPU_LIB ${NGRAPH_LIB_DIR}/${NGRAPH_CPU_LIB_NAME}) +SET(NGRAPH_TBB_LIB ${NGRAPH_LIB_DIR}/${NGRAPH_TBB_LIB_NAME}) + +ExternalProject_Add( + ${NGRAPH_PROJECT} + ${EXTERNAL_PROJECT_LOG_ARGS} + DEPENDS ${MKLDNN_PROJECT} ${MKLML_PROJECT} + GIT_REPOSITORY ${NGRAPH_GIT_REPO} + GIT_TAG ${NGRAPH_GIT_TAG} + PREFIX ${NGRAPH_SOURCES_DIR} + UPDATE_COMMAND "" + CMAKE_GENERATOR ${CMAKE_GENERATOR} + CMAKE_GENERATOR_PLATFORM ${CMAKE_GENERATOR_PLATFORM} + CMAKE_GENERATOR_TOOLSET ${CMAKE_GENERATOR_TOOLSET} + CMAKE_ARGS -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} + CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} + CMAKE_ARGS -DCMAKE_INSTALL_PREFIX=${NGRAPH_INSTALL_DIR} + CMAKE_ARGS -DNGRAPH_UNIT_TEST_ENABLE=FALSE + CMAKE_ARGS -DNGRAPH_TOOLS_ENABLE=FALSE + CMAKE_ARGS -DNGRAPH_INTERPRETER_ENABLE=FALSE + CMAKE_ARGS -DNGRAPH_DEX_ONLY=TRUE + CMAKE_ARGS -DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} + CMAKE_ARGS -DMKLDNN_INCLUDE_DIR=${MKLDNN_INC_DIR} + CMAKE_ARGS -DMKLDNN_LIB_DIR=${MKLDNN_INSTALL_DIR}/${CMAKE_INSTALL_LIBDIR} + CMAKE_ARGS -DMKLML_LIB_DIR=${MKLML_INSTALL_DIR}/lib +) + +add_dependencies(ngraph ${NGRAPH_PROJECT}) +target_compile_definitions(ngraph INTERFACE -DPADDLE_WITH_NGRAPH) +target_include_directories(ngraph INTERFACE ${NGRAPH_INC_DIR}) +target_link_libraries(ngraph INTERFACE ${NGRAPH_SHARED_LIB}) diff --git a/cmake/external/nnpack.cmake b/cmake/external/nnpack.cmake deleted file mode 100644 index d42bcb0f329041462bd8b568052fbb8226d18e4e..0000000000000000000000000000000000000000 --- a/cmake/external/nnpack.cmake +++ /dev/null @@ -1,30 +0,0 @@ -# Find the NNPACK library -# NNPACK_ROOT - where to find NNPACK include and library. -# - -set(NNPACK_FOUND OFF) -set(NNPACK_ROOT $ENV{NNPACK_ROOT} CACHE PATH "Folder contains NNPACK") -find_path(NNPACK_INC_DIR nnpack.h PATHS ${NNPACK_ROOT}/include) -find_library(NNPACK_LIB NAMES nnpack PATHS ${NNPACK_ROOT}/lib) -find_library(PTHREADPOOL_LIB NAMES pthreadpool PATHS ${NNPACK_ROOT}/lib) -find_library(NNPACK_UKERNELS_LIB NAMES nnpack_ukernels PATHS ${NNPACK_ROOT}/lib) -find_library(NNPACK_CPUFEATURES_LIB NAMES cpufeatures PATHS ${NNPACK_ROOT}/lib) - -if(NNPACK_INC_DIR AND NNPACK_LIB AND PTHREADPOOL_LIB) - set(NNPACK_FOUND ON) - INCLUDE_DIRECTORIES(${NNPACK_INC_DIR}) - - set(NNPACK_LIBS) - list(APPEND NNPACK_LIBS ${NNPACK_LIB} ${PTHREADPOOL_LIB}) - if (NNPACK_UKERNELS_LIB) - list(APPEND NNPACK_LIBS ${NNPACK_UKERNELS_LIB}) - endif() - if (NNPACK_CPUFEATURES_LIB) - list(APPEND NNPACK_LIBS ${NNPACK_CPUFEATURES_LIB}) - endif() - if(NOT ANDROID) - list(APPEND NNPACK_LIBS "rt") - endif() -else() - message(FATAL_ERROR "Cannot find NNPACK in (${NNPACK_ROOT})") -endif() diff --git a/cmake/external/openblas.cmake b/cmake/external/openblas.cmake index 755dbd610c40c2d9b85d3017b6f000a869b0f39a..d8a4a0be6f5aaa3a1a4977bbc68348743f2fa742 100644 --- a/cmake/external/openblas.cmake +++ b/cmake/external/openblas.cmake @@ -11,18 +11,9 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. - -IF(USE_EIGEN_FOR_BLAS) - return() -ENDIF(USE_EIGEN_FOR_BLAS) - INCLUDE(cblas) -# IF(WIN32 AND NOT ${CBLAS_FOUND}) - - IF(NOT ${CBLAS_FOUND}) - INCLUDE(ExternalProject) SET(CBLAS_SOURCES_DIR ${THIRD_PARTY_PATH}/openblas) @@ -34,6 +25,7 @@ IF(NOT ${CBLAS_FOUND}) CACHE FILEPATH "openblas library." FORCE) ADD_DEFINITIONS(-DPADDLE_USE_OPENBLAS) + IF (WIN32) SET(CBLAS_FOUND true) MESSAGE(WARNING, "In windows, openblas only support msvc build, please build it manually and put it at " ${CBLAS_INSTALL_DIR}) @@ -43,38 +35,12 @@ IF(NOT ${CBLAS_FOUND}) SET(OPENBLAS_CC "${CMAKE_C_COMPILER} -Wno-unused-but-set-variable -Wno-unused-variable") SET(OPENBLAS_COMMIT "v0.2.20") - IF(CMAKE_CROSSCOMPILING) - SET(OPTIONAL_ARGS HOSTCC=${HOST_C_COMPILER}) - GET_FILENAME_COMPONENT(CROSS_SUFFIX ${CMAKE_C_COMPILER} DIRECTORY) - SET(CROSS_SUFFIX ${CROSS_SUFFIX}/) - IF(ANDROID) - IF(ANDROID_ABI MATCHES "^armeabi(-v7a)?$") - # use softfp - SET(OPTIONAL_ARGS ${OPTIONAL_ARGS} TARGET=ARMV7 ARM_SOFTFP_ABI=1 USE_THREAD=0) - ELSEIF(ANDROID_ABI STREQUAL "arm64-v8a") - SET(OPTIONAL_ARGS ${OPTIONAL_ARGS} TARGET=ARMV8 BINARY=64 USE_THREAD=0) - ENDIF() - ELSEIF(IOS) - IF(CMAKE_OSX_ARCHITECTURES MATCHES "arm64") - SET(OPENBLAS_CC "${OPENBLAS_CC} ${CMAKE_C_FLAGS} -isysroot ${CMAKE_OSX_SYSROOT}") - SET(OPENBLAS_CC "${OPENBLAS_CC} -arch arm64") - SET(OPTIONAL_ARGS ${OPTIONAL_ARGS} TARGET=ARMV8 BINARY=64 USE_THREAD=0 CROSS_SUFFIX=${CROSS_SUFFIX}) - ELSE() - MESSAGE(FATAL_ERROR "OpenBLAS only support arm64 architectures on iOS. " - "You can set IOS_USE_VECLIB_FOR_BLAS=ON or USE_EIGEN_FOR_BLAS=ON to use other blas library instead.") - ENDIF() - ELSEIF(RPI) - # use hardfp - SET(OPTIONAL_ARGS ${OPTIONAL_ARGS} TARGET=ARMV7 USE_THREAD=0) - ENDIF() - ELSE() - IF(APPLE) - SET(OPENBLAS_CC "${CMAKE_C_COMPILER} -isysroot ${CMAKE_OSX_SYSROOT}") - ENDIF() - SET(OPTIONAL_ARGS "") - IF(CMAKE_SYSTEM_PROCESSOR MATCHES "^x86(_64)?$") - SET(OPTIONAL_ARGS DYNAMIC_ARCH=1 NUM_THREADS=64) - ENDIF() + IF(APPLE) + SET(OPENBLAS_CC "${CMAKE_C_COMPILER} -isysroot ${CMAKE_OSX_SYSROOT}") + ENDIF() + SET(OPTIONAL_ARGS "") + IF(CMAKE_SYSTEM_PROCESSOR MATCHES "^x86(_64)?$") + SET(OPTIONAL_ARGS DYNAMIC_ARCH=1 NUM_THREADS=64) ENDIF() SET(COMMON_ARGS CC=${OPENBLAS_CC} NO_SHARED=1 NO_LAPACK=1 libs) @@ -95,25 +61,6 @@ IF(NOT ${CBLAS_FOUND}) ELSE() ENDIF(NOT WIN32) SET(CBLAS_PROVIDER openblas) - IF(WITH_C_API) - INSTALL(DIRECTORY ${CBLAS_INC_DIR} DESTINATION third_party/openblas) - # Because libopenblas.a is a symbolic link of another library, thus need to - # install the whole directory. - IF(ANDROID) - SET(TMP_INSTALL_DIR third_party/openblas/lib/${ANDROID_ABI}) - ELSE() - SET(TMP_INSTALL_DIR third_party/openblas/lib) - ENDIF() - INSTALL(CODE "execute_process( - COMMAND ${CMAKE_COMMAND} -E copy_directory ${CBLAS_INSTALL_DIR}/lib - ${CMAKE_INSTALL_PREFIX}/${TMP_INSTALL_DIR} - )" - ) - INSTALL(CODE "MESSAGE(STATUS \"Installing: \" - \"${CBLAS_INSTALL_DIR}/lib -> ${CMAKE_INSTALL_PREFIX}/${TMP_INSTALL_DIR}\" - )" - ) - ENDIF() ENDIF(NOT ${CBLAS_FOUND}) MESSAGE(STATUS "BLAS library: ${CBLAS_LIBRARIES}") @@ -139,7 +86,6 @@ ENDIF() IF(NOT ${CBLAS_FOUND}) ADD_DEPENDENCIES(cblas extern_openblas) - LIST(APPEND external_project_dependencies cblas) ELSE() IF("${CBLAS_PROVIDER}" STREQUAL "MKLML") ADD_DEPENDENCIES(cblas mklml) diff --git a/cmake/external/protobuf.cmake b/cmake/external/protobuf.cmake index 550b0dada8e90c1e2b33705fd53c065672113b45..09eb437aede4364f8aa285d5296f21cd8460fca1 100644 --- a/cmake/external/protobuf.cmake +++ b/cmake/external/protobuf.cmake @@ -30,66 +30,61 @@ UNSET_VAR(PROTOBUF_LITE_LIBRARY) UNSET_VAR(PROTOBUF_LIBRARY) UNSET_VAR(PROTOBUF_INCLUDE_DIR) UNSET_VAR(Protobuf_PROTOC_EXECUTABLE) +function(protobuf_generate_python SRCS) + # shameless copy from https://github.com/Kitware/CMake/blob/master/Modules/FindProtobuf.cmake + if(NOT ARGN) + message(SEND_ERROR "Error: PROTOBUF_GENERATE_PYTHON() called without any proto files") + return() + endif() -if(NOT COMMAND protobuf_generate_python) # before cmake 3.4, protobuf_genrerate_python is not defined. - function(protobuf_generate_python SRCS) - # shameless copy from https://github.com/Kitware/CMake/blob/master/Modules/FindProtobuf.cmake - if(NOT ARGN) - message(SEND_ERROR "Error: PROTOBUF_GENERATE_PYTHON() called without any proto files") - return() - endif() - - if(PROTOBUF_GENERATE_CPP_APPEND_PATH) - # Create an include path for each file specified - foreach(FIL ${ARGN}) - get_filename_component(ABS_FIL ${FIL} ABSOLUTE) - get_filename_component(ABS_PATH ${ABS_FIL} PATH) - list(FIND _protobuf_include_path ${ABS_PATH} _contains_already) - if(${_contains_already} EQUAL -1) - list(APPEND _protobuf_include_path -I ${ABS_PATH}) - endif() - endforeach() - else() - set(_protobuf_include_path -I ${CMAKE_CURRENT_SOURCE_DIR}) - endif() - - if(DEFINED PROTOBUF_IMPORT_DIRS AND NOT DEFINED Protobuf_IMPORT_DIRS) - set(Protobuf_IMPORT_DIRS "${PROTOBUF_IMPORT_DIRS}") - endif() - - if(DEFINED Protobuf_IMPORT_DIRS) - foreach(DIR ${Protobuf_IMPORT_DIRS}) - get_filename_component(ABS_PATH ${DIR} ABSOLUTE) - list(FIND _protobuf_include_path ${ABS_PATH} _contains_already) - if(${_contains_already} EQUAL -1) - list(APPEND _protobuf_include_path -I ${ABS_PATH}) - endif() - endforeach() - endif() - - set(${SRCS}) + if(PROTOBUF_GENERATE_CPP_APPEND_PATH) + # Create an include path for each file specified foreach(FIL ${ARGN}) get_filename_component(ABS_FIL ${FIL} ABSOLUTE) - get_filename_component(FIL_WE ${FIL} NAME_WE) - if(NOT PROTOBUF_GENERATE_CPP_APPEND_PATH) - get_filename_component(FIL_DIR ${FIL} DIRECTORY) - if(FIL_DIR) - set(FIL_WE "${FIL_DIR}/${FIL_WE}") - endif() + get_filename_component(ABS_PATH ${ABS_FIL} PATH) + list(FIND _protobuf_include_path ${ABS_PATH} _contains_already) + if(${_contains_already} EQUAL -1) + list(APPEND _protobuf_include_path -I ${ABS_PATH}) endif() + endforeach() + else() + set(_protobuf_include_path -I ${CMAKE_CURRENT_SOURCE_DIR}) + endif() + if(DEFINED PROTOBUF_IMPORT_DIRS AND NOT DEFINED Protobuf_IMPORT_DIRS) + set(Protobuf_IMPORT_DIRS "${PROTOBUF_IMPORT_DIRS}") + endif() - list(APPEND ${SRCS} "${CMAKE_CURRENT_BINARY_DIR}/${FIL_WE}_pb2.py") - add_custom_command( - OUTPUT "${CMAKE_CURRENT_BINARY_DIR}/${FIL_WE}_pb2.py" - COMMAND ${Protobuf_PROTOC_EXECUTABLE} --python_out ${CMAKE_CURRENT_BINARY_DIR} ${_protobuf_include_path} ${ABS_FIL} - DEPENDS ${ABS_FIL} ${Protobuf_PROTOC_EXECUTABLE} - COMMENT "Running Python protocol buffer compiler on ${FIL}" - VERBATIM ) + if(DEFINED Protobuf_IMPORT_DIRS) + foreach(DIR ${Protobuf_IMPORT_DIRS}) + get_filename_component(ABS_PATH ${DIR} ABSOLUTE) + list(FIND _protobuf_include_path ${ABS_PATH} _contains_already) + if(${_contains_already} EQUAL -1) + list(APPEND _protobuf_include_path -I ${ABS_PATH}) + endif() endforeach() + endif() - set(${SRCS} ${${SRCS}} PARENT_SCOPE) - endfunction() -endif() + set(${SRCS}) + foreach(FIL ${ARGN}) + get_filename_component(ABS_FIL ${FIL} ABSOLUTE) + get_filename_component(FIL_WE ${FIL} NAME_WE) + if(NOT PROTOBUF_GENERATE_CPP_APPEND_PATH) + get_filename_component(FIL_DIR ${FIL} DIRECTORY) + if(FIL_DIR) + set(FIL_WE "${FIL_DIR}/${FIL_WE}") + endif() + endif() + list(APPEND ${SRCS} "${CMAKE_CURRENT_BINARY_DIR}/${FIL_WE}_pb2.py") + add_custom_command( + OUTPUT "${CMAKE_CURRENT_BINARY_DIR}/${FIL_WE}_pb2.py" + COMMAND ${PROTOBUF_PROTOC_EXECUTABLE} --python_out ${CMAKE_CURRENT_BINARY_DIR} ${_protobuf_include_path} ${ABS_FIL} + DEPENDS ${ABS_FIL} ${PROTOBUF_PROTOC_EXECUTABLE} + COMMENT "Running Python protocol buffer compiler on ${FIL}" + VERBATIM ) + endforeach() + + set(${SRCS} ${${SRCS}} PARENT_SCOPE) +endfunction() # Print and set the protobuf library information, # finish this cmake process and exit from this file. @@ -126,6 +121,7 @@ macro(PROMPT_PROTOBUF_LIB) # FIND_Protobuf.cmake uses `Protobuf_PROTOC_EXECUTABLE`. # make `protobuf_generate_cpp` happy. SET(Protobuf_PROTOC_EXECUTABLE ${PROTOBUF_PROTOC_EXECUTABLE}) + FOREACH(dep ${protobuf_DEPS}) ADD_DEPENDENCIES(protobuf ${dep}) ADD_DEPENDENCIES(protobuf_lite ${dep}) @@ -133,7 +129,6 @@ macro(PROMPT_PROTOBUF_LIB) ADD_DEPENDENCIES(protoc ${dep}) ENDFOREACH() - LIST(APPEND external_project_dependencies protobuf) RETURN() endmacro() macro(SET_PROTOBUF_VERSION) @@ -144,7 +139,6 @@ endmacro() set(PROTOBUF_ROOT "" CACHE PATH "Folder contains protobuf") IF (WIN32) SET(PROTOBUF_ROOT ${THIRD_PARTY_PATH}/install/protobuf) - MESSAGE(WARNING, "In windows, protobuf only support msvc build, please build it manually and put it at " ${PROTOBUF_ROOT}) ENDIF(WIN32) if (NOT "${PROTOBUF_ROOT}" STREQUAL "") @@ -192,25 +186,23 @@ FUNCTION(build_protobuf TARGET_NAME BUILD_FOR_HOST) SET(OPTIONAL_ARGS "-DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}" "-DCMAKE_C_COMPILER=${CMAKE_C_COMPILER}" - "-DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS}" "-DCMAKE_C_FLAGS=${CMAKE_C_FLAGS}" + "-DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG}" + "-DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE}" + "-DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS}" + "-DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE}" + "-DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG}" "-Dprotobuf_WITH_ZLIB=ON" "-DZLIB_ROOT:FILEPATH=${ZLIB_ROOT}" ${EXTERNAL_OPTIONAL_ARGS}) SET(OPTIONAL_CACHE_ARGS "-DZLIB_ROOT:STRING=${ZLIB_ROOT}") ENDIF() + IF(WIN32) + SET(OPTIONAL_ARGS ${OPTIONAL_ARGS} "-DCMAKE_GENERATOR_PLATFORM=x64") + ENDIF() - SET(PROTOBUF_REPO "https://github.com/google/protobuf.git") + SET(PROTOBUF_REPO "https://github.com/protocolbuffers/protobuf.git") SET(PROTOBUF_TAG "9f75c5aa851cd877fb0d93ccc31b8567a6706546") - IF(MOBILE_INFERENCE) - # The reason why the official version is not used is described in - # https://github.com/PaddlePaddle/Paddle/issues/6114 - SET(PROTOBUF_REPO "https://github.com/qingqing01/protobuf.git") - SET(PROTOBUF_TAG "v3.2.0") - IF(NOT BUILD_FOR_HOST) - SET(OPTIONAL_ARGS ${OPTIONAL_ARGS} "-Dprotobuf_BUILD_PROTOC_BINARIES=OFF") - ENDIF() - ENDIF() ExternalProject_Add( ${TARGET_NAME} @@ -229,6 +221,7 @@ FUNCTION(build_protobuf TARGET_NAME BUILD_FOR_HOST) -DCMAKE_BUILD_TYPE=${THIRD_PARTY_BUILD_TYPE} -DCMAKE_INSTALL_PREFIX=${PROTOBUF_INSTALL_DIR} -DCMAKE_INSTALL_LIBDIR=lib + -DBUILD_SHARED_LIBS=OFF CMAKE_CACHE_ARGS -DCMAKE_INSTALL_PREFIX:PATH=${PROTOBUF_INSTALL_DIR} -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE} @@ -238,19 +231,7 @@ FUNCTION(build_protobuf TARGET_NAME BUILD_FOR_HOST) ) ENDFUNCTION() -IF(NOT MOBILE_INFERENCE) - SET(PROTOBUF_VERSION 3.1) -ELSE() - SET(PROTOBUF_VERSION 3.2) -ENDIF() -IF(CMAKE_CROSSCOMPILING) - build_protobuf(protobuf_host TRUE) - LIST(APPEND external_project_dependencies protobuf_host) - - SET(PROTOBUF_PROTOC_EXECUTABLE ${protobuf_host_PROTOC_EXECUTABLE} - CACHE FILEPATH "protobuf executable." FORCE) -ENDIF() - +SET(PROTOBUF_VERSION 3.1.0) IF(NOT PROTOBUF_FOUND) build_protobuf(extern_protobuf FALSE) @@ -264,20 +245,7 @@ IF(NOT PROTOBUF_FOUND) SET(PROTOBUF_PROTOC_LIBRARY ${extern_protobuf_PROTOC_LIBRARY} CACHE FILEPATH "protoc library." FORCE) - IF(WITH_C_API) - INSTALL(DIRECTORY ${PROTOBUF_INCLUDE_DIR} DESTINATION third_party/protobuf) - IF(ANDROID) - INSTALL(FILES ${PROTOBUF_LITE_LIBRARY} DESTINATION third_party/protobuf/lib/${ANDROID_ABI}) - ELSE() - INSTALL(FILES ${PROTOBUF_LITE_LIBRARY} DESTINATION third_party/protobuf/lib) - ENDIF() - ENDIF() - - IF(CMAKE_CROSSCOMPILING) - PROMPT_PROTOBUF_LIB(protobuf_host extern_protobuf) - ELSE() - SET(PROTOBUF_PROTOC_EXECUTABLE ${extern_protobuf_PROTOC_EXECUTABLE} - CACHE FILEPATH "protobuf executable." FORCE) - PROMPT_PROTOBUF_LIB(extern_protobuf) - ENDIF() + SET(PROTOBUF_PROTOC_EXECUTABLE ${extern_protobuf_PROTOC_EXECUTABLE} + CACHE FILEPATH "protobuf executable." FORCE) + PROMPT_PROTOBUF_LIB(extern_protobuf) ENDIF(NOT PROTOBUF_FOUND) diff --git a/cmake/external/pslib.cmake b/cmake/external/pslib.cmake new file mode 100644 index 0000000000000000000000000000000000000000..0287e5cf2a835ed65c5fc26ff69d2653d5db217e --- /dev/null +++ b/cmake/external/pslib.cmake @@ -0,0 +1,72 @@ +# Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +IF(NOT ${WITH_PSLIB}) + return() +ENDIF(NOT ${WITH_PSLIB}) + +IF(WIN32 OR APPLE) + MESSAGE(WARNING + "Windows or Mac is not supported with PSLIB in Paddle yet." + "Force WITH_PSLIB=OFF") + SET(WITH_PSLIB OFF CACHE STRING "Disable PSLIB package in Windows and MacOS" FORCE) + return() +ENDIF() + +INCLUDE(ExternalProject) + +SET(PSLIB_PROJECT "extern_pslib") +IF((NOT DEFINED PSLIB_VER) OR (NOT DEFINED PSLIB_URL)) + MESSAGE(STATUS "use pre defined download url") + SET(PSLIB_VER "0.1.0" CACHE STRING "" FORCE) + SET(PSLIB_NAME "pslib" CACHE STRING "" FORCE) + SET(PSLIB_URL "https://raw.githubusercontent.com/PaddlePaddle/Fleet/release/${PSLIB_VER}/${PSLIB_NAME}.tar.gz" CACHE STRING "" FORCE) +ENDIF() +MESSAGE(STATUS "PSLIB_NAME: ${PSLIB_NAME}, PSLIB_URL: ${PSLIB_URL}") +SET(PSLIB_SOURCE_DIR "${THIRD_PARTY_PATH}/pslib") +SET(PSLIB_DOWNLOAD_DIR "${PSLIB_SOURCE_DIR}/src/${PSLIB_PROJECT}") +SET(PSLIB_DST_DIR "pslib") +SET(PSLIB_INSTALL_ROOT "${THIRD_PARTY_PATH}/install") +SET(PSLIB_INSTALL_DIR ${PSLIB_INSTALL_ROOT}/${PSLIB_DST_DIR}) +SET(PSLIB_ROOT ${PSLIB_INSTALL_DIR}) +SET(PSLIB_INC_DIR ${PSLIB_ROOT}/include) +SET(PSLIB_LIB_DIR ${PSLIB_ROOT}/lib) +SET(PSLIB_LIB ${PSLIB_LIB_DIR}/libps.so) +SET(PSLIB_IOMP_LIB ${PSLIB_LIB_DIR}/libiomp5.so) #todo what is this +SET(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_RPATH}" "${PSLIB_ROOT}/lib") + +INCLUDE_DIRECTORIES(${PSLIB_INC_DIR}) + +FILE(WRITE ${PSLIB_DOWNLOAD_DIR}/CMakeLists.txt + "PROJECT(PSLIB)\n" + "cmake_minimum_required(VERSION 3.0)\n" + "install(DIRECTORY ${PSLIB_NAME}/include ${PSLIB_NAME}/lib \n" + " DESTINATION ${PSLIB_DST_DIR})\n") + +ExternalProject_Add( + ${PSLIB_PROJECT} + ${EXTERNAL_PROJECT_LOG_ARGS} + PREFIX ${PSLIB_SOURCE_DIR} + DOWNLOAD_DIR ${PSLIB_DOWNLOAD_DIR} + DOWNLOAD_COMMAND wget --no-check-certificate ${PSLIB_URL} -c -q -O ${PSLIB_NAME}.tar.gz + && tar zxvf ${PSLIB_NAME}.tar.gz + DOWNLOAD_NO_PROGRESS 1 + UPDATE_COMMAND "" + CMAKE_ARGS -DCMAKE_INSTALL_PREFIX=${PSLIB_INSTALL_ROOT} + CMAKE_CACHE_ARGS -DCMAKE_INSTALL_PREFIX:PATH=${PSLIB_INSTALL_ROOT} +) + +ADD_LIBRARY(pslib SHARED IMPORTED GLOBAL) +SET_PROPERTY(TARGET pslib PROPERTY IMPORTED_LOCATION ${PSLIB_LIB}) +ADD_DEPENDENCIES(pslib ${PSLIB_PROJECT}) diff --git a/cmake/external/pslib_brpc.cmake b/cmake/external/pslib_brpc.cmake new file mode 100644 index 0000000000000000000000000000000000000000..22c8c1b463764b6e107c5f3da25d51b36c2ce59f --- /dev/null +++ b/cmake/external/pslib_brpc.cmake @@ -0,0 +1,72 @@ +# Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +IF(NOT ${WITH_PSLIB_BRPC}) + return() +ENDIF(NOT ${WITH_PSLIB_BRPC}) + +IF(WIN32 OR APPLE) + MESSAGE(WARNING + "Windows or Mac is not supported with PSLIB_BRPC in Paddle yet." + "Force WITH_PSLIB_BRPC=OFF") + SET(WITH_PSLIB_BRPC OFF CACHE STRING "Disable PSLIB_BRPC package in Windows and MacOS" FORCE) + return() +ENDIF() + +INCLUDE(ExternalProject) + +SET(PSLIB_BRPC_PROJECT "extern_pslib_brpc") +IF((NOT DEFINED PSLIB_BRPC_NAME) OR (NOT DEFINED PSLIB_BRPC_URL)) + MESSAGE(STATUS "use pre defined download url") + SET(PSLIB_BRPC_VER "0.1.0" CACHE STRING "" FORCE) + SET(PSLIB_BRPC_NAME "pslib_brpc" CACHE STRING "" FORCE) + SET(PSLIB_BRPC_URL "https://raw.githubusercontent.com/PaddlePaddle/Fleet/release/${PSLIB_BRPC_VER}/${PSLIB_BRPC_NAME}.tar.gz" CACHE STRING "" FORCE) +ENDIF() +MESSAGE(STATUS "PSLIB_BRPC_NAME: ${PSLIB_BRPC_NAME}, PSLIB_BRPC_URL: ${PSLIB_BRPC_URL}") +SET(PSLIB_BRPC_SOURCE_DIR "${THIRD_PARTY_PATH}/pslib_brpc") +SET(PSLIB_BRPC_DOWNLOAD_DIR "${PSLIB_BRPC_SOURCE_DIR}/src/${PSLIB_BRPC_PROJECT}") +SET(PSLIB_BRPC_DST_DIR "pslib_brpc") +SET(PSLIB_BRPC_INSTALL_ROOT "${THIRD_PARTY_PATH}/install") +SET(PSLIB_BRPC_INSTALL_DIR ${PSLIB_BRPC_INSTALL_ROOT}/${PSLIB_BRPC_DST_DIR}) +SET(PSLIB_BRPC_ROOT ${PSLIB_BRPC_INSTALL_DIR}) +SET(PSLIB_BRPC_INC_DIR ${PSLIB_BRPC_ROOT}/include) +SET(PSLIB_BRPC_LIB_DIR ${PSLIB_BRPC_ROOT}/lib) +SET(PSLIB_BRPC_LIB ${PSLIB_BRPC_LIB_DIR}/libbrpc.a) +SET(PSLIB_BRPC_IOMP_LIB ${PSLIB_BRPC_LIB_DIR}/libiomp5.so) #todo what is this +SET(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_RPATH}" "${PSLIB_BRPC_ROOT}/lib") + +INCLUDE_DIRECTORIES(${PSLIB_BRPC_INC_DIR}) + +FILE(WRITE ${PSLIB_BRPC_DOWNLOAD_DIR}/CMakeLists.txt + "PROJECT(PSLIB_BRPC)\n" + "cmake_minimum_required(VERSION 3.0)\n" + "install(DIRECTORY ${PSLIB_BRPC_NAME}/include ${PSLIB_BRPC_NAME}/lib \n" + " DESTINATION ${PSLIB_BRPC_DST_DIR})\n") + +ExternalProject_Add( + ${PSLIB_BRPC_PROJECT} + ${EXTERNAL_PROJECT_LOG_ARGS} + PREFIX ${PSLIB_BRPC_SOURCE_DIR} + DOWNLOAD_DIR ${PSLIB_BRPC_DOWNLOAD_DIR} + DOWNLOAD_COMMAND wget --no-check-certificate ${PSLIB_BRPC_URL} -c -q -O ${PSLIB_BRPC_NAME}.tar.gz + && tar zxvf ${PSLIB_BRPC_NAME}.tar.gz + DOWNLOAD_NO_PROGRESS 1 + UPDATE_COMMAND "" + CMAKE_ARGS -DCMAKE_INSTALL_PREFIX=${PSLIB_BRPC_INSTALL_ROOT} + CMAKE_CACHE_ARGS -DCMAKE_INSTALL_PREFIX:PATH=${PSLIB_BRPC_INSTALL_ROOT} +) + +ADD_LIBRARY(pslib_brpc SHARED IMPORTED GLOBAL) +SET_PROPERTY(TARGET pslib_brpc PROPERTY IMPORTED_LOCATION ${PSLIB_BRPC_LIB}) +ADD_DEPENDENCIES(pslib_brpc ${PSLIB_BRPC_PROJECT}) diff --git a/cmake/external/pybind11.cmake b/cmake/external/pybind11.cmake index c885877a2bcd622dc8d1f3ca2821a0bed84561bf..3a10ea945d3d1a3fa88954196905872b2668d5a0 100644 --- a/cmake/external/pybind11.cmake +++ b/cmake/external/pybind11.cmake @@ -26,7 +26,7 @@ ExternalProject_Add( extern_pybind ${EXTERNAL_PROJECT_LOG_ARGS} GIT_REPOSITORY "https://github.com/pybind/pybind11.git" - GIT_TAG "v2.1.1" + GIT_TAG "v2.2.4" PREFIX ${PYBIND_SOURCE_DIR} UPDATE_COMMAND "" CONFIGURE_COMMAND "" diff --git a/cmake/external/python.cmake b/cmake/external/python.cmake index f17b8d46dc2d8ded81ced7de5827d5e7fd5109f0..623c53f4f75bbd217c157bcdda0cb12c510269ee 100644 --- a/cmake/external/python.cmake +++ b/cmake/external/python.cmake @@ -18,8 +18,50 @@ ENDIF() INCLUDE(python_module) -FIND_PACKAGE(PythonInterp ${PY_VERSION}) -FIND_PACKAGE(PythonLibs ${PY_VERSION}) +FIND_PACKAGE(PythonInterp ${PY_VERSION} REQUIRED) +FIND_PACKAGE(PythonLibs ${PY_VERSION} REQUIRED) + +if(WIN32) + execute_process(COMMAND "${PYTHON_EXECUTABLE}" "-c" +"from distutils import sysconfig as s;import sys;import struct; +print(sys.prefix); +print(s.get_config_var('LDVERSION') or s.get_config_var('VERSION')); +" + RESULT_VARIABLE _PYTHON_SUCCESS + OUTPUT_VARIABLE _PYTHON_VALUES + ERROR_VARIABLE _PYTHON_ERROR_VALUE) + + if(NOT _PYTHON_SUCCESS MATCHES 0) + set(PYTHONLIBS_FOUND FALSE) + return() + endif() + + # Convert the process output into a list + string(REGEX REPLACE ";" "\\\\;" _PYTHON_VALUES ${_PYTHON_VALUES}) + string(REGEX REPLACE "\n" ";" _PYTHON_VALUES ${_PYTHON_VALUES}) + list(GET _PYTHON_VALUES 0 PYTHON_PREFIX) + list(GET _PYTHON_VALUES 1 PYTHON_LIBRARY_SUFFIX) + + # Make sure all directory separators are '/' + string(REGEX REPLACE "\\\\" "/" PYTHON_PREFIX ${PYTHON_PREFIX}) + + set(PYTHON_LIBRARY + "${PYTHON_PREFIX}/libs/Python${PYTHON_LIBRARY_SUFFIX}.lib") + + # when run in a venv, PYTHON_PREFIX points to it. But the libraries remain in the + # original python installation. They may be found relative to PYTHON_INCLUDE_DIR. + if(NOT EXISTS "${PYTHON_LIBRARY}") + get_filename_component(_PYTHON_ROOT ${PYTHON_INCLUDE_DIR} DIRECTORY) + set(PYTHON_LIBRARY + "${_PYTHON_ROOT}/libs/Python${PYTHON_LIBRARY_SUFFIX}.lib") + endif() + + # raise an error if the python libs are still not found. + if(NOT EXISTS "${PYTHON_LIBRARY}") + message(FATAL_ERROR "Python libraries not found") + endif() + SET(PYTHON_LIBRARIES "${PYTHON_LIBRARY}") +endif(WIN32) # Fixme: Maybe find a static library. Get SHARED/STATIC by FIND_PACKAGE. ADD_LIBRARY(python SHARED IMPORTED GLOBAL) @@ -37,6 +79,5 @@ IF(PYTHONINTERP_FOUND) "please use pip to upgrade protobuf. pip install -U protobuf") ENDIF() ENDIF(PYTHONINTERP_FOUND) - INCLUDE_DIRECTORIES(${PYTHON_INCLUDE_DIR}) INCLUDE_DIRECTORIES(${PYTHON_NUMPY_INCLUDE_DIR}) diff --git a/cmake/external/rocprim.cmake b/cmake/external/rocprim.cmake new file mode 100644 index 0000000000000000000000000000000000000000..914c06491890574bcdf4374d8e0fd5498e780113 --- /dev/null +++ b/cmake/external/rocprim.cmake @@ -0,0 +1,44 @@ +if (NOT WITH_AMD_GPU) + return() +endif() + +# rocprim is "ROCm Parallel Primitives" for short. +# It is a header-only library providing HIP and HC parallel primitives +# for developing performant GPU-accelerated code on AMD ROCm platform. + +if("x${HCC_HOME}" STREQUAL "x") + set(HCC_HOME "/opt/rocm/hcc") +endif() + +INCLUDE(ExternalProject) + +SET(ROCPRIM_SOURCE_DIR ${THIRD_PARTY_PATH}/rocprim) +SET(ROCPRIM_INSTALL_DIR ${THIRD_PARTY_PATH}/install/rocprim) +SET(ROCPRIM_INCLUDE_DIR ${ROCPRIM_INSTALL_DIR}/include) + +ExternalProject_Add( + extern_rocprim + GIT_REPOSITORY "https://github.com/ROCmSoftwarePlatform/rocPRIM.git" + GIT_TAG 5bd41b96ab8d8343330fb2c3e1b96775bde3b3fc + PREFIX ${ROCPRIM_SOURCE_DIR} + UPDATE_COMMAND "" + CMAKE_ARGS -DCMAKE_CXX_COMPILER=${HCC_HOME}/bin/hcc + CMAKE_ARGS -DONLY_INSTALL=ON + CMAKE_ARGS -DBUILD_TEST=OFF + CMAKE_ARGS -DCMAKE_INSTALL_PREFIX=${ROCPRIM_INSTALL_DIR} + + INSTALL_DIR ${ROCPRIM_INSTALL_DIR} + ${EXTERNAL_PROJECT_LOG_ARGS} +) + +INCLUDE_DIRECTORIES(${ROCPRIM_INCLUDE_DIR}) + +if (${CMAKE_VERSION} VERSION_LESS "3.3.0") + set(dummyfile ${CMAKE_CURRENT_BINARY_DIR}/rocprim_dummy.c) + file(WRITE ${dummyfile} "const char *dummy_rocprim = \"${dummyfile}\";") + add_library(rocprim STATIC ${dummyfile}) +else() + add_library(rocprim INTERFACE) +endif() + +add_dependencies(rocprim extern_rocprim) diff --git a/cmake/external/snappy.cmake b/cmake/external/snappy.cmake index af09ed4d5d6e21cc50aba5198a7e9ea56f49451a..1e01057aa606af78cd722d3619a710cb35817174 100644 --- a/cmake/external/snappy.cmake +++ b/cmake/external/snappy.cmake @@ -12,10 +12,6 @@ # See the License for the specific language governing permissions and # limitations under the License. -if(MOBILE_INFERENCE OR RPI) - return() -endif() - include (ExternalProject) # NOTE: snappy is needed when linking with recordio @@ -24,7 +20,11 @@ set(SNAPPY_SOURCES_DIR ${THIRD_PARTY_PATH}/snappy) set(SNAPPY_INSTALL_DIR ${THIRD_PARTY_PATH}/install/snappy) set(SNAPPY_INCLUDE_DIR "${SNAPPY_INSTALL_DIR}/include" CACHE PATH "snappy include directory." FORCE) -set(SNAPPY_LIBRARIES "${SNAPPY_INSTALL_DIR}/lib/libsnappy.a") +if(WIN32) + SET(SNAPPY_CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /wd4244 /wd4267") +else() + SET(SNAPPY_CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS}) +endif() ExternalProject_Add( extern_snappy @@ -34,8 +34,12 @@ ExternalProject_Add( UPDATE_COMMAND "" CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} - -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS} -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS} + -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG} + -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE} + -DCMAKE_CXX_FLAGS=${SNAPPY_CMAKE_CXX_FLAGS} + -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE} + -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG} -DCMAKE_INSTALL_PREFIX=${SNAPPY_INSTALL_DIR} -DCMAKE_INSTALL_LIBDIR=${SNAPPY_INSTALL_DIR}/lib -DCMAKE_POSITION_INDEPENDENT_CODE=ON @@ -48,6 +52,16 @@ ExternalProject_Add( -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE} ) +IF(WIN32) + IF(NOT EXISTS "${SNAPPY_INSTALL_DIR}/lib/libsnappy.lib") + add_custom_command(TARGET extern_snappy POST_BUILD + COMMAND cmake -E copy ${SNAPPY_INSTALL_DIR}/lib/snappy.lib ${SNAPPY_INSTALL_DIR}/lib/libsnappy.lib + ) + ENDIF() + set(SNAPPY_LIBRARIES "${SNAPPY_INSTALL_DIR}/lib/libsnappy.lib") +else(WIN32) + set(SNAPPY_LIBRARIES "${SNAPPY_INSTALL_DIR}/lib/libsnappy.a") +endif (WIN32) add_library(snappy STATIC IMPORTED GLOBAL) set_property(TARGET snappy PROPERTY IMPORTED_LOCATION ${SNAPPY_LIBRARIES}) diff --git a/cmake/external/snappystream.cmake b/cmake/external/snappystream.cmake index 6df636d7fa8757ade73892bda03a80ba9767472b..392f186b7ce3821f313ed6fc3dd5a97c2a7adebd 100644 --- a/cmake/external/snappystream.cmake +++ b/cmake/external/snappystream.cmake @@ -12,42 +12,47 @@ # See the License for the specific language governing permissions and # limitations under the License. -IF(MOBILE_INFERENCE OR RPI) - return() -ENDIF() - include (ExternalProject) -# NOTE: snappy is needed when linking with recordio - set(SNAPPYSTREAM_SOURCES_DIR ${THIRD_PARTY_PATH}/snappy_stream) set(SNAPPYSTREAM_INSTALL_DIR ${THIRD_PARTY_PATH}/install/snappy_stream) set(SNAPPYSTREAM_INCLUDE_DIR "${SNAPPYSTREAM_INSTALL_DIR}/include" CACHE PATH "snappy stream include directory." FORCE) -set(SNAPPYSTREAM_LIBRARIES "${SNAPPYSTREAM_INSTALL_DIR}/lib/libsnappystream.a") - -ExternalProject_Add( - extern_snappystream - GIT_REPOSITORY "https://github.com/hoxnox/snappystream.git" - GIT_TAG "0.2.8" - PREFIX ${SNAPPYSTREAM_SOURCES_DIR} - UPDATE_COMMAND "" - CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} - -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} - -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS} - -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS} - -DCMAKE_INSTALL_PREFIX=${SNAPPY_INSTALL_DIR} - -DCMAKE_INSTALL_LIBDIR=${SNAPPY_INSTALL_DIR}/lib - -DCMAKE_POSITION_INDEPENDENT_CODE=ON - -DCMAKE_BUILD_TYPE=${THIRD_PARTY_BUILD_TYPE} - -DSNAPPY_ROOT=${SNAPPY_INSTALL_DIR} - ${EXTERNAL_OPTIONAL_ARGS} - CMAKE_CACHE_ARGS - -DCMAKE_INSTALL_PREFIX:PATH=${SNAPPYSTREAM_INSTALL_DIR} - -DCMAKE_INSTALL_LIBDIR:PATH=${SNAPPYSTREAM_INSTALL_DIR}/lib - -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE} - DEPENDS snappy -) +if(WIN32) + # Fix me, VS2015 come without VLA support + set(SNAPPYSTREAM_LIBRARIES "${SNAPPYSTREAM_INSTALL_DIR}/lib/snappystream.lib") + MESSAGE(WARNING, "In windows, snappystream has no compile support for windows, + please build it manually and put it at " ${SNAPPYSTREAM_INSTALL_DIR}) +else(WIN32) + set(SNAPPYSTREAM_LIBRARIES "${SNAPPYSTREAM_INSTALL_DIR}/lib/libsnappystream.a") + + ExternalProject_Add( + extern_snappystream + GIT_REPOSITORY "https://github.com/hoxnox/snappystream.git" + GIT_TAG "0.2.8" + PREFIX ${SNAPPYSTREAM_SOURCES_DIR} + UPDATE_COMMAND "" + CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} + -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} + -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS} + -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG} + -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE} + -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS} + -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE} + -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG} + -DCMAKE_INSTALL_PREFIX=${SNAPPY_INSTALL_DIR} + -DCMAKE_INSTALL_LIBDIR=${SNAPPY_INSTALL_DIR}/lib + -DCMAKE_POSITION_INDEPENDENT_CODE=ON + -DCMAKE_BUILD_TYPE=${THIRD_PARTY_BUILD_TYPE} + -DSNAPPY_ROOT=${SNAPPY_INSTALL_DIR} + ${EXTERNAL_OPTIONAL_ARGS} + CMAKE_CACHE_ARGS + -DCMAKE_INSTALL_PREFIX:PATH=${SNAPPYSTREAM_INSTALL_DIR} + -DCMAKE_INSTALL_LIBDIR:PATH=${SNAPPYSTREAM_INSTALL_DIR}/lib + -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE} + DEPENDS snappy + ) +endif(WIN32) add_library(snappystream STATIC IMPORTED GLOBAL) set_property(TARGET snappystream PROPERTY IMPORTED_LOCATION ${SNAPPYSTREAM_LIBRARIES}) diff --git a/cmake/external/swig.cmake b/cmake/external/swig.cmake deleted file mode 100644 index de07703695eb14e76eedd3758d55cb98edd1e02b..0000000000000000000000000000000000000000 --- a/cmake/external/swig.cmake +++ /dev/null @@ -1,65 +0,0 @@ -# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -IF(NOT WITH_SWIG_PY) - return() -ENDIF() - -FIND_PACKAGE(SWIG) - -IF(NOT SWIG_FOUND) - # build swig as an external project - INCLUDE(ExternalProject) - - SET(SWIG_SOURCES_DIR ${THIRD_PARTY_PATH}/swig) - SET(SWIG_INSTALL_DIR ${THIRD_PARTY_PATH}/install/swig) - SET(SWIG_TARGET_VERSION "3.0.2") - SET(SWIG_DOWNLOAD_SRC_MD5 "62f9b0d010cef36a13a010dc530d0d41") - SET(SWIG_DOWNLOAD_WIN_MD5 "3f18de4fc09ab9abb0d3be37c11fbc8f") - - IF(WIN32) - # swig.exe available as pre-built binary on Windows: - ExternalProject_Add(swig - URL http://prdownloads.sourceforge.net/swig/swigwin-${SWIG_TARGET_VERSION}.zip - URL_MD5 ${SWIG_DOWNLOAD_WIN_MD5} - SOURCE_DIR ${SWIG_SOURCES_DIR} - CONFIGURE_COMMAND "" - BUILD_COMMAND "" - INSTALL_COMMAND "" - UPDATE_COMMAND "" - ) - SET(SWIG_DIR ${SWIG_SOURCES_DIR} CACHE FILEPATH "SWIG Directory" FORCE) - SET(SWIG_EXECUTABLE ${SWIG_SOURCES_DIR}/swig.exe CACHE FILEPATH "SWIG Executable" FORCE) - ELSE(WIN32) - # swig uses bison find it by cmake and pass it down - FIND_PACKAGE(BISON) - - # From SWIG configure - ExternalProject_Add(swig - GIT_REPOSITORY https://github.com/swig/swig.git - GIT_TAG rel-3.0.10 - PREFIX ${SWIG_SOURCES_DIR} - CONFIGURE_COMMAND cd && ./autogen.sh && ./configure - --prefix=${SWIG_INSTALL_DIR} --without-pcre - BUILD_COMMAND cd && make - INSTALL_COMMAND cd && make install - UPDATE_COMMAND "" - ) - - SET(SWIG_DIR ${SWIG_INSTALL_DIR}/share/swig/${SWIG_TARGET_VERSION}) - SET(SWIG_EXECUTABLE ${SWIG_INSTALL_DIR}/bin/swig) - ENDIF(WIN32) - - LIST(APPEND external_project_dependencies swig) -ENDIF(NOT SWIG_FOUND) diff --git a/cmake/external/threadpool.cmake b/cmake/external/threadpool.cmake index 0159815fed81bdff6de3e561af569e9edc75f947..1f56bc7ab056ef0dd95d603ebe3461ef044b2a79 100644 --- a/cmake/external/threadpool.cmake +++ b/cmake/external/threadpool.cmake @@ -26,5 +26,3 @@ else() endif() add_dependencies(simple_threadpool extern_threadpool) - -LIST(APPEND external_project_dependencies simple_threadpool) diff --git a/cmake/external/warpctc.cmake b/cmake/external/warpctc.cmake index 07e1137e16afc1e4e9ab9640e1ccaea8008a0cd2..6f2af8670f25c00ac0970fe4ae2b0c5b03aa0d9e 100644 --- a/cmake/external/warpctc.cmake +++ b/cmake/external/warpctc.cmake @@ -12,10 +12,6 @@ # See the License for the specific language governing permissions and # limitations under the License. -IF(MOBILE_INFERENCE) - return() -ENDIF() - INCLUDE(ExternalProject) SET(WARPCTC_SOURCES_DIR ${THIRD_PARTY_PATH}/warpctc) @@ -26,25 +22,33 @@ SET(WARPCTC_INCLUDE_DIR "${WARPCTC_INSTALL_DIR}/include" # Used in unit test test_WarpCTCLayer SET(WARPCTC_LIB_DIR "${WARPCTC_INSTALL_DIR}/lib" CACHE PATH "Warp-ctc Library Directory" FORCE) -SET(WARPCTC_LIBRARIES "${WARPCTC_INSTALL_DIR}/lib/libwarpctc${CMAKE_SHARED_LIBRARY_SUFFIX}" - CACHE FILEPATH "Warp-ctc Library" FORCE) -IF(CMAKE_CXX_COMPILER_ID STREQUAL "Clang" OR CMAKE_CXX_COMPILER_ID STREQUAL "AppleClang" ) +IF(CMAKE_CXX_COMPILER_ID STREQUAL "Clang" OR CMAKE_CXX_COMPILER_ID STREQUAL "AppleClang" OR WIN32) SET(USE_OMP OFF) ELSE() SET(USE_OMP ON) ENDIF() +IF(WIN32) + SET(WARPCTC_REPOSITORY "https://github.com/wopeizl/warp-ctc.git") +ELSE() + SET(WARPCTC_REPOSITORY "https://github.com/dzhwinter/warp-ctc.git") +ENDIF() + ExternalProject_Add( extern_warpctc ${EXTERNAL_PROJECT_LOG_ARGS} - GIT_REPOSITORY "https://github.com/dzhwinter/warp-ctc.git" + GIT_REPOSITORY ${WARPCTC_REPOSITORY} PREFIX ${WARPCTC_SOURCES_DIR} UPDATE_COMMAND "" CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} - -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS} -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS} + -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG} + -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE} + -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS} + -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE} + -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG} -DCMAKE_INSTALL_PREFIX=${WARPCTC_INSTALL_DIR} -DWITH_GPU=${WITH_GPU} -DWITH_OMP=${USE_OMP} @@ -59,6 +63,18 @@ ExternalProject_Add( -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON -DCMAKE_INSTALL_PREFIX:PATH=${WARPCTC_INSTALL_DIR} ) +IF(WIN32) + IF(NOT EXISTS "${WARPCTC_INSTALL_DIR}/lib/warpctc${CMAKE_SHARED_LIBRARY_SUFFIX}") + add_custom_command(TARGET extern_warpctc POST_BUILD + COMMAND cmake -E copy ${WARPCTC_INSTALL_DIR}/bin/warpctc${CMAKE_SHARED_LIBRARY_SUFFIX} ${WARPCTC_INSTALL_DIR}/lib/warpctc${CMAKE_SHARED_LIBRARY_SUFFIX} + ) + ENDIF() + SET(WARPCTC_LIBRARIES "${WARPCTC_INSTALL_DIR}/lib/warpctc${CMAKE_SHARED_LIBRARY_SUFFIX}" + CACHE FILEPATH "Warp-ctc Library" FORCE) +else(WIN32) + SET(WARPCTC_LIBRARIES "${WARPCTC_INSTALL_DIR}/lib/libwarpctc${CMAKE_SHARED_LIBRARY_SUFFIX}" + CACHE FILEPATH "Warp-ctc Library" FORCE) +ENDIF(WIN32) MESSAGE(STATUS "warp-ctc library: ${WARPCTC_LIBRARIES}") INCLUDE_DIRECTORIES(${WARPCTC_INCLUDE_DIR}) # For warpctc code to include its headers. @@ -67,5 +83,3 @@ INCLUDE_DIRECTORIES(${THIRD_PARTY_PATH}/install) # For Paddle code to include wa ADD_LIBRARY(warpctc SHARED IMPORTED GLOBAL) SET_PROPERTY(TARGET warpctc PROPERTY IMPORTED_LOCATION ${WARPCTC_LIBRARIES}) ADD_DEPENDENCIES(warpctc extern_warpctc) - -LIST(APPEND external_project_dependencies warpctc) diff --git a/cmake/external/xbyak.cmake b/cmake/external/xbyak.cmake index 384c2f9328296ce6a8a6293be6cc47e5063dd3c4..1d61154c0d45dea795902d6544deb796693db263 100644 --- a/cmake/external/xbyak.cmake +++ b/cmake/external/xbyak.cmake @@ -55,4 +55,3 @@ else() endif() add_dependencies(xbyak ${XBYAK_PROJECT}) -list(APPEND external_project_dependencies xbyak) diff --git a/cmake/external/xxhash.cmake b/cmake/external/xxhash.cmake new file mode 100644 index 0000000000000000000000000000000000000000..23b1e02108642df561948a6faa3152effb7ca932 --- /dev/null +++ b/cmake/external/xxhash.cmake @@ -0,0 +1,73 @@ +INCLUDE(ExternalProject) + +set(XXHASH_SOURCE_DIR ${THIRD_PARTY_PATH}/xxhash) +set(XXHASH_INSTALL_DIR ${THIRD_PARTY_PATH}/install/xxhash) +set(XXHASH_INCLUDE_DIR "${XXHASH_INSTALL_DIR}/include") + +IF(WITH_STATIC_LIB) + SET(BUILD_CMD make lib) +ELSE() + IF(APPLE) + SET(BUILD_CMD sed -i \"\" "s/-Wstrict-prototypes -Wundef/-Wstrict-prototypes -Wundef -fPIC/g" ${XXHASH_SOURCE_DIR}/src/extern_xxhash/Makefile && make lib) + ELSE(APPLE) + SET(BUILD_CMD sed -i "s/-Wstrict-prototypes -Wundef/-Wstrict-prototypes -Wundef -fPIC/g" ${XXHASH_SOURCE_DIR}/src/extern_xxhash/Makefile && make lib) + ENDIF(APPLE) +ENDIF() + +if(WIN32) + ExternalProject_Add( + extern_xxhash + ${EXTERNAL_PROJECT_LOG_ARGS} + GIT_REPOSITORY "https://github.com/Cyan4973/xxHash" + GIT_TAG "v0.6.5" + PREFIX ${XXHASH_SOURCE_DIR} + DOWNLOAD_NAME "xxhash" + UPDATE_COMMAND "" + BUILD_IN_SOURCE 1 + PATCH_COMMAND + CONFIGURE_COMMAND + ${CMAKE_COMMAND} ${XXHASH_SOURCE_DIR}/src/extern_xxhash/cmake_unofficial + -DCMAKE_INSTALL_PREFIX:PATH=${XXHASH_INSTALL_DIR} + -DCMAKE_BUILD_TYPE:STRING=${CMAKE_BUILD_TYPE} + -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF + -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON + -DBUILD_XXHSUM=OFF + -DCMAKE_GENERATOR_PLATFORM=x64 + -DBUILD_SHARED_LIBS=OFF + ${OPTIONAL_CACHE_ARGS} + TEST_COMMAND "" + ) +else() + ExternalProject_Add( + extern_xxhash + ${EXTERNAL_PROJECT_LOG_ARGS} + GIT_REPOSITORY "https://github.com/Cyan4973/xxHash" + GIT_TAG "v0.6.5" + PREFIX ${XXHASH_SOURCE_DIR} + DOWNLOAD_NAME "xxhash" + UPDATE_COMMAND "" + CONFIGURE_COMMAND "" + BUILD_IN_SOURCE 1 + PATCH_COMMAND + BUILD_COMMAND ${BUILD_CMD} + INSTALL_COMMAND export PREFIX=${XXHASH_INSTALL_DIR}/ && make install + TEST_COMMAND "" + ) +endif() + +if (WIN32) + IF(NOT EXISTS "${XXHASH_INSTALL_DIR}/lib/libxxhash.lib") + add_custom_command(TARGET extern_xxhash POST_BUILD + COMMAND cmake -E copy ${XXHASH_INSTALL_DIR}/lib/xxhash.lib ${XXHASH_INSTALL_DIR}/lib/libxxhash.lib + ) + ENDIF() + set(XXHASH_LIBRARIES "${XXHASH_INSTALL_DIR}/lib/libxxhash.lib") +else() + set(XXHASH_LIBRARIES "${XXHASH_INSTALL_DIR}/lib/libxxhash.a") +endif () +INCLUDE_DIRECTORIES(${XXHASH_INCLUDE_DIR}) + +add_library(xxhash STATIC IMPORTED GLOBAL) +set_property(TARGET xxhash PROPERTY IMPORTED_LOCATION ${XXHASH_LIBRARIES}) +include_directories(${XXHASH_INCLUDE_DIR}) +add_dependencies(xxhash extern_xxhash) diff --git a/cmake/external/zlib.cmake b/cmake/external/zlib.cmake index c3d73235453c8c9fd2859c3ab142888e8bda2dbe..5569fefe992d10ad4820e51e677f40271d0214e7 100644 --- a/cmake/external/zlib.cmake +++ b/cmake/external/zlib.cmake @@ -19,12 +19,6 @@ SET(ZLIB_INSTALL_DIR ${THIRD_PARTY_PATH}/install/zlib) SET(ZLIB_ROOT ${ZLIB_INSTALL_DIR} CACHE FILEPATH "zlib root directory." FORCE) SET(ZLIB_INCLUDE_DIR "${ZLIB_INSTALL_DIR}/include" CACHE PATH "zlib include directory." FORCE) -IF(WIN32) - SET(ZLIB_LIBRARIES "${ZLIB_INSTALL_DIR}/lib/zlibstatic.lib" CACHE FILEPATH "zlib library." FORCE) -ELSE(WIN32) - SET(ZLIB_LIBRARIES "${ZLIB_INSTALL_DIR}/lib/libz.a" CACHE FILEPATH "zlib library." FORCE) -ENDIF(WIN32) - INCLUDE_DIRECTORIES(${ZLIB_INCLUDE_DIR}) # For zlib code to include its own headers. INCLUDE_DIRECTORIES(${THIRD_PARTY_PATH}/install) # For Paddle code to include zlib.h. @@ -49,18 +43,17 @@ ExternalProject_Add( -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE} ) +IF(WIN32) + IF(NOT EXISTS "${ZLIB_INSTALL_DIR}/lib/libz.lib") + add_custom_command(TARGET extern_zlib POST_BUILD + COMMAND cmake -E copy ${ZLIB_INSTALL_DIR}/lib/zlibstatic.lib ${ZLIB_INSTALL_DIR}/lib/libz.lib + ) + ENDIF() + SET(ZLIB_LIBRARIES "${ZLIB_INSTALL_DIR}/lib/libz.lib" CACHE FILEPATH "zlib library." FORCE) +ELSE(WIN32) + SET(ZLIB_LIBRARIES "${ZLIB_INSTALL_DIR}/lib/libz.a" CACHE FILEPATH "zlib library." FORCE) +ENDIF(WIN32) ADD_LIBRARY(zlib STATIC IMPORTED GLOBAL) SET_PROPERTY(TARGET zlib PROPERTY IMPORTED_LOCATION ${ZLIB_LIBRARIES}) ADD_DEPENDENCIES(zlib extern_zlib) - -LIST(APPEND external_project_dependencies zlib) - -IF(WITH_C_API) - INSTALL(DIRECTORY ${ZLIB_INCLUDE_DIR} DESTINATION third_party/zlib) - IF(ANDROID) - INSTALL(FILES ${ZLIB_LIBRARIES} DESTINATION third_party/zlib/lib/${ANDROID_ABI}) - ELSE() - INSTALL(FILES ${ZLIB_LIBRARIES} DESTINATION third_party/zlib/lib) - ENDIF() -ENDIF() diff --git a/cmake/flags.cmake b/cmake/flags.cmake index 343e44ab4bc21c1a656048b675062f1b897bbc77..36b533aa4f7815896fb48c33fefad892b8d0d29c 100644 --- a/cmake/flags.cmake +++ b/cmake/flags.cmake @@ -21,7 +21,7 @@ function(CheckCompilerCXX11Flag) if (${CMAKE_CXX_COMPILER_VERSION} VERSION_LESS 3.3) message(FATAL_ERROR "Unsupported Clang version. Clang >= 3.3 required.") endif() - endif() + endif() endif() endfunction() @@ -129,6 +129,9 @@ set(COMMON_FLAGS -Wno-error=parentheses-equality # Warnings in pybind11 -Wno-error=ignored-attributes # Warnings in Eigen, gcc 6.3 -Wno-error=terminate # Warning in PADDLE_ENFORCE + -Wno-error=int-in-bool-context # Warning in Eigen gcc 7.2 + -Wimplicit-fallthrough=0 # Warning in tinyformat.h + -Wno-error=maybe-uninitialized # Warning in boost gcc 7.2 ) set(GPU_COMMON_FLAGS @@ -144,19 +147,12 @@ set(GPU_COMMON_FLAGS -Wno-error=unused-function # Warnings in Numpy Header. -Wno-error=array-bounds # Warnings in Eigen::array ) - -else(NOT WIN32) -set(COMMON_FLAGS - "/w") #disable all warnings. -set(GPU_COMMON_FLAGS - "/w") #disable all warnings +set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -m64") endif(NOT WIN32) if (APPLE) - if(NOT CMAKE_CROSSCOMPILING) - # On Mac OS X build fat binaries with x86_64 architectures by default. - set (CMAKE_OSX_ARCHITECTURES "x86_64" CACHE STRING "Build architectures for OSX" FORCE) - endif() + # On Mac OS X build fat binaries with x86_64 architectures by default. + set (CMAKE_OSX_ARCHITECTURES "x86_64" CACHE STRING "Build architectures for OSX" FORCE) # On Mac OS X register class specifier is deprecated and will cause warning error on latest clang 10.0 set (COMMON_FLAGS -Wno-deprecated-register) endif(APPLE) @@ -192,8 +188,7 @@ safe_set_static_flag() CMAKE_CXX_FLAGS_MINSIZEREL CMAKE_CXX_FLAGS_RELWITHDEBINFO CMAKE_C_FLAGS CMAKE_C_FLAGS_DEBUG CMAKE_C_FLAGS_RELEASE CMAKE_C_FLAGS_MINSIZEREL CMAKE_C_FLAGS_RELWITHDEBINFO) - if(${flag_var} MATCHES "/W3") - string(REGEX REPLACE "/W3" "/w" ${flag_var} "${${flag_var}}") - endif(${flag_var} MATCHES "/W3") + string(REGEX REPLACE "(^| )/W[0-9]( |$)" " " ${flag_var} "${${flag_var}}") + set(flag_var "${flag_var} /w") endforeach(flag_var) endif(WIN32) diff --git a/cmake/generic.cmake b/cmake/generic.cmake index 34581e43e86631a556f03ef08fc424698b4a99dc..6679a09dfc9dd00cfe3b5c5da3e12bd1c1389432 100644 --- a/cmake/generic.cmake +++ b/cmake/generic.cmake @@ -90,11 +90,11 @@ # including binary directory for generated headers. include_directories(${CMAKE_CURRENT_BINARY_DIR}) -if(NOT APPLE AND NOT ANDROID) - find_package(Threads REQUIRED) - link_libraries(${CMAKE_THREAD_LIBS_INIT}) - set(CMAKE_CXX_LINK_EXECUTABLE "${CMAKE_CXX_LINK_EXECUTABLE} -pthread -ldl -lrt") -endif(NOT APPLE AND NOT ANDROID) +if(NOT APPLE) + find_package(Threads REQUIRED) + link_libraries(${CMAKE_THREAD_LIBS_INIT}) + set(CMAKE_CXX_LINK_EXECUTABLE "${CMAKE_CXX_LINK_EXECUTABLE} -pthread -ldl -lrt") +endif(NOT APPLE) set_property(GLOBAL PROPERTY FLUID_MODULES "") # find all fluid modules is used for paddle fluid static library @@ -110,6 +110,18 @@ function(find_fluid_modules TARGET_NAME) endif() endfunction(find_fluid_modules) + +function(common_link TARGET_NAME) + if (WITH_PROFILER) + target_link_libraries(${TARGET_NAME} gperftools::profiler) + endif() + + if (WITH_JEMALLOC) + target_link_libraries(${TARGET_NAME} jemalloc::jemalloc) + endif() +endfunction() + + # find all third_party modules is used for paddle static library # for reduce the dependency when building the inference libs. set_property(GLOBAL PROPERTY FLUID_THIRD_PARTY) @@ -220,7 +232,7 @@ function(merge_static_libs TARGET_NAME) # Get the file names of the libraries to be merged set(libfiles ${libfiles} $) endforeach() - # msvc will put libarary in directory of "/Release/xxxlib" by default + # msvc will put libarary in directory of "/Release/xxxlib" by default # COMMAND cmake -E remove "${CMAKE_CURRENT_BINARY_DIR}/${CMAKE_BUILD_TYPE}/${TARGET_NAME}.lib" add_custom_command(TARGET ${TARGET_NAME} POST_BUILD COMMAND cmake -E make_directory "${CMAKE_CURRENT_BINARY_DIR}/${CMAKE_BUILD_TYPE}" @@ -259,10 +271,26 @@ function(cc_library TARGET_NAME) list(APPEND cc_library_DEPS dynload_mklml) endif() add_dependencies(${TARGET_NAME} mklml) - target_link_libraries(${TARGET_NAME} "-L${MKLML_LIB_DIR} -liomp5 -Wl,--as-needed") + if(WIN32) + target_link_libraries(${TARGET_NAME} ${MKLML_IOMP_LIB}) + else(WIN32) + target_link_libraries(${TARGET_NAME} "-L${MKLML_LIB_DIR} -liomp5 -Wl,--as-needed") + endif(WIN32) + endif() + # remove link to python, see notes at: + # https://github.com/pybind/pybind11/blob/master/docs/compiling.rst#building-manually + if("${cc_library_DEPS};" MATCHES "python;") + list(REMOVE_ITEM cc_library_DEPS python) + add_dependencies(${TARGET_NAME} python) + if(WIN32) + target_link_libraries(${TARGET_NAME} ${PYTHON_LIBRARIES}) + else() + target_link_libraries(${TARGET_NAME} "-Wl,-undefined,dynamic_lookup") + endif(WIN32) endif() target_link_libraries(${TARGET_NAME} ${cc_library_DEPS}) add_dependencies(${TARGET_NAME} ${cc_library_DEPS}) + common_link(${TARGET_NAME}) endif() # cpplint code style @@ -276,11 +304,50 @@ function(cc_library TARGET_NAME) if(cc_library_DEPS) merge_static_libs(${TARGET_NAME} ${cc_library_DEPS}) else() - message(FATAL "Please specify source file or library in cc_library.") + message(FATAL_ERROR "Please specify source files or libraries in cc_library(${TARGET_NAME} ...).") endif() endif(cc_library_SRCS) endfunction(cc_library) +# The link operation under windows may exceeds the maximum characters limit, simply break the link command +# into multiple link opeartion can fix that, say +# original: +# lib /out:target.lib a.lib b.lib c.lib d.lib +# after: +# 1. lib /out:dummy_lib_1.lib a.lib b.lib +# 2. lib /out:dummy_lib_2.lib c.lib d.lib +# 1. lib /out:target.lib dummy_lib_1.lib dummy_lib_2.lib +function(sep_library TARGET_NAME) + set(options STATIC static SHARED shared) + set(oneValueArgs "") + set(multiValueArgs SRCS DEPS) + cmake_parse_arguments(sep_library "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) + set(dummy_index 1) + set(dummy_offset 1) + # the dummy target would be consisted of limit size libraries + set(dummy_limit 50) + list(LENGTH sep_library_DEPS sep_all_len) + foreach(v ${sep_library_DEPS}) + list(APPEND dummy_list ${v}) + list(LENGTH dummy_list listlen ) + if ((${listlen} GREATER ${dummy_limit}) OR (${dummy_offset} EQUAL ${sep_all_len})) + message("create dummy library ${TARGET_NAME}_dummy_lib_${dummy_index} for ${TARGET_NAME}") + cc_library(${TARGET_NAME}_dummy_lib_${dummy_index} STATIC DEPS ${dummy_list}) + foreach(i ${dummy_list}) + list(REMOVE_AT dummy_list 0) + endforeach() + list(APPEND ${TARGET_NAME}_dummy_list ${TARGET_NAME}_dummy_lib_${dummy_index}) + MATH(EXPR dummy_index "${dummy_index}+1") + endif() + MATH(EXPR dummy_offset "${dummy_offset}+1") + endforeach() + if(${sep_library_SHARED}) + cc_library(${TARGET_NAME} SHARED SRCS ${sep_library_SRCS} DEPS ${${TARGET_NAME}_dummy_list}) + else(${sep_library_SHARED}) + cc_library(${TARGET_NAME} STATIC SRCS ${sep_library_SRCS} DEPS ${${TARGET_NAME}_dummy_list}) + endif(${sep_library_SHARED}) +endfunction(sep_library) + function(cc_binary TARGET_NAME) set(options "") set(oneValueArgs "") @@ -290,7 +357,10 @@ function(cc_binary TARGET_NAME) if(cc_binary_DEPS) target_link_libraries(${TARGET_NAME} ${cc_binary_DEPS}) add_dependencies(${TARGET_NAME} ${cc_binary_DEPS}) + common_link(${TARGET_NAME}) endif() + get_property(os_dependency_modules GLOBAL PROPERTY OS_DEPENDENCY_MODULES) + target_link_libraries(${TARGET_NAME} ${os_dependency_modules}) endfunction(cc_binary) function(cc_test TARGET_NAME) @@ -300,8 +370,16 @@ function(cc_test TARGET_NAME) set(multiValueArgs SRCS DEPS ARGS) cmake_parse_arguments(cc_test "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) add_executable(${TARGET_NAME} ${cc_test_SRCS}) - target_link_libraries(${TARGET_NAME} ${cc_test_DEPS} paddle_gtest_main lod_tensor memory gtest gflags glog) + if(WIN32) + if("${cc_test_DEPS};" MATCHES "python;") + list(REMOVE_ITEM cc_test_DEPS python) + target_link_libraries(${TARGET_NAME} ${PYTHON_LIBRARIES}) + endif() + endif(WIN32) + get_property(os_dependency_modules GLOBAL PROPERTY OS_DEPENDENCY_MODULES) + target_link_libraries(${TARGET_NAME} ${cc_test_DEPS} ${os_dependency_modules} paddle_gtest_main lod_tensor memory gtest gflags glog) add_dependencies(${TARGET_NAME} ${cc_test_DEPS} paddle_gtest_main lod_tensor memory gtest gflags glog) + common_link(${TARGET_NAME}) add_test(NAME ${TARGET_NAME} COMMAND ${TARGET_NAME} ${cc_test_ARGS} WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}) @@ -310,6 +388,7 @@ function(cc_test TARGET_NAME) endif() set_property(TEST ${TARGET_NAME} PROPERTY ENVIRONMENT FLAGS_cpu_deterministic=true) set_property(TEST ${TARGET_NAME} PROPERTY ENVIRONMENT FLAGS_init_allocated_mem=true) + set_property(TEST ${TARGET_NAME} PROPERTY ENVIRONMENT FLAGS_limit_of_tmp_allocation=4294967296) # 4G set_property(TEST ${TARGET_NAME} PROPERTY ENVIRONMENT FLAGS_cudnn_deterministic=true) # No unit test should exceed 10 minutes. set_tests_properties(${TARGET_NAME} PROPERTIES TIMEOUT 600) @@ -360,6 +439,7 @@ function(nv_binary TARGET_NAME) if(nv_binary_DEPS) target_link_libraries(${TARGET_NAME} ${nv_binary_DEPS}) add_dependencies(${TARGET_NAME} ${nv_binary_DEPS}) + common_link(${TARGET_NAME}) endif() endif() endfunction(nv_binary) @@ -371,14 +451,17 @@ function(nv_test TARGET_NAME) set(multiValueArgs SRCS DEPS) cmake_parse_arguments(nv_test "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) cuda_add_executable(${TARGET_NAME} ${nv_test_SRCS}) - target_link_libraries(${TARGET_NAME} ${nv_test_DEPS} paddle_gtest_main lod_tensor memory gtest gflags glog) + get_property(os_dependency_modules GLOBAL PROPERTY OS_DEPENDENCY_MODULES) + target_link_libraries(${TARGET_NAME} ${nv_test_DEPS} paddle_gtest_main lod_tensor memory gtest gflags glog ${os_dependency_modules}) add_dependencies(${TARGET_NAME} ${nv_test_DEPS} paddle_gtest_main lod_tensor memory gtest gflags glog) + common_link(${TARGET_NAME}) add_test(${TARGET_NAME} ${TARGET_NAME}) if (nv_test_SERIAL) set_property(TEST ${TARGET_NAME} PROPERTY RUN_SERIAL 1) endif() set_property(TEST ${TARGET_NAME} PROPERTY ENVIRONMENT FLAGS_cpu_deterministic=true) set_property(TEST ${TARGET_NAME} PROPERTY ENVIRONMENT FLAGS_init_allocated_mem=true) + set_property(TEST ${TARGET_NAME} PROPERTY ENVIRONMENT FLAGS_limit_of_tmp_allocation=4294967296) # 4G set_property(TEST ${TARGET_NAME} PROPERTY ENVIRONMENT FLAGS_cudnn_deterministic=true) endif() endfunction(nv_test) @@ -401,25 +484,29 @@ function(hip_library TARGET_NAME) else() add_library(${TARGET_NAME} STATIC ${_cmake_options} ${_generated_files} ${_sources}) set_target_properties(${TARGET_NAME} PROPERTIES LINKER_LANGUAGE CXX) - target_link_libraries(${TARGET_NAME} /opt/rocm/hip/lib/libhip_hcc.so /opt/rocm/hip/lib/libhip_device.a) - find_fluid_modules(${TARGET_NAME}) + target_link_libraries(${TARGET_NAME} /opt/rocm/hip/lib/libhip_hcc.so /opt/rocm/hip/lib/libhip_device.a /opt/rocm/rccl/lib/librccl.so /opt/rocm/hiprand/lib/libhiprand.so) + find_fluid_modules(${TARGET_NAME}) endif() - if (hip_library_DEPS) - add_dependencies(${TARGET_NAME} ${hip_library_DEPS}) - target_link_libraries(${TARGET_NAME} ${hip_library_DEPS}) + if("${hip_library_DEPS}" MATCHES "ARCHIVE_START") + # Support linking flags: --whole-archive (Linux) / -force_load (MacOS). + # WARNING: Please don't use ARCHIVE_START&ARCHIVE_END if TARGET_NAME will be linked by other libraries. + target_circle_link_libraries(${TARGET_NAME} ${hip_library_DEPS}) + list(REMOVE_ITEM hip_library_DEPS ARCHIVE_START ARCHIVE_END) + else() + target_link_libraries(${TARGET_NAME} ${hip_library_DEPS}) endif() # cpplint code style foreach(source_file ${hip_library_SRCS}) - string(REGEX REPLACE "\\.[^.]*$" "" source ${source_file}) - if(EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${source}.h) - list(APPEND hip_library_HEADERS ${CMAKE_CURRENT_SOURCE_DIR}/${source}.h) - endif() + string(REGEX REPLACE "\\.[^.]*$" "" source ${source_file}) + if(EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${source}.h) + list(APPEND hip_library_HEADERS ${CMAKE_CURRENT_SOURCE_DIR}/${source}.h) + endif() endforeach() else(hip_library_SRCS) if (hip_library_DEPS) - merge_static_libs(${TARGET_NAME} ${hip_library_DEPS}) + merge_static_libs(${TARGET_NAME} ${hip_library_DEPS}) else() - message(FATAL "Please specify source file or library in nv_library.") + message(FATAL "Please specify source file or library in nv_library.") endif() endif(hip_library_SRCS) endif() @@ -435,6 +522,7 @@ function(hip_binary TARGET_NAME) if(hip_binary_DEPS) target_link_libraries(${TARGET_NAME} ${hip_binary_DEPS}) add_dependencies(${TARGET_NAME} ${hip_binary_DEPS}) + common_link(${TARGET_NAME}) endif() endif() endfunction(hip_binary) @@ -452,8 +540,10 @@ function(hip_test TARGET_NAME) endif() add_executable(${TARGET_NAME} ${_cmake_options} ${_generated_files} ${_sources}) set_target_properties(${TARGET_NAME} PROPERTIES LINKER_LANGUAGE HIP) - target_link_libraries(${TARGET_NAME} ${hip_test_DEPS} paddle_gtest_main memory gtest gflags) + get_property(os_dependency_modules GLOBAL PROPERTY OS_DEPENDENCY_MODULES) + target_link_libraries(${TARGET_NAME} ${hip_test_DEPS} paddle_gtest_main memory gtest gflags ${os_dependency_modules}) add_dependencies(${TARGET_NAME} ${hip_test_DEPS} paddle_gtest_main memory gtest gflags) + common_link(${TARGET_NAME}) add_test(${TARGET_NAME} ${TARGET_NAME}) endif() endfunction(hip_test) @@ -496,6 +586,7 @@ function(go_library TARGET_NAME) endif() if(go_library_DEPS) add_dependencies(${TARGET_NAME} ${go_library_DEPS}) + common_link(${TARGET_NAME}) endif(go_library_DEPS) # The "source file" of the library is `${dummyfile}` which never @@ -566,12 +657,6 @@ function(paddle_protobuf_generate_cpp SRCS HDRS) set(${SRCS}) set(${HDRS}) - if (MOBILE_INFERENCE) - set(EXTRA_FLAG "lite:") - else() - set(EXTRA_FLAG "") - endif() - foreach(FIL ${ARGN}) get_filename_component(ABS_FIL ${FIL} ABSOLUTE) get_filename_component(FIL_WE ${FIL} NAME_WE) @@ -588,7 +673,7 @@ function(paddle_protobuf_generate_cpp SRCS HDRS) COMMAND ${CMAKE_COMMAND} -E make_directory "${CMAKE_CURRENT_BINARY_DIR}" COMMAND ${PROTOBUF_PROTOC_EXECUTABLE} -I${CMAKE_CURRENT_SOURCE_DIR} - --cpp_out "${EXTRA_FLAG}${CMAKE_CURRENT_BINARY_DIR}" ${ABS_FIL} + --cpp_out "${CMAKE_CURRENT_BINARY_DIR}" ${ABS_FIL} DEPENDS ${ABS_FIL} protoc COMMENT "Running C++ protocol buffer compiler on ${FIL}" VERBATIM ) @@ -625,9 +710,10 @@ function(py_test TARGET_NAME) set(oneValueArgs "") set(multiValueArgs SRCS DEPS ARGS ENVS) cmake_parse_arguments(py_test "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) + add_test(NAME ${TARGET_NAME} - COMMAND env FLAGS_init_allocated_mem=true FLAGS_cudnn_deterministic=true - FLAGS_cpu_deterministic=true + COMMAND ${CMAKE_COMMAND} -E env FLAGS_init_allocated_mem=true FLAGS_cudnn_deterministic=true + FLAGS_cpu_deterministic=true FLAGS_limit_of_tmp_allocation=4294967296 # 4G PYTHONPATH=${PADDLE_BINARY_DIR}/python ${py_test_ENVS} ${PYTHON_EXECUTABLE} -u ${py_test_SRCS} ${py_test_ARGS} WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}) @@ -659,7 +745,7 @@ function(grpc_library TARGET_NAME) #FIXME(putcn): the follwoing line is supposed to generate *.pb.h and cc, but # somehow it didn't. line 602 to 604 is to patching this. Leaving this here # for now to enable dist CI. - protobuf_generate_cpp(grpc_proto_srcs grpc_proto_hdrs "${ABS_PROTO}") + paddle_protobuf_generate_cpp(grpc_proto_srcs grpc_proto_hdrs "${ABS_PROTO}") set(grpc_grpc_srcs "${CMAKE_CURRENT_BINARY_DIR}/${PROTO_WE}.grpc.pb.cc") set(grpc_grpc_hdrs "${CMAKE_CURRENT_BINARY_DIR}/${PROTO_WE}.grpc.pb.h") cc_library("${TARGET_NAME}_proto" SRCS "${grpc_proto_srcs}") @@ -702,7 +788,7 @@ function(brpc_library TARGET_NAME) get_filename_component(PROTO_WE ${brpc_library_PROTO} NAME_WE) get_filename_component(PROTO_PATH ${ABS_PROTO} PATH) - protobuf_generate_cpp(brpc_proto_srcs brpc_proto_hdrs "${ABS_PROTO}") + paddle_protobuf_generate_cpp(brpc_proto_srcs brpc_proto_hdrs "${ABS_PROTO}") cc_library("${TARGET_NAME}_proto" SRCS "${brpc_proto_srcs}") cc_library("${TARGET_NAME}" SRCS "${brpc_library_SRCS}" DEPS "${TARGET_NAME}_proto" "${brpc_library_DEPS}") endfunction() diff --git a/cmake/hip.cmake b/cmake/hip.cmake index bfe491bd6b7602959d3dd60bd06c67993593cc9b..c3a748db502037f926dc241e4c3bc26a83ad3468 100644 --- a/cmake/hip.cmake +++ b/cmake/hip.cmake @@ -3,28 +3,38 @@ if(NOT WITH_AMD_GPU) endif() include_directories("/opt/rocm/include") +include_directories("/opt/rocm/hip/include") +include_directories("/opt/rocm/miopen/include") include_directories("/opt/rocm/hipblas/include") include_directories("/opt/rocm/hiprand/include") include_directories("/opt/rocm/rocrand/include") include_directories("/opt/rocm/rccl/include") include_directories("/opt/rocm/thrust") -list(APPEND EXTERNAL_LIBS "-L/opt/rocm/lib/ -lhip_hcc") - -set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -fPIC -DPADDLE_WITH_HIP -std=c++14" ) +set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -fPIC -DPADDLE_WITH_HIP -std=c++11" ) if(WITH_DSO) set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_USE_DSO") endif(WITH_DSO) -if(WITH_DOUBLE) - set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_TYPE_DOUBLE") -endif(WITH_DOUBLE) - if(WITH_TESTING) set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_WITH_TESTING") endif(WITH_TESTING) +if(WITH_DISTRIBUTE) + set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_WITH_DISTRIBUTE") +endif(WITH_DISTRIBUTE) + +if(WITH_GRPC) + set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_WITH_GRPC") +endif(WITH_GRPC) + +if(WITH_MKLDNN) + set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_WITH_MKLDNN") +endif(WITH_MKLDNN) + +set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DANY_IMPL_ANY_CAST_MOVEABLE") + if(CMAKE_BUILD_TYPE STREQUAL "Debug") list(APPEND HIP_HCC_FLAGS ${CMAKE_CXX_FLAGS_DEBUG}) elseif(CMAKE_BUILD_TYPE STREQUAL "RelWithDebInfo") diff --git a/cmake/inference_lib.cmake b/cmake/inference_lib.cmake index 67cca09b64c1ed7a503a886e78347d786eae0de7..b7c32f80db0dcb826f3f67ffb55da1c715785add 100644 --- a/cmake/inference_lib.cmake +++ b/cmake/inference_lib.cmake @@ -22,183 +22,247 @@ function(copy TARGET) list(LENGTH copy_lib_SRCS copy_lib_SRCS_len) list(LENGTH copy_lib_DSTS copy_lib_DSTS_len) - if(NOT ${copy_lib_SRCS_len} EQUAL ${copy_lib_DSTS_len}) + if (NOT ${copy_lib_SRCS_len} EQUAL ${copy_lib_DSTS_len}) message(FATAL_ERROR "${TARGET} source numbers are not equal to destination numbers") - endif() + endif () math(EXPR len "${copy_lib_SRCS_len} - 1") add_custom_target(${TARGET} DEPENDS ${copy_lib_DEPS}) - foreach(index RANGE ${len}) + foreach (index RANGE ${len}) list(GET copy_lib_SRCS ${index} src) list(GET copy_lib_DSTS ${index} dst) - add_custom_command(TARGET ${TARGET} PRE_BUILD - COMMAND mkdir -p "${dst}" - COMMAND cp -r "${src}" "${dst}" - COMMENT "copying ${src} -> ${dst}") - endforeach() + if (WIN32) + if(IS_DIRECTORY ${src}) + get_filename_component(last_path ${src} NAME) + string(APPEND dst "/" ${last_path}) + add_custom_command(TARGET ${TARGET} PRE_BUILD + COMMAND ${CMAKE_COMMAND} -E make_directory "${dst}" + ) + if(EXISTS ${src}) + add_custom_command(TARGET ${TARGET} PRE_BUILD + COMMAND cmake -E copy_directory "${src}" "${dst}" + COMMENT "copying ${src} -> ${dst}") + else() + message(WARNING "${src} not exist!") + endif() + else() + # windows cmd shell will not expand wildcard automatically. + # below expand the files, and copy them by rules. + file(GLOB src_files ${src}) + if (NOT "${src_files}" STREQUAL "") + list(REMOVE_DUPLICATES src_files) + endif () + add_custom_command(TARGET ${TARGET} PRE_BUILD + COMMAND ${CMAKE_COMMAND} -E make_directory "${dst}" + ) + foreach (src_file ${src_files}) + add_custom_command(TARGET ${TARGET} PRE_BUILD + COMMAND ${CMAKE_COMMAND} -E copy "${src_file}" "${dst}" + COMMENT "copying ${src_file} -> ${dst}") + endforeach () + endif() + else (WIN32) # not windows + add_custom_command(TARGET ${TARGET} PRE_BUILD + COMMAND mkdir -p "${dst}" + COMMAND cp -r "${src}" "${dst}" + COMMENT "copying ${src} -> ${dst}") + endif (WIN32) # not windows + endforeach () endfunction() # third party set(dst_dir "${FLUID_INSTALL_DIR}/third_party/eigen3") copy(eigen3_lib - SRCS ${EIGEN_INCLUDE_DIR}/Eigen/Core ${EIGEN_INCLUDE_DIR}/Eigen/src ${EIGEN_INCLUDE_DIR}/unsupported/Eigen - DSTS ${dst_dir}/Eigen ${dst_dir}/Eigen ${dst_dir}/unsupported - DEPS eigen3 -) + SRCS ${EIGEN_INCLUDE_DIR}/Eigen/Core ${EIGEN_INCLUDE_DIR}/Eigen/src ${EIGEN_INCLUDE_DIR}/unsupported/Eigen + DSTS ${dst_dir}/Eigen ${dst_dir}/Eigen ${dst_dir}/unsupported + DEPS eigen3 + ) set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/gflags") copy(gflags_lib - SRCS ${GFLAGS_INCLUDE_DIR} ${GFLAGS_LIBRARIES} - DSTS ${dst_dir} ${dst_dir}/lib - DEPS gflags -) + SRCS ${GFLAGS_INCLUDE_DIR} ${GFLAGS_LIBRARIES} + DSTS ${dst_dir} ${dst_dir}/lib + DEPS gflags + ) set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/glog") copy(glog_lib - SRCS ${GLOG_INCLUDE_DIR} ${GLOG_LIBRARIES} - DSTS ${dst_dir} ${dst_dir}/lib - DEPS glog -) + SRCS ${GLOG_INCLUDE_DIR} ${GLOG_LIBRARIES} + DSTS ${dst_dir} ${dst_dir}/lib + DEPS glog + ) set(dst_dir "${FLUID_INSTALL_DIR}/third_party/boost/") copy(boost_lib - SRCS ${BOOST_INCLUDE_DIR}/boost - DSTS ${dst_dir} - DEPS boost -) + SRCS ${BOOST_INCLUDE_DIR}/boost + DSTS ${dst_dir} + DEPS boost + ) -if(NOT PROTOBUF_FOUND) +set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/xxhash") +copy(xxhash_lib + SRCS ${XXHASH_INCLUDE_DIR} ${XXHASH_LIBRARIES} + DSTS ${dst_dir} ${dst_dir}/lib + DEPS xxhash + ) + +if (NOT PROTOBUF_FOUND OR WIN32) set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/protobuf") copy(protobuf_lib - SRCS ${PROTOBUF_INCLUDE_DIR} ${PROTOBUF_LIBRARY} - DSTS ${dst_dir} ${dst_dir}/lib - DEPS extern_protobuf - ) -endif() + SRCS ${PROTOBUF_INCLUDE_DIR} ${PROTOBUF_LIBRARY} + DSTS ${dst_dir} ${dst_dir}/lib + DEPS extern_protobuf + ) +endif () -if(NOT CBLAS_FOUND) - set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/openblas") - copy(openblas_lib - SRCS ${CBLAS_INSTALL_DIR}/lib ${CBLAS_INSTALL_DIR}/include - DSTS ${dst_dir} ${dst_dir} - DEPS extern_openblas - ) -elseif (WITH_MKLML) +if (WITH_MKLML) set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/mklml") copy(mklml_lib - SRCS ${MKLML_LIB} ${MKLML_IOMP_LIB} ${MKLML_INC_DIR} - DSTS ${dst_dir}/lib ${dst_dir}/lib ${dst_dir} - DEPS mklml - ) -endif() + SRCS ${MKLML_LIB} ${MKLML_IOMP_LIB} ${MKLML_INC_DIR} + DSTS ${dst_dir}/lib ${dst_dir}/lib ${dst_dir} + DEPS mklml + ) +elseif (NOT CBLAS_FOUND OR WIN32) + set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/openblas") + copy(openblas_lib + SRCS ${CBLAS_INSTALL_DIR}/lib ${CBLAS_INSTALL_DIR}/include + DSTS ${dst_dir} ${dst_dir} + DEPS extern_openblas + ) +endif () -if(WITH_MKLDNN) - set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/mkldnn") - copy(mkldnn_lib - SRCS ${MKLDNN_INC_DIR} ${MKLDNN_SHARED_LIB} - DSTS ${dst_dir} ${dst_dir}/lib - DEPS mkldnn - ) +if (WITH_GPU AND NOT WIN32) + set(dgc_dir "${FLUID_INSTALL_DIR}/third_party/install/dgc") + copy(dgc_lib + SRCS ${DGC_INSTALL_DIR}/lib ${DGC_INSTALL_DIR}/include + DSTS ${dgc_dir} ${dgc_dir} + DEPS dgc) endif() -if (NOT WIN32) -if(NOT MOBILE_INFERENCE AND NOT RPI) - set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/snappy") - copy(snappy_lib - SRCS ${SNAPPY_INCLUDE_DIR} ${SNAPPY_LIBRARIES} - DSTS ${dst_dir} ${dst_dir}/lib - DEPS snappy) - - set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/snappystream") - copy(snappystream_lib - SRCS ${SNAPPYSTREAM_INCLUDE_DIR} ${SNAPPYSTREAM_LIBRARIES} - DSTS ${dst_dir} ${dst_dir}/lib - DEPS snappystream) - - set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/zlib") - copy(zlib_lib - SRCS ${ZLIB_INCLUDE_DIR} ${ZLIB_LIBRARIES} - DSTS ${dst_dir} ${dst_dir}/lib - DEPS zlib) -endif() -endif(NOT WIN32) + +if (WITH_MKLDNN) + set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/mkldnn") + copy(mkldnn_lib + SRCS ${MKLDNN_INC_DIR} ${MKLDNN_SHARED_LIB} + DSTS ${dst_dir} ${dst_dir}/lib + DEPS mkldnn_shared_lib + ) +endif () + +if (WITH_NGRAPH) + set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/ngraph") + copy(ngraph_lib + SRCS ${NGRAPH_INC_DIR} ${NGRAPH_LIB_DIR} + DSTS ${dst_dir} ${dst_dir} + DEPS ngraph + ) +endif () + +set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/snappy") +copy(snappy_lib + SRCS ${SNAPPY_INCLUDE_DIR} ${SNAPPY_LIBRARIES} + DSTS ${dst_dir} ${dst_dir}/lib + DEPS snappy) + +set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/snappystream") +copy(snappystream_lib + SRCS ${SNAPPYSTREAM_INCLUDE_DIR} ${SNAPPYSTREAM_LIBRARIES} + DSTS ${dst_dir} ${dst_dir}/lib + DEPS snappystream) + +set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/zlib") +copy(zlib_lib + SRCS ${ZLIB_INCLUDE_DIR} ${ZLIB_LIBRARIES} + DSTS ${dst_dir} ${dst_dir}/lib + DEPS zlib) # paddle fluid module set(src_dir "${PADDLE_SOURCE_DIR}/paddle/fluid") set(dst_dir "${FLUID_INSTALL_DIR}/paddle/fluid") set(module "framework") if (NOT WIN32) -set(framework_lib_deps framework_py_proto) -endif(NOT WIN32) + set(framework_lib_deps framework_py_proto) +endif (NOT WIN32) copy(framework_lib DEPS ${framework_lib_deps} - SRCS ${src_dir}/${module}/*.h ${src_dir}/${module}/details/*.h ${PADDLE_BINARY_DIR}/paddle/fluid/framework/framework.pb.h - ${src_dir}/${module}/ir/*.h - DSTS ${dst_dir}/${module} ${dst_dir}/${module}/details ${dst_dir}/${module} ${dst_dir}/${module}/ir -) + SRCS ${src_dir}/${module}/*.h ${src_dir}/${module}/details/*.h ${PADDLE_BINARY_DIR}/paddle/fluid/framework/framework.pb.h + ${src_dir}/${module}/ir/*.h + DSTS ${dst_dir}/${module} ${dst_dir}/${module}/details ${dst_dir}/${module} ${dst_dir}/${module}/ir + ) set(module "memory") copy(memory_lib - SRCS ${src_dir}/${module}/*.h ${src_dir}/${module}/detail/*.h - DSTS ${dst_dir}/${module} ${dst_dir}/${module}/detail -) + SRCS ${src_dir}/${module}/*.h ${src_dir}/${module}/detail/*.h ${src_dir}/${module}/allocation/*.h + DSTS ${dst_dir}/${module} ${dst_dir}/${module}/detail ${dst_dir}/${module}/allocation + ) set(inference_deps paddle_fluid_shared paddle_fluid) set(module "inference/api") if (WITH_ANAKIN AND WITH_MKL) copy(anakin_inference_lib DEPS paddle_inference_api inference_anakin_api - SRCS - ${PADDLE_BINARY_DIR}/paddle/fluid/inference/api/libinference_anakin_api* # compiled anakin api - ${ANAKIN_INSTALL_DIR} # anakin release - DSTS ${FLUID_INSTALL_DIR}/third_party/install/anakin ${FLUID_INSTALL_DIR}/third_party/install/anakin) - list(APPEND inference_deps anakin_inference_lib) -endif() + SRCS + ${PADDLE_BINARY_DIR}/paddle/fluid/inference/api/libinference_anakin_api* # compiled anakin api + ${ANAKIN_INSTALL_DIR} # anakin release + DSTS ${FLUID_INSTALL_DIR}/third_party/install/anakin ${FLUID_INSTALL_DIR}/third_party/install/anakin) + list(APPEND inference_deps anakin_inference_lib) +endif () + +if (TENSORRT_FOUND) + copy(tensorrt_lib DEPS ${inference_deps} + SRCS ${TENSORRT_ROOT}/include/Nv*.h ${TENSORRT_ROOT}/lib/libnvinfer* + DSTS ${FLUID_INSTALL_DIR}/third_party/install/tensorrt/include ${FLUID_INSTALL_DIR}/third_party/install/tensorrt/lib) +endif () + set(module "inference") +if(WIN32) + set(paddle_fluid_lib ${PADDLE_BINARY_DIR}/paddle/fluid/inference/${CMAKE_BUILD_TYPE}/libpaddle_fluid.*) +else(WIN32) + set(paddle_fluid_lib ${PADDLE_BINARY_DIR}/paddle/fluid/inference/libpaddle_fluid.*) +endif(WIN32) copy(inference_lib DEPS ${inference_deps} - SRCS ${src_dir}/${module}/*.h ${PADDLE_BINARY_DIR}/paddle/fluid/inference/libpaddle_fluid.* - ${src_dir}/${module}/api/paddle_inference_api.h - ${PADDLE_BINARY_DIR}/paddle/fluid/inference/api/paddle_inference_pass.h - DSTS ${dst_dir}/${module} ${dst_dir}/${module} ${dst_dir}/${module} ${dst_dir}/${module} -) + SRCS ${src_dir}/${module}/*.h ${paddle_fluid_lib} + ${src_dir}/${module}/api/paddle_*.h + DSTS ${dst_dir}/${module} ${dst_dir}/${module} ${dst_dir}/${module} + ) set(module "platform") copy(platform_lib DEPS profiler_py_proto - SRCS ${src_dir}/${module}/*.h ${src_dir}/${module}/dynload/*.h ${src_dir}/${module}/details/*.h - DSTS ${dst_dir}/${module} ${dst_dir}/${module}/dynload ${dst_dir}/${module}/details -) + SRCS ${src_dir}/${module}/*.h ${src_dir}/${module}/dynload/*.h ${src_dir}/${module}/details/*.h + DSTS ${dst_dir}/${module} ${dst_dir}/${module}/dynload ${dst_dir}/${module}/details + ) set(module "string") copy(string_lib - SRCS ${src_dir}/${module}/*.h ${src_dir}/${module}/tinyformat/*.h - DSTS ${dst_dir}/${module} ${dst_dir}/${module}/tinyformat -) + SRCS ${src_dir}/${module}/*.h ${src_dir}/${module}/tinyformat/*.h + DSTS ${dst_dir}/${module} ${dst_dir}/${module}/tinyformat + ) set(module "pybind") copy(pybind_lib - SRCS ${CMAKE_CURRENT_BINARY_DIR}/paddle/fluid/${module}/pybind.h - DSTS ${dst_dir}/${module} -) + SRCS ${CMAKE_CURRENT_BINARY_DIR}/paddle/fluid/${module}/pybind.h + DSTS ${dst_dir}/${module} + ) # CMakeCache Info copy(cmake_cache - SRCS ${CMAKE_CURRENT_BINARY_DIR}/CMakeCache.txt - DSTS ${FLUID_INSTALL_DIR}) + SRCS ${CMAKE_CURRENT_BINARY_DIR}/CMakeCache.txt + DSTS ${FLUID_INSTALL_DIR}) # This command generates a complete fluid library for both train and inference -add_custom_target(fluid_lib_dist DEPENDS ${fluid_lib_dist_dep}) +add_custom_target(fluid_lib_dist DEPENDS ${fluid_lib_dist_dep}) # Following commands generate a inference-only fluid library # third_party, version.txt and CMakeCache.txt are the same position with ${FLUID_INSTALL_DIR} copy(third_party DEPS fluid_lib_dist - SRCS ${FLUID_INSTALL_DIR}/third_party ${FLUID_INSTALL_DIR}/CMakeCache.txt - DSTS ${FLUID_INFERENCE_INSTALL_DIR} ${FLUID_INFERENCE_INSTALL_DIR} -) + SRCS ${FLUID_INSTALL_DIR}/third_party ${FLUID_INSTALL_DIR}/CMakeCache.txt + DSTS ${FLUID_INFERENCE_INSTALL_DIR} ${FLUID_INFERENCE_INSTALL_DIR} + ) -# only need libpaddle_fluid.so/a and paddle_inference_api.h for inference-only library +# only need libpaddle_fluid.so/a and paddle_*.h for inference-only library copy(inference_api_lib DEPS fluid_lib_dist - SRCS ${FLUID_INSTALL_DIR}/paddle/fluid/inference/libpaddle_fluid.* - ${FLUID_INSTALL_DIR}/paddle/fluid/inference/paddle_inference_api.h + SRCS ${paddle_fluid_lib} + ${FLUID_INSTALL_DIR}/paddle/fluid/inference/paddle_*.h DSTS ${FLUID_INFERENCE_INSTALL_DIR}/paddle/lib ${FLUID_INFERENCE_INSTALL_DIR}/paddle/include ) @@ -206,20 +270,20 @@ add_custom_target(inference_lib_dist DEPENDS third_party inference_api_lib) # paddle fluid version function(version version_file) - execute_process( - COMMAND ${GIT_EXECUTABLE} log --pretty=format:%H -1 - WORKING_DIRECTORY ${PADDLE_SOURCE_DIR} - OUTPUT_VARIABLE PADDLE_GIT_COMMIT) - file(WRITE ${version_file} - "GIT COMMIT ID: ${PADDLE_GIT_COMMIT}\n" - "WITH_MKL: ${WITH_MKL}\n" - "WITH_MKLDNN: ${WITH_MKLDNN}\n" - "WITH_GPU: ${WITH_GPU}\n") - if(WITH_GPU) - file(APPEND ${version_file} - "CUDA version: ${CUDA_VERSION}\n" - "CUDNN version: v${CUDNN_MAJOR_VERSION}\n") - endif() + execute_process( + COMMAND ${GIT_EXECUTABLE} log --pretty=format:%H -1 + WORKING_DIRECTORY ${PADDLE_SOURCE_DIR} + OUTPUT_VARIABLE PADDLE_GIT_COMMIT) + file(WRITE ${version_file} + "GIT COMMIT ID: ${PADDLE_GIT_COMMIT}\n" + "WITH_MKL: ${WITH_MKL}\n" + "WITH_MKLDNN: ${WITH_MKLDNN}\n" + "WITH_GPU: ${WITH_GPU}\n") + if (WITH_GPU) + file(APPEND ${version_file} + "CUDA version: ${CUDA_VERSION}\n" + "CUDNN version: v${CUDNN_MAJOR_VERSION}\n") + endif () endfunction() version(${FLUID_INSTALL_DIR}/version.txt) version(${FLUID_INFERENCE_INSTALL_DIR}/version.txt) diff --git a/cmake/operators.cmake b/cmake/operators.cmake new file mode 100644 index 0000000000000000000000000000000000000000..c17e718f4279f24c85db8be1177e5b5e82b13e08 --- /dev/null +++ b/cmake/operators.cmake @@ -0,0 +1,227 @@ +set(PART_CUDA_KERNEL_FILES) +function(op_library TARGET) + # op_library is a function to create op library. The interface is same as + # cc_library. But it handle split GPU/CPU code and link some common library + # for ops. + set(cc_srcs) + set(cu_srcs) + set(hip_cu_srcs) + set(miopen_hip_cc_srcs) + set(cu_cc_srcs) + set(cudnn_cu_cc_srcs) + set(CUDNN_FILE) + set(mkldnn_cc_srcs) + set(MKLDNN_FILE) + set(op_common_deps operator op_registry math_function) + set(options "") + set(oneValueArgs "") + set(multiValueArgs SRCS DEPS) + set(pybind_flag 0) + cmake_parse_arguments(op_library "${options}" "${oneValueArgs}" + "${multiValueArgs}" ${ARGN}) + + list(LENGTH op_library_SRCS op_library_SRCS_len) + if (${op_library_SRCS_len} EQUAL 0) + if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.cc) + list(APPEND cc_srcs ${TARGET}.cc) + endif() + if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.cu.cc) + list(APPEND cu_cc_srcs ${TARGET}.cu.cc) + endif() + if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.cu) + list(APPEND cu_srcs ${TARGET}.cu) + endif() + if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.part.cu) + set(PART_CUDA_KERNEL_FILES ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.part.cu + ${PART_CUDA_KERNEL_FILES} PARENT_SCOPE) + list(APPEND cu_srcs ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.part.cu) + endif() + + if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.hip.cu) + list(APPEND hip_cu_srcs ${TARGET}.hip.cu) + endif() + string(REPLACE "_op" "_cudnn_op" CUDNN_FILE "${TARGET}") + if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${CUDNN_FILE}.cu.cc) + list(APPEND cudnn_cu_cc_srcs ${CUDNN_FILE}.cu.cc) + endif() + if(WITH_AMD_GPU) + string(REPLACE "_op" "_miopen_op" MIOPEN_FILE "${TARGET}") + if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${MIOPEN_FILE}.hip.cc) + list(APPEND miopen_hip_cc_srcs ${MIOPEN_FILE}.hip.cc) + endif() + endif() + if(WITH_MKLDNN) + string(REPLACE "_op" "_mkldnn_op" MKLDNN_FILE "${TARGET}") + if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/mkldnn/${MKLDNN_FILE}.cc) + list(APPEND mkldnn_cc_srcs mkldnn/${MKLDNN_FILE}.cc) + endif() + endif() + else() + foreach(src ${op_library_SRCS}) + if (${src} MATCHES ".*\\.hip.cu$") + list(APPEND hip_cu_srcs ${src}) + elseif (${src} MATCHES ".*\\.cu$") + list(APPEND cu_srcs ${src}) + elseif(${src} MATCHES ".*_cudnn_op.cu.cc$") + list(APPEND cudnn_cu_cc_srcs ${src}) + elseif(WITH_AMD_GPU AND ${src} MATCHES ".*_miopen_op.hip.cc$") + list(APPEND miopen_hip_cc_srcs ${src}) + elseif(WITH_MKLDNN AND ${src} MATCHES ".*_mkldnn_op.cc$") + list(APPEND mkldnn_cc_srcs ${src}) + elseif(${src} MATCHES ".*\\.cu.cc$") + list(APPEND cu_cc_srcs ${src}) + elseif(${src} MATCHES ".*\\.cc$") + list(APPEND cc_srcs ${src}) + else() + message(FATAL_ERROR "${TARGET} Source file ${src} should only be .cc or .cu") + endif() + endforeach() + endif() + + list(LENGTH cc_srcs cc_srcs_len) + if (${cc_srcs_len} EQUAL 0) + message(FATAL_ERROR "The op library ${TARGET} should contains at least one .cc file") + endif() + if (WIN32) + # remove windows unsupported op, because windows has no nccl, no warpctc such ops. + foreach(windows_unsupport_op "nccl_op" "gen_nccl_id_op") + if ("${TARGET}" STREQUAL "${windows_unsupport_op}") + return() + endif() + endforeach() + endif(WIN32) + set(OP_LIBRARY ${TARGET} ${OP_LIBRARY} CACHE INTERNAL "op libs") + + list(LENGTH op_library_DEPS op_library_DEPS_len) + if (${op_library_DEPS_len} GREATER 0) + set(DEPS_OPS ${TARGET} ${DEPS_OPS} PARENT_SCOPE) + endif() + if (WITH_GPU) + nv_library(${TARGET} SRCS ${cc_srcs} ${cu_cc_srcs} ${cudnn_cu_cc_srcs} ${mkldnn_cc_srcs} ${cu_srcs} DEPS ${op_library_DEPS} + ${op_common_deps}) + elseif (WITH_AMD_GPU) + hip_library(${TARGET} SRCS ${cc_srcs} ${hip_cu_srcs} ${miopen_hip_cc_srcs} ${mkldnn_cc_srcs} DEPS ${op_library_DEPS} + ${op_common_deps}) + else() + cc_library(${TARGET} SRCS ${cc_srcs} ${mkldnn_cc_srcs} DEPS ${op_library_DEPS} + ${op_common_deps}) + endif() + + # Define operators that don't need pybind here. + foreach(manual_pybind_op "compare_op" "logical_op" "nccl_op" +"tensor_array_read_write_op" "tensorrt_engine_op" "conv_fusion_op" +"fusion_transpose_flatten_concat_op" "fusion_conv_inception_op" "sync_batch_norm_op" "dgc_op") + if ("${TARGET}" STREQUAL "${manual_pybind_op}") + set(pybind_flag 1) + endif() + endforeach() + + # The registration of USE_OP, please refer to paddle/fluid/framework/op_registry.h. + # Note that it's enough to just adding one operator to pybind in a *_op.cc file. + # And for detail pybind information, please see generated paddle/pybind/pybind.h. + file(READ ${TARGET}.cc TARGET_CONTENT) + string(REGEX MATCH "REGISTER_OPERATOR\\(.*REGISTER_OPERATOR\\(" multi_register "${TARGET_CONTENT}") + string(REGEX MATCH "REGISTER_OPERATOR\\([a-z0-9_]*," one_register "${multi_register}") + if (one_register STREQUAL "") + string(REPLACE "_op" "" TARGET "${TARGET}") + else () + string(REPLACE "REGISTER_OPERATOR(" "" TARGET "${one_register}") + string(REPLACE "," "" TARGET "${TARGET}") + endif() + + # pybind USE_NO_KERNEL_OP + # HACK: if REGISTER_OP_CPU_KERNEL presents the operator must have kernel + string(REGEX MATCH "REGISTER_OP_CPU_KERNEL" regex_result "${TARGET_CONTENT}") + string(REPLACE "_op" "" TARGET "${TARGET}") + if (${pybind_flag} EQUAL 0 AND regex_result STREQUAL "") + file(APPEND ${pybind_file} "USE_NO_KERNEL_OP(${TARGET});\n") + set(pybind_flag 1) + endif() + + # pybind USE_CPU_ONLY_OP + list(LENGTH cu_srcs cu_srcs_len) + list(LENGTH cu_cc_srcs cu_cc_srcs_len) + list(LENGTH mkldnn_cc_srcs mkldnn_cc_srcs_len) + list(LENGTH hip_cu_srcs hip_cu_srcs_len) + list(LENGTH miopen_hip_cc_srcs miopen_hip_cc_srcs_len) + if (${pybind_flag} EQUAL 0 AND ${mkldnn_cc_srcs_len} EQUAL 0 AND ${cu_srcs_len} EQUAL 0 AND ${cu_cc_srcs_len} EQUAL 0 AND + ${hip_cu_srcs_len} EQUAL 0 AND ${miopen_hip_cc_srcs_len} EQUAL 0) + file(APPEND ${pybind_file} "USE_CPU_ONLY_OP(${TARGET});\n") + set(pybind_flag 1) + endif() + + # pybind USE_OP_DEVICE_KERNEL for CUDNN + list(LENGTH cudnn_cu_cc_srcs cudnn_cu_cc_srcs_len) + if (WITH_GPU AND ${cudnn_cu_cc_srcs_len} GREATER 0) + if(${TARGET} STREQUAL "activation") + file(APPEND ${pybind_file} "USE_OP_DEVICE_KERNEL(relu, CUDNN);\n") + else() + file(APPEND ${pybind_file} "USE_OP_DEVICE_KERNEL(${TARGET}, CUDNN);\n") + endif() + endif() + + # pybind USE_OP_DEVICE_KERNEL for MIOPEN + if (WITH_AMD_GPU AND ${miopen_hip_cc_srcs_len} GREATER 0) + file(APPEND ${pybind_file} "USE_OP_DEVICE_KERNEL(${TARGET}, MIOPEN);\n") + endif() + + # pybind USE_OP_DEVICE_KERNEL for MKLDNN + if (WITH_MKLDNN AND ${mkldnn_cc_srcs_len} GREATER 0) + # Append first implemented MKLDNN activation operator + if (${MKLDNN_FILE} STREQUAL "activation_mkldnn_op") + file(APPEND ${pybind_file} "USE_OP_DEVICE_KERNEL(relu, MKLDNN);\n") + elseif(${MKLDNN_FILE} STREQUAL "conv_mkldnn_op") + file(APPEND ${pybind_file} "USE_OP_DEVICE_KERNEL_WITH_CUSTOM_TYPE(conv2d, MKLDNN, FP32);\n") + file(APPEND ${pybind_file} "USE_OP_DEVICE_KERNEL_WITH_CUSTOM_TYPE(conv2d, MKLDNN, S8);\n") + file(APPEND ${pybind_file} "USE_OP_DEVICE_KERNEL_WITH_CUSTOM_TYPE(conv2d, MKLDNN, U8);\n") + + else() + file(APPEND ${pybind_file} "USE_OP_DEVICE_KERNEL(${TARGET}, MKLDNN);\n") + endif() + endif() + + # pybind USE_OP + if (${pybind_flag} EQUAL 0) + # NOTE(*): activation use macro to regist the kernels, set use_op manually. + if(${TARGET} STREQUAL "activation") + file(APPEND ${pybind_file} "USE_OP(relu);\n") + elseif(${TARGET} STREQUAL "fake_dequantize") + file(APPEND ${pybind_file} "USE_OP(fake_dequantize_max_abs);\n") + elseif(${TARGET} STREQUAL "fake_quantize") + file(APPEND ${pybind_file} "USE_OP(fake_quantize_abs_max);\n") + elseif(${TARGET} STREQUAL "tensorrt_engine_op") + message(STATUS "Pybind skips [tensorrt_engine_op], for this OP is only used in inference") + elseif(${TARGET} STREQUAL "fc") + # HACK: fc only have mkldnn and cpu, which would mismatch the cpu only condition + file(APPEND ${pybind_file} "USE_CPU_ONLY_OP(${TARGET});\n") + else() + file(APPEND ${pybind_file} "USE_OP(${TARGET});\n") + endif() + endif() +endfunction() + + +function(register_operators) + set(options "") + set(oneValueArgs "") + set(multiValueArgs EXCLUDES DEPS) + cmake_parse_arguments(register_operators "${options}" "${oneValueArgs}" + "${multiValueArgs}" ${ARGN}) + + file(GLOB OPS RELATIVE "${CMAKE_CURRENT_SOURCE_DIR}" "*_op.cc") + string(REPLACE "_mkldnn" "" OPS "${OPS}") + string(REPLACE ".cc" "" OPS "${OPS}") + list(REMOVE_DUPLICATES OPS) + list(LENGTH register_operators_DEPS register_operators_DEPS_len) + + foreach(src ${OPS}) + list(FIND register_operators_EXCLUDES ${src} _index) + if (${_index} EQUAL -1) + if (${register_operators_DEPS_len} GREATER 0) + op_library(${src} DEPS ${register_operators_DEPS}) + else() + op_library(${src}) + endif() + endif() + endforeach() +endfunction() diff --git a/cmake/rdma.cmake b/cmake/rdma.cmake deleted file mode 100644 index b698f3bdc3ff586a72badee3e0109e29285b457f..0000000000000000000000000000000000000000 --- a/cmake/rdma.cmake +++ /dev/null @@ -1,82 +0,0 @@ -# user should download rdma first from subversion repository - -# execute following instruction to download svn mannally -# svn co https://svn.baidu.com/sys/ip/trunk/rdma/sockrdmav1 rdma/ -# svn co https://svn.baidu.com/sys/ip/trunk/rdma/thirdparty rdma/ -# we use static output in svn repositories to avoid implict bugs from not standard runtime env. - -if(WITH_RDMA) - set(RDMA_ROOT $ENV{RDMA_ROOT} CACHE PATH "Folder contains RDMA sock library and thirdparty library") - - function(generate_rdma_links) - #redirect to current DIR to isolate the pollution from system runtime environment - #it can benifits unified control for different gcc environment. - #e.g, by default gcc48 did not refer /usr/lib64 which could contain low version - #runtime libraries that will crash process while loading it. That redirect trick - #can fix it. - execute_process( - COMMAND mkdir -p librdma - COMMAND ln -s -f /usr/lib64/libibverbs.so.1.0.0 librdma/libibverbs.so.1 - COMMAND ln -s -f /usr/lib64/libibverbs.so.1.0.0 librdma/libibverbs.so - COMMAND ln -s -f /usr/lib64/librdmacm.so.1.0.0 librdma/librdmacm.so.1 - COMMAND ln -s -f /usr/lib64/librdmacm.so.1.0.0 librdma/librdmacm.so - COMMAND ln -s -f /lib64/libnl.so.1.1.4 librdma/libnl.so.1 - COMMAND ln -s -f /lib64/libnl.so.1.1.4 librdma/libnl.so - WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR} - ) - endfunction(generate_rdma_links) - - #check and set headers - find_path(RDMA_INC_SXISOCK sxi_sock.h PATHS ${RDMA_ROOT}/sockrdmav1/output/include) - find_path(RDMA_INC_XIO libxio.h PATHS ${RDMA_ROOT}/thirdparty/output/accelio) - find_path(RDMA_INC_EVENT event2 PATHS ${RDMA_ROOT}/thirdparty/output/libevent) - find_path(RDMA_INC_NUMA numa.h PATHS ${RDMA_ROOT}/thirdparty/output/libnuma) - - #check and set libs - find_library(RDMA_LIB_SXISOCK NAMES sxisock PATHS ${RDMA_ROOT}/sockrdmav1/output) - find_library(RDMA_LIB_XIO NAMES xio PATHS ${RDMA_ROOT}/thirdparty/output/accelio) - find_library(RDMA_LIB_EVENT NAMES event PATHS ${RDMA_ROOT}/thirdparty/output/libevent) - find_library(RDMA_LIB_EVENT_CORE NAMES event_core PATHS ${RDMA_ROOT}/thirdparty/output/libevent) - find_library(RDMA_LIB_EVENT_EXTRA NAMES event_extra PATHS ${RDMA_ROOT}/thirdparty/output/libevent) - find_library(RDMA_LIB_EVENT_PTHREADS NAMES event_pthreads PATHS ${RDMA_ROOT}/thirdparty/output/libevent) - find_library(RDMA_LIB_NUMA NAMES numa PATHS ${RDMA_ROOT}/thirdparty/output/libnuma) - - if( - RDMA_INC_SXISOCK AND - RDMA_INC_XIO AND - RDMA_INC_EVENT AND - RDMA_INC_NUMA AND - RDMA_LIB_SXISOCK AND - RDMA_LIB_XIO AND - RDMA_LIB_EVENT AND - RDMA_LIB_EVENT_CORE AND - RDMA_LIB_EVENT_EXTRA AND - RDMA_LIB_EVENT_PTHREADS AND - RDMA_LIB_NUMA - ) - - set(RDMA_INC_DIR - ${RDMA_INC_SXISOCK} - ${RDMA_INC_XIO} - ${RDMA_INC_EVENT} - ${RDMA_INC_NUMA}) - set(RDMA_LIBS - ${RDMA_LIB_SXISOCK} - ${RDMA_LIB_XIO} - ${RDMA_LIB_EVENT} - ${RDMA_LIB_EVENT_CORE} - ${RDMA_LIB_EVENT_EXTRA} - ${RDMA_LIB_EVENT_PTHREADS} - ${RDMA_LIB_NUMA} - ) - set(RDMA_LD_FLAGS "-L./librdma -libverbs -lrdmacm -Xlinker -rpath ./librdma") - include_directories("${RDMA_INC_DIR}") - else() - #if this module is not called, RDMA_INC_DIR RDMA_LIBS will be null, so top module always refer this variable - message(FATAL_ERROR, "RDMA libraries are not found, try to set RDMA_ROOT or check all related libraries.") - endif() -else(WITH_RDMA) - set(RDMA_LIBS "") - set(RDMA_LD_FLAGS "") - add_definitions(-DPADDLE_DISABLE_RDMA) -endif(WITH_RDMA) diff --git a/cmake/simd.cmake b/cmake/simd.cmake index 3eacf4d86aa0385eddb690d72e85e3384929bb99..566dc75fda019eb66759eb403f60e16f18cffef1 100644 --- a/cmake/simd.cmake +++ b/cmake/simd.cmake @@ -89,7 +89,9 @@ CHECK_CXX_SOURCE_RUNS(" #include int main() { - __m512i a = _mm512_undefined_epi32(); + __m512i a = _mm512_set_epi32 (-1, 2, -3, 4, -1, 2, -3, 4, + 13, -5, 6, -7, 9, 2, -6, 3); + __m512i result = _mm512_abs_epi32 (a); return 0; }" AVX512F_FOUND) diff --git a/cmake/system.cmake b/cmake/system.cmake index c91ef91127511da9ac8b9e11349f4a23aaedd613..65db05bebe957d740e391847d980e211b0e9e750 100644 --- a/cmake/system.cmake +++ b/cmake/system.cmake @@ -74,21 +74,6 @@ MARK_AS_ADVANCED(HOST_SYSTEM CPU_CORES) MESSAGE(STATUS "Found Paddle host system: ${HOST_SYSTEM}, version: ${HOST_SYSTEM_VERSION}") MESSAGE(STATUS "Found Paddle host system's CPU: ${CPU_CORES} cores") -# configuration for cross-compiling -IF(DEFINED CMAKE_SYSTEM_NAME) - INCLUDE(cross_compiling/host) - IF(${CMAKE_SYSTEM_NAME} STREQUAL "Android") - SET(ANDROID TRUE) - INCLUDE(cross_compiling/android) - ELSEIF(${CMAKE_SYSTEM_NAME} STREQUAL "RPi") - SET(RPI TRUE) - INCLUDE(cross_compiling/raspberry_pi) - ELSEIF(${CMAKE_SYSTEM_NAME} STREQUAL "iOS") - SET(IOS TRUE) - INCLUDE(cross_compiling/ios) - ENDIF() -ENDIF() - # external dependencies log output SET(EXTERNAL_PROJECT_LOG_ARGS LOG_DOWNLOAD 0 # Wrap download in script to log output diff --git a/cmake/tensorrt.cmake b/cmake/tensorrt.cmake index fa0e834a1dfd6e60f0ec07945be9a4d84017316f..3bf12094e4c32e69f908cbe6cefc7871fc9bb568 100644 --- a/cmake/tensorrt.cmake +++ b/cmake/tensorrt.cmake @@ -33,5 +33,6 @@ if(TENSORRT_FOUND) message(STATUS "Current TensorRT header is ${TENSORRT_INCLUDE_DIR}/NvInfer.h. " "Current TensorRT version is v${TENSORRT_MAJOR_VERSION}. ") include_directories(${TENSORRT_INCLUDE_DIR}) - list(APPEND EXTERNAL_LIBS ${TENSORRT_LIBRARY}) + link_directories(${TENSORRT_LIBRARY}) + add_definitions(-DPADDLE_WITH_TENSORRT) endif() diff --git a/cmake/util.cmake b/cmake/util.cmake index 0dc33ce385175d1e2dc454d41db467d4b9d9cf9a..02667dbce69ed159193ff88f38069dd08cdcf678 100644 --- a/cmake/util.cmake +++ b/cmake/util.cmake @@ -53,118 +53,3 @@ function(target_circle_link_libraries TARGET_NAME) "-Wl,--end-group") endif() endfunction() - -# compile_cu_as_cpp -# Make a cu file compiled as C++ -# Arguments: Source files -macro(compile_cu_as_cpp) - foreach(s ${ARGN}) - set_source_files_properties(${s} PROPERTIES LANGUAGE CXX) - set_source_files_properties(${s} PROPERTIES COMPILE_FLAGS "-x c++") - endforeach() -endmacro() - -# link_paddle_exe -# add paddle library for a paddle executable, such as trainer, pserver. -# -# It will handle WITH_PYTHON etc. -function(link_paddle_exe TARGET_NAME) - if(WITH_RDMA) - generate_rdma_links() - endif() - - if(MOBILE_INFERENCE) - target_circle_link_libraries(${TARGET_NAME} - ARCHIVE_START - paddle_gserver - paddle_function - ARCHIVE_END - paddle_math - paddle_utils - paddle_parameter - paddle_proto - paddle_cuda - ${EXTERNAL_LIBS} - ${CMAKE_THREAD_LIBS_INIT} - ${CMAKE_DL_LIBS} - ${RDMA_LD_FLAGS} - ${RDMA_LIBS}) - else() - target_circle_link_libraries(${TARGET_NAME} - ARCHIVE_START - paddle_gserver - paddle_function - ARCHIVE_END - paddle_pserver - paddle_trainer_lib - paddle_network - paddle_math - paddle_utils - paddle_parameter - paddle_proto - paddle_cuda - paddle_optimizer - ${EXTERNAL_LIBS} - ${CMAKE_THREAD_LIBS_INIT} - ${CMAKE_DL_LIBS} - ${RDMA_LD_FLAGS} - ${RDMA_LIBS}) - endif() - - if(ANDROID) - target_link_libraries(${TARGET_NAME} log) - endif(ANDROID) - - if(WITH_MKLML AND MKLML_LIB_DIR AND MKLML_IOMP_LIB) - target_link_libraries(${TARGET_NAME} "-L${MKLML_LIB_DIR} -liomp5 -Wl,--as-needed") - endif() - - add_dependencies(${TARGET_NAME} ${external_project_dependencies}) -endfunction() - -# link_paddle_test -# Link a paddle unittest for target -# TARGET_NAME: the unittest target name -# Rest Arguemnts: not used. -function(link_paddle_test TARGET_NAME) - link_paddle_exe(${TARGET_NAME}) - target_link_libraries(${TARGET_NAME} - paddle_test_main - paddle_test_util - ${GTEST_LIBRARIES}) -endfunction() - -# add_unittest_without_exec -# -# create a paddle unittest. not specifically define how to run this unittest. -# TARGET_NAME: the unittest target name, same as executable file name -# Rest Arguments: the source files to compile this unittest. -macro(add_unittest_without_exec TARGET_NAME) - add_executable(${TARGET_NAME} ${ARGN}) - link_paddle_test(${TARGET_NAME}) -endmacro() - -# add_unittest -# create a paddle unittest and just to execute this binary to make unittest. -# -# TARGET_NAME: the unittest target name, same as executable file name -# Rest Arguments: the source files to compile this unittest. -macro(add_unittest TARGET_NAME) - add_unittest_without_exec(${TARGET_NAME} ${ARGN}) - add_test(${TARGET_NAME} ${TARGET_NAME}) -endmacro() - -# add_simple_unittest -# create a paddle unittest with file name. It just compile ${TARGET_NAME}.cpp to -# ${TARGET_NAME} and then execute it. -macro(add_simple_unittest TARGET_NAME) - add_unittest(${TARGET_NAME} ${TARGET_NAME}.cpp) -endmacro() - -# Creates C resources file from files in given resource file -function(create_resources res_file output_file) - add_custom_command( - OUTPUT ${output_file} - COMMAND python ARGS ${PADDLE_SOURCE_DIR}/cmake/make_resource.py ${res_file} ${output_file} - DEPENDS ${res_file} ${PADDLE_SOURCE_DIR}/cmake/make_resource.py) -endfunction() diff --git a/cmake/version.cmake b/cmake/version.cmake index ac10bdf067be549fe90112aef73fd6e1fbe0ac48..dd57d4ab9969ce530f93ca1694350b1a26b5b543 100644 --- a/cmake/version.cmake +++ b/cmake/version.cmake @@ -31,8 +31,23 @@ while ("${PADDLE_VERSION}" STREQUAL "") set(tmp_version "${GIT_TAG_NAME}~1") endif() else() - # otherwise, we always set PADDLE_VERSION to 0.0.0 to represent latest - set(PADDLE_VERSION "0.0.0") + execute_process( + COMMAND ${GIT_EXECUTABLE} describe --exact-match --tags ${tmp_version} + WORKING_DIRECTORY ${PADDLE_SOURCE_DIR} + OUTPUT_VARIABLE GIT_EXACT_TAG_NAME + RESULT_VARIABLE GIT_EXACT_TAG_RESULT + ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE) + if (NOT ${GIT_EXACT_TAG_NAME}) + # Check if current branch is tag branch + if (${GIT_EXACT_TAG_NAME} MATCHES "v${TAG_VERSION_REGEX}") + string(REPLACE "v" "" PADDLE_VERSION ${GIT_EXACT_TAG_NAME}) + else() + set(PADDLE_VERSION "0.0.0") + endif() + else() + # otherwise, we always set PADDLE_VERSION to 0.0.0 to represent latest + set(PADDLE_VERSION "0.0.0") + endif() endif() else() set(PADDLE_VERSION "0.0.0") diff --git a/doc/CMakeLists.txt b/doc/CMakeLists.txt deleted file mode 100644 index a777a4974cc377db103a470698f817612a4e9a32..0000000000000000000000000000000000000000 --- a/doc/CMakeLists.txt +++ /dev/null @@ -1,9 +0,0 @@ -add_custom_target(paddle_apis ALL - DEPENDS paddle_v2_apis) - -add_custom_target(paddle_docs ALL - DEPENDS paddle_v2_docs paddle_v2_docs_cn - paddle_mobile_docs paddle_mobile_docs_cn) - -add_subdirectory(v2) -add_subdirectory(mobile) diff --git a/doc/about/about_us.rst b/doc/about/about_us.rst deleted file mode 100644 index f67d8b8130030db8d7e7d10b30271a913bd6272a..0000000000000000000000000000000000000000 --- a/doc/about/about_us.rst +++ /dev/null @@ -1,53 +0,0 @@ -========= -关于我们 -========= - -什么是PaddlePaddle --------------------- - -- PaddlePaddle是百度自主研发并开源的深度学习框架,它能够让开发者和企业安全、快速地实现自己的AI想法 - -- 项目团队汇聚了全球顶级的深度学习科学家,致力于为开发者和企业提供最好的深度学习研发体验 - -- 框架具有易学、易用、安全、高效四大特性,是最适合中国开发者和企业的深度学习工具 - -PaddlePaddle的技术特色 -------------------------- - -- 新一代深度学习框架: PaddlePaddle是基于“深度学习编程语言”的新一代深度学习框架,在保证性能的同时,极大的提升了框架对模型的表达能力,能够描述任意潜在可能出现的模型 - -- 对大规模计算更加友好:经过百度内多种大规模计算业务的打磨,PaddlePaddle在分布式计算上表现优异,基于EDL技术能够节约大量计算资源,同时也能支持大规模稀疏模型的训练 - -- 提供可视化的深度学习:通过Visual DL可以帮助开发者方便的观测训练整体趋势、数据样本质量和中间结果、参数分布和变化趋势、以及模型的结构,帮助开发者更便捷的完成编程过程 - -提供基于PaddlePaddle的教育体系 --------------------------------- - -- 深度学习课程:百度与中国市场顶级的教育、培训机构共同开发了深度学习精品课程以及学习教材,帮助开发者从零掌握深度学习 - -- 深度学习实训:对于目的是科研和学习的用户,PaddlePaddle提供了无需安装、线上运行的开发环境,并提供算法、算力、数据支持 - -- 线下培训:提供丰富、高质量的线下教育活动,如青年教师培训、线下实战营、沙龙等多种形式的培训和交流 - - -提供基于PaddlePaddle的AI服务 ------------------------------- - -- EadyDL:可以帮助零算法基础的企业快速完成一个深度学习任务,只需少量的数据即可得到优质的模型 - -- AI市场:提供标准化的AI 能力、产品的交易机制,帮助企业快速找到所需,有效开展AI业务 - -- 深度学习竞赛: PaddlePaddle汇聚顶尖深度学习开发者,企业可以发布自己的商业问题,通过竞赛方式快速找到最优的解决方案 - -你对PaddlePaddle有任何的问题都可以通过以下方式联系到我们 ------------------------------------------------------------ - -- 学习/使用问题:可以在 `PaddlePaddle开源社区 `_,以及 `PaddlePaddle中文社区 `_ 向我们反馈 - -- 对PaddlePaddle框架发展的建议:可发送邮件至Paddle-better@baidu.com - -我们期待与你一起打造世界顶级深度学习框架,共同推动AI技术的进步 - - - -PaddlePaddle团队 diff --git a/doc/mobile/CMakeLists.txt b/doc/mobile/CMakeLists.txt deleted file mode 100644 index 7b34ba8d0768427802b11614c6962f3c3f6ef4e3..0000000000000000000000000000000000000000 --- a/doc/mobile/CMakeLists.txt +++ /dev/null @@ -1,52 +0,0 @@ -if(NOT DEFINED SPHINX_THEME) - set(SPHINX_THEME default) -endif() - -if(NOT DEFINED SPHINX_THEME_DIR) - set(SPHINX_THEME_DIR) -endif() - -# configured documentation tools and intermediate build results -set(BINARY_BUILD_DIR_EN "${CMAKE_CURRENT_BINARY_DIR}/en/_build") - -# Sphinx cache with pickled ReST documents -set(SPHINX_CACHE_DIR_EN "${CMAKE_CURRENT_BINARY_DIR}/en/_doctrees") - -# HTML output director -set(SPHINX_HTML_DIR_EN "${CMAKE_CURRENT_BINARY_DIR}/en/html") - -set(IMPORT_PADDLE_STRING "") -set(IMPORT_PADDLEV2_STRING "") - -configure_file( - "${CMAKE_CURRENT_SOURCE_DIR}/../templates/conf.py.en.in" - "${BINARY_BUILD_DIR_EN}/conf.py" - @ONLY) - -sphinx_add_target(paddle_mobile_docs - html - ${BINARY_BUILD_DIR_EN} - ${SPHINX_CACHE_DIR_EN} - ${CMAKE_CURRENT_SOURCE_DIR} - ${SPHINX_HTML_DIR_EN}) - -# configured documentation tools and intermediate build results -set(BINARY_BUILD_DIR_CN "${CMAKE_CURRENT_BINARY_DIR}/cn/_build") - -# Sphinx cache with pickled ReST documents -set(SPHINX_CACHE_DIR_CN "${CMAKE_CURRENT_BINARY_DIR}/cn/_doctrees") - -# HTML output director -set(SPHINX_HTML_DIR_CN "${CMAKE_CURRENT_BINARY_DIR}/cn/html") - -configure_file( - "${CMAKE_CURRENT_SOURCE_DIR}/../templates/conf.py.cn.in" - "${BINARY_BUILD_DIR_CN}/conf.py" - @ONLY) - -sphinx_add_target(paddle_mobile_docs_cn - html - ${BINARY_BUILD_DIR_CN} - ${SPHINX_CACHE_DIR_CN} - ${CMAKE_CURRENT_SOURCE_DIR} - ${SPHINX_HTML_DIR_CN}) diff --git a/doc/mobile/cross_compiling_for_android_cn.md b/doc/mobile/cross_compiling_for_android_cn.md deleted file mode 100644 index 0607748b751e9f2d606236d9e98868335379b05c..0000000000000000000000000000000000000000 --- a/doc/mobile/cross_compiling_for_android_cn.md +++ /dev/null @@ -1,187 +0,0 @@ -# Android平台编译指南 - -用户可通过如下两种方式,交叉编译Android平台上适用的PaddlePaddle库: - -- [基于Docker容器的编译方式](#基于docker容器的编译方式) -- [基于Linux交叉编译环境的编译方式](#基于linux交叉编译环境的编译方式) - -## 基于Docker容器的编译方式 -Docker能在所有主要操作系统(包括Linux,Mac OS X和Windows)上运行,因此,使用基于Docker容器的编译方式,用户可在自己熟悉的开发平台上编译Android平台上适用的PaddlePaddle库。 - -### 构建PaddlePaddle的Android开发镜像 -我们把PaddlePaddle的交叉编译环境打包成一个镜像,称为开发镜像,里面涵盖了交叉编译Android版PaddlePaddle库需要的所有编译工具。 - -```bash -$ git clone https://github.com/PaddlePaddle/Paddle.git -$ cd Paddle -$ docker build -t username/paddle-android:dev . -f Dockerfile.android -``` - -用户也可以使用PaddlePaddle提供的官方开发镜像: - -```bash -$ docker pull paddlepaddle/paddle:latest-dev-android -``` - -对于国内用户,我们提供了加速访问的镜像源: - -```bash -$ docker pull docker.paddlepaddlehub.com/paddle:latest-dev-android -``` - -### 编译PaddlePaddle C-API库 -构建好开发镜像后,即可使用开发镜像来编译Android版PaddlePaddle C-API库。 -Android的Docker开发镜像向用户提供两个可配置的参数: - - -- - - - - - - - - - - - - - - - - - - - - - - -
ArgumentOptional ValuesDefault
ANDROID_ABIarmeabi-v7a, arm64-v8aarmeabi-v7a
ANDROID_API>= 1621
- -- 编译`armeabi-v7a`,`Android API 21`的PaddlePaddle库 - -```bash -$ docker run -it --rm -v $PWD:/paddle -w /paddle -e "ANDROID_ABI=armeabi-v7a" -e "ANDROID_API=21" username/paddle-android:dev ./paddle/scripts/paddle_build.sh build_android -``` - -- 编译`arm64-v8a`,`Android API 21`的PaddlePaddle库 - -```bash -$ docker run -it --rm -v $PWD:/paddle -w /paddle -e "ANDROID_ABI=arm64-v8a" -e "ANDROID_API=21" username/paddle-android:dev ./paddle/scripts/paddle_build.sh build_android -``` - -执行上述`docker run`命令时,容器执行[paddle/scripts/paddle_build.sh build_android](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/paddle_build.sh)脚本。该脚本中记录了交叉编译Android版PaddlePaddle库常用的CMake配置,并且会根据`ANDROID_ABI`和`ANDROID_API`自动构建独立工具链、进行编译和安装。由于arm64架构要求Android API不小于21。因此当`ANDROID_ABI=arm64-v8a`,`ANDROID_API<21`时,Docker容器中将默认使用`Android API 21`的编译工具链。用户可以参考下文[配置交叉编译参数](#配置交叉编译参数)章节,根据个人的需求修改定制Docker容器所执行的脚本。编译安装结束之后,PaddlePaddle的C-API库将被安装到`$PWD/install_android`目录,所依赖的第三方库同时也被安装到`$PWD/install_android/third_party`目录。 - -## 基于Linux交叉编译环境的编译方式 -本文档将以Linux x86-64平台为例,介绍交叉编译Android平台上适用的PaddlePaddle库的方法和步骤。 - -### 准备交叉编译环境 - -从源码交叉编译PaddlePaddle,用户需要提前准备好交叉编译环境。Android平台上使用的C/C++交叉编译工具链为[Android NDK](https://developer.android.com/ndk/downloads/index.html?hl=zh-cn),用户可自行前往下载预编译好的版本,也可通过以下命令获取: - -```bash -wget -q https://dl.google.com/android/repository/android-ndk-r14b-linux-x86_64.zip -unzip -q android-ndk-r14b-linux-x86_64.zip -``` - -Android NDK中包含了所有Android API级别、所有架构(arm/arm64/x86/mips)需要用到的编译工具和系统库。用户可根据自己的编译目标架构、所需支持的最低Android API级别,构建[独立工具链](https://developer.android.google.cn/ndk/guides/standalone_toolchain.html?hl=zh-cn)。 - -- 构建`armeabi-v7a`、 `Android API 21`的独立工具链: - -```bash -your/path/to/android-ndk-r14b-linux-x86_64/build/tools/make-standalone-toolchain.sh \ - --arch=arm --platform=android-21 --install-dir=your/path/to/arm_standalone_toolchain -``` - -此命令将在`your/path/to/arm_standalone_toolchain`目录生成一套独立编译工具链,面向架构为32位ARM架构,支持的最小的Android API级别为21,支持编译器`arm-linux-androideabi-gcc (GCC) 4.9`和`clang 3.8`。 - -- 构建`arm64-v8a`、 `Android API 21`的独立工具链: - -```bash -your/path/to/android-ndk-r14b-linux-x86_64/build/tools/make-standalone-toolchain.sh \ - --arch=arm64 --platform=android-21 --install-dir=your/path/to/arm64_standalone_toolchain -``` - -此命令将在`your/path/to/arm64_standalone_toolchain`目录生成一套独立编译工具链,面向架构为64位ARM64架构,支持的最小Android API级别为21,支持编译器`arm-linux-androideabi-gcc (GCC) 4.9`和`clang 3.8`。 - -### 配置交叉编译参数 - -CMake系统对交叉编译提供了支持[cmake-toolchains](https://cmake.org/cmake/help/v3.0/manual/cmake-toolchains.7.html#cross-compiling)。为了简化cmake配置,PaddlePaddle为交叉编译提供了工具链配置文档[cmake/cross_compiling/android.cmake](https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/cross_compiling/android.cmake),以提供一些默认的编译器和编译参数相关配置。注意,从CMake 3.7版本开始,CMake官方对Android平台的交叉编译提供了通用的支持。PaddlePaddle若检测到用户使用的CMake版本不低于3.7时,将会将用户传进来的配置参数传递CMake系统,交由CMake系统本身来处理。有关参数配置的详细说明见[cmake-toolchains](https://cmake.org/cmake/help/v3.7/manual/cmake-toolchains.7.html#cross-compiling)。 - -交叉编译Android版本的PaddlePaddle库时,有一些必须配置的参数: -- `CMAKE_SYSTEM_NAME`,CMake编译的目标平台,必须设置为`Android`。在设置`CMAKE_SYSTEM_NAME=Android`后,PaddlePaddle的CMake系统才认为是在交叉编译Android系统的版本,并自动编译PaddlePaddle所需的所有第三方库。此外,还会强制设置一些PaddlePaddle参数的值(`WITH_GPU=OFF`、`WITH_AVX=OFF`、`WITH_PYTHON=OFF`、`WITH_RDMA=OFF`、`WITH_MKL=OFF`、`WITH_GOLANG=OFF`)。 -- `WITH_C_API`,必须设置为`ON`。在Android平台上只支持使用C-API来预测。 -- `WITH_SWIG_PY`,必须设置为`OFF`。在Android平台上不支持通过swig调用来训练或者预测。 - -Android平台可选配置参数: - -- `ANDROID_STANDALONE_TOOLCHAIN`,独立工具链所在的绝对路径,或者相对于构建目录的相对路径。PaddlePaddle的CMake系统将根据该值自动推导和设置需要使用的交叉编译器、sysroot、以及Android API级别;否则,用户需要在cmake时手动设置这些值。无默认值。 -- `ANDROID_TOOLCHAIN`,目标工具链。可设置`gcc/clang`,默认值为`clang`。 - - CMake 3.7以上,将会始终使用`clang`工具链;CMake 3.7以下,可设置`ANDROID_TOOLCHAIN=gcc`以使用`gcc`工具链。 - - Android官方提供的`clang`编译器要求系统支持`GLIBC 2.15`以上。 -- `ANDROID_ABI`,目标架构ABI。目前支持`armeabi-v7a`和`arm64-v8a`,默认值为`armeabi-v7a`。 -- `ANDROID_NATIVE_API_LEVEL`,工具链的Android API级别。若没有显式设置,PaddlePaddle将根据`ANDROID_STANDALONE_TOOLCHAIN`的值自动推导得到。 -- `ANROID_ARM_MODE`,是否使用ARM模式。 - - `ANDROID_ABI=armeabi-v7a`时,可设置`ON/OFF`,默认值为`ON`; - - `ANDROID_ABI=arm64-v8a`时,不需要设置。 -- `ANDROID_ARM_NEON`,是否使用NEON指令。 - - `ANDROID_ABI=armeabi-v7a`时,可设置`ON/OFF`,默认值为`ON`; - - `ANDROID_ABI=arm64-v8a`时,不需要设置。 - -其他配置参数: - -- `USE_EIGEN_FOR_BLAS`,是否使用Eigen库进行矩阵计算。可设置`ON/OFF`,默认值为`OFF`。 -- `HOST_C/CXX_COMPILER`,宿主机的C/C++编译器。在编译宿主机版protoc可执行文件和目标机版OpenBLAS库时需要用到。默认设置成环境变量`CC/CXX`的值;若环境变量`CC/CXX`没有设置,则设置成`cc/c++`编译器。 - -常用的cmake配置如下: - -```bash -cmake -DCMAKE_SYSTEM_NAME=Android \ - -DANDROID_STANDALONE_TOOLCHAIN=your/path/to/arm_standalone_toolchain \ - -DANDROID_ABI=armeabi-v7a \ - -DANDROID_ARM_NEON=ON \ - -DANDROID_ARM_MODE=ON \ - -DUSE_EIGEN_FOR_BLAS=ON \ - -DCMAKE_INSTALL_PREFIX=your/path/to/install \ - -DWITH_C_API=ON \ - -DWITH_SWIG_PY=OFF \ - .. -``` - -``` -cmake -DCMAKE_SYSTEM_NAME=Android \ - -DANDROID_STANDALONE_TOOLCHAIN=your/path/to/arm64_standalone_toolchain \ - -DANDROID_ABI=arm64-v8a \ - -DUSE_EIGEN_FOR_BLAS=OFF \ - -DCMAKE_INSTALL_PREFIX=your/path/to/install \ - -DWITH_C_API=ON \ - -DWITH_SWIG_PY=OFF \ - .. -``` - -用户还可根据自己的需求设置其他编译参数。 - -- 设置`CMAKE_BUILD_TYPE`为`MinSizeRel`,最小化生成的库的大小。 -- 设置`CMAKE_BUILD_TYPE`为`Release`,获得最快的执行速度, -- 用户亦可以通过手动设置`CMAKE_C/CXX_FLAGS`来影响PaddlePaddle的编译过程。 - -**性能TIPS**,为了达到最快的计算速度,在CMake参数配置上,有以下建议: - -- 设置`CMAKE_BUILD_TYPE`为`Release` -- 使用`clang`编译工具链 -- `armeabi-v7a`时,设置`USE_EIGEN_BLAS=ON`,使用Eigen进行矩阵计算;`arm64-v8a`时,设置`USE_EIGEN_FOR_BLAS=OFF`,使用OpenBLAS进行矩阵计算 - -### 编译和安装 - -CMake配置完成后,执行以下命令,PaddlePaddle将自动下载和编译所有第三方依赖库、编译和安装PaddlePaddle预测库。 - -```bash -make -make install -``` - -注意:如果你曾经在源码目录下编译过其他平台的PaddlePaddle库,请先使用`rm -rf`命令删除`third_party`目录和`build`目录,以确保所有的第三方依赖库和PaddlePaddle代码都是针对新的CMake配置重新编译的。 - -执行完安装命令后,`your/path/to/install`目录中会包含`include`、`lib`和`third_party`目录,其中`include`中包含C-API的头文件,`lib`中包含若干个不同Android ABI的PaddlePaddle库,`third_party`中包含所依赖的所有第三方库。自此,PaddlePaddle的已经安装完成,用户可将`your/path/to/install`目录下的生成文件用于深度学习相关Android App中,调用方法见C-API文档。 diff --git a/doc/mobile/cross_compiling_for_android_en.md b/doc/mobile/cross_compiling_for_android_en.md deleted file mode 100644 index 572063e8012efee2d2e142eb57e459e0e8c6382c..0000000000000000000000000000000000000000 --- a/doc/mobile/cross_compiling_for_android_en.md +++ /dev/null @@ -1,189 +0,0 @@ -# Build PaddlePaddle for Android - -There are two approaches to build PaddlePaddle for Android: - -- [Cross-Compiling Using Docker](#cross-compiling-using-docker) -- [Cross-Compiling on Linux](#cross-compiling-on-linux) - -## Cross-Compiling Using Docker - -Docker-based cross-compiling is the recommended approach because Docker runs on all major operating systems, including Linux, Mac OS X, and Windows. - -### Build the Docker Image - -The following steps pack all the tools that we need to build PaddlePaddle into a Docker image. - -```bash -$ git clone https://github.com/PaddlePaddle/Paddle.git -$ cd Paddle -$ docker build -t paddle:dev-android . -f Dockerfile.android -``` - -Users can directly use the published Docker image. - -```bash -$ docker pull paddlepaddle/paddle:latest-dev-android -``` - -For users in China, we provide a faster mirror. - -```bash -$ docker pull docker.paddlepaddlehub.com/paddle:latest-dev-android -``` - -### Build the Inference Library - -We can run the Docker image we just created to build the inference library of PaddlePaddle for Android using the command below: - -```bash -$ docker run -it --rm -v $PWD:/paddle -w /paddle -e "ANDROID_ABI=armeabi-v7a" -e "ANDROID_API=21" paddle:dev-android ./paddle/scripts/paddle_build.sh build_android -``` - -The Docker image accepts two arguments `ANDROID_ABI` and `ANDROID_API`: - - -- - - - - - - - - - - - - - - - - - - - - - - -
ArgumentOptional ValuesDefault
ANDROID_ABIarmeabi-v7a, arm64-v8aarmeabi-v7a
ANDROID_API>= 1621
- -The ARM-64 architecture (`arm64-v8a`) requires at least level 21 of Android API. - -The build command, [`paddle/scripts/paddle_build.sh build_android`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/paddle_build.sh) generates the [Android cross-compiling standalone toolchain](https://developer.android.com/ndk/guides/standalone_toolchain.html) based on the argument: `ANDROID_ABI` or `ANDROID_API`. For information about other configuration arguments, please continue reading. - -The above command generates and outputs the inference library in `$PWD/install_android` and puts third-party libraries in `$PWD/install_android/third_party`. - -## Cross-Compiling on Linux - -The Linux-base approach to cross-compile is to run steps in `Dockerfile.android` manually on a Linux x64 computer. - -### Setup the Environment - -To build for Android's, we need [Android NDK]( -https://developer.android.com/ndk/downloads/index.html): - -```bash -wget -q https://dl.google.com/android/repository/android-ndk-r14b-linux-x86_64.zip -unzip -q android-ndk-r14b-linux-x86_64.zip -``` - -Android NDK includes everything we need to build the [*standalone toolchain*](https://developer.android.com/ndk/guides/standalone_toolchain.html), which in then used to build PaddlePaddle for Android. (We plan to remove the intermediate stage of building the standalone toolchain in the near future.) - -- To build the standalone toolchain for `armeabi-v7a` and Android API level 21: - -```bash -your/path/to/android-ndk-r14b-linux-x86_64/build/tools/make-standalone-toolchain.sh \ - --arch=arm --platform=android-21 --install-dir=your/path/to/arm_standalone_toolchain -``` - - The generated standalone toolchain will be in `your/path/to/arm_standalone_toolchain`. - -- To build the standalone toolchain for `arm64-v8a` and Android API level 21: - -```bash -your/path/to/android-ndk-r14b-linux-x86_64/build/tools/make-standalone-toolchain.sh \ - --arch=arm64 --platform=android-21 --install-dir=your/path/to/arm64_standalone_toolchain -``` - - The generated standalone toolchain will be in `your/path/to/arm64_standalone_toolchain`. - -### Cross-Compiling Arguments - -CMake supports [choosing the toolchain](https://cmake.org/cmake/help/v3.0/manual/cmake-toolchains.7.html#cross-compiling). PaddlePaddle provides [`android.cmake`](https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/cross_compiling/android.cmake), which configures the Android cross-compiling toolchain for CMake. `android.cmake` is not required for CMake >= 3.7, which support Android cross-compiling. PaddlePaddle detects the CMake version, for those newer than 3.7, it uses [the official version](https://cmake.org/cmake/help/v3.7/manual/cmake-toolchains.7.html#cross-compiling). - -Some other CMake arguments you need to know: - -- `CMAKE_SYSTEM_NAME` must be `Android`. This tells PaddlePaddle's CMake system to cross-compile third-party dependencies. This also changes some other CMake arguments like `WITH_GPU=OFF`, `WITH_AVX=OFF`, `WITH_PYTHON=OFF`, `WITH_RDMA=OFF`, `WITH_MKL=OFF` and `WITH_GOLANG=OFF`. -- `WITH_C_API` must be `ON`, to build the C-based inference library for Android. -- `WITH_SWIG_PY` must be `OFF` because the Android platform doesn't support SWIG-based API. - -Some Android-specific arguments: - -- `ANDROID_STANDALONE_TOOLCHAIN`: the absolute path of the Android standalone toolchain, or the path relative to the CMake build directory. PaddlePaddle's CMake extensions would derive the cross-compiler, sysroot and Android API level from this argument. -- `ANDROID_TOOLCHAIN`: could be `gcc` or `clang`. The default value is `clang`. - - For CMake >= 3.7, it should anyway be `clang`. For older versions, it could be `gcc`. - - Android's official `clang` requires `glibc` >= 2.15. -- `ANDROID_ABI`: could be `armeabi-v7a` or `arm64-v8a`. The default value is `armeabi-v7a`. -- `ANDROID_NATIVE_API_LEVEL`: could be derived from the value of `ANDROID_STANDALONE_TOOLCHAIN`. -- `ANROID_ARM_MODE`: - - could be `ON` or `OFF`, and defaults to `ON`, when `ANDROID_ABI=armeabi-v7a`; - - no need to specify when `ANDROID_ABI=arm64-v8a`. -- `ANDROID_ARM_NEON`: indicates if to use NEON instructions. - - could be `ON` or `OFF`, and defaults to `ON`, when `ANDROID_ABI=armeabi-v7a`; - - no need to specify when `ANDROID_ABI=arm64-v8a`. - -Other useful arguments: - -- `USE_EIGEN_FOR_BLAS`: indicates if using Eigen. Could be `ON` or `OFF`, defaults to `OFF`. -- `HOST_C/CXX_COMPILER`: specifies the host compiler, which is used to build the host-specific protoc and target-specific OpenBLAS. It defaults to the value of the environment variable `CC/C++`, or `cc/c++`. - -Some frequent configurations for your reference: - -```bash -cmake -DCMAKE_SYSTEM_NAME=Android \ - -DANDROID_STANDALONE_TOOLCHAIN=your/path/to/arm_standalone_toolchain \ - -DANDROID_ABI=armeabi-v7a \ - -DANDROID_ARM_NEON=ON \ - -DANDROID_ARM_MODE=ON \ - -DUSE_EIGEN_FOR_BLAS=ON \ - -DCMAKE_INSTALL_PREFIX=your/path/to/install \ - -DWITH_C_API=ON \ - -DWITH_SWIG_PY=OFF \ - .. -``` - -``` -cmake -DCMAKE_SYSTEM_NAME=Android \ - -DANDROID_STANDALONE_TOOLCHAIN=your/path/to/arm64_standalone_toolchain \ - -DANDROID_ABI=arm64-v8a \ - -DUSE_EIGEN_FOR_BLAS=OFF \ - -DCMAKE_INSTALL_PREFIX=your/path/to/install \ - -DWITH_C_API=ON \ - -DWITH_SWIG_PY=OFF \ - .. -``` - - -There are some other arguments you might want to configure. - -- `CMAKE_BUILD_TYPE=MinSizeRel` minimizes the size of library. -- `CMAKE_BUILD_TYPE-Release` optimizes the runtime performance. - -Our own tip for performance optimization to use clang and Eigen or OpenBLAS: - -- `CMAKE_BUILD_TYPE=Release` -- `ANDROID_TOOLCHAIN=clang` -- `USE_EIGEN_BLAS=ON` for `armeabi-v7a`, or `USE_EIGEN_FOR_BLAS=OFF` for `arm64-v8a`. - -### Build and Install - -After running `cmake`, we can run `make; make install` to build and install. - -Before building, you might want to remove the `third_party` and `build` directories including pre-built libraries for other architectures. - -After building,in the directory `CMAKE_INSTALL_PREFIX`, you will find three sub-directories: - -- `include`: the header file of the inference library, -- `lib`: the inference library built for various Android ABIs, -- `third_party`: dependent third-party libraries built for Android. diff --git a/doc/mobile/cross_compiling_for_ios_cn.md b/doc/mobile/cross_compiling_for_ios_cn.md deleted file mode 100644 index d5196d9a4c93c7692d2a624ec7d0650e32806338..0000000000000000000000000000000000000000 --- a/doc/mobile/cross_compiling_for_ios_cn.md +++ /dev/null @@ -1,117 +0,0 @@ -# iOS平台编译指南 -交叉编译iOS平台上适用的PaddlePaddle库,需要在MacOS系统上进行。本文的将介绍在MacOS上,从源码交叉编译iOS平台上适用的PaddlePaddle库。 - -## 准备交叉编译环境 -Apple官方为iOS开发提供了完整的交叉编译工具和集成开发环境,用户从App Store下载安装Xcode即可。也可自行前往官网下载,[Xcode](https://developer.apple.com/cn/xcode/)。安装完成之后,可在命令行执行`xcodebuild -version`,判断是否安装成功。 - -```bash -$ xcodebuild -version -Xcode 9.0 -Build version 9A235 -``` - -## 配置交叉编译参数 - -PaddlePaddle为交叉编译提供了工具链配置文档[cmake/cross_compiling/ios.cmake](https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/cross_compiling/ios.cmake),以提供一些默认的编译器和编译参数配置。 - -交叉编译iOS版本的PaddlePaddle库时,有一些必须配置的参数: - -- `CMAKE_SYSTEM_NAME`,CMake编译的目标平台,必须设置为`iOS`。在设置`CMAKE_SYSTEM_NAME=iOS`后,PaddlePaddle的CMake系统会自动编译所有的第三方依赖库,并且强制设置一些PaddlePaddle参数的值(`WITH_C_API=ON`、`WITH_GPU=OFF`、`WITH_AVX=OFF`、`WITH_PYTHON=OFF`、`WITH_RDMA=OFF`)。 -- `WITH_C_API`,是否编译C-API预测库,必须设置为ON。在iOS平台上只支持使用C-API来预测。 -- `WITH_SWIG_PY`,必须设置为`OFF`。在iOS平台上不支持通过swig调用来训练或者预测。 - -iOS平台可选配置参数: - -- `IOS_PLATFORM`,可设置为`OS`(默认值)或`SIMULATOR`。 - - `OS`,构建目标为`arm`架构的iPhone或者iPad等物理设备。 - - `SIMULATOR`,构建目标为`x86`架构的模拟器平台。 -- `IOS_ARCH`,目标架构。针对不同的`IOS_PLATFORM`,可设置的目标架构如下表所示,默认编译所有架构: - - - - - - - - - - - - - - - - - - - - - - -
IOS_PLATFORMIOS_ARCH
OSarmv7, armv7s, arm64
SIMULATORi386, x86_64
- -- `IOS_DEPLOYMENT_TARGET`,最小的iOS部署版本,默认值为`7.0`。 -- `IOS_ENABLE_BITCODE`,是否使能[Bitcode](https://developer.apple.com/library/content/documentation/IDEs/Conceptual/AppDistributionGuide/AppThinning/AppThinning.html#//apple_ref/doc/uid/TP40012582-CH35-SW3),可设置`ON/OFF`,默认值为`ON`。 -- `IOS_USE_VECLIB_FOR_BLAS`,是否使用[vecLib](https://developer.apple.com/documentation/accelerate/veclib)框架进行BLAS矩阵计算,可设置`ON/OFF`,默认值为`OFF`。 -- `IOS_DEVELOPMENT_ROOT`,`Developer`目录,可显式指定为`/path/to/platform/Developer`。若未显式指定,PaddlePaddle将会根据`IOS_PLATFORM`自动选择`Xcode`对应`platform`的`Developer`目录。 -- `IOS_SDK_ROOT`,所使用`SDK`的根目录,可显式指定为`/path/to/platform/Developer/SDKs/SDK`。若未显式指定,PaddlePaddle将会自动选择`IOS_DEVELOPMENT_ROOT`目录下最新的`SDK`版本。 - -其他配置参数: - -- `USE_EIGEN_FOR_BLAS`,是否使用Eigen库进行矩阵计算,在`IOS_USE_VECLIB_FOR_BLAS=OFF`时有效。可设置`ON/OFF`,默认值为`OFF`。 -- `HOST_C/CXX_COMPILER`,宿主机的C/C++编译器。默认值为环境变量`CC/CXX`的值;若环境变量`CC/CXX`未设置,则使用`cc/c++`编译器。 - -常用的cmake配置如下: - -```bash -cmake -DCMAKE_SYSTEM_NAME=iOS \ - -DIOS_PLATFORM=OS \ - -DIOS_ARCH="armv7;arm64" \ - -DIOS_ENABLE_BITCODE=ON \ - -DIOS_USE_VECLIB_FOR_BLAS=ON \ - -DCMAKE_INSTALL_PREFIX=your/path/to/install \ - -DWITH_C_API=ON \ - -DWITH_TESTING=OFF \ - -DWITH_SWIG_PY=OFF \ - .. -``` - -```bash -cmake -DCMAKE_SYSTEM_NAME=iOS \ - -DIOS_PLATFORM=SIMULATOR \ - -DIOS_ARCH="x86_64" \ - -DIOS_USE_VECLIB_FOR_BLAS=ON \ - -DCMAKE_INSTALL_PREFIX=your/path/to/install \ - -DWITH_C_API=ON \ - -DWITH_TESTING=OFF \ - -DWITH_SWIG_PY=OFF \ - .. -``` - -用户还可根据自己的需求设置其他编译参数。比如希望最小化生成库的大小,可以设置`CMAKE_BUILD_TYPE`为`MinSizeRel`;若希望得到最快的执行速度,则可设置`CMAKE_BUILD_TYPE`为`Release`。亦可以通过手动设置`CMAKE_C/CXX_FLAGS`来影响PaddlePaddle的编译过程。 - -**性能TIPS**,为了达到最快的计算速度,在CMake参数配置上,有以下建议: - -- 设置`CMAKE_BUILD_TYPE`为`Release` -- 设置`IOS_USE_VECLIB_FOR_BLAS=ON`,调用`vecLib`框架提供的BLAS函数进行矩阵计算。 - -## 编译和安装 - -CMake配置完成后,执行以下命令,PaddlePaddle将自动下载和编译所有第三方依赖库、编译和安装PaddlePaddle预测库。 - -``` -$ make -$ make install -``` - -注意:如果你曾在源码目录下编译过其他平台的PaddlePaddle库,请先使用`rm -rf`命令删除`third_party`目录和`build`目录,以确保所有的第三方依赖库和PaddlePaddle代码都是针对新的CMake配置重新编译的。 - -执行完安装命令后,`your/path/to/install`目录中会包含以下内容: - -- `include`目录,其中包含所有C-API的头文件 -- `lib`目录,其中包含PaddlePaddle的C-API静态库 -- `third_party`目录,其中包含所依赖的所有第三方库 - -注意,如果PaddlePaddle库需要同时支持真机和模拟器,则需要分别编译真机和模拟器版本,然后使用`lipo`工具合并fat库。 - -自此,PaddlePaddle库已经安装完成,用户可将合成的fat库用于深度学习相关的iOS App中,调用方法见C-API文档。 diff --git a/doc/mobile/cross_compiling_for_ios_en.md b/doc/mobile/cross_compiling_for_ios_en.md deleted file mode 100644 index 19bfe86c511c7e43b462f94c8cabba420b3007f1..0000000000000000000000000000000000000000 --- a/doc/mobile/cross_compiling_for_ios_en.md +++ /dev/null @@ -1,120 +0,0 @@ -# Build PaddlePaddle for iOS - -This tutorial will walk you through cross compiling the PaddlePaddle library for iOS from the source in MacOS. - -## Preparation - -Apple provides Xcode for cross-compiling and IDE for iOS development. Download from App store or [here](https://developer.apple.com/cn/xcode/). To verify your installation, run command as follows - -```bash -$ xcodebuild -version -Xcode 9.0 -Build version 9A235 -``` - -## Cross-compiling configurations - -PaddlePaddle provides cross-compiling toolchain configuration documentation [cmake/cross_compiling/ios.cmake](https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/cross_compiling/ios.cmake), which has some default settings for frequently used compilers. - -There are some mandatory environment variables need to be set before cross compiling PaddlePaddle for iOS: - -- `CMAKE_SYSTEM_NAME`, CMake compiling target platform name, has to be `iOS`. PaddlePaddle CMake will compile all the third party dependencies and enforce some parameters (`WITH_C_API=ON`, `WITH_GPU=OFF`, `WITH_AVX=OFF`, `WITH_PYTHON=OFF`,`WITH_RDMA=OFF`) when this variable is set with value `iOS`. - -- `WITH_C_API`, Whether to compile inference C-API library, has to be `ON`, since C-API is the only supported interface for inferencing in iOS. -- `WITH_SWIG_PY`, has to be `OFF`. It's not supported to inference or train via swig in iOS. - -Optional environment variables for iOS are: - -- `IOS_PLATFORM`, either `OS` (default) or `SIMULATOR`. - - `OS`, build targets ARM-based physical devices like iPhone or iPad. - - `SIMULATOR`, build targets x86 architecture simulators. -- `IOS_ARCH`, target architecture. By default, all architecture types will be compiled. If you need to specify the architecture to compile for, please find valid values for different `IOS_PLATFORM` settings from the table below: - - - - - - - - - - - - - - - - - - - - - - -
IOS_PLATFORMIOS_ARCH
OSarmv7, armv7s, arm64
SIMULATORi386, x86_64
- -- `IOS_DEPLOYMENT_TARGET`, minimum iOS version to deployment, `7.0` by default. -- `IOS_ENABLE_BITCODE`, whether to enable [Bitcode](https://developer.apple.com/library/content/documentation/IDEs/Conceptual/AppDistributionGuide/AppThinning/AppThinning.html#//apple_ref/doc/uid/TP40012582-CH35-SW3), values can be `ON/OFF`, `ON` by default. -- `IOS_USE_VECLIB_FOR_BLAS`, whether to use [vecLib](https://developer.apple.com/documentation/accelerate/veclib) framework for BLAS computing. values can be `ON/OFF`, `OFF` by default. -- `IOS_DEVELOPMENT_ROOT`, the path to `Developer` directory, can be explicitly set with your `/path/to/platform/Developer`. If left blank, PaddlePaddle will automatically pick the Xcode corresponding `platform`'s `Developer` directory based on your `IOS_PLATFORM` value. -- `IOS_SDK_ROOT`, the path to `SDK` root, can be explicitly set with your `/path/to/platform/Developer/SDKs/SDK`. if left black, PaddlePaddle will pick the latest SDK in the directory of `IOS_DEVELOPMENT_ROOT`. - -other settings: - -- `USE_EIGEN_FOR_BLAS`, whether to use Eigen for matrix computing. effective when `IOS_USE_VECLIB_FOR_BLAS=OFF`. Values can be `ON/OFF`, `OFF` by default. -- `HOST_C/CXX_COMPILER`, host C/C++ compiler. Uses value from environment variable `CC/CXX` by default or `cc/c++` if `CC/CXX` doesn't exist. - -some typical cmake configurations: - -```bash -cmake -DCMAKE_SYSTEM_NAME=iOS \ - -DIOS_PLATFORM=OS \ - -DIOS_ARCH="armv7;arm64" \ - -DIOS_ENABLE_BITCODE=ON \ - -DIOS_USE_VECLIB_FOR_BLAS=ON \ - -DCMAKE_INSTALL_PREFIX=your/path/to/install \ - -DWITH_C_API=ON \ - -DWITH_TESTING=OFF \ - -DWITH_SWIG_PY=OFF \ - .. -``` - -```bash -cmake -DCMAKE_SYSTEM_NAME=iOS \ - -DIOS_PLATFORM=SIMULATOR \ - -DIOS_ARCH="x86_64" \ - -DIOS_USE_VECLIB_FOR_BLAS=ON \ - -DCMAKE_INSTALL_PREFIX=your/path/to/install \ - -DWITH_C_API=ON \ - -DWITH_TESTING=OFF \ - -DWITH_SWIG_PY=OFF \ - .. -``` - -You can set other compiling parameters for your own need. I.E. if you are trying to minimize the library size, set `CMAKE_BUILD_TYPE` with `MinSizeRel`; or if the performance is your concern, set `CMAKE_BUILD_TYPE` with `Release`. You can even manipulate the PaddlePaddle compiling procedure by manually set `CMAKE_C/CXX_FLAGS` values. - -**TIPS for a better performance**: - -- set `CMAKE_BUILD_TYPE` with `Release` -- set `IOS_USE_VECLIB_FOR_BLAS` with `ON` - -## Build and install - -After CMake, run following commands, PaddlePaddle will download the compile 3rd party dependencies, compile and install PaddlePaddle inference library. - -``` -$ make -$ make install -``` - -Please Note: if you compiled PaddlePaddle in the source directory for other platforms, do remove `third_party` and `build` directory within the source with `rm -rf` to ensure that all the 3rd party libraries dependencies and PaddlePaddle is newly compiled with current CMake configuration. - -`your/path/to/install` directory will have following directories after `make install`: - -- `include`, contains all the C-API header files. -- `lib`, contains PaddlePaddle C-API static library. -- `third_party` contains all the 3rd party libraries. - -Please note: if PaddlePaddle library need to support both physical devices and simulators, you will need to compile correspondingly, then merge fat library with `lipo`. - -Now you will have PaddlePaddle library compiled and installed, the fat library can be used in deep learning related iOS APPs. Please refer to C-API documentation for usage guides. diff --git a/doc/mobile/cross_compiling_for_raspberry_cn.md b/doc/mobile/cross_compiling_for_raspberry_cn.md deleted file mode 100644 index f8ef9dc8031613831437745995268f3abc392f5b..0000000000000000000000000000000000000000 --- a/doc/mobile/cross_compiling_for_raspberry_cn.md +++ /dev/null @@ -1,62 +0,0 @@ -# Raspberry Pi平台编译指南 - -通常有两个方法来构建基于 Rasspberry Pi 的版本: - -1. 通过ssh等方式登录到Raspberry Pi系统上来构建。所需的开发工具和第三方库可以参考 [`/Dockerfile`](https://github.com/PaddlePaddle/Paddle/blob/develop/Dockerfile)。 - -1. 另一个方法是交叉编译。这篇文档介绍在 Linux/x64 上交叉编译Raspberry Pi平台上适用的PaddlePaddle的方法和步骤。 - -## 安装交叉编译器 - -克隆下面 Github repo - -```bash -git clone https://github.com/raspberrypi/tools.git -``` - -即可在 `./tools/tree/master/arm-bcm2708/gcc-linaro-arm-linux-gnueabihf-raspbian-x64` 目录里找到交叉编译器 arm-linux-gnueabihf-gcc 4.8.3。运行该编译工具链需要一台 Linux x64 机器上以及 2.14版本以上的 glibc。 - -## 配置交叉编译参数 - -CMake[支持交叉编译](https://cmake.org/cmake/help/v3.0/manual/cmake-toolchains.7.html#cross-compiling)。PaddlePaddle for Raspberry Pi的配置信息在[cmake/cross_compiling/raspberry_pi.cmake](https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/cross_compiling/raspberry_pi.cmake)。 - -交叉编译Raspberry Pi版本PaddlePaddle库时,有一些必须配置的参数: - -- `CMAKE_SYSTEM_NAME`:CMake编译的目标平台,必须配置为`RPi`。在设置`CMAKE_SYSTEM_NAME=RPi`后,PaddlePaddle的CMake系统才认为在是在交叉编译Raspberry Pi系统的版本,并自动编译宿主机版protoc可执行文件、目标机版protobuf库、以及目标机版OpenBLAS库。 - -- `RPI_TOOLCHAIN`:编译工具链所在的绝对路径,或者相对于构建目录的相对路径。PaddlePaddle的CMake系统将根据该值自动设置需要使用的交叉编译器;否则,用户需要在cmake时手动设置这些值。无默认值。 - -- `RPI_ARM_NEON`:是否使用NEON指令。目前必须设置成`ON`,默认值为`ON`。 - -- `HOST_C/CXX_COMPILER`,宿主机的C/C++编译器。在编译宿主机版protoc可执行文件和目标机版OpenBLAS库时需要用到。默认设置成环境变量`CC`的值;若环境变量`CC`没有设置,则设置成`cc`编译器。 - -一个常用的CMake配置如下: - -``` -cmake -DCMAKE_SYSTEM_NAME=RPi \ - -DRPI_TOOLCHAIN=your/path/to/arm-bcm2708/gcc-linaro-arm-linux-gnueabihf-raspbian-x64 \ - -DRPI_ARM_NEON=ON \ - -DCMAKE_INSTALL_PREFIX=your/path/to/install \ - -DWITH_GPU=OFF \ - -DWITH_C_API=ON \ - -DWITH_PYTHON=OFF \ - -DWITH_SWIG_PY=OFF \ - .. -``` - -其中`WITH_C_API=ON`表示需要构建推理库。 - -用户还可根据自己的需求设置其他编译参数。比如希望最小化生成的库的大小,可以设置`CMAKE_BUILD_TYPE`为`MinSizeRel`;若希望最快的执行速度,则可设置`CMAKE_BUILD_TYPE`为`Release`。 - -## 编译和安装 - -CMake配置完成后,执行以下命令,PaddlePaddle将自动下载和编译所有第三方依赖库、编译和安装PaddlePaddle。 - -```bash -make -make install -``` - -注意:如果你曾经在源码目录下编译过其他平台的PaddlePaddle库,请先使用`rm -rf`命令删除`third_party`目录和`build`目录,以确保所有的第三方依赖库和PaddlePaddle代码都是针对新的CMake配置重新编译的。 - -执行完安装命令后,`your/path/to/install`目录中会包含`include`和`lib`目录,其中`include`中包含C-API的头文件,`lib`中包含一个Raspberry Pi版本的库。 diff --git a/doc/mobile/cross_compiling_for_raspberry_en.md b/doc/mobile/cross_compiling_for_raspberry_en.md deleted file mode 100644 index 3c1a5950ff9553bb725d5a96e3fdf2e5e9f6f95c..0000000000000000000000000000000000000000 --- a/doc/mobile/cross_compiling_for_raspberry_en.md +++ /dev/null @@ -1,62 +0,0 @@ -# Build PaddlePaddle for Raspberry Pi - -You may use any of the following two approaches to build the inference library of PaddlePaddle for Raspberry Pi: - -1. Build using SSH: Log in to a Raspberry Pi using SSH and build the library. The required development tools and third-party dependencies are listed in here: [`/Dockerfile`](https://github.com/PaddlePaddle/Paddle/blob/develop/Dockerfile). - -1. Cross-compile: We talk about how to cross-compile PaddlePaddle for Raspberry Pi on a Linux/x64 machine, in more detail in this article. - -## The Cross-Compiling Toolchain - -Step 1. Clone the Github repo by running the following command. - -```bash -git clone https://github.com/raspberrypi/tools.git -``` - -Step 2. Use the pre-built cross-compiler found in `./tools/tree/master/arm-bcm2708/gcc-linaro-arm-linux-gnueabihf-raspbian-x64`. To run it on a Linux computer, glibc version >= 2.14 is needed. - -## CMake Arguments - -CMake supports [cross-compiling](https://cmake.org/cmake/help/v3.0/manual/cmake-toolchains.7.html#cross-compiling). All CMake configuration arguments required for the cross-compilation for Raspberry Pi can be found in [`cmake/cross_compiling/raspberry_pi.cmake`](https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/cross_compiling/raspberry_pi.cmake). - -Some important arguments that need to be set: - -- `CMAKE_SYSTEM_NAME`: The target platform. Must be `RPi`. - -- `RPI_TOOLCHAIN`: The absolute path of the cross-compiling toolchain. - -- `RPI_ARM_NEON`: Use ARM NEON Intrinsics. This is a required argument and set default to `ON`. - -- `HOST_C/CXX_COMPILER`: The C/C++ compiler for the host. It is used to build building tools running on the host, for example, protoc. - -A commonly-used CMake configuration is as follows: - -``` -cmake -DCMAKE_SYSTEM_NAME=RPi \ - -DRPI_TOOLCHAIN=your/path/to/arm-bcm2708/gcc-linaro-arm-linux-gnueabihf-raspbian-x64 \ - -DRPI_ARM_NEON=ON \ - -DCMAKE_INSTALL_PREFIX=your/path/to/install \ - -DWITH_GPU=OFF \ - -DWITH_C_API=ON \ - -DWITH_PYTHON=OFF \ - -DWITH_SWIG_PY=OFF \ - .. -``` - -To build the inference library, please set the argument WITH\_C\_API to ON: `WITH_C_API=ON`. - -You can add more arguments. For example, to minimize the size of the generated inference library, you may use `CMAKE_BUILD_TYPE=MinSizeRel`. For performance optimization, you may use `CMAKE_BUILD_TYPE=Release`. - -## Build and Install - -The following commands build the inference library of PaddlePaddle for Raspberry Pi and third-party dependencies. - -```bash -make -make install -``` - - The intermediate files will be stored in `build`. Third-party libraries will be located in `build/third_party`. If you have already built it for other platforms like Android or iOS, you may want to clear these directories by running the command: `rm -rf build`. - -The infernece library will be in `your/path/to/install/lib`, with related header files in `your/path/to/install/include`. diff --git a/doc/mobile/index_cn.rst b/doc/mobile/index_cn.rst deleted file mode 100644 index 56d1515005f6e40b084c6b2184c6a0b3e3a00496..0000000000000000000000000000000000000000 --- a/doc/mobile/index_cn.rst +++ /dev/null @@ -1,9 +0,0 @@ -移动端 -====== - -.. toctree:: - :maxdepth: 1 - - cross_compiling_for_android_cn.md - cross_compiling_for_ios_cn.md - cross_compiling_for_raspberry_cn.md diff --git a/doc/mobile/index_en.rst b/doc/mobile/index_en.rst deleted file mode 100644 index e0acdff0284e3bc84b2cc4a34a142ee01754f940..0000000000000000000000000000000000000000 --- a/doc/mobile/index_en.rst +++ /dev/null @@ -1,9 +0,0 @@ -Mobile -====== - -.. toctree:: - :maxdepth: 1 - - cross_compiling_for_android_en.md - cross_compiling_for_ios_en.md - cross_compiling_for_raspberry_en.md diff --git a/doc/survey/cluster_bootstrapping_tools.md b/doc/survey/cluster_bootstrapping_tools.md deleted file mode 100644 index 1cd9962700bb49866f1ed6987abc28b27888a23f..0000000000000000000000000000000000000000 --- a/doc/survey/cluster_bootstrapping_tools.md +++ /dev/null @@ -1,71 +0,0 @@ -# Cluster bootstrapping tool survey -## Abstract -In order to bring up a cluster from bare metal machine to a fully functional kubernetes cluster for Paddlepaddle to run, we need to utilize some tools. Here we are going to compare [Sextant](https://github.com/k8sp/sextant) and [Tectonic installer](https://github.com/coreos/tectonic-installer) - -## Basic assumptions -Here are some basic assumptions before we move on to details -1. You are an administrator of a bare metal machine cluster, which means: - * you have full control to each of the machines. - * you have full control to the network which machines are connected to. -2. Machines can be booted from network with PEX or iPXE -3. You understand the [general procedure to bring up a cluster](#appendix-general-procedure-to-bring-up-a-cluster) - -if your cluster is able to mark above items with checkmarks, then keep reading. - -## Comparing Sextant and Tectonic installer -### Sextant -Sextant is an end2end solution to bring up a bare metal cluster to a fully functional k8s cluster, it integrates DHCP, name service, PEX, cloud-config-service, docker registry services altogether. - -#### Pros -1. End2End: basically all admin need to do is to config the cluster.yaml and power on the cluster. -2. Offline cluster configuration: Sextant has 2 phases during working with it, config time and deploy time. when admin is configuring, it requires admin's machine has internet connectivity, which will download some images, etc. But in deploy time, it's completely OK to go offline since all dependencies are ready during config time. -3. docker registry integrated. -4. GPU machine took care of. - -### Cons -1. k8s API server is not deployed with high availability in considering by default. -2. No grouping support. -3. No API interface, a one-off service. - - -### Tectonic installer -First of all, Tectonic is not free, it requires coreos.com account as a step of installation, and free user can only create less than 10 nodes. - -Tectonic is a suite of software which wraps around k8s and providing more utility regarding dev ops, ie, -Tectonic installer as it's named, it installs Tectonic to a bare metal cluster which means it's not totally an equivalent of Sextant. At the "booting a cluster" part, it mostly utilizes [Matchbox](https://github.com/coreos/matchbox), which is a general cluster bootstrapper. - -Matchbox's Approach is similar to Sexstant. - -### Pros -1. supports grouping machines. -2. supports running provisioning service in rtk. (not a big deal though). -3. supports http/gRPC API interface. -4. supports multi-template. - -### Cons -1. Not an e2e solution to bring up a cluster, need a lot of extra work and other software. -2. [Not fully supporting](https://github.com/coreos/matchbox/issues/550) centOS deployment yet. - -## Conclusion -Sextant is a better solution overall for paddle cloud deploying to a bare metal cluster. It would be great if Sextant can also 1) deploy k8s api server with high availability by default; 2) not designed as a one-off service. - - - -## Appendix: General procedure to bring up a cluster -It's physically impossible for a cluster admin to manually install OS and applications into cluster nodes one by one, here is what an admin would do in cloud industry: -1. setup a bootstrap machine with static IP in the cluster, which has following services: - * DHCP: assigns ip address for rest of the nodes. - * name service: to map node name to a IP - * PXE related services: the booting related info will be delivered to newly booted machines as their IP is assigned via DHCP service, PXE service will provide further booting and installing info and image with TFTP and http protocol. - * cluster config service: this is for providing cluster node with OS config via http - * optional docker registry: a built-in docker registry makes the whole cluster independent from connecting internet, and speeds up software distribution. -2. New node powers on, it will - * broadcast the request for an IP address - * DHCP server assigns the IP address, and deliver the PXE booting related info to the node. - * cluster node will request config files with booting info delivered with DHCP via the TFTP service, and in most of the cases, the config file will point to a http service for the booting image. - * Since PXE is configured with initrd, it will utilize the cloud config service and do further installations like coreOS or K8s installations. - * then restart the node. - -For further understanding, following 2 links from Matchbox are some good readings: -* [Machine lifecycle](https://github.com/coreos/matchbox/blob/master/Documentation/machine-lifecycle.md) -* [PXE booting](https://github.com/coreos/matchbox/blob/master/Documentation/network-booting.md) diff --git a/doc/survey/dynamic_graph.md b/doc/survey/dynamic_graph.md deleted file mode 100644 index 7f62eeadff43af1f0a3c81e284a6508bf063b21e..0000000000000000000000000000000000000000 --- a/doc/survey/dynamic_graph.md +++ /dev/null @@ -1,379 +0,0 @@ -# Automatic Differentiation with the Tape - -## Automatic Differentiation - -A key challenge in deep learning is to automatically derive the backward pass given the forward pass as a program, which has been long studied in the field of [automatic differentiation](https://arxiv.org/pdf/1502.05767.pdf), or autodiff, before the prosperity of deep learning. - -## Program Transformation v.s. Backtracking - -Given the forward pass program, there are two strategies to derive the backward pass: - -1. by transforming the forward pass program without executing it, or -1. by backtracking the execution process of the forward pass program. - -This article is about the latter strategy. - -## The Tape and Dynamic Networks - -We refer to the trace of the execution of the forward pass program as a *tape* [[1]](http://www.bcl.hamilton.ie/~barak/papers/toplas-reverse.pdf). When we train a deep learning model, the tape changes every iteration as the input data change, so we'd have to re-derive the backward pass, which is time-consuming, but also eases the case that the forward program includes control flows like if-else and for/while. With these control flows, the execution trace might change with iterations. Such changes are known as *dynamic networks* in the field of deep learning. - -## Typical Systems - -Deep learning systems that utilize the idea of dynamic networks gained their popularities in recent years. This article surveys the following typical systems: - -- [DyNet](https://dynet.readthedocs.io/en/latest/) -- [PyTorch](https://pytorch.org/) -- Chainer -- Autograd from HIPS - -Before diving into these systems, let us pose an example forward pass program: - -```python -x = Variable(randn(20, 1))) -label = Variable(randint(1)) -W_1, W_2 = Variable(randn(20, 20)), Variable(randn(10, 20)) -h = matmul(W_1, x) -pred = matmul(W_2, h) -loss = softmax(pred, label) -loss.backward() -``` - -## The Representation of Tapes - -### DyNet: the Tape as a List - -DyNet uses a linear data structure, a list, to represent the tape. During the execution of the above example, it is a list of operators: `matmul`, `matmul`, and `softmax`. The list also includes information needed to do the backward pass, such as pointers to the inputs and outputs. Then the tape is played in reverse order at `loss.backward().` - -
- -digraph g { - graph [ - rankdir = "LR" - ]; - node [ - fontsize = "16" - shape = "ellipse" - ]; - edge []; - "node0" [ - label = " type: matmul | input: W_1, x | output: h" - shape = "record" - ]; - "node1" [ - label = " type: matmul | input: W_2, h | output: pred" - shape = "record" - ]; - "node2" [ - label = " type: softmax | input: pred, label | output: loss" - shape = "record" - ]; - "node0":f0 -> "node1":f0 []; - "node1":f0 -> "node2":f0 []; -} -
- -![Alt text](https://g.gravizo.com/svg?digraph%20g%20{%20graph%20[%20rankdir%20=%20%22LR%22%20];%20node%20[%20fontsize%20=%20%2216%22%20shape%20=%20%22ellipse%22%20];%20edge%20[];%20%22node0%22%20[%20label%20=%20%22%3Cf0%3E%20type:%20matmul%20|%20%3Cf1%3E%20input:%20W_1,%20x%20|%20%3Cf2%3E%20output:%20h%22%20shape%20=%20%22record%22%20];%20%22node1%22%20[%20label%20=%20%22%3Cf0%3E%20type:%20matmul%20|%20%3Cf1%3E%20input:%20W_2,%20h%20|%20%3Cf2%3E%20output:%20pred%22%20shape%20=%20%22record%22%20];%20%22node2%22%20[%20label%20=%20%22%3Cf0%3E%20type:%20softmax%20|%20%3Cf1%3E%20input:%20pred,%20label%20|%20%3Cf2%3E%20output:%20loss%22%20shape%20=%20%22record%22%20];%20%22node0%22:f0%20-%3E%20%22node1%22:f0%20[%20id%20=%200%20];%20%22node1%22:f0%20-%3E%20%22node2%22:f0%20[%20id%20=%201%20];%20}) - -### PyTorch: the Tape as a Graph - -The graph is composed of `Variable`s and `Function`s. During the forward execution, a `Variable` records its creator function, e.g. `h.creator = matmul`. And a Function records its inputs' previous/dependent functions `prev_func` through `creator`, e.g. `matmul.prev_func = matmul1`. At `loss.backward()`, a topological sort is performed on all `prev_func`s. Then the grad op is performed by the sorted order. Please be aware that a `Function` might have more than one `prev_func`s. - -
- -digraph g { - graph [ - rankdir = "LR" - ]; - - subgraph function { - node [ - fontsize = "16" - style = filled - shape = "record" - ]; - "matmul0" [ label = " type: matmul | prev_func: None" ]; - "matmul1" [ label = " type: matmul | prev_func: matmul" ]; - "softmax" [ label = " type: softmax | prev_func: matmul" ]; - } - - subgraph variable { - node [ - fontsize = "16" - shape = "Mrecord" - style = filled - fillcolor = white - ]; - "x" [ label = " x | creator: None" ]; - "label" [ label = " label | creator: None" ]; - "W_1" [ label = " W_1 | creator: None" ]; - "W_2" [ label = " W_2 | creator: None" ]; - "h" [ label = " h | creator: None" ]; - "pred" [ label = " pred | creator: matmul" ]; - "loss" [ label = " loss | creator: softmax" ]; - } - - subgraph data_flow { - "x":f0 -> "matmul0":f0; - "W_1":f0 -> "matmul0":f0; - "matmul0":f0 -> "h":f0; - - "h":f0 -> "matmul1":f0; - "W_2":f0 -> "matmul1":f0; - "matmul1":f0 -> "pred":f0; - - "pred":f0 -> "softmax":f0; - "label":f0 -> "softmax":f0; - "softmax":f0 -> "loss":f0; - } - - subgraph prev_func { - edge [color="red", arrowsize="0.6", penwidth="1", constraint=false]; - "matmul1":f1 -> "matmul0":f0; - "softmax":f1 -> "matmul1":f0; - label = "prev_func"; - } -} -
- -![Alt text](https://g.gravizo.com/svg?digraph%20g%20{%20graph%20[%20rankdir%20=%20%22LR%22%20];%20subgraph%20function%20{%20node%20[%20fontsize%20=%20%2216%22%20style%20=%20filled%20shape%20=%20%22record%22%20];%20%22matmul0%22%20[%20label%20=%20%22%3Cf0%3E%20type:%20matmul%20|%20prev_func:%20None%22%20];%20%22matmul1%22%20[%20label%20=%20%22%3Cf0%3E%20type:%20matmul%20|%20prev_func:%20matmul%22%20];%20%22softmax%22%20[%20label%20=%20%22%3Cf0%3E%20type:%20softmax%20|%20prev_func:%20matmul%22%20];%20}%20subgraph%20variable%20{%20node%20[%20fontsize%20=%20%2216%22%20shape%20=%20%22Mrecord%22%20style%20=%20filled%20fillcolor%20=%20white%20];%20%22x%22%20[%20label%20=%20%22%3Cf0%3E%20x%20|%20%3Cf1%3E%20creator:%20None%22%20];%20%22label%22%20[%20label%20=%20%22%3Cf0%3E%20label%20|%20%3Cf1%3E%20creator:%20None%22%20];%20%22W_1%22%20[%20label%20=%20%22%3Cf0%3E%20W_1%20|%20%3Cf1%3E%20creator:%20None%22%20];%20%22W_2%22%20[%20label%20=%20%22%3Cf0%3E%20W_2%20|%20%3Cf1%3E%20creator:%20None%22%20];%20%22h%22%20[%20label%20=%20%22%3Cf0%3E%20h%20|%20%3Cf1%3E%20creator:%20None%22%20];%20%22pred%22%20[%20label%20=%20%22%3Cf0%3E%20pred%20|%20%3Cf1%3E%20creator:%20matmul%22%20];%20%22loss%22%20[%20label%20=%20%22%3Cf0%3E%20loss%20|%20%3Cf1%3E%20creator:%20softmax%22%20];%20}%20subgraph%20data_flow%20{%20%22x%22:f0%20-%3E%20%22matmul0%22:f0;%20%22W_1%22:f0%20-%3E%20%22matmul0%22:f0;%20%22matmul0%22:f0%20-%3E%20%22h%22:f0;%20%22h%22:f0%20-%3E%20%22matmul1%22:f0;%20%22W_2%22:f0%20-%3E%20%22matmul1%22:f0;%20%22matmul1%22:f0%20-%3E%20%22pred%22:f0;%20%22pred%22:f0%20-%3E%20%22softmax%22:f0;%20%22label%22:f0%20-%3E%20%22softmax%22:f0;%20%22softmax%22:f0%20-%3E%20%22loss%22:f0;%20}%20subgraph%20prev_func%20{%20edge%20[color=%22red%22,%20arrowsize=%220.6%22,%20penwidth=%221%22,%20constraint=false];%20%22matmul1%22:f1%20-%3E%20%22matmul0%22:f0;%20%22softmax%22:f1%20-%3E%20%22matmul1%22:f0;%20label%20=%20%22prev_func%22;%20}%20}) - -Chainer and Autograd use the similar techniques to record the forward pass. For details, please refer to the appendix. - -## Comparison: List v.s. Graph - -The list of DyNet could be considered the result of the topological sort of the graph of PyTorch. Or, the graph is the raw representation of the tape, which gives us the chance to *prune* part of the graph that is irrelevant with the backward pass before the topological sort [[2]](https://openreview.net/pdf?id=BJJsrmfCZ). Consider the following example, PyTorch only does backward on `SmallNet` while DyNet does both `SmallNet` and `BigNet`: - -```python -result = BigNet(data) -loss = SmallNet(data) -loss.backward() -``` - -## Lazy v.s. Immediate Evaluation - -Another difference between DyNet and PyTorch is that DyNet lazily evaluates the forward pass, whereas PyTorch executes it immediately. Consider the following example: - -```python -for epoch in range(num_epochs): - for in_words, out_label in training_data: - dy.renew_cg() - W = dy.parameter(W_p) - b = dy.parameter(b_p) - score_sym = dy.softmax(W*dy.concatenate([E[in_words[0]],E[in_words[1]]])+b) - loss_sym = dy.pickneglogsoftmax(score_sym, out_label) - loss_val = loss_sym.value() - loss_sym.backward() -``` - -The computation of `lookup`, `concat`, `matmul` and `softmax` didn't happen until the call of `loss_sym.value()`. This defered execution is useful because it allows some graph-like optimization possible, e.g. kernel fusion. - -PyTorch chooses immediate evaluation. It avoids ever materializing a "forward graph"/"tape" (no need to explicitly call `dy.renew_cg()` to reset the list), recording only what is necessary to differentiate the computation, i.e. `creator` and `prev_func`. - - -## Fluid: Learning the Lessons - -Please refer to `paddle/contrib/dynamic/`. - -## Appendix - -### Overview - -| Framework | Has Tape | Core in C++ | First Release Date | -|-----------|----------|-------------|--------------------| -| Autograd | No | No | Mar 5, 2015 | -| Chainer | No | No | Jun 5, 2015 | -| Pytorch | No | Yes | Aug 31, 2016 | -| Dynet | Yes | Yes | Oct 12, 2016 | - -### Source Code -#### Autograd -[Backward code](https://github.com/HIPS/autograd/blob/442205dfefe407beffb33550846434baa90c4de7/autograd/core.py#L8-L40). In the forward pass, a graph of VJPNode is constructed. -```python -# User API -def make_grad(fun, x): - start_node = VJPNode.new_root() - end_value, end_node = trace(start_node, fun, x) - return backward_pass(g, end_node), end_value - -# trace the forward pass by creating VJPNodes -def trace(start_node, fun, x): - with trace_stack.new_trace() as t: - start_box = new_box(x, t, start_node) - end_box = fun(start_box) - return end_box._value, end_box._node - -def backward_pass(g, end_node): - outgrads = {end_node : (g, False)} - for node in toposort(end_node): - outgrad = outgrads.pop(node) - ingrads = node.vjp(outgrad[0]) - for parent, ingrad in zip(node.parents, ingrads): - outgrads[parent] = add_outgrads(outgrads.get(parent), ingrad) - return outgrad[0] - -# Every VJPNode corresponds to a op_grad -class VJPNode(Node): - __slots__ = ['parents', 'vjp'] - def __init__(self, value, fun, args, kwargs, parent_argnums, parents): - self.parents = parents - vjpmaker = primitive_vjps[fun] - self.vjp = vjpmaker(parent_argnums, value, args, kwargs) -``` -#### Chainer -Example Code -```python -# (1) Function Set definition, creates FunctionNode -model = FunctionSet( - l1=F.Linear(784, 100), - l2=F.Linear(100, 100), - l3=F.Linear(100, 10)).to_gpu() - -# (2) Optimizer Setup -opt = optimizers.SGD() -opt.setup(model) - -# (3) Forward computation -def forward(x, t): - h1 = F.relu(model.l1(x)) - h2 = F.relu(model.l2(h1)) - y = model.l3(h2) - return F.softmax_cross_entropy(y, t) - -# (4) Training loop -for epoch in xrange(n_epoch): - for i in xrange(0, N, b_size): - x = Variable(to_gpu(...)) - t = Variable(to_gpu(...)) - opt.zero_grads() - loss = forward(x, t) - loss.backward() - opt.update() -``` -In `forward(x, t)`, a graph of [`VariableNode`](https://github.com/chainer/chainer/blob/master/chainer/variable.py#L110) and [`FunctionNode`](https://github.com/chainer/chainer/blob/a69103a4aa59d5b318f39b01dbcb858d465b89cf/chainer/function_node.py#L19) is constructed. Every output's `VariableNode.creator` is pointed to the `FunctionNode`. -```python -class FunctionNode(object): - ... - def apply(self, inputs): - outputs = self.forward(inputs) - ret = tuple([variable.Variable(y, requires_grad=requires_grad) - for y in outputs]) - # Topological ordering - self.rank = max([x.rank for x in inputs]) if input_vars else 0 - # Add backward edges - for y in ret: - y.creator_node = self - self.inputs = tuple([x.node for x in input_vars]) - self.outputs = tuple([y.node for y in ret]) - - return ret -``` -`loss.backward()` will calculate the accumulated gradient of all variables. All the backward of `FunctionNode`s will be called based on the topological order. -```python -class VariableNode(object): - ... - def backward(self, retain_grad, loss_scale): - if self.creator_node is None: - return - - cand_funcs = [] - seen_set = set() - grads = {} - - # Initialize error by 1, if this is a loss variable - if self.data.size == 1 and self._grad_var is None: - self.grad = numpy.ones_like(self.data) - grads[self._node] = self._grad_var - - def add_cand(cand): - if cand not in seen_set: - # Negate since heapq is min-heap. This is a global variable - heapq.heappush(cand_funcs, (-cand.rank, len(seen_set), cand)) - seen_set.add(cand) - - add_cand(self.creator_node) - - while cand_funcs: - _, _, func = heapq.heappop(cand_funcs) - gxs = func.backward_accumulate(func.inputs, func.outputs, func.outputs.grad) - - for x, gx in enumerate(gxs): - if x in grads: - grads[x] += gx - else: - grads[x] = gx - - if x.creator_node is not None: - add_cand(x.creator_node) -``` - -#### PyTorch -Example Code -```python -x = Variable(torch.ones(5, 5)) -y = Variable(torch.ones(5, 5) * 4) -z = x ** 2 + x * 2 + x * y + y -z.backward(torch.ones(5, 5)) -``` -The trace is done by `Variable.creator` and `Function.previous_functions`. -```python -class Variable(object): - def __init__(self, tensor, creator=None, requires_grad=True): - if creator is None: - creator = Leaf(self, requires_grad) - self.data = tensor - self.creator = creator - self._grad = None - - def backward(self, gradient=None): - if gradient is None: - if self.data.numel() != 1: - raise RuntimeError('backward should be called only on a scalar (i.e. 1-element tensor) or with gradient w.r.t. the variable') - gradient = self.data.new(1).fill_(1) - self._execution_engine.run_backward(self, gradient) - -class Function(obejct): - # ... - def _do_forward(self, *input): - unpacked_input = tuple(arg.data for arg in input) - raw_output = self.forward(*unpacked_input) - - # mark output.creator = self for backward trace - output = tuple(Variable(tensor, self) for tensor in raw_output) - - self.previous_functions = [(arg.creator, id(arg)) for arg in input] - self.output_ids = {id(var): i for i, var in enumerate(output)} - return output - - def _do_backward(self, grad_output): - return self.backwaerd(grad_output) -``` -The [backward](https://github.com/pytorch/pytorch/blob/v0.1.1/torch/autograd/engine.py) is similar to Autograd. - -#### DyNet -Example code -```python -model = dy.model() -W_p = model.add_parameters((20, 100)) -b_p = model.add_parameters(20) -E = model.add_lookup_parameters((20000, 50)) -for epoch in range(num_epochs): - for in_words, out_label in training_data: - dy.renew_cg() # init tape - W = dy.parameter(W_p) - b = dy.parameter(b_p) - score_sym = dy.softmax(W*dy.concatenate([E[in_words[0]],E[in_words[1]]])+b) - loss_sym = dy.pickneglogsoftmax(score_sym, out_label) - loss_val = loss_sym.value() - loss_sym.backward() -``` -[forward](https://github.com/clab/dynet/blob/740a9626a13a2732544de142e256ad0d0a166658/dynet/exec.cc#L84-L158), [backward](https://github.com/clab/dynet/blob/740a9626a13a2732544de142e256ad0d0a166658/dynet/exec.cc#L166-L284). The trace is done by creating a tape of expressions in every iteration. Backward is done by traverse the tape in the reverse order. -```c++ -void SimpleExecutionEngine::backward(VariableIndex from_where, bool full) { - ... - for (int i = num_nodes - 1; i >= 0; --i) { - // each node corresponds to an op - node->backward(xs, node_fx, node_dEdfx, ai, node_dEdxai); - } - ... -} -``` diff --git a/doc/survey/op_fusion_design.md b/doc/survey/op_fusion_design.md deleted file mode 100644 index d6e48f4f58269b67450cb012f6dcc59e1083abba..0000000000000000000000000000000000000000 --- a/doc/survey/op_fusion_design.md +++ /dev/null @@ -1,20 +0,0 @@ -# Operator fusion -Fusing multiple operators together is an important method to optimize the program execution, particularly for GPU or other specialized accelerators. An obvious benefit is to avoid the overhead of saving the intermediate result back into global memory. - -There are generally two ways to fuse operators, fusing directly connected operators and fusing non directly connected operators. The first method is mainly used by [NNVM Compiler](https://github.com/dmlc/tvm/) and [XLA](https://www.tensorflow.org/performance/xla/). The second method is mainly used by Dynet and TensorFlow Fold to do auto-batching. The principle of fusing operator is according to some rules to combine multiple operations into one, for example, `Y = X * W` and `Z = Y + B` can be fused to `Z = X * W + B`, and `Y1 = X1 * W` and `Y2 = X2 * W` can be fused to `[Y1;Y2] = [X1;X2] * W`. In order to get a short-term profit, we decided to try to manually specify these rules. - -## Challenge -The challenge of fusing operators is: - - how to make the rules. - - how to implement these rules efficiently. - -### How to make the rules? - -The problem of determining the best single location for a fusion operator is an NP-hard combinatorial problem. After analysis the operators of the DL model, we found there are two group of operators can be fused explicitly, one is the simple and adjacent operations, for example, `tmp = x + y` and `z = Relu(tmp)`, and the other is the operators that have the same function, for example, a serials of `SGD` or `Momentum`. They usually appear in the model in a large number. So we should think about how to fuse them separately first. - -### How to implement these rules efficiently? -#### How to fuse the adjacent operations efficiently? -Here we use a template function to represent the fused operations. The pros of using a template function are that it is simple and efficient, and the cons are that it is not easy to expand, and it can only be used to express some simple operations. So taking into account our current needs, the template function is more appropriate. - -#### How to fuse the operators that have the same function efficiently? -We take SGD operator as an example, the training model may have hundreds of parameters and correspondingly have the same number of SGD operators. The expression(`w = w - lr*w_g`) of those operators is the same, so during of training, the executor will execute this expression hundreds time in CPU or other specialized accelerators. If we can fuse them and make the address of all `w` and all `w_g` continuous respectively, we only need execute one time. For some accelerators, the time of launching kernel is not neglected, so the time of hundreds of times of launching and executing kernel may be larger than launching and executing only once. There usually are many operators that similar to `SGD` in the DL model, such as `AllReduce` and `FC`. diff --git a/doc/templates/conf.py.cn.in b/doc/templates/conf.py.cn.in deleted file mode 100644 index 890f70615538af23cd05b9ffd685e870a5644cdb..0000000000000000000000000000000000000000 --- a/doc/templates/conf.py.cn.in +++ /dev/null @@ -1,151 +0,0 @@ -# -*- coding: utf-8 -*- -# -# documentation build configuration file, created by -# sphinx-quickstart on Thu Jul 23 19:40:08 2015. -# -# This file is execfile()d with the current directory set to its -# containing dir. -# -# Note that not all possible configuration values are present in this -# autogenerated file. -# -# All configuration values have a default; values that are commented out -# serve to show the default. -import sys -import os, subprocess -sys.path.insert(0, os.path.abspath('@PADDLE_BINARY_DIR@/python')) -import shlex -from recommonmark import parser, transform -@IMPORT_PADDLE_STRING@ -@IMPORT_PADDLEV2_STRING@ - -MarkdownParser = parser.CommonMarkParser -AutoStructify = transform.AutoStructify - -# If extensions (or modules to document with autodoc) are in another directory, -# add these directories to sys.path here. If the directory is relative to the -# documentation root, use os.path.abspath to make it absolute, like shown here. -templates_path = ["@PADDLE_SOURCE_DIR@/doc/templates"] - -# -- General configuration ------------------------------------------------ - -# General information about the project. -project = u'PaddlePaddle' -author = u'%s developers' % project -copyright = u'2016, %s' % author -github_doc_root = '' - -# add markdown parser -MarkdownParser.github_doc_root = github_doc_root -source_parsers = { - '.md': MarkdownParser, - '.Rmd': MarkdownParser, -} -os.environ['PADDLE_BUILD_DOC'] = '1' - -# Add any Sphinx extension module names here, as strings. They can be -# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom ones -extensions = [ - 'sphinx.ext.autodoc', - 'sphinx.ext.autosummary', - 'sphinx.ext.mathjax', - 'sphinx.ext.napoleon', - 'sphinx.ext.graphviz' -] -mathjax_path="https://cdn.bootcss.com/mathjax/2.7.0/MathJax.js" -table_styling_embed_css = True - -autodoc_member_order = 'bysource' - -# The suffix(es) of source filenames. -# You can specify multiple suffix as a list of string: -# source_suffix = ['.rst', '.md'] -source_suffix = ['.rst', '.md', '.Rmd'] - -# The encoding of source files. -source_encoding = 'utf-8' - -# The master toctree document. -master_doc = 'index_cn' - -# The language for content autogenerated by Sphinx. Refer to documentation -# for a list of supported languages. -# -# This is also used if you do content translation via gettext catalogs. -# Usually you set "language" from the command line for these cases. -language = 'zh_CN' - -# There are two options for replacing |today|: either, you set today to some -# non-false value, then it is used: -#today = '' -# Else, today_fmt is used as the format for a strftime call. -#today_fmt = '%B %d, %Y' - -# List of patterns, relative to source directory, that match files and -# directories to ignore when looking for source files. -exclude_patterns = ['_build', '**/*_en*', '*_en*', 'api/*'] - -# The reST default role (used for this markup: `text`) to use for all -# documents. -#default_role = None - -# If true, '()' will be appended to :func: etc. cross-reference text. -#add_function_parentheses = True - -# If true, the current module name will be prepended to all description -# unit titles (such as .. function::). -#add_module_names = True - -# If true, sectionauthor and moduleauthor directives will be shown in the -# output. They are ignored by default. -#show_authors = False - -# The name of the Pygments (syntax highlighting) style to use. -pygments_style = 'sphinx' - -# A list of ignored prefixes for module index sorting. -#modindex_common_prefix = [] - -# If true, keep warnings as "system message" paragraphs in the built documents. -#keep_warnings = False - -# If true, `todo` and `todoList` produce output, else they produce nothing. -todo_include_todos = False - -# -- Options for HTML output ---------------------------------------------- - -# The theme to use for HTML and HTML Help pages. See the documentation for -# a list of builtin themes. -html_theme = 'sphinx_rtd_theme' - -# Add any paths that contain custom static files (such as style sheets) here, -# relative to this directory. They are copied after the builtin static files, -# so a file named "default.css" will overwrite the builtin "default.css". -#html_static_path = [] - -# Output file base name for HTML help builder. -htmlhelp_basename = project + 'doc' - -# -- Options for LaTeX output --------------------------------------------- -latex_elements = { -} - -# Grouping the document tree into LaTeX files. List of tuples -# (source start file, target name, title, -# author, documentclass [howto, manual, or own class]). -latex_documents = [ - (master_doc, '%s.tex' % project, project, - author, 'manual'), -] - -# Use the .. admonition:: directive for Notes sections. -# False to use the .. rubric:: directive instead. -napoleon_use_admonition_for_notes = True - -def setup(app): - # Add hook for building doxygen xml when needed - # no c++ API for now - app.add_config_value('recommonmark_config', { - 'url_resolver': lambda url: github_doc_root + url, - }, True) - app.add_transform(AutoStructify) diff --git a/doc/templates/conf.py.en.in b/doc/templates/conf.py.en.in deleted file mode 100644 index 5b09464cb991f96127edec40f7dbbc97a8d82582..0000000000000000000000000000000000000000 --- a/doc/templates/conf.py.en.in +++ /dev/null @@ -1,152 +0,0 @@ -# -*- coding: utf-8 -*- -# -# documentation build configuration file, created by -# sphinx-quickstart on Thu Jul 23 19:40:08 2015. -# -# This file is execfile()d with the current directory set to its -# containing dir. -# -# Note that not all possible configuration values are present in this -# autogenerated file. -# -# All configuration values have a default; values that are commented out -# serve to show the default. -import sys -import os, subprocess -sys.path.insert(0, os.path.abspath('@PADDLE_BINARY_DIR@/python')) -import shlex -from recommonmark import parser, transform -@IMPORT_PADDLE_STRING@ -@IMPORT_PADDLEV2_STRING@ - - -MarkdownParser = parser.CommonMarkParser -AutoStructify = transform.AutoStructify - -# If extensions (or modules to document with autodoc) are in another directory, -# add these directories to sys.path here. If the directory is relative to the -# documentation root, use os.path.abspath to make it absolute, like shown here. -templates_path = ["@PADDLE_SOURCE_DIR@/doc/templates"] - -# -- General configuration ------------------------------------------------ - -# General information about the project. -project = u'PaddlePaddle' -author = u'%s developers' % project -copyright = u'2016, %s' % author -github_doc_root = '' - -# add markdown parser -MarkdownParser.github_doc_root = github_doc_root -source_parsers = { - '.md': MarkdownParser, - '.Rmd': MarkdownParser, -} -os.environ['PADDLE_BUILD_DOC'] = '1' - -# Add any Sphinx extension module names here, as strings. They can be -# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom ones -extensions = [ - 'sphinx.ext.autodoc', - 'sphinx.ext.autosummary', - 'sphinx.ext.mathjax', - 'sphinx.ext.napoleon', -] - - -autodoc_member_order = 'bysource' - - -# The suffix(es) of source filenames. -# You can specify multiple suffix as a list of string: -# source_suffix = ['.rst', '.md'] -source_suffix = ['.rst', '.md', '.Rmd'] - -# The encoding of source files. -source_encoding = 'utf-8' - -# The master toctree document. -master_doc = 'index_en' - -# The language for content autogenerated by Sphinx. Refer to documentation -# for a list of supported languages. -# -# This is also used if you do content translation via gettext catalogs. -# Usually you set "language" from the command line for these cases. -language = None - -# There are two options for replacing |today|: either, you set today to some -# non-false value, then it is used: -#today = '' -# Else, today_fmt is used as the format for a strftime call. -#today_fmt = '%B %d, %Y' - -# List of patterns, relative to source directory, that match files and -# directories to ignore when looking for source files. -exclude_patterns = ['_build', '**/*_cn*', '*_cn*', 'api/*'] - -# The reST default role (used for this markup: `text`) to use for all -# documents. -#default_role = None - -# If true, '()' will be appended to :func: etc. cross-reference text. -#add_function_parentheses = True - -# If true, the current module name will be prepended to all description -# unit titles (such as .. function::). -#add_module_names = True - -# If true, sectionauthor and moduleauthor directives will be shown in the -# output. They are ignored by default. -#show_authors = False - -# The name of the Pygments (syntax highlighting) style to use. -pygments_style = 'sphinx' - -# A list of ignored prefixes for module index sorting. -#modindex_common_prefix = [] - -# If true, keep warnings as "system message" paragraphs in the built documents. -#keep_warnings = False - -# If true, `todo` and `todoList` produce output, else they produce nothing. -todo_include_todos = False - -# -- Options for HTML output ---------------------------------------------- - -# The theme to use for HTML and HTML Help pages. See the documentation for -# a list of builtin themes. -html_theme = 'sphinx_rtd_theme' - -# Add any paths that contain custom static files (such as style sheets) here, -# relative to this directory. They are copied after the builtin static files, -# so a file named "default.css" will overwrite the builtin "default.css". -#html_static_path = [] - -# Output file base name for HTML help builder. -htmlhelp_basename = project + 'doc' - -# -- Options for LaTeX output --------------------------------------------- -latex_elements = { -} - -# Grouping the document tree into LaTeX files. List of tuples -# (source start file, target name, title, -# author, documentclass [howto, manual, or own class]). -latex_documents = [ - (master_doc, '%s.tex' % project, project, - author, 'manual'), -] - -# Use the .. admonition:: directive for Notes sections. -# False to use the .. rubric:: directive instead. -napoleon_use_admonition_for_notes = True - -def setup(app): - # Add hook for building doxygen xml when needed - # no c++ API for now - app.add_config_value('recommonmark_config', { - 'url_resolver': lambda url: github_doc_root + url, - 'enable_eval_rst': True, - }, True) - app.add_transform(AutoStructify) diff --git a/doc/templates/layout.html b/doc/templates/layout.html deleted file mode 100644 index 5091eb32eaeff77bd40f5d348e887b99b6eff4ea..0000000000000000000000000000000000000000 --- a/doc/templates/layout.html +++ /dev/null @@ -1,23 +0,0 @@ -{# layout.html #} -{# Import the theme's layout. #} -{% extends "!layout.html" %} - -{# SIDE NAV, TOGGLES ON MOBILE #} -{% block menu %} - -{% endblock %} - -{%- block extrahead %} - -{% endblock %} diff --git a/doc/v2/CMakeLists.txt b/doc/v2/CMakeLists.txt deleted file mode 100644 index d230a1b9217eea6740419822f350096e361a4435..0000000000000000000000000000000000000000 --- a/doc/v2/CMakeLists.txt +++ /dev/null @@ -1,54 +0,0 @@ -if(NOT DEFINED SPHINX_THEME) - set(SPHINX_THEME default) -endif() - -if(NOT DEFINED SPHINX_THEME_DIR) - set(SPHINX_THEME_DIR) -endif() - -# configured documentation tools and intermediate build results -set(BINARY_BUILD_DIR_EN "${CMAKE_CURRENT_BINARY_DIR}/en/_build") - -# Sphinx cache with pickled ReST documents -set(SPHINX_CACHE_DIR_EN "${CMAKE_CURRENT_BINARY_DIR}/en/_doctrees") - -# HTML output director -set(SPHINX_HTML_DIR_EN "${CMAKE_CURRENT_BINARY_DIR}/en/html") - -set(IMPORT_PADDLE_STRING "") -set(IMPORT_PADDLEV2_STRING "") - -configure_file( - "${CMAKE_CURRENT_SOURCE_DIR}/../templates/conf.py.en.in" - "${BINARY_BUILD_DIR_EN}/conf.py" - @ONLY) - -sphinx_add_target(paddle_v2_docs - html - ${BINARY_BUILD_DIR_EN} - ${SPHINX_CACHE_DIR_EN} - ${CMAKE_CURRENT_SOURCE_DIR} - ${SPHINX_HTML_DIR_EN}) - -# configured documentation tools and intermediate build results -set(BINARY_BUILD_DIR_CN "${CMAKE_CURRENT_BINARY_DIR}/cn/_build") - -# Sphinx cache with pickled ReST documents -set(SPHINX_CACHE_DIR_CN "${CMAKE_CURRENT_BINARY_DIR}/cn/_doctrees") - -# HTML output directory -set(SPHINX_HTML_DIR_CN "${CMAKE_CURRENT_BINARY_DIR}/cn/html") - -configure_file( - "${CMAKE_CURRENT_SOURCE_DIR}/../templates/conf.py.cn.in" - "${BINARY_BUILD_DIR_CN}/conf.py" - @ONLY) - -sphinx_add_target(paddle_v2_docs_cn - html - ${BINARY_BUILD_DIR_CN} - ${SPHINX_CACHE_DIR_CN} - ${CMAKE_CURRENT_SOURCE_DIR} - ${SPHINX_HTML_DIR_CN}) - -add_subdirectory(api) diff --git a/doc/v2/api/CMakeLists.txt b/doc/v2/api/CMakeLists.txt deleted file mode 100644 index 0c74522cb089b17c8419e9058f76631b0fe0df93..0000000000000000000000000000000000000000 --- a/doc/v2/api/CMakeLists.txt +++ /dev/null @@ -1,25 +0,0 @@ -# configured documentation tools and intermediate build results -set(BINARY_BUILD_DIR_EN "${CMAKE_CURRENT_BINARY_DIR}/en/_build") - -# Sphinx cache with pickled ReST documents -set(SPHINX_CACHE_DIR_EN "${CMAKE_CURRENT_BINARY_DIR}/en/_doctrees") - -# HTML output director -set(SPHINX_HTML_DIR_EN "${CMAKE_CURRENT_BINARY_DIR}/en/html") - -set(IMPORT_PADDLE_STRING "import paddle") -set(IMPORT_PADDLEV2_STRING "import paddle.v2") - -configure_file( - "${CMAKE_CURRENT_SOURCE_DIR}/../../templates/conf.py.en.in" - "${BINARY_BUILD_DIR_EN}/conf.py" - @ONLY) - -sphinx_add_target(paddle_v2_apis - html - ${BINARY_BUILD_DIR_EN} - ${SPHINX_CACHE_DIR_EN} - ${CMAKE_CURRENT_SOURCE_DIR} - ${SPHINX_HTML_DIR_EN}) - -add_dependencies(paddle_v2_apis gen_proto_py framework_py_proto copy_paddle_pybind paddle_python) diff --git a/doc/v2/api/config/activation.rst b/doc/v2/api/config/activation.rst deleted file mode 100644 index 5317e66b64bbd85c61f19700a9d2c1d239dee573..0000000000000000000000000000000000000000 --- a/doc/v2/api/config/activation.rst +++ /dev/null @@ -1,108 +0,0 @@ -=========== -Activation -=========== - -Abs -=== - -.. automodule:: paddle.v2.activation - :members: Abs - :noindex: - -Exp -=== - -.. automodule:: paddle.v2.activation - :members: Exp - :noindex: - -Identity -======== - -.. automodule:: paddle.v2.activation - :members: Identity - :noindex: - -Linear -====== - -.. automodule:: paddle.v2.activation - :members: Linear - :noindex: - -Log -=== - -.. automodule:: paddle.v2.activation - :members: Log - :noindex: - -Square -====== - -.. automodule:: paddle.v2.activation - :members: Square - :noindex: - -Sigmoid -======= - -.. automodule:: paddle.v2.activation - :members: Sigmoid - :noindex: - -Softmax -======= - -.. automodule:: paddle.v2.activation - :members: Softmax - :noindex: - -SequenceSoftmax -=============== - -.. automodule:: paddle.v2.activation - :members: SequenceSoftmax - :noindex: - -Relu -==== - -.. automodule:: paddle.v2.activation - :members: Relu - :noindex: - -BRelu -===== - -.. automodule:: paddle.v2.activation - :members: BRelu - :noindex: - -SoftRelu -======== - -.. automodule:: paddle.v2.activation - :members: SoftRelu - :noindex: - -Tanh -==== - -.. automodule:: paddle.v2.activation - :members: Tanh - :noindex: - -STanh -===== - -.. automodule:: paddle.v2.activation - :members: STanh - :noindex: - -SoftSign -======== - -.. automodule:: paddle.v2.activation - :members: SoftSign - :noindex: diff --git a/doc/v2/api/config/attr.rst b/doc/v2/api/config/attr.rst deleted file mode 100644 index a93f41b86779200d8bac651614f4d61f4895875f..0000000000000000000000000000000000000000 --- a/doc/v2/api/config/attr.rst +++ /dev/null @@ -1,6 +0,0 @@ -Parameter Attribute -=================== - -.. automodule:: paddle.v2.attr - :members: - :noindex: diff --git a/doc/v2/api/config/evaluators.rst b/doc/v2/api/config/evaluators.rst deleted file mode 100644 index 458d892e825a7a9bbe7843ad5c508bd5a31f5f0f..0000000000000000000000000000000000000000 --- a/doc/v2/api/config/evaluators.rst +++ /dev/null @@ -1,110 +0,0 @@ -.. _api_v2: - -========== -Evaluators -========== - -Classification -============== - -classification_error --------------------- -.. automodule:: paddle.v2.evaluator - :members: classification_error - :noindex: - -auc ---- -.. automodule:: paddle.v2.evaluator - :members: auc - :noindex: - -ctc_error ---------- -.. automodule:: paddle.v2.evaluator - :members: ctc_error - :noindex: - -chunk ------ -.. automodule:: paddle.v2.evaluator - :members: chunk - :noindex: - -precision_recall ----------------- -.. automodule:: paddle.v2.evaluator - :members: precision_recall - :noindex: - -Rank -==== - -pnpair ------- -.. automodule:: paddle.v2.evaluator - :members: pnpair - :noindex: - -Utils -===== - -sum ---- -.. automodule:: paddle.v2.evaluator - :members: sum - :noindex: - -column_sum ----------- -.. automodule:: paddle.v2.evaluator - :members: column_sum - :noindex: - -Print -===== - -classification_error_printer ----------------------------- -.. automodule:: paddle.v2.evaluator - :members: classification_error_printer - :noindex: - -gradient_printer ----------------- -.. automodule:: paddle.v2.evaluator - :members: gradient_printer - :noindex: - -maxid_printer -------------- -.. automodule:: paddle.v2.evaluator - :members: maxid_printer - :noindex: - -maxframe_printer ----------------- -.. automodule:: paddle.v2.evaluator - :members: maxframe_printer - :noindex: - -seqtext_printer ---------------- -.. automodule:: paddle.v2.evaluator - :members: seqtext_printer - :noindex: - -value_printer -------------- -.. automodule:: paddle.v2.evaluator - :members: value_printer - :noindex: - -Detection -========== - -detection_map -------------- -.. automodule:: paddle.v2.evaluator - :members: detection_map - :noindex: diff --git a/doc/v2/api/config/layer.rst b/doc/v2/api/config/layer.rst deleted file mode 100644 index 5a0cfadfce84df41defdf518b7c3a6222d5b30a1..0000000000000000000000000000000000000000 --- a/doc/v2/api/config/layer.rst +++ /dev/null @@ -1,552 +0,0 @@ -.. _api_v2.layer: - -====== -Layers -====== - -Data layer -=========== - -.. _api_v2.layer_data: - -data ----- -.. autofunction:: paddle.v2.layer.data - :noindex: - -Fully Connected Layers -====================== - -.. _api_v2.layer_fc: - -fc --- -.. autofunction:: paddle.v2.layer.fc - :noindex: - -selective_fc ------------- -.. autofunction:: paddle.v2.layer.selective_fc - :noindex: - -Conv Layers -=========== - -conv_operator -------------- -.. autofunction:: paddle.v2.layer.conv_operator - :noindex: - -conv_projection ---------------- -.. autofunction:: paddle.v2.layer.conv_projection - :noindex: - -conv_shift ----------- -.. autofunction:: paddle.v2.layer.conv_shift - :noindex: - -img_conv --------- -.. autofunction:: paddle.v2.layer.img_conv - :noindex: - -.. _api_v2.layer_context_projection: - -context_projection ------------------- -.. autofunction:: paddle.v2.layer.context_projection - :noindex: - -row_conv --------- -.. autofunction:: paddle.v2.layer.row_conv - :noindex: - -Image Pooling Layer -=================== - -img_pool --------- -.. autofunction:: paddle.v2.layer.img_pool - :noindex: - -spp ---- -.. autofunction:: paddle.v2.layer.spp - :noindex: - -maxout ------- -.. autofunction:: paddle.v2.layer.maxout - :noindex: - -roi_pool --------- -.. autofunction:: paddle.v2.layer.roi_pool - :noindex: - -pad ----- -.. autofunction:: paddle.v2.layer.pad - :noindex: - -Norm Layer -========== - -img_cmrnorm ------------ -.. autofunction:: paddle.v2.layer.img_cmrnorm - :noindex: - -batch_norm ----------- -.. autofunction:: paddle.v2.layer.batch_norm - :noindex: - -sum_to_one_norm ---------------- -.. autofunction:: paddle.v2.layer.sum_to_one_norm - :noindex: - -cross_channel_norm ------------------- -.. autofunction:: paddle.v2.layer.cross_channel_norm - :noindex: - -row_l2_norm ------------ -.. autofunction:: paddle.v2.layer.row_l2_norm - :noindex: - -Recurrent Layers -================ - -recurrent ---------- -.. autofunction:: paddle.v2.layer.recurrent - :noindex: - -lstmemory ---------- -.. autofunction:: paddle.v2.layer.lstmemory - :noindex: - -grumemory ---------- -.. autofunction:: paddle.v2.layer.grumemory - :noindex: - -gated_unit ------------ -.. autofunction:: paddle.v2.layer.gated_unit - :noindex: - -Recurrent Layer Group -===================== - -memory ------- -.. autofunction:: paddle.v2.layer.memory - :noindex: - -recurrent_group ---------------- -.. autofunction:: paddle.v2.layer.recurrent_group - :noindex: - -lstm_step ---------- -.. autofunction:: paddle.v2.layer.lstm_step - :noindex: - -gru_step --------- -.. autofunction:: paddle.v2.layer.gru_step - :noindex: - -beam_search ------------- -.. autofunction:: paddle.v2.layer.beam_search - :noindex: - -get_output ----------- -.. autofunction:: paddle.v2.layer.get_output - :noindex: - -Mixed Layer -=========== - -.. _api_v2.layer_mixed: - -mixed ------ -.. autofunction:: paddle.v2.layer.mixed - :noindex: - -.. _api_v2.layer_embedding: - -embedding ---------- -.. autofunction:: paddle.v2.layer.embedding - :noindex: - -scaling_projection ------------------- -.. autofunction:: paddle.v2.layer.scaling_projection - :noindex: - -dotmul_projection ------------------ -.. autofunction:: paddle.v2.layer.dotmul_projection - :noindex: - -dotmul_operator ---------------- -.. autofunction:: paddle.v2.layer.dotmul_operator - :noindex: - -full_matrix_projection ----------------------- -.. autofunction:: paddle.v2.layer.full_matrix_projection - :noindex: - -identity_projection -------------------- -.. autofunction:: paddle.v2.layer.identity_projection - :noindex: - -slice_projection -------------------- -.. autofunction:: paddle.v2.layer.slice_projection - :noindex: - -table_projection ----------------- -.. autofunction:: paddle.v2.layer.table_projection - :noindex: - -trans_full_matrix_projection ----------------------------- -.. autofunction:: paddle.v2.layer.trans_full_matrix_projection - :noindex: - -Aggregate Layers -================ - -AggregateLevel --------------- -.. autoclass:: paddle.v2.layer.AggregateLevel - :noindex: - -.. _api_v2.layer_pooling: - -pooling -------- -.. autofunction:: paddle.v2.layer.pooling - :noindex: - -.. _api_v2.layer_last_seq: - -last_seq --------- -.. autofunction:: paddle.v2.layer.last_seq - :noindex: - -.. _api_v2.layer_first_seq: - -first_seq ---------- -.. autofunction:: paddle.v2.layer.first_seq - :noindex: - -sub_seq ---------- -.. autofunction:: paddle.v2.layer.sub_seq - :noindex: - -concat ------- -.. autofunction:: paddle.v2.layer.concat - :noindex: - -seq_concat ----------- -.. autofunction:: paddle.v2.layer.seq_concat - :noindex: - -seq_slice ---------- -.. autofunction:: paddle.v2.layer.seq_slice - :noindex: - -sub_nested_seq --------------- -.. autofunction:: paddle.v2.layer.sub_nested_seq - :noindex: - -Reshaping Layers -================ - -block_expand ------------- -.. autofunction:: paddle.v2.layer.block_expand - :noindex: - -.. _api_v2.layer_expand: - -ExpandLevel ------------ -.. autoclass:: paddle.v2.layer.ExpandLevel - :noindex: - -expand ------- -.. autofunction:: paddle.v2.layer.expand - :noindex: - -repeat ------- -.. autofunction:: paddle.v2.layer.repeat - :noindex: - -rotate ------- -.. autofunction:: paddle.v2.layer.rotate - :noindex: - -seq_reshape ------------ -.. autofunction:: paddle.v2.layer.seq_reshape - :noindex: - -Math Layers -=========== - -addto ------ -.. autofunction:: paddle.v2.layer.addto - :noindex: - -linear_comb ------------ -.. autofunction:: paddle.v2.layer.linear_comb - :noindex: - -interpolation -------------- -.. autofunction:: paddle.v2.layer.interpolation - :noindex: - -bilinear_interp ---------------- -.. autofunction:: paddle.v2.layer.bilinear_interp - :noindex: - -dropout --------- -.. autofunction:: paddle.v2.layer.dropout - :noindex: - -dot_prod ---------- -.. autofunction:: paddle.v2.layer.dot_prod - :noindex: - -out_prod --------- -.. autofunction:: paddle.v2.layer.out_prod - :noindex: - -power ------ -.. autofunction:: paddle.v2.layer.power - :noindex: - -scaling -------- -.. autofunction:: paddle.v2.layer.scaling - :noindex: - -clip ----- -.. autofunction:: paddle.v2.layer.clip - :noindex: - -resize ------- -.. autofunction:: paddle.v2.layer.resize - :noindex: - -slope_intercept ---------------- -.. autofunction:: paddle.v2.layer.slope_intercept - :noindex: - -tensor ------- -.. autofunction:: paddle.v2.layer.tensor - :noindex: - -.. _api_v2.layer_cos_sim: - -cos_sim -------- -.. autofunction:: paddle.v2.layer.cos_sim - :noindex: - -l2_distance ------------ -.. autofunction:: paddle.v2.layer.l2_distance - :noindex: - -trans ------ -.. autofunction:: paddle.v2.layer.trans - :noindex: - -scale_shift ------------ -.. autofunction:: paddle.v2.layer.scale_shift - :noindex: - -factorization_machine ---------------------- -.. autofunction:: paddle.v2.layer.factorization_machine - :noindex: - -Sampling Layers -=============== - -maxid ------ -.. autofunction:: paddle.v2.layer.max_id - :noindex: - -sampling_id ------------ -.. autofunction:: paddle.v2.layer.sampling_id - :noindex: - -multiplex ---------- -.. autofunction:: paddle.v2.layer.multiplex - :noindex: - -.. _api_v2.layer_costs: - -Cost Layers -=========== - -cross_entropy_cost ------------------- -.. autofunction:: paddle.v2.layer.cross_entropy_cost - :noindex: - -cross_entropy_with_selfnorm_cost --------------------------------- -.. autofunction:: paddle.v2.layer.cross_entropy_with_selfnorm_cost - :noindex: - -multi_binary_label_cross_entropy_cost -------------------------------------- -.. autofunction:: paddle.v2.layer.multi_binary_label_cross_entropy_cost - :noindex: - -classification_cost -------------------- -.. autofunction:: paddle.v2.layer.classification_cost - :noindex: - -huber_regression_cost -------------------------- -.. autofunction:: paddle.v2.layer.huber_regression_cost - :noindex: - -huber_classification_cost -------------------------- -.. autofunction:: paddle.v2.layer.huber_classification_cost - :noindex: - -lambda_cost ------------ -.. autofunction:: paddle.v2.layer.lambda_cost - :noindex: - -square_error_cost ------------------ -.. autofunction:: paddle.v2.layer.square_error_cost - :noindex: - -rank_cost ---------- -.. autofunction:: paddle.v2.layer.rank_cost - :noindex: - -sum_cost ---------- -.. autofunction:: paddle.v2.layer.sum_cost - :noindex: - -crf ---- -.. autofunction:: paddle.v2.layer.crf - :noindex: - -crf_decoding ------------- -.. autofunction:: paddle.v2.layer.crf_decoding - :noindex: - -ctc ---- -.. autofunction:: paddle.v2.layer.ctc - :noindex: - -warp_ctc --------- -.. autofunction:: paddle.v2.layer.warp_ctc - :noindex: - -nce ---- -.. autofunction:: paddle.v2.layer.nce - :noindex: - -hsigmoid ---------- -.. autofunction:: paddle.v2.layer.hsigmoid - :noindex: - -smooth_l1_cost --------------- -.. autofunction:: paddle.v2.layer.smooth_l1_cost - :noindex: - -multibox_loss --------------- -.. autofunction:: paddle.v2.layer.multibox_loss - :noindex: - -detection_output ----------------- -.. autofunction:: paddle.v2.layer.detection_output - :noindex: - -Check Layer -============ - -eos ---- -.. autofunction:: paddle.v2.layer.eos - :noindex: - -Activation -========== - -prelu --------- -.. autofunction:: paddle.v2.layer.prelu - :noindex: diff --git a/doc/v2/api/config/networks.rst b/doc/v2/api/config/networks.rst deleted file mode 100644 index 048379cf01f4aec5e73e2fe3ddfa728f3c17a5d1..0000000000000000000000000000000000000000 --- a/doc/v2/api/config/networks.rst +++ /dev/null @@ -1,132 +0,0 @@ -======== -Networks -======== - -The v2.networks module contains pieces of neural network that combine multiple layers. - -NLP -=== - -sequence_conv_pool ------------------- -.. automodule:: paddle.v2.networks - :members: sequence_conv_pool - :noindex: - -.. _api_trainer_config_helpers_network_text_conv_pool: - -text_conv_pool --------------- -.. automodule:: paddle.v2.networks - :members: text_conv_pool - :noindex: - -Images -====== - -img_conv_bn_pool ----------------- -.. automodule:: paddle.v2.networks - :members: img_conv_bn_pool - :noindex: - -img_conv_group --------------- -.. automodule:: paddle.v2.networks - :members: img_conv_group - :noindex: - -.. _api_trainer_config_helpers_network_simple_img_conv_pool: - -simple_img_conv_pool --------------------- -.. automodule:: paddle.v2.networks - :members: simple_img_conv_pool - :noindex: - -small_vgg ---------- -.. automodule:: paddle.v2.networks - :members: small_vgg - :noindex: - -vgg_16_network ---------------- -.. automodule:: paddle.v2.networks - :members: vgg_16_network - :noindex: - -Recurrent -========= - -LSTM ----- - -lstmemory_unit -`````````````` -.. automodule:: paddle.v2.networks - :members: lstmemory_unit - :noindex: - -lstmemory_group -``````````````` -.. automodule:: paddle.v2.networks - :members: lstmemory_group - :noindex: - -simple_lstm -``````````` -.. automodule:: paddle.v2.networks - :members: simple_lstm - :noindex: - -bidirectional_lstm -`````````````````` -.. automodule:: paddle.v2.networks - :members: bidirectional_lstm - :noindex: - -GRU ---- - -gru_unit -```````` -.. automodule:: paddle.v2.networks - :members: gru_unit - :noindex: - -gru_group -````````` -.. automodule:: paddle.v2.networks - :members: gru_group - :noindex: - -simple_gru -`````````` -.. automodule:: paddle.v2.networks - :members: simple_gru - :noindex: - -simple_gru2 -``````````` -.. automodule:: paddle.v2.networks - :members: simple_gru2 - :noindex: - -bidirectional_gru -`````````````````` -.. automodule:: paddle.v2.networks - :members: bidirectional_gru - :noindex: - -simple_attention ----------------- -.. automodule:: paddle.v2.networks - :members: simple_attention - :noindex: - -dot_product_attention ---------------------- -.. automodule:: paddle.v2.networks - :members: dot_product_attention - :noindex: diff --git a/doc/v2/api/config/optimizer.rst b/doc/v2/api/config/optimizer.rst deleted file mode 100644 index b32373fdef52a7aa9d64b12cda3f76cb2abf351b..0000000000000000000000000000000000000000 --- a/doc/v2/api/config/optimizer.rst +++ /dev/null @@ -1,45 +0,0 @@ -========== -Optimizer -========== - -Momentum -======== -.. automodule:: paddle.v2.optimizer - :members: Momentum - :noindex: - -Adam -==== -.. automodule:: paddle.v2.optimizer - :members: Adam - :noindex: - -Adamax -====== -.. automodule:: paddle.v2.optimizer - :members: Adamax - :noindex: - -AdaGrad -======= -.. automodule:: paddle.v2.optimizer - :members: AdaGrad - :noindex: - -DecayedAdaGrad -============== -.. automodule:: paddle.v2.optimizer - :members: DecayedAdaGrad - :noindex: - -AdaDelta -======== -.. automodule:: paddle.v2.optimizer - :members: AdaDelta - :noindex: - -RMSProp -======= -.. automodule:: paddle.v2.optimizer - :members: RMSProp - :noindex: diff --git a/doc/v2/api/config/pooling.rst b/doc/v2/api/config/pooling.rst deleted file mode 100644 index d26b365c9284632210a1532853e39feedc70758b..0000000000000000000000000000000000000000 --- a/doc/v2/api/config/pooling.rst +++ /dev/null @@ -1,46 +0,0 @@ -======= -Pooling -======= - -BasePool -======== -.. automodule:: paddle.v2.pooling - :members: BasePool - :noindex: - -Avg -=== -.. automodule:: paddle.v2.pooling - :members: Avg - :noindex: - -Max -=== -.. automodule:: paddle.v2.pooling - :members: Max - :noindex: - -Sum -=== -.. automodule:: paddle.v2.pooling - :members: Sum - :noindex: - -SquareRootN -=========== -.. automodule:: paddle.v2.pooling - :members: SquareRootN - :noindex: - -CudnnAvg -======== -.. automodule:: paddle.v2.pooling - :members: CudnnAvg - :noindex: - -CudnnMax -======== -.. automodule:: paddle.v2.pooling - :members: CudnnMax - :noindex: - diff --git a/doc/v2/api/data.rst b/doc/v2/api/data.rst deleted file mode 100644 index b56c7332cc284649c7e04328e51a7faa78593a39..0000000000000000000000000000000000000000 --- a/doc/v2/api/data.rst +++ /dev/null @@ -1,10 +0,0 @@ -================================== -Data Reader Interface and DataSets -================================== - -.. toctree:: - :maxdepth: 1 - - data/data_reader.rst - data/image.rst - data/dataset.rst diff --git a/doc/v2/api/data/data_reader.rst b/doc/v2/api/data/data_reader.rst deleted file mode 100644 index 1a35d0bbc8f9d751f49c7e1fc26feb1bcb3ae7f0..0000000000000000000000000000000000000000 --- a/doc/v2/api/data/data_reader.rst +++ /dev/null @@ -1,72 +0,0 @@ -===================== -Data Reader Interface -===================== - - -DataTypes -========= - -.. autofunction:: paddle.v2.data_type.dense_array - :noindex: - -.. autofunction:: paddle.v2.data_type.integer_value - :noindex: - -.. autofunction:: paddle.v2.data_type.integer_value_sequence - :noindex: - -.. autofunction:: paddle.v2.data_type.integer_value_sub_sequence - :noindex: - -.. autofunction:: paddle.v2.data_type.sparse_binary_vector - :noindex: - -.. autofunction:: paddle.v2.data_type.sparse_binary_vector_sequence - :noindex: - -.. autofunction:: paddle.v2.data_type.sparse_binary_vector_sub_sequence - :noindex: - -.. autofunction:: paddle.v2.data_type.sparse_float_vector - :noindex: - -.. autofunction:: paddle.v2.data_type.sparse_float_vector_sequence - :noindex: - -.. autofunction:: paddle.v2.data_type.sparse_float_vector_sub_sequence - :noindex: - -.. autofunction:: paddle.v2.data_type.sparse_non_value_slot - :noindex: - -.. autofunction:: paddle.v2.data_type.sparse_value_slot - :noindex: - -.. autoclass:: paddle.v2.data_type.InputType - :members: - :noindex: - -DataFeeder -========== - -.. automodule:: paddle.v2.data_feeder - :members: - :noindex: - -Reader -====== - -.. automodule:: paddle.reader - :members: - :noindex: - -.. automodule:: paddle.reader.creator - :members: - :noindex: - -minibatch -========= - -.. automodule:: paddle.v2.minibatch - :members: - :noindex: diff --git a/doc/v2/api/data/dataset.rst b/doc/v2/api/data/dataset.rst deleted file mode 100644 index e7c8be4452bf55e0967d750c2e624e8e316e9330..0000000000000000000000000000000000000000 --- a/doc/v2/api/data/dataset.rst +++ /dev/null @@ -1,82 +0,0 @@ -Dataset -======= - -.. automodule:: paddle.dataset - :members: - :noindex: - -mnist -+++++ - -.. automodule:: paddle.dataset.mnist - :members: - :noindex: - -cifar -+++++ - -.. automodule:: paddle.dataset.cifar - :members: - :noindex: - -conll05 -+++++++ - -.. automodule:: paddle.dataset.conll05 - :members: get_dict,get_embedding,test - :noindex: - -imdb -++++ - -.. automodule:: paddle.dataset.imdb - :members: - :noindex: - -imikolov -++++++++ - -.. automodule:: paddle.dataset.imikolov - :members: - :noindex: - -movielens -+++++++++ - -.. automodule:: paddle.dataset.movielens - :members: - :noindex: - -.. autoclass:: paddle.dataset.movielens.MovieInfo - :noindex: - -.. autoclass:: paddle.dataset.movielens.UserInfo - :noindex: - -sentiment -+++++++++ - -.. automodule:: paddle.dataset.sentiment - :members: - :noindex: - -uci_housing -+++++++++++ - -.. automodule:: paddle.dataset.uci_housing - :members: - :noindex: - -wmt14 -+++++ - -.. automodule:: paddle.dataset.wmt14 - :members: - :noindex: - -wmt16 -+++++ - -.. automodule:: paddle.dataset.wmt16 - :members: - :noindex: diff --git a/doc/v2/api/data/image.rst b/doc/v2/api/data/image.rst deleted file mode 100644 index 97651ffa6be56cf3ecaca2caca38a353fa5c1f49..0000000000000000000000000000000000000000 --- a/doc/v2/api/data/image.rst +++ /dev/null @@ -1,5 +0,0 @@ -Image Interface -=============== - -.. automodule:: paddle.v2.image - :members: diff --git a/doc/v2/api/index_en.rst b/doc/v2/api/index_en.rst deleted file mode 100644 index 5813509dce46677444f0234db8e0eaa4f113e3a0..0000000000000000000000000000000000000000 --- a/doc/v2/api/index_en.rst +++ /dev/null @@ -1,9 +0,0 @@ -API -=== - -.. toctree:: - :maxdepth: 1 - - model_configs.rst - data.rst - run_logic.rst diff --git a/doc/v2/api/model_configs.rst b/doc/v2/api/model_configs.rst deleted file mode 100644 index 992b559cbd87244612521d4c96f84f997d6c4196..0000000000000000000000000000000000000000 --- a/doc/v2/api/model_configs.rst +++ /dev/null @@ -1,13 +0,0 @@ -Model Configuration -=================== - -.. toctree:: - :maxdepth: 1 - - config/activation.rst - config/layer.rst - config/evaluators.rst - config/optimizer.rst - config/pooling.rst - config/networks.rst - config/attr.rst diff --git a/doc/v2/api/overview.rst b/doc/v2/api/overview.rst deleted file mode 100644 index a6f21428de1e4906e4af9433bc1c994f2b2c8b8e..0000000000000000000000000000000000000000 --- a/doc/v2/api/overview.rst +++ /dev/null @@ -1,12 +0,0 @@ -V2 API Overview -================ - -The PaddlePaddle V2 API is designed to provide a modern user interface for PaddlePaddle V1(the original layer-based platform of PaddlePaddle), -it proposes some high-level concepts such as `Layers `_ , `Optimizer `_ , `Evaluator `_ and `Data Reader `_ to make the model configuration more familiar to users. - -A model is composed of the computation described by a group of `Layers`, with `Evaluator` to define the error, `Optimizer` to update the parameters and `Data Reader` to feed in the data. - -We also provide the `interface for Training and Inference `_ to help control the training and inference phrase, -it has several easy to use methods to better expose the internal running details, different `events `_ are available to users by writing some callbacks. - -All in all, the V2 API gives a higher abstraction and make PaddlePaddle programs require fiew lines of code. diff --git a/doc/v2/api/run_logic.rst b/doc/v2/api/run_logic.rst deleted file mode 100644 index 5c97651f6536d89d2b5926d4b2907a547aa86b55..0000000000000000000000000000000000000000 --- a/doc/v2/api/run_logic.rst +++ /dev/null @@ -1,31 +0,0 @@ -====================== -Training and Inference -====================== - -Parameters -========== - -.. automodule:: paddle.v2.parameters - :members: Parameters - :noindex: - -Trainer -======= - -.. automodule:: paddle.v2.trainer - :members: SGD - :noindex: - -Event -===== - -.. automodule:: paddle.v2.event - :members: - :noindex: - -Inference -========= - -.. autofunction:: paddle.v2.infer - :noindex: - \ No newline at end of file diff --git a/doc/v2/build_and_install/build_from_source_cn.rst b/doc/v2/build_and_install/build_from_source_cn.rst deleted file mode 100644 index d0dacb104f148c2aeb323365cbd6f014ae00ed5a..0000000000000000000000000000000000000000 --- a/doc/v2/build_and_install/build_from_source_cn.rst +++ /dev/null @@ -1,225 +0,0 @@ -从源码编译 -====================== - -.. _requirements: - -需要的软硬件 ----------------- - -为了编译PaddlePaddle,我们需要 - -1. 一台电脑,可以装的是 Linux, Windows 或者 MacOS 操作系统 -2. Docker - -不需要依赖其他任何软件了。即便是 Python 和 GCC 都不需要,因为我们会把所有编译工具都安装进一个 Docker 镜像里。 - -.. _build_step: - -编译方法 ----------------- - -PaddlePaddle需要使用Docker环境完成编译,这样可以免去单独安装编译依赖的步骤,可选的不同编译环境Docker镜像 -可以在 `这里 `__ 找到,您也可以 -在 `这里 `__ 找到 paddle_manylinux_devel -镜像的编译以及使用方法。或者参考下述可选步骤,从源码中构建用于编译PaddlePaddle的Docker镜像。 - -如果您选择不使用Docker镜像,则需要在本机安装下面章节列出的 :ref:`编译依赖 <_compile_deps>` 之后才能开始编译的步骤。 - -编译PaddlePaddle,需要执行: - -.. code-block:: bash - - # 1. 获取源码 - git clone https://github.com/PaddlePaddle/Paddle.git - cd Paddle - # 2. 可选步骤:源码中构建用于编译PaddlePaddle的Docker镜像 - docker build -t paddle:dev . - # 3. 执行下面的命令编译CPU-Only的二进制 - docker run -it -v $PWD:/paddle -w /paddle -e "PYTHON_ABI=cp27-cp27mu" -e "WITH_GPU=OFF" -e "WITH_TESTING=OFF" paddlepaddle/paddle_manylinux_devel:cuda8.0_cudnn5 ./paddle/scripts/paddle_build.sh build - # 4. 或者也可以使用为上述可选步骤构建的镜像(必须先执行第2步) - docker run -it -v $PWD:/paddle -w /paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=OFF" paddle:dev ./paddle/scripts/paddle_build.sh build - -注: - -- 上述命令把当前目录(源码树根目录)映射为 container 里的 :code:`/paddle` 目录。 - -- 如果您使用的是 manylinux 的镜像进行编译, 那么您需要通过环境变量 :code:`PYTHON_ABI` 来指定一个 `Python ABI `__. -PaddlePaddle目前支持的 Python ABI 有 :code:`cp27-cp27m` 和 :code:`cp27-cp27mu`. - -编译完成后会在build/python/dist目录下生成输出的whl包,可以选在在当前机器安装也可以拷贝到目标机器安装: - -.. code-block:: bash - - pip install build/python/dist/*.whl - -如果机器中已经安装过PaddlePaddle,有两种方法: - -.. code-block:: bash - - 1. 先卸载之前的版本,再重新安装 - pip uninstall paddlepaddle - pip install build/python/dist/*.whl - - 2. 直接升级到更新的版本 - pip install build/python/dist/*.whl -U - -.. _run_test: - -执行单元测试 ----------------- - -如果您期望在编译完成后立即执行所有的单元测试,可以按照下面的方法: - -设置 :code:`RUN_TEST=ON` 和 :code:`WITH_TESTING=ON` 就会在完成编译之后,立即执行单元测试。 -开启 :code:`WITH_GPU=ON` 可以指定同时执行GPU上的单元测试。 - -.. code-block:: bash - - docker run -it -v $PWD:/paddle -w /paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=ON" -e "RUN_TEST=ON" paddlepaddle/paddle_manylinux_devel:cuda8.0_cudnn5 ./paddle/scripts/paddle_build.sh test - -如果期望执行其中一个单元测试,(比如 :code:`test_sum_op` ): - -.. code-block:: bash - - docker run -it -v $PWD:/paddle -w /paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=ON" -e "RUN_TEST=OFF" paddlepaddle/paddle_manylinux_devel:cuda8.0_cudnn5 /bin/bash - ./paddle/scripts/paddle_build.sh build - cd build - ctest -R test_sum_op -V - -.. _faq_docker: - -常见问题 ----------------- - -- 什么是 Docker? - - 如果您没有听说 Docker,可以把它想象为一个类似 virtualenv 的系统,但是虚拟的不仅仅是 Python 的运行环境。 - -- Docker 还是虚拟机? - - 有人用虚拟机来类比 Docker。需要强调的是:Docker 不会虚拟任何硬件,Docker container 里运行的编译工具实际上都是在本机的 CPU 和操作系统上直接运行的,性能和把编译工具安装在本机运行一样。 - -- 为什么用 Docker? - - 把工具和配置都安装在一个 Docker image 里可以标准化编译环境。这样如果遇到问题,其他人可以复现问题以便帮助。 - - 另外,对于习惯使用Windows和MacOS的开发者来说,使用Docker就不用配置交叉编译环境了。 - -- 我可以选择不用Docker吗? - - 当然可以。大家可以用把开发工具安装进入 Docker image 一样的方式,把这些工具安装到本机。这篇文档介绍基于 Docker 的开发流程,是因为这个流程比其他方法都更简便。 - -- 学习 Docker 有多难? - - 理解 Docker 并不难,大概花十分钟看一下 `如何使用Docker `_ 。这可以帮您省掉花一小时安装和配置各种开发工具,以及切换机器时需要新安装的辛苦。别忘了 PaddlePaddle 更新可能导致需要新的开发工具。更别提简化问题复现带来的好处了。 - -- 我可以用 IDE 吗? - - 当然可以,因为源码就在本机上。IDE 默认调用 make 之类的程序来编译源码,我们只需要配置 IDE 来调用 Docker 命令编译源码即可。 - - 很多 PaddlePaddle 开发者使用 Emacs。他们在自己的 `~/.emacs` 配置文件里加两行 - - .. code-block:: emacs - - (global-set-key "\C-cc" 'compile) - (setq compile-command "docker run --rm -it -v $(git rev-parse --show-toplevel):/paddle paddle:dev") - - 就可以按 `Ctrl-C` 和 `c` 键来启动编译了。 - -- 可以并行编译吗? - - 是的。我们的 Docker image 运行一个 `Paddle编译Bash脚本 `_ 。这个脚本调用 `make -j$(nproc)` 来启动和 CPU 核一样多的进程来并行编译。 - -- Docker 需要 sudo - - 如果用自己的电脑开发,自然也就有管理员权限(sudo)了。如果用公用的电脑开发,需要请管理员安装和配置好 Docker。此外,PaddlePaddle 项目在努力开始支持其他不需要 sudo 的集装箱技术,比如 rkt。 - -- 在 Windows/MacOS 上编译很慢 - - Docker 在 Windows 和 MacOS 都可以运行。不过实际上是运行在一个 Linux 虚拟机上。可能需要注意给这个虚拟机多分配一些 CPU 和内存,以保证编译高效。具体做法请参考 `如何为Windows/Mac计算机上的Docker增加内存和虚拟机 `_ 。 - -- 磁盘不够 - - 本文中的例子里,`docker run` 命令里都用了 `--rm` 参数,这样保证运行结束之后的 containers 不会保留在磁盘上。可以用 `docker ps -a` 命令看到停止后但是没有删除的 containers。`docker build` 命令有时候会产生一些中间结果,是没有名字的 images,也会占用磁盘。可以参考 `如何删除Docker Container `_ 来清理这些内容。 - - -.. _compile_deps: - -附录:编译依赖 ----------------- - -PaddlePaddle编译需要使用到下面的依赖(包含但不限于),其他的依赖软件,会自动在编译时下载。 - -.. csv-table:: PaddlePaddle编译依赖 - :header: "依赖", "版本", "说明" - :widths: 10, 15, 30 - - "CMake", ">=3.2", "" - "GCC", "4.8.2", "推荐使用CentOS的devtools2" - "Python", "2.7.x", "依赖libpython2.7.so" - "pip", ">=9.0", "" - "numpy", "", "" - "SWIG", ">=2.0", "" - "Go", ">=1.8", "可选" - - -.. _build_options: - -附录:编译选项 ----------------- - -PaddlePaddle的编译选项,包括生成CPU/GPU二进制文件、链接何种BLAS库等。 -用户可在调用cmake的时候设置它们,详细的cmake使用方法可以参考 -`官方文档 `_ 。 - -在cmake的命令行中,通过使用 ``-D`` 命令设置该类编译选项,例如: - -.. code-block:: bash - - cmake .. -DWITH_GPU=OFF - -.. csv-table:: 编译选项说明 - :header: "选项", "说明", "默认值" - :widths: 1, 7, 2 - - "WITH_GPU", "是否支持GPU", "ON" - "WITH_C_API", "是否仅编译CAPI", "OFF" - "WITH_DOUBLE", "是否使用双精度浮点数", "OFF" - "WITH_DSO", "是否运行时动态加载CUDA动态库,而非静态加载CUDA动态库。", "ON" - "WITH_AVX", "是否编译含有AVX指令集的PaddlePaddle二进制文件", "ON" - "WITH_PYTHON", "是否内嵌PYTHON解释器", "ON" - "WITH_STYLE_CHECK", "是否编译时进行代码风格检查", "ON" - "WITH_TESTING", "是否开启单元测试", "OFF" - "WITH_DOC", "是否编译中英文文档", "OFF" - "WITH_SWIG_PY", "是否编译PYTHON的SWIG接口,该接口可用于预测和定制化训练", "Auto" - "WITH_GOLANG", "是否编译go语言的可容错parameter server", "OFF" - "WITH_MKL", "是否使用MKL数学库,如果为否则是用OpenBLAS", "ON" - -BLAS -+++++ - -PaddlePaddle支持 `MKL `_ 和 -`OpenBlAS `_ 两种BLAS库。默认使用MKL。如果使用MKL并且机器含有AVX2指令集, -还会下载MKL-DNN数学库,详细参考 `mkldnn设计文档 `_ 。 - -如果关闭MKL,则会使用OpenBLAS作为BLAS库。 - -CUDA/cuDNN -+++++++++++ - -PaddlePaddle在编译时/运行时会自动找到系统中安装的CUDA和cuDNN库进行编译和执行。 -使用参数 :code:`-DCUDA_ARCH_NAME=Auto` 可以指定开启自动检测SM架构,加速编译。 - -PaddlePaddle可以使用cuDNN v5.1之后的任何一个版本来编译运行,但尽量请保持编译和运行使用的cuDNN是同一个版本。 -我们推荐使用最新版本的cuDNN。 - -编译选项的设置 -++++++++++++++ - -PaddePaddle通过编译时指定路径来实现引用各种BLAS/CUDA/cuDNN库。cmake编译时,首先在系统路径( :code:`/usr/lib:/usr/local/lib` )中搜索这几个库,同时也会读取相关路径变量来进行搜索。 通过使用 ``-D`` 命令可以设置,例如 - -.. code-block:: bash - - cmake .. -DWITH_GPU=ON -DWITH_TESTING=OFF -DCUDNN_ROOT=/opt/cudnnv5 - -**注意:这几个编译选项的设置,只在第一次cmake的时候有效。如果之后想要重新设置,推荐清理整个编译目录(** :code:`rm -rf` )**后,再指定。** diff --git a/doc/v2/build_and_install/build_from_source_en.rst b/doc/v2/build_and_install/build_from_source_en.rst deleted file mode 100644 index 664b68da8b7dd3e005ebf3ec34de77729e5ab355..0000000000000000000000000000000000000000 --- a/doc/v2/build_and_install/build_from_source_en.rst +++ /dev/null @@ -1,237 +0,0 @@ -Build from Sources -========================== - -.. _requirements: - -Requirements ----------------- - -To build PaddlePaddle, you need - -1. A computer -- Linux, Windows, MacOS. -2. Docker. - -Nothing else. Not even Python and GCC, because you can install all build tools into a Docker image. -We run all the tools by running this image. - -.. _build_step: - -How To Build ----------------- - -You need to use Docker to build PaddlePaddle -to avoid installing dependencies by yourself. We have several pre-built -Docker images `here `_ , -you can also find how to build and use paddle_manylinux_devel Docker image from -`here `__ -Or you can build your own image from source as the optional step below: - -If you don't wish to use docker,you need to install several compile dependencies manually as :ref:`Compile Dependencies <_compile_deps>` shows to start compilation. - -.. code-block:: bash - - # 1. clone the source code - git clone https://github.com/PaddlePaddle/Paddle.git - cd Paddle - # 2. Optional: build development docker image from source - docker build -t paddle:dev . - # 3. Run the following command to build a CPU-Only binaries - docker run -it -v $PWD:/paddle -w /paddle -e "PYTHON_ABI=cp27-cp27mu" -e "WITH_GPU=OFF" -e "WITH_TESTING=OFF" paddlepaddle/paddle_manylinux_devel:cuda8.0_cudnn5 ./paddle/scripts/paddle_build.sh build - # 4. Or, use your built Docker image to build PaddlePaddle (must run step 2) - docker run -it -v $PWD:/paddle -w /paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=OFF" paddle:dev ./paddle/scripts/paddle_build.sh build - -NOTE: - -- The above command try to mount the current working directory (root directory of source code) -into :code:`/paddle` directory inside docker container. - -- You need to pass in the required environment variable :code:`PYTHON_ABI` to specify a `Python ABI `__. -Currently PaddlePaddle supported Python ABIs include :code:`cp27-cp27m` and :code:`cp27-cp27mu` . - -When the compile finishes, you can get the output whl package under -build/python/dist, then you can choose to install the whl on local -machine or copy it to the target machine. - -.. code-block:: bash - - pip install build/python/dist/*.whl - -If the machine has installed PaddlePaddle before, there are two methods: - -.. code-block:: bash - - 1. uninstall and reinstall - pip uninstall paddlepaddle - pip install build/python/dist/*.whl - - 2. upgrade directly - pip install build/python/dist/*.whl -U - -.. _run_test: - -Run Tests ----------------- - -If you wish to run the tests, you may follow the below steps: - -When using Docker, set :code:`RUN_TEST=ON` and :code:`WITH_TESTING=ON` will run test immediately after the build. -Set :code:`WITH_GPU=ON` Can also run tests on GPU. - -.. code-block:: bash - - docker run -it -v $PWD:/paddle -w /paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=ON" -e "RUN_TEST=ON" paddlepaddle/paddle_manylinux_devel:cuda8.0_cudnn5 ./paddle/scripts/paddle_build.sh test - -If you wish to run only one unit test, like :code:`test_sum_op`: - -.. code-block:: bash - - docker run -it -v $PWD:/paddle -w /paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=ON" -e "RUN_TEST=OFF" paddlepaddle/paddle_manylinux_devel:cuda8.0_cudnn5 /bin/bash - ./paddle/scripts/paddle_build.sh build - cd build - ctest -R test_sum_op -V - -.. _faq_docker: - -Frequently Asked Questions ---------------------------- - -- What is Docker? - - If you haven't heard of it, consider it something like Python's virtualenv. - -- Docker or virtual machine? - - Some people compare Docker with VMs, but Docker doesn't virtualize any hardware nor running a guest OS, which means there is no compromise on the performance. - -- Why Docker? - - Using a Docker image of build tools standardizes the building environment, which makes it easier for others to reproduce your problems and to help. - - Also, some build tools don't run on Windows or Mac or BSD, but Docker runs almost everywhere, so developers can use whatever computer they want. - -- Can I choose not to use Docker? - - Sure, you don't have to install build tools into a Docker image; instead, you can install them on your local computer. This document exists because Docker would make the development way easier. - -- How difficult is it to learn Docker? - - It takes you ten minutes to read `an introductory article `_ and saves you more than one hour to install all required build tools, configure them, especially when new versions of PaddlePaddle require some new tools. Not even to mention the time saved when other people trying to reproduce the issue you have. - -- Can I use my favorite IDE? - - Yes, of course. The source code resides on your local computer, and you can edit it using whatever editor you like. - - Many PaddlePaddle developers are using Emacs. They add the following few lines into their `~/.emacs` configure file: - - .. code-block:: emacs - - (global-set-key "\C-cc" 'compile) - (setq compile-command "docker run --rm -it -v $(git rev-parse --show-toplevel):/paddle paddle:dev") - - so they could type `Ctrl-C` and `c` to build PaddlePaddle from source. - -- Does Docker do parallel building? - - Our building Docker image runs a `Bash script `_ , which calls `make -j$(nproc)` to starts as many processes as the number of your CPU cores. - -- Docker requires sudo - - An owner of a computer has the administrative privilege, a.k.a., sudo, and Docker requires this privilege to work properly. If you use a shared computer for development, please ask the administrator to install and configure Docker. We will do our best to support rkt, another container technology that doesn't require sudo. - -- Docker on Windows/MacOS builds slowly - - On Windows and MacOS, Docker containers run in a Linux VM. You might want to give this VM some more memory and CPUs so to make the building efficient. Please refer to `this issue `_ for details. - -- Not enough disk space - - Examples in this article use option `--rm` with the `docker run` command. This option ensures that stopped containers do not exist on hard disks. We can use `docker ps -a` to list all containers, including stopped. Sometimes `docker build` generates some intermediate dangling images, which also take disk space. To clean them, please refer to `this article `_ . - -.. _compile_deps: - -Appendix: Compile Dependencies -------------------------------- - -PaddlePaddle need the following dependencies when compiling, other dependencies -will be downloaded automatically. - -.. csv-table:: PaddlePaddle Compile Dependencies - :header: "Dependency", "Version", "Description" - :widths: 10, 15, 30 - - "CMake", ">=3.2", "" - "GCC", "4.8.2", "Recommend devtools2 for CentOS" - "Python", "2.7.x", "Need libpython2.7.so" - "pip", ">=9.0", "" - "numpy", "", "" - "SWIG", ">=2.0", "" - "Go", ">=1.8", "Optional" - - -.. _build_options: - -Appendix: Build Options -------------------------- - -Build options include whether build binaries for CPU or GPU, which BLAS -library to use etc. You may pass these settings when running cmake. -For detailed cmake tutorial please refer to `here `__ 。 - - -You can add :code:`-D` argument to pass such options, like: - -.. code-block:: bash - - cmake .. -DWITH_GPU=OFF - -.. csv-table:: Bool Type Options - :header: "Option", "Description", "Default" - :widths: 1, 7, 2 - - "WITH_GPU", "Build with GPU support", "ON" - "WITH_C_API", "Build only CAPI", "OFF" - "WITH_DOUBLE", "Build with double precision", "OFF" - "WITH_DSO", "Dynamically load CUDA libraries", "ON" - "WITH_AVX", "Build with AVX support", "ON" - "WITH_PYTHON", "Build with integrated Python interpreter", "ON" - "WITH_STYLE_CHECK", "Check code style when building", "ON" - "WITH_TESTING", "Build unit tests", "OFF" - "WITH_DOC", "Build documentations", "OFF" - "WITH_SWIG_PY", "Build Python SWIG interface for V2 API", "Auto" - "WITH_GOLANG", "Build fault-tolerant parameter server written in go", "OFF" - "WITH_MKL", "Use MKL as BLAS library, else use OpenBLAS", "ON" - - -BLAS -+++++ - -PaddlePaddle supports `MKL `_ and -`OpenBlAS `_ as BLAS library。By default it uses MKL. -If you are using MKL and your machine supports AVX2, MKL-DNN will also be downloaded -and used, for more `details `_ . - -If you choose not to use MKL, then OpenBlAS will be used. - -CUDA/cuDNN -+++++++++++ - -PaddlePaddle will automatically find CUDA and cuDNN when compiling and running. -parameter :code:`-DCUDA_ARCH_NAME=Auto` can be used to detect SM architecture -automatically in order to speed up the build. - -PaddlePaddle can build with any version later than cuDNN v5.1, and we intend to -keep on with latest cuDNN versions. Be sure to run with the same version of cuDNN -you built. - -Pass Compile Options -++++++++++++++++++++++ - -You can pass compile options to use intended BLAS/CUDA/Cudnn libraries. -When running cmake command, it will search system paths like -:code:`/usr/lib:/usr/local/lib` and then search paths that you -passed to cmake, i.e. - -.. code-block:: bash - - cmake .. -DWITH_GPU=ON -DWITH_TESTING=OFF -DCUDNN_ROOT=/opt/cudnnv5 - -**NOTE: These options only take effect when running cmake for the first time, you need to clean the cmake cache or clean the build directory (** :code:`rm -rf` **) if you want to change it.** diff --git a/doc/v2/build_and_install/docker_install_cn.rst b/doc/v2/build_and_install/docker_install_cn.rst deleted file mode 100644 index 106c86bace075764c84bc2a7f7cb09d466fa8794..0000000000000000000000000000000000000000 --- a/doc/v2/build_and_install/docker_install_cn.rst +++ /dev/null @@ -1,146 +0,0 @@ -使用Docker安装运行 -================================ - -使用Docker安装和运行PaddlePaddle可以无需考虑依赖环境即可运行。并且也可以在Windows的docker中运行。 -您可以在 `Docker官网 `_ 获得基本的Docker安装和使用方法。 - -如果您在使用Windows,可以参考 -`这篇 `_ -教程,完成在Windows上安装和使用Docker。 - -在了解Docker的基本使用方法之后,即可开始下面的步骤: - -.. _docker_pull: - -获取PaddlePaddle的Docker镜像 ------------------------------- - -执行下面的命令获取最新的PaddlePaddle Docker镜像,版本为cpu_avx_mkl: - - .. code-block:: bash - - docker pull paddlepaddle/paddle - -对于国内用户,我们提供了加速访问的镜像源: - - .. code-block:: bash - - docker pull docker.paddlepaddlehub.com/paddle - -下载GPU版本(cuda8.0_cudnn5_avx_mkl)的Docker镜像: - - .. code-block:: bash - - docker pull paddlepaddle/paddle:latest-gpu - docker pull docker.paddlepaddlehub.com/paddle:latest-gpu - -选择下载使用不同的BLAS库的Docker镜像: - - .. code-block:: bash - - # 默认是使用MKL的镜像 - docker pull paddlepaddle/paddle - # 使用OpenBLAS的镜像 - docker pull paddlepaddle/paddle:latest-openblas - -下载指定版本的Docker镜像,可以从 `DockerHub网站 `_ 获取可选的tag,并执行下面的命令: - - .. code-block:: bash - - docker pull paddlepaddle/paddle:[tag] - # 比如: - docker pull docker.paddlepaddlehub.com/paddle:0.11.0-gpu - -.. _docker_run: - -在Docker中执行PaddlePaddle训练程序 ----------------------------------- - -假设您已经在当前目录(比如在/home/work)编写了一个PaddlePaddle的程序 :code:`train.py` (可以参考 -`PaddlePaddleBook `_ -编写),就可以使用下面的命令开始执行训练: - - .. code-block:: bash - - cd /home/work - docker run -it -v $PWD:/work paddlepaddle/paddle /work/train.py - -上述命令中, :code:`-it` 参数说明容器已交互式运行; :code:`-v $PWD:/work` -指定将当前路径(Linux中$PWD变量会展开为当前路径的绝对路径)挂载到容器内部的 :code:`/work` -目录; :code:`paddlepaddle/paddle` 指定需要使用的容器; 最后 :code:`/work/train.py` -为容器内执行的命令,即运行训练程序。 - -当然,您也可以进入到Docker容器中,以交互式的方式执行或调试您的代码: - - .. code-block:: bash - - docker run -it -v $PWD:/work paddlepaddle/paddle /bin/bash - cd /work - python train.py - -**注:PaddlePaddle Docker镜像为了减小体积,默认没有安装vim,您可以在容器中执行** :code:`apt-get install -y vim` **安装后,在容器中编辑代码。** - -.. _docker_run_book: - -使用Docker启动PaddlePaddle Book教程 ------------------------------------ - -使用Docker可以快速在本地启动一个包含了PaddlePaddle官方Book教程的Jupyter Notebook,可以通过网页浏览。 -PaddlePaddle Book是为用户和开发者制作的一个交互式的Jupyter Notebook。 -如果您想要更深入了解deep learning,PaddlePaddle Book一定是您最好的选择。 -大家可以通过它阅读教程,或者制作和分享带有代码、公式、图表、文字的交互式文档。 - -我们提供可以直接运行PaddlePaddle Book的Docker镜像,直接运行: - - .. code-block:: bash - - docker run -p 8888:8888 paddlepaddle/book - -国内用户可以使用下面的镜像源来加速访问: - - .. code-block:: bash - - docker run -p 8888:8888 docker.paddlepaddlehub.com/book - -然后在浏览器中输入以下网址: - - .. code-block:: text - - http://localhost:8888/ - -就这么简单,享受您的旅程! - -.. _docker_run_gpu: - -使用Docker执行GPU训练 ------------------------------- - -为了保证GPU驱动能够在镜像里面正常运行,我们推荐使用 -`nvidia-docker `_ 来运行镜像。 -请不要忘记提前在物理机上安装GPU最新驱动。 - - .. code-block:: bash - - nvidia-docker run -it -v $PWD:/work paddlepaddle/paddle:latest-gpu /bin/bash - -**注: 如果没有安装nvidia-docker,可以尝试以下的方法,将CUDA库和Linux设备挂载到Docker容器内:** - - .. code-block:: bash - - export CUDA_SO="$(\ls /usr/lib64/libcuda* | xargs -I{} echo '-v {}:{}') $(\ls /usr/lib64/libnvidia* | xargs -I{} echo '-v {}:{}')" - export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}') - docker run ${CUDA_SO} ${DEVICES} -it paddlepaddle/paddle:latest-gpu - -**关于AVX:** - -AVX是一种CPU指令集,可以加速PaddlePaddle的计算。最新的PaddlePaddle Docker镜像默认 -是开启AVX编译的,所以,如果您的电脑不支持AVX,需要单独 -`编译 <./build_from_source_cn.html>`_ PaddlePaddle为no-avx版本。 - -以下指令能检查Linux电脑是否支持AVX: - - .. code-block:: bash - - if cat /proc/cpuinfo | grep -i avx; then echo Yes; else echo No; fi - -如果输出是No,就需要选择使用no-AVX的镜像 diff --git a/doc/v2/build_and_install/docker_install_en.rst b/doc/v2/build_and_install/docker_install_en.rst deleted file mode 100644 index 25aecb8d0da9feb00006da6259b529b7011d91cb..0000000000000000000000000000000000000000 --- a/doc/v2/build_and_install/docker_install_en.rst +++ /dev/null @@ -1,153 +0,0 @@ -Run in Docker Containers -================================= - -Run PaddlePaddle in Docker container so that you don't need to care about -runtime dependencies, also you can run under Windows system. You can get -tutorials at `here `_ . - -If you are using Windows, please refer to -`this `_ -tutorial to start running docker under windows. - -After you've read above tutorials you may proceed the following steps. - -.. _docker_pull: - -Pull PaddlePaddle Docker Image ------------------------------- - -Run the following command to download the latest Docker images, the version is cpu_avx_mkl: - - .. code-block:: bash - - docker pull paddlepaddle/paddle - -For users in China, we provide a faster mirror: - - .. code-block:: bash - - docker pull docker.paddlepaddlehub.com/paddle - -Download GPU version (cuda8.0_cudnn5_avx_mkl) images: - - .. code-block:: bash - - docker pull paddlepaddle/paddle:latest-gpu - docker pull docker.paddlepaddlehub.com/paddle:latest-gpu - -Choose between different BLAS version: - - .. code-block:: bash - - # image using MKL by default - docker pull paddlepaddle/paddle - # image using OpenBLAS - docker pull paddlepaddle/paddle:latest-openblas - - -If you want to use legacy versions, choose a tag from -`DockerHub `_ -and run: - - .. code-block:: bash - - docker pull paddlepaddle/paddle:[tag] - # i.e. - docker pull docker.paddlepaddlehub.com/paddle:0.11.0-gpu - -.. _docker_run: - -Launch your training program in Docker --------------------------------------- - -Assume that you have already written a PaddlePaddle program -named :code:`train.py` under directory :code:`/home/work` (refer to -`PaddlePaddleBook `_ -for more samples), then run the following command: - - .. code-block:: bash - - cd /home/work - docker run -it -v $PWD:/work paddlepaddle/paddle /work/train.py - -In the above command, :code:`-it` means run the container interactively; -:code:`-v $PWD:/work` means mount the current directory ($PWD will expand -to current absolute path in Linux) under :code:`/work` in the container. -:code:`paddlepaddle/paddle` to specify image to use; finnally -:code:`/work/train.py` is the command to run inside docker. - -Also, you can go into the container shell, run or debug your code -interactively: - - .. code-block:: bash - - docker run -it -v $PWD:/work paddlepaddle/paddle /bin/bash - cd /work - python train.py - -**NOTE: We did not install vim in the default docker image to reduce the image size, you can run** :code:`apt-get install -y vim` **to install it if you need to edit python files.** - -.. _docker_run_book: - -PaddlePaddle Book ------------------- - -You can create a container serving PaddlePaddle Book using Jupyter Notebook in -one minute using Docker. PaddlePaddle Book is an interactive Jupyter Notebook -for users and developers.If you want to -dig deeper into deep learning, PaddlePaddle Book definitely is your best choice. - -We provide a packaged book image, simply issue the command: - - .. code-block:: bash - - docker run -p 8888:8888 paddlepaddle/book - -For users in China, we provide a faster mirror: - - .. code-block:: bash - - docker run -p 8888:8888 docker.paddlepaddlehub.com/book - -Then, you would back and paste the address into the local browser: - - .. code-block:: text - - http://localhost:8888/ - -That's all. Enjoy your journey! - -.. _docker_run_gpu: - -Train with Docker with GPU ------------------------------- - -We recommend using -`nvidia-docker `_ -to run GPU training jobs. Please ensure you have latest -GPU driver installed before move on. - - .. code-block:: bash - - nvidia-docker run -it -v $PWD:/work paddlepaddle/paddle:latest-gpu /bin/bash - -**NOTE: If you don't have nvidia-docker installed, try the following method to mount CUDA libs and devices into the container.** - - .. code-block:: bash - - export CUDA_SO="$(\ls /usr/lib64/libcuda* | xargs -I{} echo '-v {}:{}') $(\ls /usr/lib64/libnvidia* | xargs -I{} echo '-v {}:{}')" - export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}') - docker run ${CUDA_SO} ${DEVICES} -it paddlepaddle/paddle:latest-gpu - -**About AVX:** - -AVX is a kind of CPU instruction can accelerate PaddlePaddle's calculations. -The latest PaddlePaddle Docker image turns AVX on by default, so, if your -computer doesn't support AVX, you'll probably need to -`build <./build_from_source_en.html>`_ with :code:`WITH_AVX=OFF`. - -The following command will tell you whether your computer supports AVX. - - .. code-block:: bash - - if cat /proc/cpuinfo | grep -i avx; then echo Yes; else echo No; fi diff --git a/doc/v2/build_and_install/index_cn.rst b/doc/v2/build_and_install/index_cn.rst deleted file mode 100644 index 1a9305ac4b6578c14a962f223c647a71e3b8a72b..0000000000000000000000000000000000000000 --- a/doc/v2/build_and_install/index_cn.rst +++ /dev/null @@ -1,56 +0,0 @@ -安装与编译 -========== - -.. _install_steps: - -PaddlePaddle针对不同的用户群体提供了多种安装方式。 - -专注深度学习模型开发 --------------------- - -PaddlePaddle提供了多种python wheel包,可通过pip一键安装: - -.. toctree:: - :maxdepth: 1 - - pip_install_cn.rst - -这是最便捷的安装方式,请根据机器配置和系统选择对应的安装包。 - -关注底层框架 -------------- - -PaddlePaddle提供了基于Docker的安装方式,请参照以下教程: - -.. toctree:: - :maxdepth: 1 - - docker_install_cn.rst - -我们推荐在Docker中运行PaddlePaddle,该方式具有以下优势: - -- 无需单独安装第三方依赖 -- 方便分享运行时环境,易于问题的复现 - -对于有定制化二进制文件需求的用户,我们同样提供了从源码编译安装PaddlePaddle的方法: - -.. toctree:: - :maxdepth: 1 - - build_from_source_cn.rst - -.. warning:: - - 需要提醒的是,这种安装方式会涉及到一些第三方库的下载、编译及安装,整个安装过程耗时较长。 - - -常见问题汇总 --------------- - -如果在安装过程中遇到了问题,请先尝试在下面的页面寻找答案: - -:ref:`常见问题解答 ` - -如果问题没有得到解决,欢迎向PaddlePaddle社区反馈问题: - -`创建issue `_ diff --git a/doc/v2/build_and_install/index_en.rst b/doc/v2/build_and_install/index_en.rst deleted file mode 100644 index 7990bacbd6966e88e8763e9c5709e410f7e9fed4..0000000000000000000000000000000000000000 --- a/doc/v2/build_and_install/index_en.rst +++ /dev/null @@ -1,56 +0,0 @@ -install and Compile -====================== - -.. _install_steps: - -PaddlePaddle provides various methods of installation for many different users - -Focus on Deep Learning Model Development ----------------------------------------- - -PaddlePaddle provides lots of packages of python wheel , that pip can install: - -.. toctree:: - :maxdepth: 1 - - pip_install_en.rst - -This is the most convenient way of installation. Please choose the right installation package with machine configure and system. - -Follow the Bottom Frame ------------------------- - -PaddlePaddle also supports installation using Docker. Please refer to the tutorial below: - -.. toctree:: - :maxdepth: 1 - - docker_install_en.rst - -We recommend running PaddlePaddle in Docker. This method has the following advantages: - -- Does not require installation of third-party dependencies. -- Easy to share runtime environment. - -Lastly, users can also compile and install PaddlePaddle from source code. The instructions are below: - -.. toctree:: - :maxdepth: 1 - - build_from_source_en.rst - -.. warning:: - - One caveat with this approach is that developers will have to download, compile and install all third-party dependencies. Thus this process of installation is more time consuming. - - -FAQ ------------ - -For any problems during installation, please refer to the page below for answers: - -:ref:`常见问题解答 ` - -If the problem still persists, you are welcome to seek assistance from the PaddlePaddle community: - -`创建issue `_ diff --git a/doc/v2/build_and_install/paddleci.png b/doc/v2/build_and_install/paddleci.png deleted file mode 100644 index 16087ce059aa3c07ce8c927d983eb86351915825..0000000000000000000000000000000000000000 Binary files a/doc/v2/build_and_install/paddleci.png and /dev/null differ diff --git a/doc/v2/build_and_install/pip_install_cn.rst b/doc/v2/build_and_install/pip_install_cn.rst deleted file mode 100644 index 095da19cd41d29bfa72ab23abd24bec45f925a86..0000000000000000000000000000000000000000 --- a/doc/v2/build_and_install/pip_install_cn.rst +++ /dev/null @@ -1,105 +0,0 @@ -使用pip安装 -================================ - -PaddlePaddle可以使用常用的Python包管理工具 -`pip `_ -完成安装,并可以在大多数主流的Linux操作系统以及MacOS上执行。 - -.. _pip_install: - -使用pip安装 ------------------------------- - -执行下面的命令即可在当前机器上安装PaddlePaddle的运行时环境,并自动下载安装依赖软件。 - - .. code-block:: bash - - pip install paddlepaddle - -当前的默认版本为0.12.0,cpu_avx_openblas,您可以通过指定版本号来安装其它版本,例如: - - .. code-block:: bash - - pip install paddlepaddle==0.11.0 - - -如果需要安装支持GPU的版本(cuda8.0_cudnn5_avx_openblas),需要执行: - - .. code-block:: bash - - pip install paddlepaddle-gpu - -当前的默认版本也是0.12.0,PaddlePaddle针对不同需求提供了更多版本的安装包,部分列表如下: - -================================= ======================================== -版本号 版本说明 -================================= ======================================== -paddlepaddle-gpu==0.12.0 使用CUDA 8.0和cuDNN 5编译的0.12.0版本 -paddlepaddle-gpu==0.11.0.post87 使用CUDA 8.0和cuDNN 7编译的0.11.0版本 -paddlepaddle-gpu==0.11.0.post8 使用CUDA 8.0和cuDNN 5编译的0.11.0版本 -paddlepaddle-gpu==0.11.0 使用CUDA 7.5和cuDNN 5编译的0.11.0版本 -================================= ======================================== - -您可以在 `Release History `_ 中找到paddlepaddle-gpu的各个发行版本。 - -如果需要获取并安装最新的(开发分支)PaddlePaddle,可以从我们的CI系统中下载最新的whl安装包和c-api开发包并安装, -您可以从下面的表格中找到需要的版本: - -如果在点击下面链接时出现如下登陆界面,点击“Log in as guest”即可开始下载: - -.. image:: paddleci.png - :scale: 50 % - :align: center - -.. csv-table:: 各个版本最新的whl包 - :header: "版本说明", "cp27-cp27mu", "cp27-cp27m" - :widths: 1, 3, 3 - - "cpu_avx_mkl", "`paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl `__", "`paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl `__" - "cpu_avx_openblas", "`paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl `__", "`paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl `__" - "cpu_noavx_openblas", "`paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl `__", "`paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl `_" - "cuda8.0_cudnn5_avx_mkl", "`paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl `__", "`paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl `__" - "cuda8.0_cudnn7_avx_mkl", "`paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl `__", "`paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl `__" - "cuda9.0_cudnn7_avx_mkl", "`paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl `__", "`paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl `__" - -.. _pip_dependency: - -运行环境依赖 ------------------------------- - -PaddlePaddle安装包由于不仅仅包含.py程序,而且包含了C++编写的部分,所以我们确保发布的二进制包可以支持主流的Linux操作系统,比如CentOS 6以上,Ubuntu 14.04以上,MacOS 10.12以上。 - -PaddlePaddle发布的安装包会尽量对齐 `manylinux1 `_ 标准,通常使用CentOS 5作为编译环境。但由于CUDA库通常需要CentOS 6以上,而且CentOS 5即将停止维护,所以我们默认使用CentOS 6作为标准编译环境。 - -.. csv-table:: PaddlePaddle环境依赖 - :header: "依赖", "版本", "说明" - :widths: 10, 15, 30 - - "操作系统", "Linux, MacOS", "CentOS 6以上,Ubuntu 14.04以上,MacOS 10.12以上" - "Python", "2.7.x", "暂时不支持Python3" - "libc.so", "GLIBC_2.7", "glibc至少包含GLIBC_2.7以上的符号" - "libstdc++.so", "GLIBCXX_3.4.11, CXXABI_1.3.3", "至少包含GLIBCXX_3.4.11, CXXABI_1.3.3以上的符号" - "libgcc_s.so", "GCC_3.3", "至少包含GCC_3.3以上的符号" - -.. _pip_faq: - -安装常见问题和解决方法 ------------------------------- - -- paddlepaddle*.whl is not a supported wheel on this platform. - - 出现这个问题的主要原因是,没有找到和当前系统匹配的paddlepaddle安装包。请检查Python版本是否为2.7系列。另外最新的pip官方源中的安装包默认是manylinux1标准,需要使用最新的pip (>9.0.0) 才可以安装。可以使用下面的命令更新您的pip: - - .. code-block:: bash - - pip install --upgrade pip - - 如果仍然存在问题,可以执行: - - .. code-block:: bash - - python -c "import pip; print(pip.pep425tags.get_supported())" - - 获取当前系统支持的安装包格式,并检查和需安装的包是否匹配。pypi安装包可以在 `这个 `_ 链接中找到。 - - 如果系统支持的是 linux_x86_64 而安装包是 manylinux1_x86_64 ,需要升级pip版本到最新; 如果系统支持 manylinux1_x86_64 而安装包(本地)是 linux_x86_64 ,可以重命名这个whl包为 manylinux1_x86_64 再安装。 diff --git a/doc/v2/build_and_install/pip_install_en.rst b/doc/v2/build_and_install/pip_install_en.rst deleted file mode 100644 index 8406e4aa1fbb953c3b615b10d1bcb2c45974dde0..0000000000000000000000000000000000000000 --- a/doc/v2/build_and_install/pip_install_en.rst +++ /dev/null @@ -1,123 +0,0 @@ -Install using pip -================================ - -You can use current widely used Python package management -tool `pip `_ -to install PaddlePaddle. This method can be used in -most of current Linux systems or MacOS. - -.. _pip_install: - -Install using pip ------------------------------- - -Run the following command to install PaddlePaddle on the current -machine, it will also download requirements. - - .. code-block:: bash - - pip install paddlepaddle - -the default version is 0.12.0, cpu_avx_openblas, you can specify the versions to satisfy your demands, like: - - .. code-block:: bash - - pip install paddlepaddle==0.11.0 - -If you need to install a GPU-enabled version (cuda8.0_cudnn5_avx_openblas), you need to run: - - .. code-block:: bash - - pip install paddlepaddle-gpu - -The default version is also 0.12.0, PaddlePaddle provides several versions of packages for different needs, as shown in the table: - -================================= ======================================== -版本号 版本说明 -================================= ======================================== -paddlepaddle-gpu==0.12.0 0.12.0 built with CUDA 8.0 and cuDNN 5 -paddlepaddle-gpu==0.11.0.post87 0.11.0 built with CUDA 8.0 and cuDNN 7 -paddlepaddle-gpu==0.11.0.post8 0.11.0 built with CUDA 8.0 and cuDNN 5 -paddlepaddle-gpu==0.11.0 0.11.0 built with CUDA 7.5 and cuDNN 5 -================================= ======================================== - -You can find all versions released of paddlepaddle-gpu in `Release History `_ . - -If you wish to install the latest develop branch PaddlePaddle, -you can download the latest whl package from our CI system. Access -the below links, log in as guest, then click at the "Artifact" -tab, you'll find the download link of whl packages. - -If the links below shows up the login form, just click "Log in as guest" to start the download: - -.. image:: paddleci.png - :scale: 50 % - :align: center - -.. csv-table:: whl package of each version - :header: "version", "cp27-cp27mu", "cp27-cp27m" - :widths: 1, 3, 3 - - "cpu_avx_mkl", "`paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl `__", "`paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl `__" - "cpu_avx_openblas", "`paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl `__", "`paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl `__" - "cpu_noavx_openblas", "`paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl `__", "`paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl `__" - "cuda8.0_cudnn5_avx_mkl", "`paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl `__", "`paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl `__" - "cuda8.0_cudnn7_avx_mkl", "`paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl `__", "`paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl `__" - "cuda9.0_cudnn7_avx_mkl", "`paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl `__", "`paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl `__" - -.. _pip_dependency: - -Runtime Dependency ------------------------------- - -PaddlePaddle installation packages (whl) does not only contain .py files, -but also binaries built from C++ code. We ensure that PaddlePaddle can -run on current mainline Linux distributions, like CentOS 6, Ubuntu 14.04 -and MacOS 10.12. - -PaddlePaddle whl packages are trying to satisfy -`manylinux1 `_ -standard, which uses CentOS 5 as default build environment. But CUDA libraries -seems only run on CentOS 6 at least, also, CentOS 5 is about to end its lifetime, -so we use CentOS 6 as default build environment. - -.. csv-table:: PaddlePaddle Runtime Deps - :header: "Dependency", "version", "description" - :widths: 10, 15, 30 - - "OS", "Linux, MacOS", "CentOS 6 or later,Ubuntu 14.04 or later,MacOS 10.12 or later" - "Python", "2.7.x", "Currently Python3 is not supported" - "libc.so", "GLIBC_2.7", "glibc at least include GLIBC_2.7 symbols" - "libstdc++.so", "GLIBCXX_3.4.11, CXXABI_1.3.3", "At least include GLIBCXX_3.4.11, CXXABI_1.3.3 symbols" - "libgcc_s.so", "GCC_3.3", "At least include GCC_3.3 symbols" - -.. _pip_faq: - -FAQ ------------------------------- - -- paddlepaddle*.whl is not a supported wheel on this platform. - - The main cause of this issue is that your current platform is - not supported. Please check that you are using Python 2.7 series. - Besides, pypi only supports manylinux1 standard, you'll need to - upgrade your pip to >9.0.0. Then run the below command: - - .. code-block:: bash - - pip install --upgrade pip - - If the problem still exists, run the following command: - - .. code-block:: bash - - python -c "import pip; print(pip.pep425tags.get_supported())" - - Then you'll get supported package suffixes, then check if it matches - the file name of the whl package. You can find default whl package at - `here `_ - - If your system supports linux_x86_64 but the whl package is manylinux1_x86_64, - you'll need to update pip to the latest version; If your system supports - manylinux1_x86_64 but the whl package is linux_x86_64 you can rename the - file to manylinux1_x86_64 suffix and then install. diff --git a/doc/v2/design/cluster_train/README.md b/doc/v2/design/cluster_train/README.md deleted file mode 100644 index 177a5f5d54bd924fab34795219ce1f7b270c8e25..0000000000000000000000000000000000000000 --- a/doc/v2/design/cluster_train/README.md +++ /dev/null @@ -1,182 +0,0 @@ -# Design Doc: Distributed Training - -## Objective - -In [this slides](https://www.slideshare.net/cxwangyi/paddlepaddle-a-complete-solution-for-businesses), we explained that we'd like PaddlePaddle running on general-purpose clusters like those managed by Kubernetes, so to address demands for AI from both Internet and non-Internet industries. - -This poses technical challenges to PaddlePaddle: - -1. Support fault-recovery. -1. Support both offline and online training. -1. [Serverless computing](https://en.wikipedia.org/wiki/Serverless_computing) of distributed training. - - -## Training Job - -A training job will be created once user asks Paddle cloud to train a model. The training job is made up of different processes that collaboratively consume data and produce a trained model. There are three kinds of processes: - -1. the *master server process*, which dispatches tasks to -1. one or more *trainer processes*, which run distributed training and synchronize gradients/models via -1. one or more *parameter server processes*, where each holds a shard of the global model, and receive the uploaded gradients from every *trainer process*, so they can run the optimize functions to update their parameters. - -Their relation is illustrated in the following graph: - - - -By coordinating these processes, PaddlePaddle supports use both Synchronize Stochastic Gradient Descent (sync SGD) and Asynchronous Stochastic Gradient Descent (async SGD) to train user-defined neural network topologies. - -When training with sync SGD, parameter servers wait for all trainers to finish gradients update and then send the updated parameters to trainers, training can not proceed until the trainer received the updated parameters. This creates a synchronization point between trainers. When training with async SGD, each trainer upload gradient and download new parameters individually, without the synchronization with other trainers. Using asyc SGD will be faster in terms of time per pass, but have more noise in gradient since trainers are likely to have a stale model. - -### Master Server Process - -The master server process will: - -- Partition a dataset into [tasks](#task) and dispatch tasks to trainers. -- Keep track of training progress on the dataset with [task queue](#task-queue). A training job will iterate on the dataset for a full pass until it goes into next pass. - - -#### Task - -A task is a data shard to be trained. The total number of tasks will be much bigger than the total number of trainers. The number of data instances inside a task will be much bigger than the mini-batch size. - -#### Task Queue - -The master server has three task queues to track training progress. As illustrated in the graph below, Job A and Job B both have one master server. Each master server process has three task queues. - - - -- The todo queue holds tasks to be dispatched. When a job starts, the master server fills in the todo queue with all tasks. -- The pending queue holds tasks that are currently training by trainers. -- the done queue holds tasks that are already trained. - -The life cycle of a single task is illustrated below: - - - -1. When a new pass of training starts, all tasks will be placed in the todo queue. -1. Upon trainer requests for new task, the master server will dispatch a task from todo queue to it, put the task in the pending queue and wait for completion. -1. The trainer will work on its task and tell the master server once the task is completed and ask for new task. The master server will dispatch a new task to that trainer. -1. If a task fails for any reason in trainer, or takes longer than a specific period of time, the master server will move the task back to the todo queue. The timeout count for that task will increase by one. If the timeout count is above a threshold, the task is likely to cause a trainer to crash, then it will be discarded. -1. The master server will move completed task to the done queue. When the todo queue is empty, the master server will start a new pass by moving all tasks in the done queue to todo queue and reset the timeout counter of all tasks to zero. - -### Trainer Process - -The trainer process will: - -- Request tasks from the master. -- Work on the tasks -- Upload gradient to parameter servers, and update local model by downloading new parameters from parameter servers. - -### Parameter Server Process - -Parameter server processes hold the parameters collaboratively. The parameters are partitioned on different parameter servers. - -The parameter server will: - -- Receive gradient from the trainers, update its parameters, and give the trainers the latest parameters. -- Periodically save its parameters to distributed file system by overriding the previous save. - -### Optimization Algorithms - -The communication pattern between the trainers and the parameter servers depends on the category of optimization algorithm: - -- Synchronous Stochastic Gradient Descent (sync-SGD) - - Parameter server will wait for all trainer finish n-th mini-batch calculation and send their gradients before broadcasting new parameters to every trainer. Every trainer will wait for the new parameters before starting n+1-th mini-batch. - -- Asynchronous Stochastic Gradient Descent (async-SGD) - - There will no synchronization between different trainers, and parameter server updates its parameter as soon as it receives new gradient: - - - Each trainer uploads its accumulated gradient every n mini-batches. - - Every m mini-batches, the trainer downloads new parameters from parameter server. - - n and m do not have to be equal. - -## Fault Tolerant - -The training job will pause if the master server processes is dead, or any of the parameter server process is dead. They will be started by [Kubernetes](https://kubernetes.io/) and recover in few minutes. Please refer to [fault recovery](#fault-recovery). - -The training job will continue to make progress if there is at least one training process running. The strategy depends on the type of optimization algorithm: - -- sync-SGD - - TODO - -- async-SGD - - Since async-SGD does not require synchronization between mini-batches, the system will by definition make process if at least one trainer is running. - -## Fault Recovery - -PaddlePaddle uses [etcd](https://github.com/coreos/etcd) to keep track of the states of processes. Because etcd is a distributed reliable key-value store, the restarted process can recover its states from etcd. The model parameters are periodically saved into distributed file system, so a restarted parameter server can recover its parameters from the saved file. - -Now we will introduce how each process recovers from a failure, the graph below shows how etcd is used: - - - -### Master Server Process - -When the master is started by the Kubernetes, it executes the following steps at startup: - -1. Grabs a unique *master* lock in etcd, which prevents concurrent master instantiations. -1. Recovers the task queues from etcd if they already exist, otherwise, the master will create them. -1. Write its ip address to */master/addr* so that trainers can discover it. -1. Listens to trainers' request of task, dispatch one upon request, and updates task queue using an etcd transaction to ensure lock is held during the update. - -When the master server process is dead for any reason, Kubernetes will restart it. It will be online again with all states recovered from etcd in few minutes. - -### Trainer Process - -When the trainer is started by the Kubernetes, it executes the following steps at startup: - -1. Watches the available parameter server prefix keys `/ps/` on etcd and waits until the count of parameter servers reaches the desired count */ps_desired*. -1. Finds and watches */master/addr* to get master's address. -1. Requests for tasks from the master to start training. - -When a trainer fails, Kuberentes would try to restart it. The recovered trainer would fetch tasks from master and go on training. - -### Parameter Server Process - -When the parameter server is started by Kubernetes, it executes the following steps at startup: - -1. Read desired total number of parameter servers from etcd `/ps_desired` -1. Search through etcd keys `/ps/` (`/ps/0`, `/ps/1`, ...) to find the first non-existant key whose index is smaller than the total number of parameter servers. Set the key using a transaction to avoid concurrent writes. The parameter server's index is inferred from the key name. - - The desired number of parameter servers is 3: - - - - The third parameter server joined: - - - -1. The parameter server can load parameters if there are already saved parameters in the save path (inferred from its index). -1. Now the parameter server is ready for the trainers' requests. - -If the parameter server's etcd lease expires, the parameter server will kill itself. - - -## Parameter Server Checkpointing -See [here](./checkpointing.md) - -## Store and dispatching trainning data -See [here](./data_dispatch.md) - - -## Dynamic Scaling - -### Trainer Scaling - -TODO - -### Parameter Server Scaling - -Not planned for v1. - -## Training Dataset Format - -TODO - -## User Interface - -TODO diff --git a/doc/v2/design/cluster_train/checkpointing.md b/doc/v2/design/cluster_train/checkpointing.md deleted file mode 100644 index c87ef2c7d2636208866d05456d5d44316d0bb200..0000000000000000000000000000000000000000 --- a/doc/v2/design/cluster_train/checkpointing.md +++ /dev/null @@ -1,44 +0,0 @@ -## 模型参数检查点(Checkpointing) -模型数据检查点的实现,可以有效的避免parameter server的单点或多点同时故障。模型参数检查点通过定期向磁盘上保存一份存储在parameter server内存中的模型数据的完整镜像,来保证训练过程可以从中间状态重新启动。在一个不可中断并缺少备份的训练任务中,可以通过阶段性的保存每个parameter server的数据快照(snapshot)到 ***分布式存储服务*** 达到容灾的目的,比如每隔10分钟最新的快照,并删除更早的快照。在出现单点故障时,只需要恢复这台节点,或者将这台节点迁移到另一个节点并启动即可恢复训练任务。 - - - -### 快照保存的设计如下: - -说明: - -* parameter server在集群中启动后,自动挂载分布式存储目录,并把快照保存到这个目录下。 -* ***注:每个parameter server的检查点各自独立保存,暂时不考虑多个parameter server同步的保存一个特定时间点的全局检查点,因为这样做也没法保证消除随机性。*** - -检查点保存程序流程: - -1. 如果满足条件"每隔10分钟"时,parameter server会获取parameters内存的`read_lock`,启动一个新的线程开始保存检查点。如果已经正在执行保存检查点的线程,则忽略。由于对parameters的更新需要获取parameters内存的`write_lock`,所以在写入快照的过程中,parameter server会暂停参数更新并等待。 -2. parameter server生成一个UUID,向指定的目录中一个新的文件(文件名为此UUID)写入快照数据。在快照写入完成后,计算这个文件的MD5 sum。然后在etcd的`/checkpoints/[pserver_id]`中写入json内容:`{"uuid": [UUID], "md5", "MD5 sum", "timestamp": xxxx}`。 -3. 删除磁盘目录中不是当前uuid的快照文件。 -4. 释放对paramters内存的锁定,停止保存检查点的线程。 - -这里需要用户额外注意,在您的实际环境中,训练任务的运行可能会占满trainer和parameter server之间的网络带宽,如果parameter server此时还需要通过网络访问分布式存储以保存快照,可能会造成网络拥塞,而出现阶段性的运行停滞。 - -### 从快照恢复 - -在parameter server第一次启动或任意时间parameter server故障后被Kubernetes重新启动,则需要回滚到上一个检查点: - - 1. 从etcd中读取节点:`/checkpoints/[pserver_id]`获取最新的检查点的文件uuid - 1. 从磁盘文件中加载uuid文件名的检查点快照文件,并加载其中的参数 - 1. 如果上面两步出现错误,则使用启动参数定义的初始化方法初始化参数 - 1. 开始提供服务 - -## TODO List -### 推测执行/加速执行(TODO) -在异构集群中,如果存在某些trainer执行速度过慢会影响整体集群的速度(如图中Trainer 1),此时master将负责启动一个新的Trainer(Accelerate Trainer 2),使用同样的训练数据block。哪个trainer先完成block的训练,则把另一个慢速的kill掉。 - -### 动态扩容/缩容 -目前只考虑动态扩容trainer数量,可以减小系统复杂性。 - -## 术语 -* model: 指深度学习训练之后得到的所有参数,使用这个神经网络可以完成对新数据的预测 -* parameters: 神经网络中的参数,包括权重w和偏置b。一个神经网络的模型由大量的参数组成 -* shard: 分片,通常指将一个整体拆分成多份的其中的一份。 -* model shard: 将一个神经网络参数拆分成多份,每个shard分别存储在其中一台parameter server之上 -* parameter block: 多个parameter block构成一个model shard -* 单点故障: 任意时刻只可能同时有一台服务器故障。由于集群中同时存在两台机器故障的概率极低((平均故障率*平均故障修复时间)^2)只对特殊在线系统考虑两台以上同时故障的容灾。 diff --git a/doc/v2/design/cluster_train/data_dispatch.md b/doc/v2/design/cluster_train/data_dispatch.md deleted file mode 100644 index 1f5d22ff5e6abcb576d16cbe7391da1967a1ab8e..0000000000000000000000000000000000000000 --- a/doc/v2/design/cluster_train/data_dispatch.md +++ /dev/null @@ -1,160 +0,0 @@ -## 训练数据的存储和分发 - -### 概念解释 - -### 流程介绍 -生产环境中的训练数据集通常体积很大,并被存储在诸如Hadoop HDFS,Ceph,AWS S3之类的分布式存储之上。这些分布式存储服务通常会把数据切割成多个分片分布式的存储在多个节点之上。这样就可以在云端执行多种数据类计算任务,包括: - -* 数据预处理任务 -* Paddle训练任务 -* 在线模型预测服务 -
- -
- -在上图中显示了在一个实际生产环境中的应用(人脸识别)的数据流图。生产环境的日志数据会通过实时流的方式(Kafka)和离线数据的方式(HDFS)存储,并在集群中运行多个分布式数据处理任务,比如流式数据处理(online data process),离线批处理(offline data process)完成数据的预处理,提供给paddle作为训练数据。用户也可以上传labeled data到分布式存储补充训练数据。在paddle之上运行的深度学习训练输出的模型会提供给在线人脸识别的应用使用。 - -### 训练数据存储 -我们选择[CephFS](http://docs.ceph.com/docs/master/cephfs/)作为存储系统。 - -- 无论是从[PFSClient](../file_manager/README.md)的角度,还是从[Pod](https://kubernetes.io/docs/concepts/workloads/pods/pod/)中运行任务的角度,统一用`/pfs/$DATACENTER/home/$USER`来访问用户自己的数据。 -- `/pfs/$DATACENTER/common`下存放公共数据集合 - - 做只读挂载 - -
- -
- -### 文件预处理 - - -在开始训练之前, 数据集需要预先被转换成PaddlePaddle分布式训练使用的存储格[RecordIO](https://github.com/PaddlePaddle/Paddle/issues/1947)。我们提供两个转换方式: - -1. 用户在本地转换好再上传 -1. 用户上传数据后,在机群上运行转换程序 - -转换生成的文件名会是以下格式: - -```text -name_prefix-aaaaa-of-bbbbb -``` - -"aaaaa"和"bbbbb"都是五位的数字,每一个文件是数据集的一个shard,"aaaaa"代表shard的index,"bbbbb"代表这个shard的最大index。 - -比如ImageNet这个数据集可能被分成1000个shard,它们的文件名是: -```text -imagenet-00000-of-00999 -imagenet-00001-of-00999 -... -imagenet-00999-of-00999 -``` - -#### 转换库 - -无论是在本地或是云端转换,我们都提供Python的转换库,接口是: -```python -def convert(output_path, reader, num_shards, name_prefix) -``` - -- `output_path`: directory in which output files will be saved. -- `reader`: a [data reader](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/reader/README.md#data-reader-interface), from which the convert program will read data instances. -- `num_shards`: the number of shards that the dataset will be partitioned into. -- `name_prefix`: the name prefix of generated files. - -`reader`每次输出一个data instance,这个instance可以是单个值,或者用tuple表示的多个值: - -```python -yield 1 # 单个值 -yield numpy.random.uniform(-1, 1, size=28*28) # 单个值 -yield numpy.random.uniform(-1, 1, size=28*28), 0 # 多个值 -``` - -每个值的类型可以是整形、浮点型数据、字符串,或者由它们组成的list,以及numpy.ndarray。如果是其它类型,会被Pickle序列化成字符串。 - -### 示例程序 - -#### 使用转换库 - -以下`reader_creator`生成的`reader`每次输出一个data instance,每个data instance包涵两个值:numpy.ndarray类型的值和整型的值: -```python -def reader_creator(): - def reader(): - for i in range(1000): - yield numpy.random.uniform(-1, 1, size=28*28), 0 # 多个值 - return reader -``` - -把`reader_creator`生成的`reader`传入`convert`函数即可完成转换: -```python -convert("./", reader_creator(), 100, random_images) -``` - -以上命令会在当前目录下生成100个文件: -```text -random_images-00000-of-00099 -random_images-00001-of-00099 -... -random_images-00099-of-00099 -``` - -#### 进行训练 - - -PaddlePaddle提供专用的[data reader creator](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/reader/README.md#python-data-reader-design-doc),生成给定`RecordIO`文件对应的data reader。**无论在本地还是在云端,reader的使用方式都是一致的**: - -```python -# ... -reader = paddle.reader.creator.RecordIO("/pfs/datacenter_name/home/user_name/random_images-*-of-*") -batch_reader = paddle.batch(paddle.dataset.mnist.train(), 128) -trainer.train(batch_reader, ...) -``` - -以上代码的reader输出的data instance与生成数据集时,reader输出的data instance是一模一样的。 - -### 上传训练文件 - -使用下面命令,可以把本地的数据上传到存储集群中。 - -```bash -paddle pfs cp filename /pfs/$DATACENTER/home/$USER/folder/ -``` - -比如,把之前示例中转换完毕的random_images数据集上传到云端的`/home/`可以用以下指令: - -```bash -paddle pfs cp random_images-*-of-* /pfs/$DATACENTER/home/$USER/folder/ -``` - -需要`$DATACENTER`的配置写到配置文件中,例如 - -``` -# config file -[datacenter_1] -username=user -usercert=user.pem -userkey=user-key.pem -endpoint=datacenter1.paddlepaddle.org - -[datacenter_2] -username=user -usercert=user.pem -userkey=user-key.pem -endpoint=datacenter2.paddlepaddle.org -``` -## TODO -### 文件访问的权限 -控制用户权限 - -- 用户可以把自己的数据分享给别人 - -### 文件访问方式 -不用mount的方式来访问数据,而是直接用API的接口远程访问 - -例如: - -``` -f = open('/pfs/datacenter_name/home/user_name/test1.dat') -``` - - -### 支持用户自定义的数据预处理job diff --git a/doc/v2/design/cluster_train/large_model_dist_train.md b/doc/v2/design/cluster_train/large_model_dist_train.md deleted file mode 100644 index edb0245ea083e791b7f32ac57a330698299fceda..0000000000000000000000000000000000000000 --- a/doc/v2/design/cluster_train/large_model_dist_train.md +++ /dev/null @@ -1,101 +0,0 @@ -# Alalysis of large model distributed training in Paddle - -***NOTE: This is only some note for how we implemeted this scheme in V1, not a new design.*** - -## What is it - -We often encounter cases that the embedding layer parameters(sparse) are so large that we can not store it in the trainer's memory when training. So we need to put them to several servers, and fetch them row by row instead of fetch all of the parameters. - -## How to use - -Specify command-line argument like `--loadsave_parameters_in_pserver=true --ports_num_for_sparse=1 --use_old_updater=1` when starting the paddle trainer. And also add something like `--ports_num_for_sparse=1 --pserver_num_threads=5` when starting pserver processes. - -Accrodingly, configure your embedding layers like: - -```python -SPARSE_REMOTE=True - -w1 = data_layer(name="w1", size=dict_size) -emb1 = embedding_layer(input=w1, size=32, param_attr=ParameterAttribute(sparse_update=SPARSE_REMOTE)) -w2 = data_layer(name="w2", size=dict_size) -emb2 = embedding_layer(input=w2, size=32, param_attr=ParameterAttribute(sparse_update=SPARSE_REMOTE)) -... -``` - -## Implementation details - -```c++ -enum MatType { - MAT_NORMAL, - MAT_NORMAL_SHARED, - MAT_VALUE_SHARED, - MAT_SPARSE_ROW_IDS, - MAT_SPARSE_ROW_AUTO_GROW, - MAT_CACHE_ROW, - MAT_SPARSE_ROW, - MAT_SPARSE_ROW_PREFETCH, - MAT_SPARSE_ROW_PREFETCH_FULL_SIZE, -}; -``` - -`MAT_SPARSE_ROW_PREFETCH` is what we use when configured to fetch only row of matrix when training. - -In `trainer_internal.cpp:L93 trainOneBatch`: - -```c++ - if (config_->getOptConfig().use_sparse_remote_updater()) { - REGISTER_TIMER("prefetch"); - gradientMachine_->prefetch(inArgs); - parameterUpdater_->getParametersRemote(); - } -``` - -When doing actual network forward and backward, at the beginning of each batch, the trainer will try to download one row of data from pserver. - -In `legacy/trainer/RemoteParameterUpdater.cpp`: `parameterUpdater_->getParametersRemote();`: - -```c++ -if (fullSize) { - ... -} else { -getParams = [&] { - parameterClient_->getParameterSparse( - /* recvParameterType= */ PARAMETER_VALUE, sendBackParameterType); -}; -applyL1 = [](Parameter& para, real decayRate) { - para.getMat(PARAMETER_VALUE)->applyL1(/*lr=*/1.0f, decayRate); -}; -} -``` - -Calling `parameterClient_->getParameterSparse` will do remote call to pserver's `getParameterSparse`: - -```c++ -void ParameterServer2::getParameterSparse(const SendParameterRequest& request, - std::vector& inputBuffers, - SendParameterResponse* response, - std::vector* outputBuffers) { - (void)inputBuffers; - auto& buffer = *readWriteBuffer_; - size_t numReals = 0; - for (const auto& block : request.blocks()) { - numReals += getParameterConfig(block).dims(1); - } - buffer.resize(numReals); - - VLOG(3) << "pserver: getParameterSparse, numReals=" << numReals; - - ReadLockGuard guard(parameterMutex_); - size_t offset = 0; - for (const auto& block : request.blocks()) { - size_t width = getParameterConfig(block).dims(1); - Buffer buf = {buffer.data() + offset, width}; - int type = request.send_back_parameter_type(); - sendBackParameterSparse(block, type, response, &buf, width, outputBuffers); - offset += width; - } -} -``` - -`getParameterConfig(block).dims(1)` returns the width of the current "parameter block"(a shard of parameter object), -then `getParameterSparse` remote call returns only one row of data to the client. diff --git a/doc/v2/design/cluster_train/master_server.md b/doc/v2/design/cluster_train/master_server.md deleted file mode 100644 index 4bf3c506f101361875043f8bfd97972b8c981a22..0000000000000000000000000000000000000000 --- a/doc/v2/design/cluster_train/master_server.md +++ /dev/null @@ -1,91 +0,0 @@ -# Design Doc: Master Server - -For an overview of master server's role, please refer to [distributed training design doc](./README.md). In this design doc we will discuss the master server in more details. The master will be implemented in [Go](https://golang.org/). - -## Dataset - - - -A dataset is a list of files in *RecordIO* format. A RecordIO file consists of chunks, whereas each chunk consists some records. - -## Task Queue - -As mentioned in [distributed training design doc](./README.md), a *task* is a data shard that the master server assigns to the trainer process to train on. A task consists of one or multiple *chunks* from one or multiple files. The master server maintains *task queues* to track the training progress. - -### Task Queue Creation - -1. Each trainer will make an RPC call (using Go's [rpc](https://golang.org/pkg/net/rpc/) package) to the master server, telling it the RecordIO files representing the dataset specified by the user. Since every trainer will tell the master server the same dataset, only the first RPC call will be honored. - - The RPC interface is: - ```go - func (m *RPCServer) ReportDataset(Paths []string, dummy *int) error { - } - ``` -1. The master server will scan through each RecordIO file to generate the *chunk index* and know how many chunks does each file have. A chunk can be referenced by the file path and the index of the chunk within the file. The chunk index is in memory data structure that enables fast access to each chunk, and the index of the chunk with the file is an integer start from 0, representing the n-th chunk within the file. - - The definition of the chunk is: - ```go - type Chunk struct { - Idx int // index of the chunk within the file - Path string - Index recordio.Index // chunk index - } - ``` -1. Chunks are grouped into tasks, and tasks are filled into the todo queue. The pending queue and the done queue are initialized with no element. - - The definition of the task is: - ```go - type Task struct { - Index int - Chunks []Chunk - } - ``` - - The elements in the tasks queues is of type `TaskEntry`, containing a timeout counter (described in [task retry logic](#task-retry-logic)), and a task: - ```go - type TaskEntry struct { - NumTimeout int - Task Task - } - ``` - - The definition of task queues is: - ```go - type TaskQueues struct { - Todo []TaskEntry - Pending map[int]TaskEntry // map from task index to task entry - Done []TaskEntry - } - ``` - -### Task Queue Persistence - -The task queues need to be persisted on [etcd](https://github.com/coreos/etcd) for fault recovery. Since the task queues only change once a task is completed or timed out, which is not very frequent, we can afford to synchronize with etcd every time the task queues change. - -We will serialize the task queues data structure with [gob encoding](https://golang.org/pkg/encoding/gob/), compress with gzip, and save into etcd synchronously under key `/task_queues`. - -### Task Dispatch - -The trainer will make an RPC call to master to get a new task when: - -- the trainer first started, or -- the trainer finishes a task. - -The RPC interface is: -```go -func (m *RPCServer) GetTask(finished *Task, result *Task) error { -} -``` -Argument `finished` will be `nil` when the trainer is just started. - -During the RPC call the master will do the following: - -- Make a copy of the task queues, and update the copy reflecting the finished tasks and the new pending tasks. -- Synchronize the copy of task queues with etcd using a transaction conditioned on holding the master lock. -- Replace the task queues with the copy and report to the trainer with the new tasks if succeeded, or discard the copy and report the error to the trainer if failed. - -### Task Retry Logic - -When a task is dispatched to the trainer, the master will schedule a function for execution after the timeout duration (based on the moving average of task completion time). If the task entry in still in the pending queue, its timeout counter will increase by one, and the task will be moved to todo queue. If the timeout counter is above the threshold, the master will log the error and discard the task. - -Please note that since a timed out task could be completed after it has been dispatched for retry, so it is possible for a task to be processed multiple times. We do not try to prevent it from happening since it's fine to train on the same task multiple times due to the stochastic nature of the stochastic gradient decent algorithm. diff --git a/doc/v2/design/cluster_train/pserver_client.md b/doc/v2/design/cluster_train/pserver_client.md deleted file mode 100644 index 474b8c572cd92fc87e9f7f3f2b19d12cccd158de..0000000000000000000000000000000000000000 --- a/doc/v2/design/cluster_train/pserver_client.md +++ /dev/null @@ -1,171 +0,0 @@ -# Design Doc: The Client Library of Parameter Server - -For an overview of trainer's role, please refer to [distributed training design doc](README.md). In this design doc, we will discuss the parameter server's client library, which will manage communication with parameter servers. The library will be implemented in [Go](https://golang.org/) and made available as a static or dynamic library with a C header file. - -## Parameter Partition - -Each parameter will be partitioned into parameter blocks to make the parameters evenly distributed on parameter servers. The partition is done automatically by the client library. The *sparse parameter* require a little different treatment: - -### Sparse Parameter - -The sparse parameter is a parameter that is updated sparsely. The name is somewhat misleading, it does not have a sparse representation, it has the same representation as a dense vector. - -Because a sparse parameter is updated sparsely, the trainer will have to partition the sparse parameter. Because the parameter server will merge all sparse parameter shard into the same file when saving the parameter. It needs special naming convention: - -If a sparse parameter is partitioned into n shards, they should be named as: - -```text -name:sparse-0 -name:sparse-1 -... -name:sparse-n-1 -``` - -The library is unaware of the partition, and treat each parameter independently. Only when saving parameters, the parameter servers will merge the sparse parameters according to the naming convention. - -## Model Optimization Using Gradients - -There are two ways to perform model optimization using gradients: - -- On Client - - The client does multiple steps of forward and backward update. In each step, the gradients are calculated and a new model is generated. After some steps, the client will calculate the difference between the newest model and the old model at step 0. The difference will be updated to parameter servers. Parameter servers will just update parameters using the difference without any optimization using gradients (such as Adam and L1 regularization). - -- On Parameter Server - - The client will send accumulated gradients to parameter servers, the parameter server will do the optimization using gradients. - -## L1 and L2 Regularization - -PaddlePaddle allows L1 or L2 regularizations to be specified per parameter, so when the trainer initializes the parameter it needs include a parameter configuration when L1 or L2 regularization is necessary. - -## Parameter Initialization - -The parameters on parameter servers need to be initialized. To provide maximum flexibility, the trainer will initialize the parameters. Only one trainer will do the initialization, the other trainers will wait for the completion of initialization and get the parameters from the parameter servers. - -### Trainer Selection - -To select the trainer for initialization, every trainer will try to get a distributed lock, whoever owns the lock will do the initialization. As illustrated below: - - - -### Trainer Selection Process - -The trainer select process is encapsulated in the C API function: -```c -int paddle_begin_init_params(paddle_pserver_client* client, const char* config_proto); -``` -The selected trainer's call to `paddle_begin_init_params` will return with 1, and the other trainers' call to `paddle_begin_init_params` will return 0. `paddle_get_params` will be blocked until initialization is completed. As illustrated below: - - - -## C Interface - -```c -typedef enum { - PADDLE_ELEMENT_TYPE_INT32 = 0, - PADDLE_ELEMENT_TYPE_UINT32 = 1, - PADDLE_ELEMENT_TYPE_INT64 = 2, - PADDLE_ELEMENT_TYPE_UINT64 = 3, - PADDLE_ELEMENT_TYPE_FLOAT32 = 4, - PADDLE_ELEMENT_TYPE_FLOAT64 = 5, -} paddle_element_type; - -typedef struct { - char* name; - paddle_element_type element_type; - unsigned char* content; - int content_len; -} paddle_parameter, paddle_gradient; - -typedef int paddle_pserver_client; - -/** - * @brief creates a pserver client that talks to etcd for coordination. - */ -paddle_pserver_client paddle_new_etcd_pserver_client(char* etcd_addr); - -/** - * @brief creates a pserver client given pserver addresses. - * - * @param pserver_addrs comma-separated pserver addresses. - * @param selected if current pserver client is selected to initialize all parameter servers. - */ -paddle_pserver_client paddle_new_pserver_client(char* pserver_addrs, int selected); -void paddle_pserver_client_release(paddle_pserver_client c); - -/** - * @brief paddle_begin_init_params begins to initialize parameters on - * parameter servers. - * - * paddle_begin_init_params will be called from multiple trainers, - * only one trainer will be selected to initialize the parameters on - * parameter servers. Other trainers need to get the initialized - * parameters from parameter servers using @paddle_get_params. - * - * @return 1 if the trainer is selected to initialize parameter - * servers, otherwise 0. - */ -int paddle_begin_init_params(paddle_pserver_client client); - -/** - * @brief paddle_init_param initializes the parameter on parameter - * servers. - * - * @param param the parameter to initialize. - * @param param_config_proto the configuration for the parameter. - * @param config_len the length of param_config_proto - * @return 0 if successful, otherwise -1. On failure, the trainer - * needs to restart the entire initialization process (starting from - * @paddle_begin_init_param). Or simply exit the program and wait for - * the cluster management system to restart the trainer. - */ -int paddle_init_param(paddle_pserver_client client, paddle_parameter param, const unsigned char* param_config_proto, int config_len); - -/** - * @brief paddle_finish_init_params tells parameter servers client has - * sent all parameters to parameter servers as initialization. - * - * @return 0 if successful, otherwise -1. On failure, the trainer - * needs to restart the entire initialization process (starting from - * @paddle_begin_init_param). Or simply exit the program and wait for - * the cluster management system to restart the trainer. - */ -int paddle_finish_init_params(paddle_pserver_client client); - -/** - * @brief paddle_send_grads sends gradients to parameter servers for - * updating parameters. - * - * @param grads the array of gradients to send. - * @param len the length of the gradient array. - * @param learning_rate the learning rate for the gradients. - * @return 0 if successful, otherwise -1. - */ -int paddle_send_grads(paddle_pserver_client client, const paddle_gradient* grads, int len); - -/** - * @brief paddle_get_params gets parameters from parameter servers. - * - * paddle_get_params will block until parameters are initialized on - * the parameter servers. - * - * @param dst the destination array of parameter pointers to save to. - * The parameter pointer must be pre-popullated with required parameter name, - * and the content of parameter must be pre-allocated of the size of required - * parameter on pserver. - * @param len the length of the names array and the paddle_parameter - * array. - * @return 0 if successful, otherwise -1. - */ -int paddle_get_params(paddle_pserver_client client, paddle_parameter** dst, int len); - -/** - * @brief paddle_save_model indicates parameters to save the parameter - * to the given path - * - * @param path the path to save parameters. - * @return 0 if successful, otherwise -1. - */ -int paddle_save_model(paddle_pserver_client client, const char* path); -``` diff --git a/doc/v2/design/cluster_train/remote_parameter_updater.md b/doc/v2/design/cluster_train/remote_parameter_updater.md deleted file mode 100644 index 6e8e5938455b869e0f3367794c41250340b37f77..0000000000000000000000000000000000000000 --- a/doc/v2/design/cluster_train/remote_parameter_updater.md +++ /dev/null @@ -1,21 +0,0 @@ -# Design Doc: Remote Parameter Updater for Cluster Train - -For an overview of distribute training, please refer to [distributed training design doc](README.md). In this design doc, we will discuss the parameter updater that will use parameter server cclient [The Client Library of Parameter Server Design Doc](pserver_client.md) to manage and update parameters. - -## Parameter Updater - -Parameter Updater is used by trainer to manage and update parameter, there are mainly two kind of parameter updater: local and remote, since this design is for cluster train, we will only discuss remote parameter updater here. - -### Remote Parameter Updater - -Remote Parameter Updater manage parameters through remote parameter server with the client that communicate with pserver([The Client Library of Parameter Server Design Doc](pserver_client.md)) - -In PaddlePaddle Python V2 API, trainer is implemented in python, and the trainer will hold a instance of parameter updater and call it's functions directly. In this design, we will also expose the api of RemoteParameterUpdater to python with swig. - -#### Sparse Remote Parameter Updater - -Since we will only implement dense parameter management new, the mechanism for sparse parameter will be discussed in next stage. - -### Interface Design - -TBD diff --git a/doc/v2/design/cluster_train/save_model.md b/doc/v2/design/cluster_train/save_model.md deleted file mode 100644 index b755185c81ad617b9c85c47de0f5f65d2201c658..0000000000000000000000000000000000000000 --- a/doc/v2/design/cluster_train/save_model.md +++ /dev/null @@ -1,111 +0,0 @@ -# Design Doc: Save Model - -## Overview - -The model is the output of the training process. There are two -ways from which user can obtain a model: - -- Save model triggered by user code: user code asks PaddlePaddle to - save a model. -- Convert model from the checkpoint: model being converted from - pservers' periodic checkpoint. In this way, the user can cancel a - job at any time, and still have a relatively fresh model (we - checkpoint around every 5 minutes). - -### Trainer Saving Model vs. Pservers Saving Model - -Both trainers and pservers have access to the model. So the model can -be saved from a trainer or pservers. We need to decide where the model -is saved from. - -#### Dense Update vs. Sparse Update - -There are two types of model update methods: dense update and sparse -update (when the model parameter is configured to be sparse). - -- Dense update - - Every trainer has it's own full copy of the model. Every model - update will update the entire model. - -- Sparse update - - The training input is sparse, and the trainer does not have the - entire model. It will only download the sub-model necessary related - to the input. When updating the model, only the sub-model related to - the training input is updated. - - -#### Pservers Saving Model - -The benefit of letting pservers save model is they have the entire -model all the time. However, since pservers are on different nodes, it -requires a merging process to merge model shards into the same -model. Thus requires the pservers to write models to a distributed -filesystem, making the checkpoint shards visible to the merge program. - -#### Trainer Saving Model - -The benefit of letting one trainer to save the model is it does not -require a distributed filesystem. And it's reusing the same save model -logic when training locally - except when doing sparse update, the -trainer needs to download the entire model during the saving process. - -#### Conclusion - -Given trainer saving model does not require a distributed filesystem, -and is an intuitive extension to trainer saving model when training -locally, we decide to let the trainer save the model when doing -distributed training. - - -### Convert Model from Checkpoint - -TODO - - -## Timeline - -We first implement trainer save the model. Converting the latest -snapshot to a model will be a TODO for future. - - -## Trainer Save Model - -### Trainer Election - -One trainer will be elected as the one to save the model. When using -etcd, trainer ID is a randomly generated UUID, the trainer will -contact the master server requesting to save the model, and find out -if itself is elected. When the master server is not used, unique -trainer IDs will be given by the administrator, the trainer whose ID -is "0" is elected to save the model. - -### Model Save Path - -Each trainer will be given the directory to save the model. The -elected trainer will save the model to -`given-directory/trainerID`. Since the trainer ID is unique, this -would prevent concurrent save to the same file when multiple trainers -are elected to save the model when split-brain problem happens. - -### What Happens When Model Is Saving - -It takes some time to save model, we need to define what will happen -when save model is taking place. - -When doing dense update, the trainer uses the local model. Pservers -does not need to pause model update. - -When doing sparse update. The trainer needs to download the entire -model while saving. To get the most accurate model, the model update -needs to be paused before the download starts and resumed after the -download finishes. Otherwise, the trainer gets a model that is -"polluted": some part of the model is old, some part of the model is -new. - -It's unclear that the "polluted" model will be inferior due to the -stochastic nature of deep learning, and pausing the model update will -add more complexity to the system. Since supporting sparse update is a -TODO item. We defer the evaluation of pause the model update or not -during saving model to the future. diff --git a/doc/v2/design/cluster_train/src/checkpointing.png b/doc/v2/design/cluster_train/src/checkpointing.png deleted file mode 100644 index c221e8474f90f37e31416cbb19c9452207a0d14c..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/checkpointing.png and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/data_dispatch.png b/doc/v2/design/cluster_train/src/data_dispatch.png deleted file mode 100644 index 5bdcc24d6a6d193cb014f8c38b362451fded5e54..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/data_dispatch.png and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/dataset.graffle b/doc/v2/design/cluster_train/src/dataset.graffle deleted file mode 100644 index c10a423ed16a23229a9ee33d11bfc82bb59646c8..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/dataset.graffle and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/dataset.png b/doc/v2/design/cluster_train/src/dataset.png deleted file mode 100644 index 2fb7f1cce3b6dd21489392557826e95a9f207c34..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/dataset.png and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/file_storage.graffle b/doc/v2/design/cluster_train/src/file_storage.graffle deleted file mode 100644 index 50a17e70fa255495337c529a3bf12a5c0024a5be..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/file_storage.graffle and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/file_storage.png b/doc/v2/design/cluster_train/src/file_storage.png deleted file mode 100644 index fccb4e3e7e738224c7f1584326bd5f351ce799aa..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/file_storage.png and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/init_lock.graffle b/doc/v2/design/cluster_train/src/init_lock.graffle deleted file mode 100644 index fa9149f21b1311eed48ef72ec55e556559d0fc94..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/init_lock.graffle and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/init_lock.png b/doc/v2/design/cluster_train/src/init_lock.png deleted file mode 100644 index 92404ee6d6c0f9a7727952bae3c869ba338ecd7f..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/init_lock.png and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/paddle-cloud-in-data-center.png b/doc/v2/design/cluster_train/src/paddle-cloud-in-data-center.png deleted file mode 100644 index da5d1a77562480ad1d886f5f21dbd84001d3d508..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/paddle-cloud-in-data-center.png and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/paddle-etcd.graffle b/doc/v2/design/cluster_train/src/paddle-etcd.graffle deleted file mode 100644 index f973dc9b9dbf72e9bc31e2d32822916cd281f8d9..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/paddle-etcd.graffle and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/paddle-etcd.png b/doc/v2/design/cluster_train/src/paddle-etcd.png deleted file mode 100644 index 57981ceb4b94f0f7d6dfa63f3d28c0402bf9cc31..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/paddle-etcd.png and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/paddle-model-sharding.graffle b/doc/v2/design/cluster_train/src/paddle-model-sharding.graffle deleted file mode 100644 index fba30f0ca2b47f0d202a432821d95e55aac37ec8..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/paddle-model-sharding.graffle and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/paddle-model-sharding.png b/doc/v2/design/cluster_train/src/paddle-model-sharding.png deleted file mode 100644 index 8c3f6724ef46c6527e63a4cd8cb0b50fe0167124..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/paddle-model-sharding.png and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/paddle-ps-0.png b/doc/v2/design/cluster_train/src/paddle-ps-0.png deleted file mode 100644 index 47ef32806f182cab003da77f1556823b3f6d1721..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/paddle-ps-0.png and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/paddle-ps-1.png b/doc/v2/design/cluster_train/src/paddle-ps-1.png deleted file mode 100644 index f3125db73096c52bac6e7c60e1675552857c0774..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/paddle-ps-1.png and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/paddle-ps.graffle b/doc/v2/design/cluster_train/src/paddle-ps.graffle deleted file mode 100644 index 0e536ffdd91cd696008b4c01bad3cb53edebdc16..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/paddle-ps.graffle and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/paddle-task-queues.graffle b/doc/v2/design/cluster_train/src/paddle-task-queues.graffle deleted file mode 100644 index 4263ed8bfd2ef0e55058828bf23f2fac3595e5fd..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/paddle-task-queues.graffle and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/paddle-task-queues.png b/doc/v2/design/cluster_train/src/paddle-task-queues.png deleted file mode 100644 index 5f980266795776752cebd0c346b85c4a75a47780..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/paddle-task-queues.png and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/paddle-task-states.graffle b/doc/v2/design/cluster_train/src/paddle-task-states.graffle deleted file mode 100644 index cf1a0b9246d9386a949d2dbb8c32fe84f72eea83..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/paddle-task-states.graffle and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/paddle-task-states.png b/doc/v2/design/cluster_train/src/paddle-task-states.png deleted file mode 100644 index 4ae43cb66c071aee9eb90d875e2373b29af9c3e0..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/paddle-task-states.png and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/pserver_init.graffle b/doc/v2/design/cluster_train/src/pserver_init.graffle deleted file mode 100644 index 5f3f1f52be8aa7f9049a8fcd6b7c93c8560c1676..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/pserver_init.graffle and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/pserver_init.png b/doc/v2/design/cluster_train/src/pserver_init.png deleted file mode 100644 index dfe491ff98dd7db1c336093c80964a260df2cd90..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/pserver_init.png and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/submit-job.graffle b/doc/v2/design/cluster_train/src/submit-job.graffle deleted file mode 100644 index 677cdfb6d9a32168bf71729eb841fa1ca0dd31d6..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/submit-job.graffle and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/submit-job.png b/doc/v2/design/cluster_train/src/submit-job.png deleted file mode 100644 index 3046a460a7ba708079e88a560debaa215a694680..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/submit-job.png and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/trainer.graffle b/doc/v2/design/cluster_train/src/trainer.graffle deleted file mode 100644 index 43415ed8cf61a5acfa34f8e56b9577f338dbf254..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/trainer.graffle and /dev/null differ diff --git a/doc/v2/design/cluster_train/src/trainer.png b/doc/v2/design/cluster_train/src/trainer.png deleted file mode 100644 index 6537d3d56589ca9f19a77a50a970e4b5275e6ce0..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/cluster_train/src/trainer.png and /dev/null differ diff --git a/doc/v2/design/cluster_train/submit-job.md b/doc/v2/design/cluster_train/submit-job.md deleted file mode 100644 index 8377d5489dc64bd2fdc5bb4f7bc737e7b489000d..0000000000000000000000000000000000000000 --- a/doc/v2/design/cluster_train/submit-job.md +++ /dev/null @@ -1,127 +0,0 @@ -# Submit a Distributed Training Job - -The user can submit a distributed training job with Python code, rather than with a command-line interface. - -## Runtime Environment On Kubernetes - -For a distributed training job, there is two Docker image called *runtime Docker image* and *base Docker image*. The runtime Docker image is the Docker image that gets scheduled by Kubernetes to run during training. The base Docker image is for building the runtime Docker image. - -### Base Docker Image - -Usually, the base Docker image is PaddlePaddle product Docker image including paddle binary files and python package. And of course, users can specify any image name hosted on any docker registry which users have the access right. - -### Runtime Docker Image - -The trainer package which user upload and some Python dependencies are packaged into a runtime Docker image based on base Docker image. - -- Handle Python Dependencies - - You need to provide requirements.txt file in your `trainer-package` folder. Example: - - ```txt - pillow - protobuf==3.1.0 - ``` - More [details](https://pip.readthedocs.io/en/1.1/requirements.html) about requirements, an example project looks like: - ```bash - paddle_example - |-quick_start - |-trainer.py - |-dataset.py - |-requirements.txt - ``` - -## Submit Distributed Training Job With Python Code - - -- `paddle.job.dist_train()` will call the Job Server API `/v1/packages` to upload the trainer package and save them on CephFS, and then call `/v1/trainer/job` to submit the PaddlePaddle distributed job. -- `/v1/trainer/job` will start a building job for preparing the runtime Docker image. When the building job is finished, Job Server will submit the PaddlePaddle distributed job to Kubernetes. -- *NOTE*: For the first version, we will not prepare the runtime Docker image, instead, the package is uploaded to Paddle Cloud, and Paddle Cloud will mount the package in a temporary folder into the base Docker image. We will not support custom Python dependencies in the first version as well. - -You can call `paddle.job.dist_train` and provide distributed training configuration as the parameters: -```python -paddle.job.dist_train( - trainer=dist_trainer(), - paddle_job=PaddleJob( - job_name = "paddle-cloud", - entry_point = "python %s"%__file__, - trainer_package = "/example/word2vec", - image = "yancey1989/paddle-job", - trainers = 10, - pservers = 3, - trainer_cpu = 1, - trainer_gpu = 1, - trainer_mem = "10G", - pserver_cpu = 1, - pserver_mem = "2G" - )) -``` - -The parameter `trainer` of `paddle.job.dist_train` is a function and you can implement it as follows: -```python -def dist_trainer(): - def trainer_creator(): - trainer = paddle.v2.trainer.SGD(...) - trainer.train(...) - return trainer_creator -``` - -The pseudo code of `paddle.job.dist_train` is as follows: -```python -def dist_train(trainer, paddle_job): - # if the code is running on cloud, set PADDLE_ON_CLOUD=YES - if os.getenv("RUNNING_ON_CLOUD", "NO") == "NO": - #submit the paddle job - paddle_job.submit() - else: - #start the training - trainer() -``` -### PaddleJob Parameters -parameter | type | explanation - --- | --- | --- -job_name | str | the unique name for the training job -entry_point | str | entry point for startup trainer process -trainer_package | str | trainer package file path which user have the access right -image|str|the [base image](#base-docker-image) for building the [runtime image](#runtime-docker-image) -pservers|int| Parameter Server process count -trainers|int| Trainer process count -pserver_cpu|int| CPU count for each Parameter Server process -pserver_mem|str| memory allocated for each Parameter Server process, a plain integer using one of these suffixes: E, P, T, G, M, K -trainer_cpu|int| CPU count for each Trainer process -trainer_mem|str| memory allocated for each Trainer process, a plain integer using one of these suffixes: E, P, T, G, M, K -trainer_gpu|int| GPU count for each Trainer process, if you only want CPU, do not set this parameter - -### Deploy Parameter Server, Trainer and Master Process - - Deploy PaddlePaddle Parameter Server processes, it's a Kubernetes ReplicaSet. - - Deploy PaddlePaddle Trainer processes, it's a Kubernetes Job. - - Deploy PaddlePaddle Master processes, it's a Kubernetes ReplicaSet. - -## Job Server - -- RESTful API - - Job server provides RESTful HTTP API for receiving the trainer package and displaying - PaddlePaddle job related informations. - - `POST /v1/package` receive the trainer package and save them on CephFS - - `POST /v1/trainer/job` submit a trainer job - - `GET /v1/jobs/` list all jobs - - `GET /v1/jobs/` the status of a job - - `DELETE /v1/jobs/` delete a job - - `GET /v1/version` job server version - -- Build Runtime Docker Image on Kubernetes - - `paddle.job.dist_train` will upload the trainer package to Job Server, save them on the distributed filesystem, and then start up a job for building the runtime Docker image that gets scheduled by Kubernetes to run during training. - - There are some benefits for building runtime Docker image on JobServer: - - On Paddle Cloud, users will run the trainer code in a Jupyter Notebook which is a Kubernetes Pod, if we want to execute `docker build` in the Pod, we should mount the host's `docker.sock` to the Pod, user's code will connect the host's Docker Engine directly, it's not safe. - - Users only need to upload the training package files, does not need to install docker engine, docker registry as dependencies. - - If we want to change another image type, such as RKT, users do not need to care about it. - -- Deploy Parameter Server, Trainer and Master Processes - - `POST /v1/trainer/job` receives the distributed training parameters, and deploy the job as follows: - - Deploy PaddlePaddle Parameter Server processes, it's a Kubernetes ReplicaSet. - - Deploy PaddlePaddle Trainer processes, it's a Kubernetes Job. - - Deploy PaddlePaddle Master processes, it's a Kubernetes ReplicaSet. diff --git a/doc/v2/design/interface/00.why_plain_c.md b/doc/v2/design/interface/00.why_plain_c.md deleted file mode 100644 index 826ff3141bc2512b525cb44ac0f18b376ce57e92..0000000000000000000000000000000000000000 --- a/doc/v2/design/interface/00.why_plain_c.md +++ /dev/null @@ -1,118 +0,0 @@ -# Paddle多语言接口实现 -## 背景 - -Paddle需要一个多语言接口,这个接口需要做到: - -* 有标准的,良好的文档 - * 例如Python可以使用[Sphinx](http://www.sphinx-doc.org/en/stable/)生成API文档,golang可以使用[GoDoc](https://godoc.org/golang.org/x/tools/cmd/godoc)生成文档。这都需要这个接口按照约定俗成的规则来注释完备。 -* 不同语言的接口适应不同语言的特性 - * 例如Java与Python的错误处理是直接扔出来Exception,而对于golang错误处理应该使用返回值。 - -## 基本要求 - -Paddle的多语言接口实现包括一下几个方面: - -* 我们使用动态库来分发Paddle。在这个动态库中不嵌入任何其他语言的解释器,也不使用其他动态库。 -* 这个动态库使用C99标准的头文件导出一些函数,不使用/导出C++符号。 -* 不导出Paddle内部的结构体、类,仅仅使用`void*`指针作为类型的句柄(handler)。 -* 不使用SWIG这种代码生成器,而是手写多语言绑定。 - - -## 原因 - -### 使用动态库来分发Paddle - -* Paddle的链接方式比较复杂 - * 如果用户要把Paddle的静态库(libpaddle.a)链接到自己的程序里,得使用 `--whole-archive` (for GCC) 或者 `--force_load` (for Clang) 参数,来确保把 libpaddle.a 里所有的符号都写入自己的程序的二进制文件里。这是因为 Paddle 的源码里使用了[object factory design pattern](http://stackoverflow.com/a/1310326/724872)。 -* 编译型语言,例如C/C++使用静态库和动态库难度差不多。但是解释性语言,例如[Python](http://stackoverflow.com/questions/19560594/how-to-import-static-library-in-python)或者[Java](http://stackoverflow.com/questions/24493337/linking-static-library-with-jni),只能调用Paddle的动态库,否则得把Paddle静态库链接到解释器里。 - * 解释性语言实际运行的二进制是解释器本身,如果调用静态库只能将静态库与解释器链接。例如对于Java来说,便是将静态库加入JVM中。这对于通常的Java的开发者来说,是不常见的做法。 - -### 动态库中不嵌入任何其他语言的解释器 - -* 目前Paddle的进程模型是C++内部驱动Python解释器进行模型配置解析和数据读取 -* 我们最终的动态库中不嵌入Python或者其他任何语言的解释器。模型配置解析,数据读取均交由其他语言完成 - -现阶段Paddle有一个问题是,Paddle内嵌的Python解释器和外部使用的Python如果版本不同,会直接报错退出。 - -### Paddle动态库中,不引用其他动态库 - -* 即这个动态库是不依赖于其他任何文件的,可以在任何机器上执行的。 - -### 这个动态库使用C99标准的头文件导出一些函数,不使用/导出C++符号 - -* 由于C++编译器没有[名字修饰](https://en.wikipedia.org/wiki/Name_mangling#C.2B.2B)的规范,不同版本的编译器之间,对于同一段C++代码生成的符号可能不一致。而多语言接口需要直接读取生成的二进制(动态库),需要有稳定的导出符号。 -* C语言是有导出符号的标准的,并且在常见的平台上,都是ABI调用标准的。 -* 大多数语言都支持使用C语言API -* 使用C99而不使用C89,是因为C99支持[Fixed-width integer types](https://en.wikipedia.org/wiki/C_data_types#Fixed-width_integer_types)和[Boolean type](https://en.wikipedia.org/wiki/C_data_types#Boolean_type)。 -* 使用C99而不使用C11的原因是,[C11](https://en.wikipedia.org/wiki/C11_(C_standard_revision))并没有Paddle特别需要的特性,且C99相对于C11使用更加广泛。 - -### 不导出Paddle内部的结构体、类,仅仅使用`void*`指针作为类型的句柄(handler) - -* Paddle内部的类为C++书写,直接导出到C的接口比较困难。 -* 在C-API中使用`void*`来表示Paddle内部类。再在每一个API中自己检查类型。 - -在C的头文件 `paddle_matrix.h` 中: - -```C -typedef void* paddle_matrix; -typedef int paddle_error; - -extern "C" -paddle_error paddle_matrix_get_shape(paddle_matrix matrix, - uint64_t* width, - uint64_t* height); -``` -而在CPP里面实现这个C的接口,文件 `paddle_matrix.cpp` - -```cpp -#include "paddle/legacy/math/matrix.h" -extern "C" -paddle_error paddle_matrix_shape(paddle_matrix matrix, - uint64_t *width, - uint64_t *height) { - auto m = (paddle::capi::CMatrix*)(matrix); - *width = m->width(); - *height = m->height(); -} -``` - -其中`paddle/capi/CMatrix.hpp`文件内容为: - -```cpp -namespace paddle { -namespace math { - -class CMatrix { - std::shared_ptr mat; -}; - -} // namespace math -} // namespace paddle -``` - -### 不使用SWIG这种代码生成器,而是手写多语言绑定 - -* [SWIG](http://www.swig.org/)是一个多语言接口的代码生成器。他的目标是使用C/C++写代码,SWIG直接读取C/C++的头文件,生成各种语言的绑定代码。 - * 对于多语言接口,SWIG需要写一个interface文件。这个文件具有独特的语法,学习成本高。且增加一个第三方语言,就需要对这个第三方语言增加一些定义。有的时候,interface文件的写法非常[tricky](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/api/Paddle.swig#L36)。社区贡献代码学习成本高。 - * SWIG暴露的接口保留了C++的接口样式,很难保证多语言代码风格的一致性。(函数命名,错误处理) - * 因为SWIG在第三方语言中暴露的函数名,类名和C++中完全一致。C++的命名风格并不能适应其他第三方语言。如果使用SWIG我们需要将在interface文件里,将大量的`SomeCppClass`重命名成`some_python_class`,或者`SomeGoTypes`。 - * 对于不同语言,错误处理的方式也不尽相同。例如对于Java或者Python,最常见的错误处理方式是Exception,而对于Golang,错误处理方式是返回值。而SWIG只能简单的暴露C++接口,无法做到对于各种语言错误处理方式的适配。 - * 对于大多数语言,直接使用C语言的.h并不困难。例如Python的[cffi](https://cffi.readthedocs.io/en/latest/overview.html#simple-example-abi-level-in-line)或者[Cython](http://cython.org/), golang的[cgo](https://golang.org/cmd/cgo/)。 - * SWIG支持的语言或者解释器有局限。例如对于Python,使用SWIG只支持CPython解释器,而不支持PyPy解释器。 - - -## 原因列表 - -| 结论 | 对比 | 原因 | -|---| --- | --- | -| 使用动态库 | 不使用静态库 | 解释型语言只能调用动态库,Paddle静态库链接复杂 | -| 不嵌入其他语言解释器 | 不嵌入Python解释器 | Paddle C++目前嵌入Python解释器,会导致不同版本Python在一个进程里的bug | -| 不引用其他动态库 | | Paddle一个动态库可以在任何Linux系统上运行 | -| 使用C99做接口 | 不使用C++做接口 | C有标准的ABI,C99是目前C最广泛的使用标准,且C99支持bool类型和定长整数(uint64_t等)类型 | -| 使用void*作为类句柄 | 不显示的写每个类具体包含什么| 实现简单,并且让接口脱离实现细节 | -| 手写多语言绑定 | 不使用SWIG | 使用SWIG需要多语言绑定的开发人员熟练掌握SWIG配置,社区参与困难。SWIG生成的代码不能保证多语言代码风格的一致性 | - - -## 实现 - -参考[Inference implementation](01.inference_implementation.md) diff --git a/doc/v2/design/interface/01.inference_implementation.md b/doc/v2/design/interface/01.inference_implementation.md deleted file mode 100644 index 9820284523246a062581f322616d196f575c9d29..0000000000000000000000000000000000000000 --- a/doc/v2/design/interface/01.inference_implementation.md +++ /dev/null @@ -1,131 +0,0 @@ -# C-API 模型推断实现文档 - -本文档描述Paddle C-API的实现细节。Paddle C-API是多语言API的基础部分。Paddle需要暴露的API很多。先实现模型推断的API,通过模型推断API的实现作为一个样例,来进行讨论。至于为什么需要C-API,请参考[Why Plain C](./00.why_plain_c.md)。 - -## Table of Contents - * [C-API 模型推断实现文档](#c-api-模型推断实现文档) - * [暴露接口原则](#暴露接口原则) - * [目录结构](#目录结构) - * [实现方式](#实现方式) - * [capi.h](#capih) - * [具体某种类型的头文件](#具体某种类型的头文件) - * [capi_private.h](#capi_privateh) - * [具体某种类型的实现文件](#具体某种类型的实现文件) - * [libpaddle_capi_shared.{so, dylib}](#libpaddle_capi_sharedso-dylib) - * [libpaddle_capi_whole.a](#libpaddle_capi_wholea) - * [examples](#examples) - * [编译选项](#编译选项) - - -## 暴露接口原则 - -1. 所有的接口均为C接口。即使用`extern "C"` -2. 除构造某种类型的函数(`paddle_matrix_create`等),其他函数均返回`paddle_error`。且调用时不能抛出异常或出现运行时错误。 -3. 所有类型名为`paddle_类型名`,所有与类型相关的函数,函数名为`paddle_类型名_函数名` -4. 如果某一个Paddle Core概念(GradientMachine/Matrix)需要被暴露到其他语言,那么 - * 为了暴露的接口尽量简单。只暴露概念的接口,而不暴露概念的实现。即暴露`GradientMachine`或者`Matrix`但不暴露`RecurrentGradientMachine`和`CpuSparseMatrix`。 - * 暴露这个概念必要函数。`必要`是指,即完成某一个任务的最少函数。 -5. 不在`capi`接口层做过多封装。 - * 如果某一个Paddle概念必须要暴露,但是又过于琐碎。不在`capi`这一层进行封装,而是直接修改Paddle Core。让Paddle核心中,这一概念不再琐碎。 - - -## 目录结构 - -```text -Paddle - `-- paddle - `-- capi - `-- examples # The example project for C-API. - `-- tests # unittests for C-API - `-- capi.h # C-API header file. - `-- capi_private.h # The shared header file between implementation sources. - `-- matrix.{h, cpp} - `-- gradient_machine.{h, cpp} - `-- ... -``` - - -Paddle的C-API目录结构如上图表所示。这个目录中除了`capi_private.h`之外的所有头文件,均会被安装到include/paddle路径下。C-API生成的二进制文件会被安装到`lib`目录下。即,安装后的目录结构为 - -```text -`-- include - `-- paddle - `-- capi.h - `-- matrix.h - `-- gradient_machine.h - `-- ... -`-- lib - `-- libpaddle_capi_shared.{so, dylib} # In mac, dynamic libary's file name extention is `dylib` - `-- libpaddle_capi_whole.a # static library for all symbols of Paddle. -``` - -## 实现方式 - -下面分别介绍某一类文件的实现方式。 - -### capi.h - -`capi.h`是用户使用C-API时所唯一需要引入的头文件。在`capi.h`中,引入了类型的头文件,`matrix.h`, `gradient_machine.h`。在引入其他类型的头文件时,使用相对路径的引用方式。即`#include "matrix.h"` - -### 具体某种类型的头文件 - -具体某种类型的头文件,即例如`matrix.h`,`gradient_machine.h`等。在这些头文件中,包含了某种类型的类型定义和暴露的全部函数。 - -这个头文件不假设其他文件的引用顺序,即使用户直接引用某种类型的头文件,也不应该报错(虽然不鼓励这样)。如果某一个类型需要引用另一个类型,例如`gradient_machine`需要引用`matrix`,则直接引入另一种类型的头文件,即`#include "matrix.h"`。 - -### capi_private.h - -`capi_prviate.h`是各个实现中共享的头文件,他主要包含了实际暴露的类型结构。在用户使用C-API时,Paddle的类型全部退化成`void *`,即`typedef paddle_matrix void*`。但,对于每种C-API暴露的类型,均是在`capi_private.h`中实现的结构体。 - -```cpp -struct CMatrix { - int type = MatrixType; - std::shared_ptr mat; -}; -``` - -通常,这个结构体包含两个项目。 - -* `type`是一个类型的标志。对于每种类型,type字段均不尽相同。这样,即使C-API接受的类型全是`void *`,我们也可以确定每一个参数的类型。 - - ```cpp - void some_c_api_function(void* some_instance) { - int* type = (int *) some_instance; - switch (*type) { - case MatrixType: - CMatrix* mat = (CMatrix *) some_instance; - ... - ... - } - } - ``` -* 这个结构体中的另一个项目是,Paddle Core中这一类型接口的智能指针(shared_ptr)。 - * 使用智能指针的原因是: 用户可以安全的释放某个C-API的实例,而不必在意Paddle Core是否还在使用这个实例。 - * 例如,用户通过C-API获得了神经网络的参数实例。当用户使用完这个参数后,直接删除这个参数即可。即便Paddle Core中的模型还在使用这个参数,这个参数也不会一并删除。 - -### 具体某种类型的实现文件 - -具体某种类型的实现文件,即`matrix.cpp`, `gradient_machine.cpp`等文件。在这些文件中,使用C++ 11实现了C-API的接口,并且使用`extern "C"`导出这些接口。在实现过程中,对输入参数的安全性进行了必要的判断,并将C-API接口的参数转发给`Paddle Core`。 - -### libpaddle\_capi_shared.{so, dylib} - -`libpaddle_capi_shared`是C-API导出的动态库。这个动态库的连接参数与Paddle的其他二进制(例如`paddle_trainer`)类似。用户可以直接使用这个动态库来引入Paddle C-API。具体使用方法为`-lpaddle_capi_shared`。 - -### libpaddle\_capi_whole.a - -`libpaddle_capi_whole`是C-API导出的静态库。这个静态库包含了Paddle的全部符号。他是将`libpaddle_gserver.a`, `libpaddle_math.a`, `libpaddle_capi.a`等全部静态库中的目标文件全部打包后产生的文件。具体使用方法为`--whole-archive -lpaddle_capi_whole --no-whole-archive`。 - - -### examples - -在样例中,使用`C99`开发了模型预测的样例代码。具体请参考[example/README.md](../../../paddle/capi/examples/README.md)。 - -## 编译选项 - -C-API的编译选项默认关闭,打开这个编译选项,需要在cmake的时候,设置 - -```bash -cmake ${YOUR_SOURCE_ROOT} -DWITH_C_API=ON -DWITH_PYTHON=OFF -DWITH_SWIG_PY=OFF -``` - -编译C-API的时候推荐Paddle不嵌入Python解释器,也不生成`SWIG`接口,具体原因参考[Why Plain C](./00.why_plain_c.md)。 diff --git a/doc/v2/design/interface/index_cn.rst b/doc/v2/design/interface/index_cn.rst deleted file mode 100644 index 2509a5c5f4182d8ce3a16a3b7bd92c0d7bf5b056..0000000000000000000000000000000000000000 --- a/doc/v2/design/interface/index_cn.rst +++ /dev/null @@ -1,7 +0,0 @@ -多语言接口 ------------- - -.. toctree:: - :maxdepth: 1 - - 00.why_plain_c.md diff --git a/doc/v2/design/interface/index_en.rst b/doc/v2/design/interface/index_en.rst deleted file mode 100644 index 356e58c39c5ef6ee5ee50ab999b85f88628bfb85..0000000000000000000000000000000000000000 --- a/doc/v2/design/interface/index_en.rst +++ /dev/null @@ -1,7 +0,0 @@ -Multilingual Interface ------------------------ - -.. toctree:: - :maxdepth: 1 - - 00.why_plain_c.md diff --git a/doc/v2/design/mkl/image/engine.png b/doc/v2/design/mkl/image/engine.png deleted file mode 100644 index 1f5f65c2cc765a514a3ba9e7b7f468e1dc4b0c3b..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/mkl/image/engine.png and /dev/null differ diff --git a/doc/v2/design/mkl/image/gradients.png b/doc/v2/design/mkl/image/gradients.png deleted file mode 100644 index f031bcf8e4cec14e63075b8b9d2c7bbd9f1b1a3c..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/mkl/image/gradients.png and /dev/null differ diff --git a/doc/v2/design/mkl/image/layers.png b/doc/v2/design/mkl/image/layers.png deleted file mode 100644 index 306f79b7a844610915eb8944128f57d2b7a3065a..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/mkl/image/layers.png and /dev/null differ diff --git a/doc/v2/design/mkl/image/matrix.png b/doc/v2/design/mkl/image/matrix.png deleted file mode 100644 index c33ce9cf0335e47cc8c1253304d0fe179186e6f2..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/mkl/image/matrix.png and /dev/null differ diff --git a/doc/v2/design/mkl/image/overview.png b/doc/v2/design/mkl/image/overview.png deleted file mode 100644 index 8fb7bbb9dd654bf363d701d0c8cd4a557043d188..0000000000000000000000000000000000000000 Binary files a/doc/v2/design/mkl/image/overview.png and /dev/null differ diff --git a/doc/v2/design/mkl/mkl_packed.md b/doc/v2/design/mkl/mkl_packed.md deleted file mode 100644 index 0123315ad4368e68b377f66119949bfd6c1c7860..0000000000000000000000000000000000000000 --- a/doc/v2/design/mkl/mkl_packed.md +++ /dev/null @@ -1,108 +0,0 @@ -# Intel® MKL Packed on PaddlePaddle: Design Doc - - -## Contents - -- [Overview](#overview) -- [Key Points](#key-points) - - [Background](#background) - - [Solution](#solution) -- [Actions](#actions) - - [CMake](#cmake) - - [Layers](#layers) - - [Unit Tests](#unit-tests) - - [Python API](#python-api) - - [Benchmarking](#benchmarking) - - -## Overview -我们计划将 Intel® MKL 中引入的 GEMM Packed APIs\[[1](#references)\] 集成到 PaddlePaddle 中,充分发挥英特尔平台的优势,有效提升PaddlePaddle在英特尔架构上的性能。 -现阶段的优化主要针对 Recurrent Neural Network(以下简称RNN)相关层(包括`RecurrentLayer`, `GatedRecurrentLayer`和`LstmLayer`), 以及 PaddlePaddle V1 API。 - -## Key Points - -### Background -目前PaddlePaddle采用了 Intel® MKL库的[cblas_?gemm](https://software.intel.com/en-us/mkl-developer-reference-c-cblas-gemm)函数,这个函数本身会在计算前将原数据转换为更适合英特尔平台的内部格式。 - -1. 转换耗时 \ -这一数据格式的转换操作(Packing),在问题本身的计算量比较小的时候,显得相对来说较为耗时。例如在DeepSpeech2 \[[2](#references)\] 的Vanilla RNN部分中,矩阵大小是`batch_size * 2048`。 -2. 转换冗余 \ -由于在现有的某些情况下(例如RNN),多次调用 cblas_?gemm 会使用相同的原数据,因此,每次调用时对原数据的重复Packing便成为了冗余。 - -为了最大程度减少多次调用 cblas_?gemm 在Packing上的耗时,Intel® MKL 引入了以下四个API: - * [cblas_?gemm_alloc](https://software.intel.com/en-us/mkl-developer-reference-c-cblas-gemm-alloc) - * [cblas_?gemm_pack](https://software.intel.com/en-us/mkl-developer-reference-c-cblas-gemm-pack) - * [cblas_?gemm_compute](https://software.intel.com/en-us/mkl-developer-reference-c-cblas-gemm-compute) - * [cblas_?gemm_free](https://software.intel.com/en-us/mkl-developer-reference-c-cblas-gemm-free) - -通过使用这些API,我们可以先完成对原数据的Packing操作,再把已转换为Packed格式的数据传递给那些复用同一数据的gemm_compute函数,从而避免了Packing冗余。 - -### Solution -在RNN的情况下,同一次前向、后向(forward/backward)过程中所有时间步(time step)共享同一个权重(weight)。当只做推断(inference)时,各次前向之间也都使用了相同的权重,没有必要在每次前向中每个时间步的计算时对权重进行重复的Packing操作。 - -我们通过使用新引入的GEMM Packed APIs,在层初始化的时候,先完成对权重的Packing操作,然后在前向,后向时复用已经转换过的权重,并在每次权重更新后,对新的权重进行转换用于下次迭代。 - -* 优化前,对于序列长度(sequence length)为`T`的网络模型(model), `N`次迭代执行的转换次数为: - - `inference`: `N * T` - - `training`: `2 * N * T` -* 优化后,对于同样设置的网络模型,其转换次数减少至: - - `inference`: `1` - - `training`: `2 * N` - -## Actions - -添加的相关文件和目录结构如下: - -```txt -PaddlePaddle/Paddle -├── ... -└── paddle/ - ├── ... - └── gserver/ - ├── ... - ├── layers/ - │ ├── ... - │ ├── MKLPackedRecurrentLayer.* - | ├── MKLPackedGatedRecurrentLayer.* - | ├── MKLPackedLstmLayer.* - | └── MKLPackedGemm.h - └── tests/ - ├── ... - └── test_MKLPacked.cpp -``` - -### CMake -在对应的`CMakeLists.txt`中根据`WITH_MKL`是否打开,来决定是否开启MKL Packed相关功能。 - -### Layers -所有的`MKLPacked*Layer`都继承于PaddlePaddle的基类`Layer`, 并添加头文件 `MKLPackedGemm.h`,该文件对相关GEMM Packed APIs做了封装。 - -### Unit Tests -我们会添加`test_MKLPacked.cpp`用于MKL Packed优化后layer的测试。 -对于每一个新加的RNN layer,我们会对比如下2个方面: -1. 对比优化后layer自身,sequence mode(`rnn_use_batch=false`)与batch mode(`rnn_use_batch=true`)的结果。 -2. 对比优化后layer与相对应的PaddlePaddle原有layer, 在batch mode下的结果。 - -### Python API -计划在`paddle/utils.Flags`中添加`use_mkl_packed`的flag,用于选择是否使用相关功能,并且当编译时`WITH_MKL=ON`的情况下,默认设置为`true`。 - -同时,在`python/paddle/trainer/config_parser.py`中对应的layer处,添加`use_mkl_packed`这个选择,方便用户在Python端选择是否启用这个功能。 - -具体实现方式比如: - -```python -use_mkl_packed = bool(int(g_command_config_args.get("use_mkl_packed", 0))) -if use_mkl_packed: - self.layer_type = mkl_packed_* -``` - -所有相关的`layer_type`会以*mkl_packed_*开头,这些会在`MKLPacked*Layer`注册layer的时候保证,以示区分。 - - -### Benchmarking -会添加相应的脚本用于测试和对比在使用MKL Packed recurrent layers 前后的网络性能。 - -## References -1. [Introducing the new Packed APIs for GEMM](https://software.intel.com/en-us/articles/introducing-the-new-packed-apis-for-gemm) -2. [DeepSpeech2 on PaddlePaddle](https://github.com/PaddlePaddle/DeepSpeech#deepspeech2-on-paddlepaddle) - diff --git a/doc/v2/design/mkl/mkldnn.md b/doc/v2/design/mkl/mkldnn.md deleted file mode 100644 index 4876de0045979be20fa45bdc84d2594516f71c03..0000000000000000000000000000000000000000 --- a/doc/v2/design/mkl/mkldnn.md +++ /dev/null @@ -1,237 +0,0 @@ -# Intel® MKL-DNN on PaddlePaddle: Design Doc - -我们计划将英特尔深度神经网络数学库[Intel MKL-DNN](https://github.com/01org/mkl-dnn) -(Intel Math Kernel Library for Deep Neural Networks)集成到PaddlePaddle, -充分展现英特尔平台的优势,有效提升PaddlePaddle在英特尔架构上的性能。 - -
-
-Figure 1. PaddlePaddle on IA -
- -近期目标 - -- 完成常用Layer的MKL-DNN实现。 -- 完成常见深度神经网络VGG,GoogLeNet 和 ResNet的MKL-DNN实现。 - -目前的优化,主要针对PaddlePaddle在重构之前的代码框架以及V1的API。 -具体的完成状态可以参见[这里](https://github.com/PaddlePaddle/Paddle/projects/21)。 - -## Contents - -- [Overview](#overview) -- [Actions](#actions) - - [CMake](#cmake) - - [Matrix](#matrix) - - [Layers](#layers) - - [Activations](#activations) - - [Parameters](#parameters) - - [Gradients](#gradients) - - [Unit Tests](#unit-tests) - - [Python API](#python-api) - - [Benchmarking](#benchmarking) - - [Others](#others) -- [Design Concerns](#design-concerns) - -## Overview - -我们会把MKL-DNN会作为第三方库集成进PaddlePaddle,与其他第三方库一样,会在编译PaddlePaddle的时候下载并编译MKL-DNN。 - -同时,为了进一步提升PaddlePaddle在基本数学运算的计算速度,我们也将MKLML即(MKL small library\[[1](#references)\]) -作为另一个第三方库集成进PaddlePaddle,它只会包括生成好的动态库和头文件。 - -MKL,MKLML以及MKL-DNN三者关系如下表: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
NameOpen SourceLicenseDescriptions
MKLNoProprietaryAccelerate math processing routines
MKLMLNoProprietarySmall package of MKL, especially for Machine Learning
MKL-DNNYesApache 2.0Accelerate primitives processing routines especially for Deep Neural Networks
- -MKLML可以与MKL-DNN共同使用,以此达到最好的性能。 - -
-
-Figure 2. PaddlePaddle with MKL Engines -
- -## Actions - -添加的相关文件和目录结构如下: - -```txt -PaddlePaddle/Paddle -├── ... -├── cmake/ -│ ├── external/ -│ │ ├── ... -│ │ ├── mkldnn.cmake -│ │ └── mklml.cmake -└── paddle/ - ├── ... - ├── math/ - │ ├── ... - │ └── MKLDNNMatrix.* - └── gserver/ - ├── ... - ├── layers/ - │ ├── ... - │ └── MKLDNN*Layer.* - ├── activations/ - │ ├── ... - │ └── MKLDNNActivations.* - └── tests/ - ├── ... - ├── MKLDNNTester.* - └── test_MKLDNN.cpp -``` - -### CMake -在`CMakeLists.txt`中提供一个与MKL有关的总开关:`WITH_MKL`,它负责决定编译时是否使用MKLML和MKL-DNN - -- `WITH_MKLML` 控制是否使用MKLML库。 -当打开`WITH_MKL`时,会自动使用MKLML库作为PaddlePaddle的CBLAS和LAPACK库,同时会开启Intel OpenMP用于提高MKLML的性能。 -编译时会把对应的头文件和库放在`build/third_party/install/mklml/*`目录下对应的地方。 -MKLML的库目前都是动态库,主要包括`libiomp5.so`和`libmklml_intel.so`。 -- `WITH_MKLDNN` 控制是否使用MKL-DNN。 -当开启`WITH_MKL`时,会自动根据硬件配置[[2](#references)]选择是否编译MKL-DNN。 -编译时会把对应的头文件和库放在`build/third_party/install/mkldnn/*`目录下对应的地方。 -MKL-DNN的库目前只有动态库`libmkldnn.so`。 - -### Matrix -目前在PaddlePaddle中数据都是以`NCHW`的格式存储,但是在MKL-DNN中的排列方式不止这一种。 -所以我们定义了一个`MKLDNNMatrix`用于管理MKL-DNN数据的不同格式以及相互之间的转换。 - -
-
-Figure 3. MKLDNNMatrix -
- -### Layers -所有MKL-DNN的Layers都会继承于`MKLDNNLayer`,该类继承于PaddlePaddle的基类`Layer`。 -在`MKLDNNLayer`中会提供一些必要的接口和函数,并且会写好`forward`和`backward`的基本逻辑, -子类只需要使用定义好的接口,实现具体的函数功能即可。 - -
-
-Figure 4. MKLDNNLayer -
- -每个MKLDNNLayer都包含用于内部存储和外部存储的一系列MKLDNNMatrix: - -- 内部存储(internel memory):`inVal_`,`inGrad_`,`outVal_`和`outGrad_`,分别代表输入数据,输入梯度,输出数据和输出梯度。 -- 外部存储(external memory):都是以ext开头,比如`extInVal_`和`extInGrad_`,它们主要是用于, -当数据格式与PaddlePaddle默认的`NCHW`格式不匹配时,转换内存的工作。 -需要注意的是,PaddlePaddle的activation会直接使用`output_.value`和`output_.grad`, -所以`extOutVal_`和`extOutGrad_`必须分别与`output_.value`和`output_.grad`共享内存, -如果不需要外部存储用于转换,那么对应的内部存储也会与它们共享内存。 -- 转换函数(resetXXX): 包括`resetInValue`,`resetInGrad`,`resetOutValue`和`resetOutGrad`, -表示对输入数据,输入梯度,输出数据和输出梯度的转换。 -这些函数会根据输入参数重新设置内部和外部存储,当然这两者也可以相等,即表示不需要转换。 - -注意:每个`MKLDNNlayer`的子类只需要使用内部存储就可以了,所有外部的转换工作都会在reset系列函数中都准备好。 - -### Activations -在重构前的PaddlePaddle中,激活函数是独立于`Layer`的概念,并且输入输出都是共用一块内存, -所以添加了对应的`MKLDNNActivation`来实现,方式类似于`MKLDNNLayer`。 - -### Parameters -对于有参数的层,我们会保证`MKLDNNLayer`使用的参数与PaddlePaddle申请的buffer共用一块内存。 -如果存在数据排列格式不一样的情况时,我们会在网络训练之前把格式转换为MKL-DNN希望的格式, -在训练结束的时候再保存为PaddlePaddle的格式,但是整个训练过程中不需要任何转换。 -这样既使得最终保存的参数格式与PaddlePaddle一致,又可以避免不必要的转换。 - -### Gradients -由于MKL-DNN的操作都是直接覆盖的形式,也就是说输出的结果不会在原来的数据上累加, -这样带来的好处就是不需要一直清空memory,节省了不必要的操作。 -但是注意的是,当网络出现分支且在`backward`的时候,需要累加不同Layer传过来的梯度。 -所以在`MKLDNNlayer`中实现了一个merge的方法,此时每个小分支的`Input Gradient` -会先临时保存在`MKLDNNMatrix`中,由分支处的Layer负责求和,并把结果放到当前层的`output_.grad`中。 -所以整体上,在实现每个子类的时候就不需要关心分支的事情了。 - -
-
-Figure 5. Merge Gradients -
- -### Unit Tests -我们会添加`test_MKLDNN.cpp`和`MKLDNNTester.*`用于MKL-DNN的测试。 -测试分为每个Layer(或Activation)的单元测试和简单网络的整体测试。 -每个测试会对比PaddlePaddle中CPU算出的结果与MKL-DNN的结果,小于某个比较小的阈值认为通过。 - -### Python API -目前只考虑**v1 API**。 - -计划在`python/paddle/trainer/config_parser.py`里面添加`use_mkldnn`这个选择,方便用户选择使用MKL-DNN的layers。 - -具体实现方式比如: - -```python -use_mkldnn = bool(int(g_command_config_args.get("use_mkldnn", 0))) -if use_mkldnn - self.layer_type = mkldnn_* -``` - -所有MKL-DNN的`layer_type`会以*mkldnn_*开头,这些会在`MKLDNN*Layer`注册layer的时候保证,以示区分。 - -同时,会在`paddle/utils.Flags`中添加一个`use_mkldnn`的flag,用于选择是否使用MKL-DNN的相关功能。 - -### Benchmarking -会添加相应的脚本在[这里](https://github.com/PaddlePaddle/Paddle/tree/develop/benchmark/paddle/image),用于测试和对比在使用MKL-DNN前后的CNN网络性能。 -测试的性能对比结果会在[IntelOptimizedPaddle.md](https://github.com/PaddlePaddle/Paddle/blob/develop/benchmark/IntelOptimizedPaddle.md) - -### Others -1. 如果在使用MKL-DNN的情况下,会把CPU的Buffer对齐为4096,具体可以参考MKL-DNN中的[memory](https://github.com/01org/mkl-dnn/blob/master/include/mkldnn.hpp#L673)。 -2. 深入PaddlePaddle,寻找有没有其他可以优化的可能,进一步优化。比如可能会用OpenMP改进SGD的更新性能。 - -## Design Concerns - -为了更好的符合PaddlePaddle的代码风格\[[3](#references)\],同时又尽可能少的牺牲MKL-DNN的性能\[[4](#references)\]。 - -我们总结出一些特别需要注意的点: - -1. 使用**deviceId_**。为了尽可能少的在父类Layer中添加变量或者函数, -我们决定使用已有的`deviceId_`变量来区分layer的属性,定义`-2`为`MKLDNNLayer`特有的设备ID。 -2. 重写父类Layer的**init**函数,修改`deviceId_`为`-2`,代表这个layer是用于跑在MKL-DNN的环境下。 -3. 创建`MKLDNNBase`,定义一些除了layer和memory相关的类和函数。 -包括MKL-DNN会用到`MKLDNNStream`和`CPUEngine`,和未来可能还会用到`FPGAEngine`等。 -4. 如果MKL-DNN layer的后面接有cpu device,那么就会使`output_.value`与`extOutVal_`共享内存, -同时数据格式就是`NCHW`,这样下一个cpu device就能拿到正确的数据。 -在有普通的CPU layer时, `extOutVal_`和`extOutGrad_`的格式始终是`NCHW`或者`NC`。 - -## References -1. [MKL small library](https://github.com/01org/mkl-dnn#linking-your-application)是[Intel MKL](https://software.intel.com/en-us/mkl)的一个子集。 -主要包括了深度学习相关的数学原语与操作,一般由MKL-DNN在发布[新版本](https://github.com/01org/mkl-dnn/releases)时一起更新。 -2. [MKL-DNN System Requirements](https://github.com/01org/mkl-dnn#system-requirements)。 -目前在PaddlePaddle中,仅会在支持AVX2指令集及以上的机器才使用MKL-DNN。 -3. [原来的方案](https://github.com/PaddlePaddle/Paddle/pull/3096)会引入**nextLayer**的信息。 -但是在PaddlePaddle中,无论是重构前的layer还是重构后的op,都不会想要知道next layer/op的信息。 -4. MKL-DNN的高性能格式与PaddlePaddle原有的`NCHW`不同(PaddlePaddle中的cuDNN部分使用的也是`NCHW`,所以不存在这个问题)。 -所以需要引入一个转换方法,并且只需要在必要的时候转换这种格式,才能更好的发挥MKL-DNN的性能。 diff --git a/doc/v2/dev/contribute_to_paddle_cn.md b/doc/v2/dev/contribute_to_paddle_cn.md deleted file mode 100644 index 3244eedf918b93f9351258f1218dfb2d507c1a9c..0000000000000000000000000000000000000000 --- a/doc/v2/dev/contribute_to_paddle_cn.md +++ /dev/null @@ -1,243 +0,0 @@ -# 如何贡献代码 - -我们真诚地感谢您的贡献,欢迎通过 GitHub 的 fork 和 pull request 流程来提交代码。 - -## 代码要求 -- 代码注释请遵守 [Doxygen](http://www.stack.nl/~dimitri/doxygen/) 的样式。 -- 确保编译器选项 `WITH_STYLE_CHECK` 已打开,并且编译能通过代码样式检查。 -- 所有代码必须具有单元测试。 -- 通过所有单元测试。 -- 请遵守[提交代码的一些约定](#提交代码的一些约定)。 - -以下教程将指导您提交代码。 -## [Fork](https://help.github.com/articles/fork-a-repo/) - -跳转到[PaddlePaddle](https://github.com/PaddlePaddle/Paddle) GitHub首页,然后单击 `Fork` 按钮,生成自己目录下的仓库,比如 。 - -## 克隆(Clone) - -将远程仓库 clone 到本地: - -```bash -➜ git clone https://github.com/USERNAME/Paddle -➜ cd Paddle -``` - - -## 创建本地分支 - -Paddle 目前使用[Git流分支模型](http://nvie.com/posts/a-successful-git-branching-model/)进行开发,测试,发行和维护,具体请参考 [Paddle 分支规范](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/releasing_process.md#paddle-分支规范)。 - -所有的 feature 和 bug fix 的开发工作都应该在一个新的分支上完成,一般从 `develop` 分支上创建新分支。 - -使用 `git checkout -b` 创建并切换到新分支。 - -```bash -➜ git checkout -b my-cool-stuff -``` - -值得注意的是,在 checkout 之前,需要保持当前分支目录 clean,否则会把 untracked 的文件也带到新分支上,这可以通过 `git status` 查看。 - -## 使用 `pre-commit` 钩子 - -Paddle 开发人员使用 [pre-commit](http://pre-commit.com/) 工具来管理 Git 预提交钩子。 它可以帮助我们格式化源代码(C++,Python),在提交(commit)前自动检查一些基本事宜(如每个文件只有一个 EOL,Git 中不要添加大文件等)。 - -`pre-commit`测试是 Travis-CI 中单元测试的一部分,不满足钩子的 PR 不能被提交到 Paddle,首先安装并在当前目录运行它: - -```bash -➜ pip install pre-commit -➜ pre-commit install -``` - -Paddle 使用 `clang-format` 来调整 C/C++ 源代码格式,请确保 `clang-format` 版本在 3.8 以上。 - -注:通过`pip install pre-commit`和`conda install -c conda-forge pre-commit`安装的`yapf`稍有不同的,Paddle 开发人员使用的是`pip install pre-commit`。 - -## 开始开发 - -在本例中,我删除了 README.md 中的一行,并创建了一个新文件。 - -通过 `git status` 查看当前状态,这会提示当前目录的一些变化,同时也可以通过 `git diff` 查看文件具体被修改的内容。 - -```bash -➜ git status -On branch test -Changes not staged for commit: - (use "git add ..." to update what will be committed) - (use "git checkout -- ..." to discard changes in working directory) - - modified: README.md - -Untracked files: - (use "git add ..." to include in what will be committed) - - test - -no changes added to commit (use "git add" and/or "git commit -a") -``` - -## 构建和测试 - -编译 PaddlePaddle 的源码以及生成文档需要多种开发工具。为了方便大家,我们的标准开发流程是把这些工具都装进一个Docker image,称为*开发镜像*,通常名字是 `paddle:latest-dev` 或者 `paddle:[version tag]-dev` 如 `paddle:0.11.0-dev`。然后所有用 `cmake && make` 的地方(比如IDE配置里)都用 `docker run paddle:latest-dev`来代替。 - -如要build这个开发镜像,在源码目录树的根目录中运行: - -```bash -➜ docker build -t paddle:latest-dev . -``` - -随后可以用这个开发镜像开始build PaddlePaddle的源码。比如如果要build一个不依赖GPU,但是支持AVX指令集,并且包括unit tests的PaddlePaddle,可以: - -```bash -➜ docker run -v $(pwd):/paddle -e "WITH_GPU=OFF" -e "WITH_AVX=ON" -e "WITH_TESTING=ON" paddle:latest-dev -``` - -这个过程除了编译PaddlePaddle为 `./build/libpaddle.so`,并且输出一个 `./build/paddle.deb`文件之外,还会输出一个 `build/Dockerfile`。我们只需要运行下面命令把编译好的PaddlePaddle打包成一个*生产镜像*(`paddle:prod`): - -```bash -➜ docker build -t paddle:prod -f build/Dockerfile . -``` - -如果要运行所有的单元测试,可以用如下命令: - -```bash -➜ docker run -it -v $(pwd):/paddle paddle:latest-dev bash -c "cd /paddle/build && ctest" -``` - -关于构建和测试的更多信息,请参见[使用Docker安装运行](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/v2/build_and_install/docker_install_cn.rst)。 - -## 提交(commit) - -接下来我们取消对 README.md 文件的改变,然后提交新添加的 test 文件。 - -```bash -➜ git checkout -- README.md -➜ git status -On branch test -Untracked files: - (use "git add ..." to include in what will be committed) - - test - -nothing added to commit but untracked files present (use "git add" to track) -➜ git add test -``` - -Git 每次提交代码,都需要写提交说明,这可以让其他人知道这次提交做了哪些改变,这可以通过`git commit` 完成。 - -```bash -➜ git commit -CRLF end-lines remover...............................(no files to check)Skipped -yapf.................................................(no files to check)Skipped -Check for added large files..............................................Passed -Check for merge conflicts................................................Passed -Check for broken symlinks................................................Passed -Detect Private Key...................................(no files to check)Skipped -Fix End of Files.....................................(no files to check)Skipped -clang-formater.......................................(no files to check)Skipped -[my-cool-stuff c703c041] add test file - 1 file changed, 0 insertions(+), 0 deletions(-) - create mode 100644 233 -``` - -## 保持本地仓库最新 - -在准备发起 Pull Request 之前,需要同步原仓库()最新的代码。 - -首先通过 `git remote` 查看当前远程仓库的名字。 - -```bash -➜ git remote -origin -➜ git remote -v -origin https://github.com/USERNAME/Paddle (fetch) -origin https://github.com/USERNAME/Paddle (push) -``` - -这里 origin 是我们 clone 的远程仓库的名字,也就是自己用户名下的 Paddle,接下来我们创建一个原始 Paddle 仓库的远程主机,命名为 upstream。 - -```bash -➜ git remote add upstream https://github.com/PaddlePaddle/Paddle -➜ git remote -origin -upstream -``` - -获取 upstream 的最新代码并更新当前分支。 - -```bash -➜ git fetch upstream -➜ git pull upstream develop -``` - -## Push 到远程仓库 - -将本地的修改推送到 GitHub 上,也就是 https://github.com/USERNAME/Paddle。 - -```bash -# 推送到远程仓库 origin 的 my-cool-stuff 分支上 -➜ git push origin my-cool-stuff -``` - -## 建立 Issue 并完成 Pull Request - -建立一个 Issue 描述问题,并记录它的编号。 - -切换到所建分支,然后点击 `New pull request`。 - -screen shot 2017-04-26 at 9 09 28 pm - -选择目标分支: - -screen shot 2017-04-26 at 9 11 52 pm - -在 PR 的描述说明中,填写 `resolve #Issue编号` 可以在这个 PR 被 merge 后,自动关闭对应的 Issue,具体请见 。 - -接下来等待 review,如果有需要修改的地方,参照上述步骤更新 origin 中的对应分支即可。 - -## 删除远程分支 - -在 PR 被 merge 进主仓库后,我们可以在 PR 的页面删除远程仓库的分支。 - -screen shot 2017-04-26 at 9 18 24 pm - -也可以使用 `git push origin :分支名` 删除远程分支,如: - -```bash -➜ git push origin :my-cool-stuff -``` - -## 删除本地分支 - -最后,删除本地分支。 - -```bash -# 切换到 develop 分支 -➜ git checkout develop - -# 删除 my-cool-stuff 分支 -➜ git branch -D my-cool-stuff -``` - -至此,我们就完成了一次代码贡献的过程。 - -## 提交代码的一些约定 - -为了使评审人在评审代码时更好地专注于代码本身,请您每次提交代码时,遵守以下约定: - -1. 请保证Travis-CI 中单元测试能顺利通过。如果没过,说明提交的代码存在问题,评审人一般不做评审。 -2. 提交PUll Request前: - - 请注意commit的数量: - - 原因:如果仅仅修改一个文件但提交了十几个commit,每个commit只做了少量的修改,这会给评审人带来很大困扰。评审人需要逐一查看每个commit才能知道做了哪些修改,且不排除commit之间的修改存在相互覆盖的情况。 - - 建议:每次提交时,保持尽量少的commit,可以通过`git commit --amend`补充上次的commit。对已经Push到远程仓库的多个commit,可以参考[squash commits after push](http://stackoverflow.com/questions/5667884/how-to-squash-commits-in-git-after-they-have-been-pushed)。 - - 请注意每个commit的名称:应能反映当前commit的内容,不能太随意。 -3. 如果解决了某个Issue的问题,请在该PUll Request的**第一个**评论框中加上:`fix #issue_number`,这样当该PUll Request被合并后,会自动关闭对应的Issue。关键词包括:close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved,请选择合适的词汇。详细可参考[Closing issues via commit messages](https://help.github.com/articles/closing-issues-via-commit-messages)。 - -此外,在回复评审人意见时,请您遵守以下约定: - -1. 评审人的每个意见都必须回复(这是开源社区的基本礼貌,别人帮了忙,应该说谢谢): - - 对评审意见同意且按其修改完的,给个简单的`Done`即可; - - 对评审意见不同意的,请给出您自己的反驳理由。 -2. 如果评审意见比较多: - - 请给出总体的修改情况。 - - 请采用[start a review](https://help.github.com/articles/reviewing-proposed-changes-in-a-pull-request/)进行回复,而非直接回复的方式。原因是每个回复都会发送一封邮件,会造成邮件灾难。 diff --git a/doc/v2/dev/contribute_to_paddle_en.md b/doc/v2/dev/contribute_to_paddle_en.md deleted file mode 120000 index c97564d93a7f0a753a23cd97d2467d595bd154ff..0000000000000000000000000000000000000000 --- a/doc/v2/dev/contribute_to_paddle_en.md +++ /dev/null @@ -1 +0,0 @@ -../../../CONTRIBUTING.md \ No newline at end of file diff --git a/doc/v2/dev/index_cn.rst b/doc/v2/dev/index_cn.rst deleted file mode 100644 index aee3c68de05de26df3cd79170fa7f4ecad4bf386..0000000000000000000000000000000000000000 --- a/doc/v2/dev/index_cn.rst +++ /dev/null @@ -1,24 +0,0 @@ -开发标准 -======== -PaddlePaddle遵守如下三个部分的代码和文档规范。 - -PaddlePaddle使用git做版本管理,docker作为构建和测试环境。代码中包含了Cuda, C++, Python, Shell等多种编程语言。语言规范遵守Google C++ Style, Pep-8, 代码库中包含自动化检查工具做风格检查。代码注释需要遵守Doxygen规范,不满足风格要求的代码会编译失败。关于如何使用git, 构建测试及代码开发, 我们提供了如下指南。 - -.. toctree:: - :maxdepth: 1 - - contribute_to_paddle_cn.md - -PaddlePaddle面向国内外用户,包含了中文和英文两部分的文档。设计文档和issue问题描述都推荐使用英文。对于设计文档,重在问题描述,背景阐述,然后才是解决方案。文档由Sphinx生成,因此代码注释也需要符合Sphinx文档标准。推荐本地使用paddlepaddle.org工具编译生成和预览文档,请参阅如下文档。 - -.. toctree:: - :maxdepth: 1 - - write_docs_cn.rst - -PaddlePaddle V2 使用新增Layer方式定义新的操作。组合基础API可以实现多种复杂Layer, 满足绝大多数应用。如需要定制Layer,请参阅如下文档,欢迎提交patch。 - -.. toctree:: - :maxdepth: 1 - - new_layer_cn.rst diff --git a/doc/v2/dev/index_en.rst b/doc/v2/dev/index_en.rst deleted file mode 100644 index cbff313fc5b9468b58159cf2b04e8464f9bebc78..0000000000000000000000000000000000000000 --- a/doc/v2/dev/index_en.rst +++ /dev/null @@ -1,28 +0,0 @@ -Development ------------- - - -PaddlePaddle adheres to the following three sections of code and document specifications. - - -PaddlePaddle uses git for version control and Docker is used for building and testing environment. The code includes Cuda, C++, Python, Shell and other programming languages,which comply with Google C++ Style, Pep-8, and the code base includes style checking by an automatic inspection tool. Code comments need to follow the Doxygen specification. The code that does not meet the style requirements will fail to compile. We provide the following guidelines for the use of Git, build tests and code development. - -.. toctree:: - :maxdepth: 1 - - contribute_to_paddle_en.md - - -PaddlePaddle is well documented in English and Chinese. We recommend using the English version of the documents and problem description. The design documents focus on problem descriptions, backgrounds, and are followed by solutions. As documents are generated by Sphinx, code comments should comply with the Sphinx documentation standard. We recommend to use the paddlepaddle.org tool to compile and generate and preview documents locally. Please refer to: - -.. toctree:: - :maxdepth: 1 - - write_docs_en.rst - -PaddlePaddle V2 defines new operations by adding new Layers. You can implement various complex layers by combining basic APIs to satisfy most applications. If you want to customize layer, please refer to the following, and welcome to propose patch. - -.. toctree:: - :maxdepth: 1 - - new_layer_en.rst diff --git a/doc/v2/dev/new_layer_cn.rst b/doc/v2/dev/new_layer_cn.rst deleted file mode 100644 index e5a14346123d342de0b67757cbbce654bd4180dc..0000000000000000000000000000000000000000 --- a/doc/v2/dev/new_layer_cn.rst +++ /dev/null @@ -1,389 +0,0 @@ -================== -如何实现新的网络层 -================== - -这份教程展示了如何在PaddlePaddle中实现一个自定义的网络层。在这里我们使用全连接层作为例子来展示实现新网络层所需要的四个步骤。 - -1. 推导该层前向和后向传递的方程。 -2. 实现该层的C++类。 -3. 增加梯度检测的单元测试,以保证梯度的正确计算。 -4. 封装该层的Python接口。 - -推导方程 -================ - -首先我们需要推导该网络层的*前向传播*和*后向传播*的方程。前向传播给定输入,计算输出。后向传播给定输出的梯度,计算输入和参数的梯度。 - -下图是一个全连接层的示意图。在全连接层中,每个输出节点都连接到所有的输入节点上。 - -.. image:: src/FullyConnected.jpg - :align: center - :scale: 60 % - -一个网络层的前向传播部分把输入转化为相应的输出。 -全连接层以一个维度为 :math:`D_i` 的稠密向量作为输入,使用一个尺度为 :math:`D_i \times D_o` 的变换矩阵 :math:`W` 把 :math:`x` 映射到一个维度为 :math:`D_o` 的向量,并在乘积结果上再加上维度为 :math:`D_o` 的偏置向量 :math:`b` 。 - -.. math:: - - y = f(W^T x + b) - -其中 :math:`f(.)` 是一个非线性的*激活方程*,例如sigmoid, tanh,以及Relu。 - -变换矩阵 :math:`W` 和偏置向量 :math:`b` 是该网络层的*参数*。一个网络层的参数是在*反向传播*时被训练的。反向传播根据输出的梯度,分别计算每个参数的梯度,以及输入的梯度。优化器则用链式法则来对每个参数计算损失函数的梯度。 - -假设损失函数是 :math:`c(y)` ,那么 - -.. math:: - - \frac{\partial c(y)}{\partial x} = \frac{\partial c(y)}{\partial y} \frac{\partial y}{\partial x} - -假设 :math:`z = W^T x + b` ,那么 - -.. math:: - - \frac{\partial y}{\partial z} = \frac{\partial f(z)}{\partial z} - -PaddlePaddle的base layer类可以自动计算上面的导数。 - -因此,对全连接层来说,我们需要计算: - -.. math:: - - \frac{\partial z}{\partial x} = W, \frac{\partial z_j}{\partial W_{ij}} = x_i, \frac{\partial z}{\partial b} = \mathbf 1 - -其中 :math:`\mathbf 1` 是一个全1的向量, :math:`W_{ij}` 是矩阵 :math:`W` 第i行第j列的数值, :math:`z_j` 是向量 :math:`z` 的第j个值, :math:`x_i` 是向量 :math:`x` 的第i个值。 - -最后我们使用链式法则计算 :math:`\frac{\partial z}{\partial x}` 以及 :math:`\frac{\partial z}{\partial W}` 。计算的细节将在下面的小节给出。 - -实现C++类 -=================== - -一个网络层的C++类需要实现初始化,前向和后向。全连接层的实现位于:code:`paddle/legacy/gserver/layers/FullyConnectedLayer.h`及:code:`paddle/legacy/gserver/layers/FullyConnectedLayer.cpp`。这里我们展示一份简化过的代码。 - -这个类需要继承 :code:`paddle::Layer` 这个基类,并且需要重写基类中的以下几个虚函数: - -- 类的构造函数和析构函数。 -- :code:`init` 函数。用于初始化参数和设置。 -- :code:`forward` 。实现网络层的前向传播。 -- :code:`backward` 。实现网络层的后向传播。 -- :code:`prefetch` 。用来从参数服务器预取参数矩阵相应的行。如果网络层不需要远程稀疏更新,则不需要重写该函数。(大多数网络层不需要支持远程稀疏更新) - - -头文件如下: - -.. code-block:: c++ - - namespace paddle { - /** - * 全连接层的每个输出都连接到上一层的所有的神经元上。 - * 它的输入与经过学习的参数做内积并加上偏置(可选)。 - * - * 配置文件接口是fc_layer。 - */ - - class FullyConnectedLayer : public Layer { - protected: - WeightList weights_; - std::unique_ptr biases_; - - public: - explicit FullyConnectedLayer(const LayerConfig& config) - : Layer(config) {} - ~FullyConnectedLayer() {} - - bool init(const LayerMap& layerMap, const ParameterMap& parameterMap); - - Weight& getWeight(int idx) { return *weights_[idx]; } - - void prefetch(); - void forward(PassType passType); - void backward(const UpdateCallback& callback = nullptr); - }; - } // namespace paddle - -头文件中把参数定义为类的成员变量。我们使用 :code:`Weight` 类作为参数的抽象,它支持多线程更新。该类的实现细节在“实现细节”中详细介绍。 - -- :code:`weights_` 是存有一系列变换矩阵的权重。在当前的实现方式下,网络层可以有多个输入。因此,它可能有不止一个权重。每个权重对应一个输入。 -- :code:`biases_` 是存有偏置向量的权重。 - -全连接层没有网络层配置的超参数。如果一个网络层需要配置的话,通常的做法是将配置存于 :code:`LayerConfig& config` 中,并在类构建函数中把它放入一个类成员变量里。 - -下面的代码片段实现了 :code:`init` 函数。 - -- 首先,所有的 :code:`init` 函数必须先调用基类中的函数 :code:`Layer::init(layerMap, parameterMap);` 。该语句会为每个层初始化其所需要的变量和连接。 -- 之后初始化所有的权重矩阵 :math:`W` 。当前的实现方式下,网络层可以有多个输入。因此,它可能有不止一个权重。 -- 最后,初始化偏置向量。 - - -.. code-block:: c++ - - bool FullyConnectedLayer::init(const LayerMap& layerMap, - const ParameterMap& parameterMap) { - /* 初始化父类 */ - Layer::init(layerMap, parameterMap); - - /* 初始化权重表 */ - CHECK(inputLayers_.size() == parameters_.size()); - for (size_t i = 0; i < inputLayers_.size(); i++) { - // 获得参数尺寸 - size_t height = inputLayers_[i]->getSize(); - size_t width = getSize(); - - // 新建一个权重 - if (parameters_[i]->isSparse()) { - CHECK_LE(parameters_[i]->getSize(), width * height); - } else { - CHECK_EQ(parameters_[i]->getSize(), width * height); - } - Weight* w = new Weight(height, width, parameters_[i]); - - // 将新建的权重加入权重表 - weights_.emplace_back(w); - } - - /* 初始化biases_ */ - if (biasParameter_.get() != NULL) { - biases_ = std::unique_ptr(new Weight(1, getSize(), biasParameter_)); - } - - return true; - } - -实现前向传播的部分有下面几个步骤。 - -- 每个层在其 :code:`forward` 函数的开头必须调用 :code:`Layer::forward(passType);` 。 -- 之后使用 :code:`reserveOutput(batchSize, size);` 为输出分配内存。由于我们支持训练数据有不同的批次大小,所以这一步是必要的。 :code:`reserveOutput` 会相应地改变输出的尺寸。为了保证效率,如果需要扩大矩阵,我们会重新分配内存;如果需要缩减矩阵,我们会继续使用现有的内存块。 -- 之后使用矩阵运算函数来计算 :math:`\sum_i W_i x + b`。:code:`getInput(i).value` 返回第i个输入矩阵。每个输入都是一个 :math:`batchSize \times dim` 的矩阵,每行表示一个批次中的单个输入。对于我们支持的全部矩阵操作,请参考 :code:`paddle/legacy/math/Matrix.h`和:code:`paddle/legacy/math/BaseMatrix.h` 。 -- 最终,使用 :code:`forwardActivation();` 进行激活操作。这会自动进行网络配置中声明的激活操作。 - - -.. code-block:: c++ - - void FullyConnectedLayer::forward(PassType passType) { - Layer::forward(passType); - - /* 若有必要,为output_申请内存 */ - int batchSize = getInput(0).getBatchSize(); - int size = getSize(); - - { - // 设置输出的尺寸 - reserveOutput(batchSize, size); - } - - MatrixPtr outV = getOutputValue(); - - // 对每个输入乘上变换矩阵 - for (size_t i = 0; i != inputLayers_.size(); ++i) { - auto input = getInput(i); - CHECK(input.value) << "The input of 'fc' layer must be matrix"; - i == 0 ? outV->mul(input.value, weights_[i]->getW(), 1, 0) - : outV->mul(input.value, weights_[i]->getW(), 1, 1); - } - - /* 加上偏置向量 */ - if (biases_.get() != NULL) { - outV->addBias(*(biases_->getW()), 1); - } - - /* 激活 */ { - forwardActivation(); - } - } - -实现后向传播的部分有下面几个步骤。 - -- :code:`backwardActivation()` 计算激活函数的梯度。通过 :code:`getOutputGrad()` 来获得输出的梯度,调用该函数后,梯度会就地(不使用额外空间)乘上输出的梯度。 -- 计算偏置的梯度。注意,我们使用 :code:`biases_->getWGrad()` 来得到某个特定参数的梯度矩阵。在一个参数的梯度被更新后,**必须**要调用 :code:`getParameterPtr()->incUpdate(callback);` 。这用于在多线程和多机上更新参数。 -- 最后,计算转换矩阵和输入的梯度,并对相应的参数调用 :code:`incUpdate` 。PaddlePaddle可以通过该机制判断是否已经收集齐所有的梯度,从而可以做一些与计算重叠的工作(例如,网络通信)。 - - -.. code-block:: c++ - - void FullyConnectedLayer::backward(const UpdateCallback& callback) { - /* 对激活求导 */ { - backwardActivation(); - } - - if (biases_ && biases_->getWGrad()) { - biases_->getWGrad()->collectBias(*getOutputGrad(), 1); - - biases_->getParameterPtr()->incUpdate(callback); - } - - bool syncFlag = hl_get_sync_flag(); - - for (size_t i = 0; i != inputLayers_.size(); ++i) { - /* 计算当前层权重的梯度 */ - if (weights_[i]->getWGrad()) { - MatrixPtr input_T = getInputValue(i)->getTranspose(); - MatrixPtr oGrad = getOutputGrad(); - { - weights_[i]->getWGrad()->mul(input_T, oGrad, 1, 1); - } - } - - - /* 计算输入层的偏差 */ - MatrixPtr preGrad = getInputGrad(i); - if (NULL != preGrad) { - MatrixPtr weights_T = weights_[i]->getW()->getTranspose(); - preGrad->mul(getOutputGrad(), weights_T, 1, 1); - } - - { - weights_[i]->getParameterPtr()->incUpdate(callback); - } - } - } - - :code:`prefetch` 函数指出了在训练时需要从参数服务器取出的行。仅在远程稀疏训练时有效。使用远程稀疏方式训练时,完整的参数矩阵被分布在不同的参数服务器上。当网络层用一个批次做训练时,该批次的输入中仅有一个子集是非零的。因此,该层仅需要这些非零样本位置所对应的变换矩阵的那些行。 :code:`prefetch` 表明了这些行的标号。 - -大多数层不需要远程稀疏训练函数。这种情况下不需要重写该函数。 - -.. code-block:: c++ - - void FullyConnectedLayer::prefetch() { - for (size_t i = 0; i != inputLayers_.size(); ++i) { - auto* sparseParam = - dynamic_cast(weights_[i]->getW().get()); - if (sparseParam) { - MatrixPtr input = getInputValue(i); - sparseParam->addRows(input); - } - } - } - -最后,使用 :code:`REGISTER_LAYER(fc, FullyConnectedLayer);` 来注册该层。 :code:`fc` 是该层的标识符, :code:`FullyConnectedLayer` 是该层的类名。 - -.. code-block:: c++ - - namespace paddle { - REGISTER_LAYER(fc, FullyConnectedLayer); - } - -若 :code:`cpp` 被放在 :code:`paddle/legacy/gserver/layers` 目录下,其会自动被加入编译列表。 - - -写梯度检查单元测试 -=============================== - -写梯度检查单元测试是一个验证新实现的层是否正确的相对简单的办法。梯度检查单元测试通过有限差分法来验证一个层的梯度。首先对输入做一个小的扰动 :math:`\Delta x` ,然后观察到输出的变化为 :math:`\Delta y` ,那么,梯度就可以通过这个方程计算得到 :math:`\frac{\Delta y}{\Delta x }` 。之后,再用这个梯度去和 :code:`backward` 函数得到的梯度去对比,以保证梯度计算的正确性。需要注意的是梯度检查仅仅验证了梯度的计算,并不保证 :code:`forward` 和 :code:`backward` 函数的实现是正确的。你需要一些更复杂的单元测试来保证你实现的网络层是正确的。 - -所有网络层的梯度检查单测都位于 :code:`paddle/legacy/gserver/tests/test_LayerGrad.cpp` 。我们建议你在写新网络层时把测试代码放入新的文件中。下面列出了全连接层的梯度检查单元测试。它包含以下几步: - -+ 生成网络层配置。网络层配置包含以下几项: - - 偏置参数的大小。(例子中是4096) - - 层的类型。(例子中是fc) - - 层的大小。(例子中是4096) - - 激活的类型。(例子中是softmax) - - dropout的比例。(例子中是0.1) -+ 配置网络层的输入。在这个例子里,我们仅有一个输入。 - - 输入的类型( :code:`INPUT_DATA` ),可以是以下几种: - - :code:`INPUT_DATA` :稠密向量。 - - :code:`INPUT_LABEL` :整数。 - - :code:`INPUT_DATA_TARGET` :稠密向量,但不用于计算梯度。 - - :code:`INPUT_SEQUENCE_DATA` :含有序列信息的稠密向量。 - - :code:`INPUT_HASSUB_SEQUENCE_DATA` :含有序列信息和子序列信息的稠密向量。 - - :code:`INPUT_SEQUENCE_LABEL` :含有序列信息的整数。 - - :code:`INPUT_SPARSE_NON_VALUE_DATA` :0-1稀疏数据。 - - :code:`INPUT_SPARSE_FLOAT_VALUE_DATA` :浮点稀疏数据。 - - 输入的名字。(例子中是 :code:`layer_0` ) - - 输入的大小。(例子中是8192) - - 非零数字的个数,仅对稀疏数据有效。 - - 稀疏数据的格式,仅对稀疏数据有效。 -+ 对每个输入,都需要调用一次 :code:`config.layerConfig.add_inputs();` 。 -+ 调用 :code:`testLayerGrad` 来做梯度检查。它包含以下参数。 - - 层和输入的配置。(例子中是 :code:`config` ) - - 网络层的类型。(例子中是 :code:`fc` ) - - 梯度检查的输入数据的批次大小。(例子中是100) - - 输入是否是转置的。大多数层需要设置为 :code:`false` 。(例子中是 :code:`false` ) - - 是否使用权重。有些层或者激活需要做归一化以保证它们的输出的和是一个常数。例如,softmax激活的输出的和总是1。在这种情况下,我们不能通过常规的梯度检查的方式来计算梯度。因此我们采用输出的加权和(非常数)来计算梯度。(例子中是 :code:`true` ,因为全连接层的激活可以是softmax) - -.. code-block:: c++ - - void testFcLayer(string format, size_t nnz) { - // Create layer configuration. - TestConfig config; - config.biasSize = 4096; - config.layerConfig.set_type("fc"); - config.layerConfig.set_size(4096); - config.layerConfig.set_active_type("softmax"); - config.layerConfig.set_drop_rate(0.1); - // Setup inputs. - config.inputDefs.push_back( - {INPUT_DATA, "layer_0", 8192, nnz, ParaSparse(format)}); - config.layerConfig.add_inputs(); - LOG(INFO) << config.inputDefs[0].sparse.sparse << " " - << config.inputDefs[0].sparse.format; - for (auto useGpu : {false, true}) { - testLayerGrad(config, "fc", 100, /* trans */ false, useGpu, - /* weight */ true); - } - } - -如果你要为了测试而增加新的文件,例如 :code:`paddle/legacy/gserver/tests/testFCGrad.cpp` ,你需要把该文件加入 :code:`paddle/legacy/gserver/tests/CMakeLists.txt` 中。下面给出了一个例子。当你执行命令 :code:`make tests` 时,所有的单测都会被执行一次。注意,有些层可能需要高精度来保证梯度检查单测正确执行。你需要在配置cmake时将 :code:`WITH_DOUBLE` 设置为 `ON` 。 - -.. code-block:: bash - - add_unittest_without_exec(test_FCGrad - test_FCGrad.cpp - LayerGradUtil.cpp - TestUtil.cpp) - - add_test(NAME test_FCGrad - COMMAND test_FCGrad) - - -实现python封装 -======================== - -python封装的实现使得我们可以在配置文件中使用新实现的网络层。所有的python封装都在 :code:`python/paddle/trainer/config_parser.py` 中。全连接层python封装的例子中包含下面几步: - -- 所有的Python封装都使用 :code:`@config_layer('fc')` 这样的装饰器。网络层的标识符为 :code:`fc` 。 -- 实现构造函数 :code:`__init__` 。 - - 它首先调用基构造函数 :code:`super(FCLayer, self).__init__(name, 'fc', size, inputs=inputs, **xargs)` 。 :code:`FCLayer` 是Python封装的类名。 :code:`fc` 是网络层的标识符。为了封装能够正确工作,这些名字必须要写对。 - - 之后,计算变换矩阵的大小和格式(是否稀疏)。 - -.. code-block:: python - - @config_layer('fc') - class FCLayer(LayerBase): - def __init__( - self, - name, - size, - inputs, - bias=True, - **xargs): - super(FCLayer, self).__init__(name, 'fc', size, inputs=inputs, **xargs) - for input_index in xrange(len(self.inputs)): - input_layer = self.get_input_layer(input_index) - psize = self.config.size * input_layer.size - dims = [input_layer.size, self.config.size] - format = self.inputs[input_index].format - sparse = format == "csr" or format == "csc" - if sparse: - psize = self.inputs[input_index].nnz - self.create_input_parameter(input_index, psize, dims, sparse, format) - self.create_bias_parameter(bias, self.config.size) - -在网络配置中,网络层的细节可以通过下面这些代码片段来指定。这个类的参数包括: - -- :code:`name` 是网络层实例的名字标识符。 -- :code:`type` 是网络层的类型,通过网络层的标识符来指定。 -- :code:`size` 是网络层输出的大小。 -- :code:`bias` 表明这个层的一个实例是否需要偏置。 -- :code:`inputs` 说明这个层的输入,输入是由一个list中的网络层实例的名字组成的。 - -.. code-block:: python - - Layer( - name = "fc1", - type = "fc", - size = 64, - bias = True, - inputs = [Input("pool3")] - ) - -我们建议你为你的Python封装实现一个“助手”,使得搭模型时更方便。具体可以参考 :code:`python/paddle/trainer_config_helpers/layers.py` 。 diff --git a/doc/v2/dev/new_layer_en.rst b/doc/v2/dev/new_layer_en.rst deleted file mode 100644 index ad723738801908a5f48343574c204bdbfc97ee08..0000000000000000000000000000000000000000 --- a/doc/v2/dev/new_layer_en.rst +++ /dev/null @@ -1,390 +0,0 @@ -================ -Write New Layers -================ - -This tutorial will guide you to write customized layers in PaddlePaddle. We will utilize fully connected layer as an example to guide you through the following steps for writing a new layer. - -- Derive equations for the forward and backward part of the layer. -- Implement C++ class for the layer. -- Write gradient check unit test to make sure the gradients are correctly computed. -- Implement Python wrapper for the layer. - -Derive Equations -================ - -First we need to derive equations of the *forward* and *backward* part of the layer. The forward part computes the output given an input. The backward part computes the gradients of the input and the parameters given the the gradients of the output. - -The illustration of a fully connected layer is shown in the following figure. In a fully connected layer, all output nodes are connected to all the input nodes. - -.. image:: src/FullyConnected.jpg - :align: center - :scale: 60 % - -The *forward part* of a layer transforms an input into the corresponding output. -Fully connected layer takes a dense input vector with dimension :math:`D_i`. It uses a transformation matrix :math:`W` with size :math:`D_i \times D_o` to project :math:`x` into a :math:`D_o` dimensional vector, and add a bias vector :math:`b` with dimension :math:`D_o` to the vector. - -.. math:: - - y = f(W^T x + b) - -where :math:`f(.)` is an nonlinear *activation* function, such as sigmoid, tanh, and Relu. - -The transformation matrix :math:`W` and bias vector :math:`b` are the *parameters* of the layer. The *parameters* of a layer are learned during training in the *backward pass*. The backward pass computes the gradients of the output function with respect to all parameters and inputs. The optimizer can use chain rule to compute the gradients of the loss function with respect to each parameter. - -Suppose our loss function is :math:`c(y)`, then - -.. math:: - - \frac{\partial c(y)}{\partial x} = \frac{\partial c(y)}{\partial y} \frac{\partial y}{\partial x} - -Suppose :math:`z = W^T x + b`, then - -.. math:: - - \frac{\partial y}{\partial z} = \frac{\partial f(z)}{\partial z} - -This derivative can be automatically computed by our base layer class. - -Then, for fully connected layer, we need to compute: - -.. math:: - - \frac{\partial z}{\partial x} = W, \frac{\partial z_j}{\partial W_{ij}} = x_i, \frac{\partial z}{\partial b} = \mathbf 1 - -where :math:`\mathbf 1` is an all one vector, :math:`W_{ij}` is the number at the i-th row and j-th column of the matrix :math:`W`, :math:`z_j` is the j-th component of the vector :math:`z`, and :math:`x_i` is the i-th component of the vector :math:`x`. - -Finally we can use chain rule to calculate :math:`\frac{\partial z}{\partial x}`, and :math:`\frac{\partial z}{\partial W}`. The details of the computation will be given in the next section. - -Implement C++ Class -=================== - -The C++ class of the layer implements the initialization, forward, and backward part of the layer. The fully connected layer is at :code:`paddle/legacy/gserver/layers/FullyConnectedLayer.h` and :code:`paddle/legacy/gserver/layers/FullyConnectedLayer.cpp`. We list simplified version of the code below. - -It needs to derive the base class :code:`paddle::Layer`, and it needs to override the following functions: - -- constructor and destructor. -- :code:`init` function. It is used to initialize the parameters and settings. -- :code:`forward`. It implements the forward part of the layer. -- :code:`backward`. It implements the backward part of the layer. -- :code:`prefetch`. It is utilized to determine the rows corresponding parameter matrix to prefetch from parameter server. You do not need to override this function if your layer does not need remote sparse update. (most layers do not need to support remote sparse update) - - -The header file is listed below: - -.. code-block:: c++ - - namespace paddle { - /** - * A layer has full connections to all neurons in the previous layer. - * It computes an inner product with a set of learned weights, and - * (optionally) adds biases. - * - * The config file api is fc_layer. - */ - - class FullyConnectedLayer : public Layer { - protected: - WeightList weights_; - std::unique_ptr biases_; - - public: - explicit FullyConnectedLayer(const LayerConfig& config) - : Layer(config) {} - ~FullyConnectedLayer() {} - - bool init(const LayerMap& layerMap, const ParameterMap& parameterMap); - - Weight& getWeight(int idx) { return *weights_[idx]; } - - void prefetch(); - void forward(PassType passType); - void backward(const UpdateCallback& callback = nullptr); - }; - } // namespace paddle - -It defines the parameters as class variables. We use :code:`Weight` class as abstraction of parameters. It supports multi-thread update. The details of this class will be described in details in the implementations. - -- :code:`weights_` is a list of weights for the transformation matrices. The current implementation can have more than one inputs. Thus, it has a list of weights. One weight corresponds to an input. -- :code:`biases_` is a weight for the bias vector. - -The fully connected layer does not have layer configuration hyper-parameters. If there are some layer hyper-parameters, a common practice is to store it in :code:`LayerConfig& config`, and put it into a class variable in the constructor. - -The following code snippet implements the :code:`init` function. - -- First, every :code:`init` function must call the :code:`init` function of the base class :code:`Layer::init(layerMap, parameterMap);`. This statement will initialize the required variables and connections for each layer. -- The it initializes all the weights matrices :math:`W`. The current implementation can have more than one inputs. Thus, it has a list of weights. -- Finally, it initializes the bias. - - -.. code-block:: c++ - - bool FullyConnectedLayer::init(const LayerMap& layerMap, - const ParameterMap& parameterMap) { - /* Initialize the basic parent class */ - Layer::init(layerMap, parameterMap); - - /* initialize the weightList */ - CHECK(inputLayers_.size() == parameters_.size()); - for (size_t i = 0; i < inputLayers_.size(); i++) { - // Option the parameters - size_t height = inputLayers_[i]->getSize(); - size_t width = getSize(); - - // create a new weight - if (parameters_[i]->isSparse()) { - CHECK_LE(parameters_[i]->getSize(), width * height); - } else { - CHECK_EQ(parameters_[i]->getSize(), width * height); - } - Weight* w = new Weight(height, width, parameters_[i]); - - // append the new weight to the list - weights_.emplace_back(w); - } - - /* initialize biases_ */ - if (biasParameter_.get() != NULL) { - biases_ = std::unique_ptr(new Weight(1, getSize(), biasParameter_)); - } - - return true; - } - -The implementation of the forward part has the following steps. - -- Every layer must call :code:`Layer::forward(passType);` at the beginning of its :code:`forward` function. -- Then it allocates memory for the output using :code:`reserveOutput(batchSize, size);`. This step is necessary because we support the batches to have different batch sizes. :code:`reserveOutput` will change the size of the output accordingly. For the sake of efficiency, we will allocate new memory if we want to expand the matrix, but we will reuse the existing memory block if we want to shrink the matrix. -- Then it computes :math:`\sum_i W_i x + b` using Matrix operations. :code:`getInput(i).value` retrieve the matrix of the i-th input. Each input is a :math:`batchSize \times dim` matrix, where each row represents an single input in a batch. For a complete lists of supported matrix operations, please refer to :code:`paddle/legacy/math/Matrix.h` and :code:`paddle/legacy/math/BaseMatrix.h`. -- Finally it applies the activation function using :code:`forwardActivation();`. It will automatically applies the corresponding activation function specifies in the network configuration. - - -.. code-block:: c++ - - void FullyConnectedLayer::forward(PassType passType) { - Layer::forward(passType); - - /* malloc memory for the output_ if necessary */ - int batchSize = getInput(0).getBatchSize(); - int size = getSize(); - - { - // Settup the size of the output. - reserveOutput(batchSize, size); - } - - MatrixPtr outV = getOutputValue(); - - // Apply the the transformation matrix to each input. - for (size_t i = 0; i != inputLayers_.size(); ++i) { - auto input = getInput(i); - CHECK(input.value) << "The input of 'fc' layer must be matrix"; - i == 0 ? outV->mul(input.value, weights_[i]->getW(), 1, 0) - : outV->mul(input.value, weights_[i]->getW(), 1, 1); - } - - /* add the bias-vector */ - if (biases_.get() != NULL) { - outV->addBias(*(biases_->getW()), 1); - } - - /* activation */ { - forwardActivation(); - } - } - -The implementation of the backward part has the following steps. - -- :code:`backwardActivation()` computes the gradients of the activation. The gradients will be multiplies in place to the gradients of the output, which can be retrieved using :code:`getOutputGrad()`. -- Compute the gradients of bias. Notice that we an use :code:`biases_->getWGrad()` to get the gradient matrix of the corresponding parameter. After the gradient of one parameter is updated, it **MUST** call :code:`getParameterPtr()->incUpdate(callback);`. This is utilize for parameter update over multiple threads or multiple machines. -- Then it computes the gradients of the transformation matrices and inputs, and it calls :code:`incUpdate` for the corresponding parameter. This gives the framework the chance to know whether it has gathered all the gradient to one parameter so that it can do some overlapping work (e.g., network communication) - - -.. code-block:: c++ - - void FullyConnectedLayer::backward(const UpdateCallback& callback) { - /* Do derivation for activations.*/ { - backwardActivation(); - } - - if (biases_ && biases_->getWGrad()) { - biases_->getWGrad()->collectBias(*getOutputGrad(), 1); - - biases_->getParameterPtr()->incUpdate(callback); - } - - bool syncFlag = hl_get_sync_flag(); - - for (size_t i = 0; i != inputLayers_.size(); ++i) { - /* Calculate the W-gradient for the current layer */ - if (weights_[i]->getWGrad()) { - MatrixPtr input_T = getInputValue(i)->getTranspose(); - MatrixPtr oGrad = getOutputGrad(); - { - weights_[i]->getWGrad()->mul(input_T, oGrad, 1, 1); - } - } - - - /* Calculate the input layers error */ - MatrixPtr preGrad = getInputGrad(i); - if (NULL != preGrad) { - MatrixPtr weights_T = weights_[i]->getW()->getTranspose(); - preGrad->mul(getOutputGrad(), weights_T, 1, 1); - } - - { - weights_[i]->getParameterPtr()->incUpdate(callback); - } - } - } - -The :code:`prefetch` function specifies the rows that need to be fetched from parameter server during training. It is only useful for remote sparse training. In remote sparse training, the full parameter matrix is stored distributedly at the parameter server. When the layer uses a batch for training, only a subset of locations of the input is non-zero in this batch. Thus, this layer only needs the rows of the transformation matrix corresponding to the locations of these non-zero entries. The :code:`prefetch` function specifies the ids of these rows. - -Most of the layers do not need remote sparse training function. You do not need to override this function in this case. - -.. code-block:: c++ - - void FullyConnectedLayer::prefetch() { - for (size_t i = 0; i != inputLayers_.size(); ++i) { - auto* sparseParam = - dynamic_cast(weights_[i]->getW().get()); - if (sparseParam) { - MatrixPtr input = getInputValue(i); - sparseParam->addRows(input); - } - } - } - -Finally, you can use :code:`REGISTER_LAYER(fc, FullyConnectedLayer);` to register the layer. :code:`fc` is the identifier of the layer, and :code:`FullyConnectedLayer` is the class name of the layer. - -.. code-block:: c++ - - namespace paddle { - REGISTER_LAYER(fc, FullyConnectedLayer); - } - -If the :code:`cpp` file is put into :code:`paddle/legacy/gserver/layers`, it will be automatically added to the compilation list. - - -Write Gradient Check Unit Test -=============================== - -An easy way to verify the correctness of new layer's implementation is to write a gradient check unit test. Gradient check unit test utilizes finite difference method to verify the gradient of a layer. It modifies the input with a small perturbation :math:`\Delta x` and observes the changes of output :math:`\Delta y`, the gradient can be computed as :math:`\frac{\Delta y}{\Delta x }`. This gradient can be compared with the gradient computed by the :code:`backward` function of the layer to ensure the correctness of the gradient computation. Notice that the gradient check only tests the correctness of the gradient computation, it does not necessarily guarantee the correctness of the implementation of the :code:`forward` and :code:`backward` function. You need to write more sophisticated unit tests to make sure your layer is implemented correctly. - -All the gradient check unit tests are located in :code:`paddle/legacy/gserver/tests/test_LayerGrad.cpp`. You are recommended to put your test into a new test file if you are planning to write a new layer. The gradient test of the gradient check unit test of the fully connected layer is listed below. It has the following steps. - -+ Create layer configuration. A layer configuration can include the following attributes: - - size of the bias parameter. (4096 in our example) - - type of the layer. (fc in our example) - - size of the layer. (4096 in our example) - - activation type. (softmax in our example) - - dropout rate. (0.1 in our example) -+ configure the input of the layer. In our example, we have only one input. - - type of the input (:code:`INPUT_DATA`) in our example. It can be one of the following types - - :code:`INPUT_DATA`: dense vector. - - :code:`INPUT_LABEL`: integer. - - :code:`INPUT_DATA_TARGET`: dense vector, but it does not used to compute gradient. - - :code:`INPUT_SEQUENCE_DATA`: dense vector with sequence information. - - :code:`INPUT_HASSUB_SEQUENCE_DATA`: dense vector with both sequence and sub-sequence information. - - :code:`INPUT_SEQUENCE_LABEL`: integer with sequence information. - - :code:`INPUT_SPARSE_NON_VALUE_DATA`: 0-1 sparse data. - - :code:`INPUT_SPARSE_FLOAT_VALUE_DATA`: float sparse data. - - name of the input. (:code:`layer_0` in our example) - - size of the input. (8192 in our example) - - number of non-zeros, only useful for sparse inputs. - - format of sparse data, only useful for sparse inputs. -+ each inputs needs to call :code:`config.layerConfig.add_inputs();` once. -+ call :code:`testLayerGrad` to perform gradient checks. It has the following arguments. - - layer and input configurations. (:code:`config` in our example) - - type of the layer. (:code:`fc` in our example) - - batch size of the gradient check. (100 in our example) - - whether the input is transpose. Most layers need to set it to :code:`false`. (:code:`false` in our example) - - whether to use weights. Some layers or activations perform normalization so that the sum of their output is a constant. For example, the sum of output of a softmax activation is one. In this case, we cannot correctly compute the gradients using regular gradient check techniques. A weighted sum of the output, which is not a constant, is utilized to compute the gradients. (:code:`true` in our example, because the activation of a fully connected layer can be softmax) - -.. code-block:: c++ - - void testFcLayer(string format, size_t nnz) { - // Create layer configuration. - TestConfig config; - config.biasSize = 4096; - config.layerConfig.set_type("fc"); - config.layerConfig.set_size(4096); - config.layerConfig.set_active_type("softmax"); - config.layerConfig.set_drop_rate(0.1); - // Setup inputs. - config.inputDefs.push_back( - {INPUT_DATA, "layer_0", 8192, nnz, ParaSparse(format)}); - config.layerConfig.add_inputs(); - LOG(INFO) << config.inputDefs[0].sparse.sparse << " " - << config.inputDefs[0].sparse.format; - for (auto useGpu : {false, true}) { - testLayerGrad(config, "fc", 100, /* trans */ false, useGpu, - /* weight */ true); - } - } - -If you are creating a new file for the test, such as :code:`paddle/legacy/gserver/tests/testFCGrad.cpp`, you need to add the file to :code:`paddle/legacy/gserver/tests/CMakeLists.txt`. An example is given below. All the unit tests will run when you execute the command :code:`make tests`. Notice that some layers might need high accuracy for the gradient check unit tests to work well. You need to configure :code:`WITH_DOUBLE` to `ON` when configuring cmake. - -.. code-block:: bash - - add_unittest_without_exec(test_FCGrad - test_FCGrad.cpp - LayerGradUtil.cpp - TestUtil.cpp) - - add_test(NAME test_FCGrad - COMMAND test_FCGrad) - - -Implement Python Wrapper -======================== - -Implementing Python wrapper allows us to use the added layer in configuration files. All the Python wrappers are in file :code:`python/paddle/legacy/trainer/config_parser.py`. An example of the Python wrapper for fully connected layer is listed below. It has the following steps: - -- Use :code:`@config_layer('fc')` at the decorator for all the Python wrapper class. :code:`fc` is the identifier of the layer. -- Implements :code:`__init__` constructor function. - - It first call :code:`super(FCLayer, self).__init__(name, 'fc', size, inputs=inputs, **xargs)` base constructor function. :code:`FCLayer` is the Python wrapper class name, and :code:`fc` is the layer identifier name. They must be correct in order for the wrapper to work. - - Then it computes the size and format (whether sparse) of each transformation matrix as well as the size. - -.. code-block:: python - - @config_layer('fc') - class FCLayer(LayerBase): - def __init__( - self, - name, - size, - inputs, - bias=True, - **xargs): - super(FCLayer, self).__init__(name, 'fc', size, inputs=inputs, **xargs) - for input_index in xrange(len(self.inputs)): - input_layer = self.get_input_layer(input_index) - psize = self.config.size * input_layer.size - dims = [input_layer.size, self.config.size] - format = self.inputs[input_index].format - sparse = format == "csr" or format == "csc" - if sparse: - psize = self.inputs[input_index].nnz - self.create_input_parameter(input_index, psize, dims, sparse, format) - self.create_bias_parameter(bias, self.config.size) - -In network configuration, the layer can be specifies using the following code snippets. The arguments of this class are: - -- :code:`name` is the name identifier of the layer instance. -- :code:`type` is the type of the layer, specified using layer identifier. -- :code:`size` is the output size of the layer. -- :code:`bias` specifies whether this layer instance has bias. -- :code:`inputs` specifies a list of layer instance names as inputs. - -.. code-block:: python - - Layer( - name = "fc1", - type = "fc", - size = 64, - bias = True, - inputs = [Input("pool3")] - ) - -You are also recommended to implement a helper for the Python wrapper, which makes it easier to write models. You can refer to :code:`python/paddle/trainer_config_helpers/layers.py` for examples. diff --git a/doc/v2/dev/src/FullyConnected.jpg b/doc/v2/dev/src/FullyConnected.jpg deleted file mode 100644 index b2241f401434e527f95ee4e0e541a3f2ff78fd1e..0000000000000000000000000000000000000000 Binary files a/doc/v2/dev/src/FullyConnected.jpg and /dev/null differ diff --git a/doc/v2/dev/src/doc_en.png b/doc/v2/dev/src/doc_en.png deleted file mode 100644 index ed6b9178fba91a3bdf45ae797a9924f84146fbc8..0000000000000000000000000000000000000000 Binary files a/doc/v2/dev/src/doc_en.png and /dev/null differ diff --git a/doc/v2/dev/write_docs_cn.rst b/doc/v2/dev/write_docs_cn.rst deleted file mode 100644 index 4231f2bb5cd800c0cd86835b5d07e491fcde4989..0000000000000000000000000000000000000000 --- a/doc/v2/dev/write_docs_cn.rst +++ /dev/null @@ -1,136 +0,0 @@ -############# -如何贡献文档 -############# - -PaddlePaddle的文档包括中英文两个部分。文档都是通过 ``cmake`` 驱动 ``sphinx`` 编译生成的,PaddlePaddle.org工具可以帮助我们实现这一编译过程,并提供更好的预览效果。 - -如何构建文档 -============ - -PaddlePaddle的文档构建有两种方式,分别为使用paddlepaddle.org工具和不使用paddlepaddle.org工具,两种方式都有各自的优点,前者方便预览,后者方便开发者进行调试。这两种方式中又分别有使用docker和不使用docker的两种构建方法。 - -我们建议使用PaddlePaddle.org工具来构建文档。 - -使用PaddlePaddle.org工具 ------------------------- -这个是目前推荐的使用方法。除了可以自动编译文档,还可以直接在网页中预览文档,需要注意的是,采用后续说明的其它方式虽然也可以预览文档,但是文档的样式与官网文档是不一致的,使用PaddlePaddle.org工具进行编译才能产生与官网文档样式一致的预览效果。 - -PaddlePaddle.org工具可以配合Docker使用,需要在系统里先安装好Docker工具包。Docker安装请参考 `Docker的官网 `_ 。安装好Docker之后即可用以下命令启动工具 - -.. code-block:: bash - - mkdir paddlepaddle # Create paddlepaddle working directory - cd paddlepaddle - - # Clone the content repositories - git clone https://github.com/PaddlePaddle/Paddle.git - git clone https://github.com/PaddlePaddle/book.git - git clone https://github.com/PaddlePaddle/models.git - git clone https://github.com/PaddlePaddle/Mobile.git - - # Please specify the working directory through -v - docker run -it -p 8000:8000 -v `pwd`:/var/content paddlepaddle/paddlepaddle.org:latest - -注意: PaddlePaddle.org 会在 -v (volume) 指定的内容存储库运行命令 -之后再用网页连到 http://localhost:8000 就可以在网页上生成需要的文档 -编译后的文件将被存储在工作目录 /.ppo_workspace/content。 - -如果不想使用Docker,你还可以通过运行Django框架直接激活工具的服务器。使用下面的命令来运行它。 - -.. code-block:: bash - - mkdir paddlepaddle # Create paddlepaddle working directory - cd paddlepaddle - - # Clone the content repositories and PaddlePaddle.org - git clone https://github.com/PaddlePaddle/Paddle.git - git clone https://github.com/PaddlePaddle/book.git - git clone https://github.com/PaddlePaddle/models.git - git clone https://github.com/PaddlePaddle/Mobile.git - git clone https://github.com/PaddlePaddle/PaddlePaddle.org.git - - # Please specify the PaddlePaddle working directory. In the current setting, it should be pwd - export CONTENT_DIR= - export ENV='' - cd PaddlePaddle.org/portal/ - pip install -r requirements.txt - python manage.py runserver - -工具服务器将读取环境变量 CONTENT_DIR 搜索代码库。请指定的PaddlePaddle工作目录给环境变量 CONTENT_DIR。 -之后再用网页连到 http://localhost:8000 就可以在网页上生成需要的文档。 -编译后的文件将被存储在工作目录 /.ppo_workspace/content。 - -想了解更多PaddlePaddle.org工具的详细信息,可以 `点击这里 `_ 。 - -不使用PaddlePaddle.org工具 --------------------------- - -使用Docker构建PaddlePaddle的文档,需要在系统里先安装好Docker工具包。Docker安装请参考 `Docker的官网 `_ 。该方法与 `从源码编译PaddlePaddle `_ 相似,通过从源码中构建可用于编译PaddlePaddle文档的Docker镜像并运行,在进入Docker容器后使用源码中的脚本构建PaddlePaddle文档,具体步骤如下: - -.. code-block:: bash - - git clone https://github.com/PaddlePaddle/Paddle.git - cd Paddle - - # 从源码中构建可用于编译PaddlePaddle文档的Docker镜像 - docker build -t paddle:dev . - docker run -it -v $PWD:/paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=OFF" -e "WITH_DOC=ON" paddle:dev /bin/bash - - # 进入Docker容器后使用build.sh脚本构建PaddlePaddle文档 - bash -x /paddle/paddle/scripts/docker/build.sh - -注:上述命令把当前目录(源码根目录)映射为 container 里的 :code:`/paddle` 目录。 - -编译完成后,会产生 ``doc/v2`` 和 ``doc/fluid`` 两个目录,在这两个目录下分别都生成 ``cn/html/`` 、 ``en/html`` 、 ``api/en/html`` 共三个子目录,分别进入这些目录下,执行以下命令: - -.. code-block:: bash - - python -m SimpleHTTPServer 8088 - -在浏览器中输入 http://localhost:8088 就可以看到编译生成的 ``v2`` 和 ``fluid`` 两种版本的中/英文的文档页面和英文的API页面。 - -如果不想使用Docker,也可以使用以下命令直接构建PaddlePaddle文档,即 - -.. code-block:: bash - - git clone https://github.com/PaddlePaddle/Paddle.git - cd Paddle - mkdir -p build - cd build - cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_GPU=OFF -DWITH_MKL=OFF -DWITH_DOC=ON - - # 如果只需要构建使用文档,则执行以下命令 - make -j $processors paddle_docs - - # 如果只需要构建API,则执行以下命令 - make -j $processors paddle_apis - -其中$processors代表启动和CPU核一样多的进程来并行编译,可以根据本机的CPU核数设置相应的值。 - -编译完成后,同样会产生 ``doc/v2`` 和 ``doc/fluid`` 两个目录,如果选择构建文档则会在这两个目录下分别都生成 ``cn/html/`` 、 ``en/html`` 两个子目录,选择构建API则会在这两个目录下分别生成 ``api/en/html`` 目录,分别进入这些子目录下,执行以下命令: - -.. code-block:: bash - - python -m SimpleHTTPServer 8088 - -在浏览器中输入 http://localhost:8088 就可以看到编译生成的 ``v2`` 和 ``fluid`` 两种版本的中/英文的文档页面和英文的API页面。下图为生成的 ``v2`` 英文文档首页示例。注意,示例中由于使用了sphinx的原始主题,所以页面的风格与官网并不一致,但这并不影响开发者进行调试。 - -.. image:: src/doc_en.png - :align: center - :scale: 60 % - -如何书写文档 -============ - -PaddlePaddle文档使用 `sphinx`_ 自动生成,用户可以参考sphinx教程进行书写。 - -如何更新www.paddlepaddle.org -============================ - -更新的文档以PR的形式提交到github中,提交方式参见 `如何贡献文档 `_ 。 -目前PaddlePaddle的develop分支的文档是自动触发更新的,用户可以分别查看最新的 `中文文档 `_ 和 -`英文文档 `_ 。 - - -.. _cmake: https://cmake.org/ -.. _sphinx: http://www.sphinx-doc.org/en/1.4.8/ diff --git a/doc/v2/dev/write_docs_en.rst b/doc/v2/dev/write_docs_en.rst deleted file mode 100644 index 6105455e202e4704aa25f0fd9916b9b61a569702..0000000000000000000000000000000000000000 --- a/doc/v2/dev/write_docs_en.rst +++ /dev/null @@ -1,139 +0,0 @@ -######################## -Contribute Documentation -######################## - -PaddlePaddle's documentation includes both Chinese and English versions. The documentation is built using the ``cmake`` command to drive the ``sphinx`` compiler. The PaddlePaddle.org tool helps us to implement this compilation process and provides better preview results. - -How to build Documentation -=========================== - -PaddlePaddle's documentation is built in two ways: using the PaddlePaddle.org tool and without using it. Both methods have their own advantages. The former facilitates previewing, while the latter facilitates debugging by the developer. We could choose to build the documentation with Docker or without it in each of the above ways. - -We recommend using PaddlePaddle.org tool to build documentation. - -Using PaddlePaddle.org tool ------------------------------ -This is the recommended method to build documentation, because it can automatically compile the documentation and preview the documentation directly in a web page. Note that, although you can preview the documentation in other ways, its style may not be consistent with the official website. Compiling with the PaddlePaddle.org tool produces a preview that will be consistent with the official website documentation style. - -The PaddlePaddle.org tool can be used with Docker and Docker needs to be installed first. Please refer to `Docker's official website `_ on how to install Docker. After installing Docker, you may use the following commands to activate the tool - -.. code-block:: bash - - mkdir paddlepaddle # Create paddlepaddle working directory - cd paddlepaddle - - # Clone the content repositories. You may only clone the contents you need - git clone https://github.com/PaddlePaddle/Paddle.git - git clone https://github.com/PaddlePaddle/book.git - git clone https://github.com/PaddlePaddle/models.git - git clone https://github.com/PaddlePaddle/Mobile.git - - # Please specify the working directory through -v - docker run -it -p 8000:8000 -v `pwd`:/var/content paddlepaddle/paddlepaddle.org:latest - -Note: PaddlePaddle.org will read the content repos specified in the -v (volume) flag of the docker run commands -Use a web browser and navigate to http://localhost:8000. Click the buttons to compile the documentation. -The compiled documentations will be stored in /.ppo_workspace/content - - -If you don't wish to use Docker, you can also activate the tool through Django. Use the following the commands to set up - -.. code-block:: bash - - mkdir paddlepaddle # Create paddlepaddle working directory - cd paddlepaddle - - # Clone the content repositories and PaddlePaddle.org - git clone https://github.com/PaddlePaddle/Paddle.git - git clone https://github.com/PaddlePaddle/book.git - git clone https://github.com/PaddlePaddle/models.git - git clone https://github.com/PaddlePaddle/Mobile.git - git clone https://github.com/PaddlePaddle/PaddlePaddle.org.git - - # Please specify the PaddlePaddle working directory. In the current setting, it should be pwd - export CONTENT_DIR= - export ENV='' - cd PaddlePaddle.org/portal/ - pip install -r requirements.txt - python manage.py runserver - -Specify the PaddlePaddle working directory for the environment variable CONTENT_DIR so that the tool could find where the working directory is. - -Use a web browser and navigate to http://localhost:8000. Click the buttons to compile the documentation -The compiled documentations will be stored in /.ppo_workspace/content - -Please `click here `_ for more information about the PaddlePaddle.org tool. - - -Manually Building the Documentation -------------------------------------- - -Build PaddlePaddle's documentation with Docker,you need to install Docker first. Please refer to `Docker's official website `_ on how to install Docker. This method is quite similar to ` Build From Sources `_ , by constructing, from source code, a docker image that can be used to build PaddlePaddle documentation. Enter the Docker container and use the script ``build.sh`` in the source directory to build the PaddlePaddle documentation. The specific steps are as follows: - -.. code-block:: bash - - git clone https://github.com/PaddlePaddle/Paddle.git - cd Paddle - - # Construct a docker image from source code - docker build -t paddle:dev . - docker run -it -v $PWD:/paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=OFF" -e "WITH_DOC=ON" paddle:dev /bin/bash - - # Use build.sh to build PaddlePaddle documentation - bash -x /paddle/paddle/scripts/docker/build.sh - -Note: The above commands maps the current directory (source root directory) to the :code:`/paddle` directory in the container. - -After compiling, there should be two generated directories: ``doc/v2`` and ``doc/fluid``, where three subdirectories ``cn/html/``, ``en/html`` and ``api/en/html`` are generated. Please enter these directories respectively and execute the following commands: - -.. code-block:: bash - - python -m SimpleHTTPServer 8088 - -Use a web browser and navigate to http://localhost:8000, you could see the compiled ``v2`` 's and ``fluid`` 's Chinese/English documents page and English APIs page. - -If you do not wish to use Docker, you can also use the following commands to directly build the PaddlePaddle documentation. - -.. code-block:: bash - - - git clone https://github.com/PaddlePaddle/Paddle.git - cd Paddle - mkdir -p build - cd build - cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_GPU=OFF -DWITH_MKL=OFF -DWITH_DOC=ON - - # If you only need to build documents, use the following commands - make -j $processors paddle_docs - - # If you only need to build APIs, use the following commands - make -j $processors paddle_apis - -$processors indicates that as many processes as the CPU cores are started to compile in parallel. It should be set according to the number of CPU cores of your machine. - -After compiling, there also should be two generated directories: ``doc/v2`` and ``doc/fluid`` . If you chose to build documents, two subdirectories ``cn/html/`` and ``en/html`` will be generated in both two directories. If you chose to build APIs,a subdirectory ``api/en/html`` will be generated. Please enter these directories respectively and execute the following commands: - -.. code-block:: bash - - python -m SimpleHTTPServer 8088 - -Use a web browser and navigate to http://localhost:8000, you could see the compiled ``v2`` 's and ``fluid`` 's Chinese/English documents page and English APIs page. The following figure is an example of the built ``v2`` 's English documents home page. Note that due to the sphinx's original theme used in the example, the style of the page is not consistent with the official website, but this does not affect the developer's debugging. - -.. image:: src/doc_en.png - :align: center - :scale: 60 % - -How to write Documentation -=========================== - -PaddlePaddle uses `sphinx`_ to compile documentation,Please check sphinx official website for more detail. - -How to update www.paddlepaddle.org -=================================== - -Please create PRs and submit them to github, please check `Contribute Code `_ 。 -PaddlePaddle develop branch will update the documentation once the PR is merged. User may check latest `Chinese Docs `_ and -`English Docs `_ 。 - -.. _cmake: https://cmake.org/ -.. _sphinx: http://www.sphinx-doc.org/en/1.4.8/ diff --git a/doc/v2/faq/build_and_install/index_cn.rst b/doc/v2/faq/build_and_install/index_cn.rst deleted file mode 100644 index 0d644777287aea0a572adb6fa40f498f9c147af7..0000000000000000000000000000000000000000 --- a/doc/v2/faq/build_and_install/index_cn.rst +++ /dev/null @@ -1,224 +0,0 @@ -.. _install_faq: - -################### -编译安装与单元测试 -################### - -.. contents:: - -1. 运行Docker GPU镜像出现 "CUDA driver version is insufficient" ----------------------------------------------------------------- - -用户在使用PaddlePaddle GPU的Docker镜像的时候,常常出现 `Cuda Error: CUDA driver version is insufficient for CUDA runtime version`, 原因在于没有把机器上CUDA相关的驱动和库映射到容器内部。 -具体的解决方法是: - -.. code-block:: bash - - $ export CUDA_SO="$(\ls usr/lib64/libcuda* | xargs -I{} echo '-v {}:{}') $(\ls /usr/lib64/libnvidia* | xargs -I{} echo '-v {}:{}')" - $ export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}') - $ docker run ${CUDA_SO} ${DEVICES} -it paddlepaddle/paddle:latest-gpu - -更多关于Docker的安装与使用, 请参考 `PaddlePaddle Docker 文档 `_ 。 - - -2. CMake源码编译, 找到的PythonLibs和PythonInterp版本不一致 ----------------------------------------------------------------- - -这是目前CMake寻找Python的逻辑存在缺陷,如果系统安装了多个Python版本,CMake找到的Python库和Python解释器版本可能有不一致现象,导致编译PaddlePaddle失败。正确的解决方法是, -用户强制指定特定的Python版本,具体操作如下: - - .. code-block:: bash - - cmake .. -DPYTHON_EXECUTABLE= -DPYTHON_LIBRARY= -DPYTHON_INCLUDE_DIR= - -用户需要指定本机上Python的路径:````, ````, ```` - -3. CMake源码编译,Paddle版本号为0.0.0 --------------------------------------- - -如果运行 :code:`paddle version`, 出现 :code:`PaddlePaddle 0.0.0`;或者运行 :code:`cmake ..`,出现 - -.. code-block:: bash - - CMake Warning at cmake/version.cmake:20 (message): - Cannot add paddle version from git tag - -那么用户需要拉取所有的远程分支到本机,命令为 :code:`git fetch upstream`,然后重新cmake即可。 - -4. paddlepaddle\*.whl is not a supported wheel on this platform. ------------------------------------------------------------------------- - -出现这个问题的主要原因是,没有找到和当前系统匹配的paddlepaddle安装包。最新的paddlepaddle python安装包支持Linux x86_64和MacOS 10.12操作系统,并安装了python 2.7和pip 9.0.1。 - -更新 :code:`pip` 包的方法是\: - -.. code-block:: bash - - pip install --upgrade pip - -如果还不行,可以执行 :code:`python -c "import pip; print(pip.pep425tags.get_supported())"` 获取当前系统支持的python包的后缀, -并对比是否和正在安装的后缀一致。 - -如果系统支持的是 :code:`linux_x86_64` 而安装包是 :code:`manylinux1_x86_64` ,需要升级pip版本到最新; -如果系统支持 :code:`manylinux1_x86_64` 而安装包(本地)是 :code:`linux_x86_64` ,可以重命名这个whl包为 :code:`manylinux1_x86_64` 再安装。 - -5. 编译安装后执行 import paddle.v2 as paddle 报ImportError: No module named v2 ------------------------------------------------------------------------------------------- -先查看一下是否曾经安装过paddle v1版本,有的话需要先卸载: - -pip uninstall py_paddle paddle - -然后安装paddle的python环境, 在build目录下执行 - -pip install python/dist/paddle*.whl && pip install ../paddle/dist/py_paddle*.whl - -6. 遇到“非法指令”或者是“illegal instruction” --------------------------------------------- - -PaddlePaddle使用avx SIMD指令提高cpu执行效率,因此错误的使用二进制发行版可能会导致这种错误,请选择正确的版本。 - -7. python相关的单元测试都过不了 --------------------------------- - -如果出现以下python相关的单元测试都过不了的情况: - -.. code-block:: bash - - 24 - test_PyDataProvider (Failed) - 26 - test_RecurrentGradientMachine (Failed) - 27 - test_NetworkCompare (Failed) - 28 - test_PyDataProvider2 (Failed) - 32 - test_Prediction (Failed) - 33 - test_Compare (Failed) - 34 - test_Trainer (Failed) - 35 - test_TrainerOnePass (Failed) - 36 - test_CompareTwoNets (Failed) - 37 - test_CompareTwoOpts (Failed) - 38 - test_CompareSparse (Failed) - 39 - test_recurrent_machine_generation (Failed) - 40 - test_PyDataProviderWrapper (Failed) - 41 - test_config_parser (Failed) - 42 - test_swig_api (Failed) - 43 - layers_test (Failed) - -并且查询PaddlePaddle单元测试的日志,提示: - -.. code-block:: bash - - paddle package is already in your PYTHONPATH. But unittest need a clean environment. - Please uninstall paddle package before start unittest. Try to 'pip uninstall paddle'. - -解决办法是: - -* 卸载PaddlePaddle包 :code:`pip uninstall paddle`, 清理掉老旧的PaddlePaddle安装包,使得单元测试有一个干净的环境。如果PaddlePaddle包已经在python的site-packages里面,单元测试会引用site-packages里面的python包,而不是源码目录里 :code:`/python` 目录下的python包。同时,即便设置 :code:`PYTHONPATH` 到 :code:`/python` 也没用,因为python的搜索路径是优先已经安装的python包。 - -8. 下载MKLML库失败 ------------------- - -.. code-block:: bash - - make[2]: *** [third_party/mklml/src/extern_mklml-stamp/extern_mklml-download] 错误 4 - make[1]: *** [CMakeFiles/extern_mklml.dir/all] 错误 2 - make[1]: *** 正在等待未完成的任务.... - -原因:网速或SSL链接原因,导致MKLML库下载不成功。 - -解决办法是:手动下载并安装,具体步骤如下。 - -.. code-block:: bash - - // 1. 进入对应的目录 - cd build/third_party/mklml/src/extern_mklml - - // 2. 查看包的大小, 正常情况下是75M,如果小于75M,即下载失败: - du -sh mklml_lnx_2018.0.1.20171007.tgz - - // 3. 手动下载且解压缩,并手动生成download成功标签: - wget --no-check-certificate https://github.com/01org/mkl-dnn/releases/download/v0.11/mklml_lnx_2018.0.1.20171007.tgz -c -O mklml_lnx_2018.0.1.20171007.tgz - tar zxf mklml_lnx_2018.0.1.20171007.tgz - touch ../extern_mklml-stamp/extern_mklml-download - - // 4. 接着编译即可 - -9. 在Mac上无法安装numpy等Python包,权限错误 ------------------- - -Mac上对自带的Python和包有严格的权限保护,最好不要在自带的Python上安装。建议用virtualenv建立一个新的Python环境来操作。 - -virtualenv的基本原理是将机器上的Python运行所需的运行环境完整地拷贝一份。我们可以在一台机器上制造多份拷贝,并在这多个拷贝之间自由切换,这样就相当于在一台机器上拥有了多个相互隔离、互不干扰的Python环境。 - -下面简单介绍下如何用virtualenv为Paddle生成一个专用的Python环境: - -安装virtualenv: -:::::::::::::::: - -virtualenv本身也是Python的一个包,可以用pip进行安装: - -.. code-block:: bash - - sudo -H pip install virtualenv - -由于virtualenv需要安装给系统自带的Python,因此需要使用sudo权限。 - -创建一个新的Python运行环境: -::::::::::::::::::: - -.. code-block:: bash - - virtualenv --no-site-packages paddle - ---no-site-packages 参数表示不拷贝已有的任何第三方包,创造一个完全干净的新Python环境。后面的paddle是我们为这个新创建的环境取的名字。 - -执行完这一步后,当前目录下应该会出现一个名为paddle(或者你取的其他名字)的目录。这个目录里保存了运行一个Python环境所需要的各种文件。 - -启动运行环境: -:::::::::::::::: - -.. code-block:: bash - - source paddle/bin/activate - -执行后会发现命令提示符前面增加了(paddle)字样,说明已经成功启动了名为‘paddle’的Python环境。执行which python,可以发现使用的已经是刚刚创建的paddle目录下的Python。 - -在这个环境中,我们可以自由地进行Paddle的安装、使用和开发工作,无需担心对系统自带Python的影响。 - -退出运行环境: -::::::::::::::: - -直接执行: - -.. code-block:: bash - - deactivate - -可以看到命令提示符前面的(paddle)字样消失。 - -自动启动某一Python环境: -:::::::::::::::: - -如果我们经常使用Paddle,我们每次打开终端后都需要执行一下source paddle/bin/activate来启动环境,比较繁琐。为了简便,可以修改终端的配置文件,来让终端每次启动后自动启动特定的Python环境。 - -执行: - -.. code-block:: bash - - vi ~/.bash_profile - -打开终端配置文件,并在文件的最后添加一行: - -.. code-block:: bash - - source paddle/bin/activate - -保存并关闭文件。 - -这样,每次打开终端时就会自动启动名为‘paddle’的Python环境了。 - -10. 通过pip安装的PaddlePaddle在 :code:`import paddle.fluid` 报找不到 :code:`libmkldnn.so` 或 :code:`libmklml_intel.so` ------------------------------------------------------------------------------------------- -出现这种问题的原因是在导入 :code:`paddle.fluid` 时需要加载 :code:`libmkldnn.so` 和 :code:`libmklml_intel.so`, -但是系统没有找到该文件。一般通过pip安装PaddlePaddle时会将 :code:`libmkldnn.so` 和 :code:`libmklml_intel.so` -拷贝到 :code:`/usr/local/lib` 路径下,所以解决办法是将该路径加到 :code:`LD_LIBRARY_PATH` 环境变量下, -即: :code:`export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH` 。 - -**注意**:如果是在虚拟环境中安装PaddlePaddle, :code:`libmkldnn.so` 和 :code:`libmklml_intel.so` 可能不在 :code:`/usr/local/lib` 路径下。 \ No newline at end of file diff --git a/doc/v2/faq/build_and_install/index_en.rst b/doc/v2/faq/build_and_install/index_en.rst deleted file mode 100644 index 7488ed8137d57785f36b9f1e1ed1269f864960bc..0000000000000000000000000000000000000000 --- a/doc/v2/faq/build_and_install/index_en.rst +++ /dev/null @@ -1,143 +0,0 @@ -.. _install_faq: - -############################### -Compile, Install, and Unit Test -############################### - -.. contents:: - -1. Insufficient CUDA driver version ----------------------------------------------------------------- - -Many users usually face issues like `Cuda Error: CUDA driver version is insufficient for CUDA runtime version` when running the PaddlePaddle GPU Docker image. The cause is that you may not map the local CUDA driver to a container directory. -You can solve the issue by running the following commands: - -.. code-block:: bash - - $ export CUDA_SO="$(\ls usr/lib64/libcuda* | xargs -I{} echo '-v {}:{}') $(\ls /usr/lib64/libnvidia* | xargs -I{} echo '-v {}:{}')" - $ export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}') - $ docker run ${CUDA_SO} ${DEVICES} -it paddlepaddle/paddle:latest-gpu - -For more infomation about Docker's installation and usage, please refer to `PaddlePaddle Docker documentation `_ . - - -2. Version mismatch between PythonLibs and PythonInterpreter ----------------------------------------------------------------- - -It is a common bug when CMake looks up Python. If you install multiple versions of Python, Cmake may find the version mismatch between PythonLibs and PythonInterpreter . You are forced to specify a Python version, as follows. - - .. code-block:: bash - - cmake .. -DPYTHON_EXECUTABLE= -DPYTHON_LIBRARY= -DPYTHON_INCLUDE_DIR= - -You should specify ````, ````, ```` to your local paths. - -3. PaddlePaddle version is 0.0.0 ------------------------------------------------- -This issue would happen when you run the code `paddle version` or `cmake ..` - -.. code-block:: bash - - CMake Warning at cmake/version.cmake:20 (message): - Cannot add paddle version from git tag - -You should pull all remote branches to your local machine with the command :code:`git fetch upstream` and then run :code:`cmake` - -4. paddlepaddle\*.whl is not a supported wheel on this platform. ------------------------------------------------------------------------- - -The primary cause for this issue is that it can not find the correct PaddlePaddle installation package that matches your current system.The latest PaddlePaddle Python installation package supports Linux x86_64 and MacOS 10.12 os including Python2.7 and Pip 9.0.1. - -You can upgrade Pip with the following command\: - -.. code-block:: bash - - pip install --upgrade pip - -If it does not work for you, you can run the command :code:`python -c "import pip; print(pip.pep425tags.get_supported())"` to get the suffix of Python package which your system may support and then compare it with the suffix of your installation. - -If the system supports :code:`linux_x86_64` and the installation package is :code:`manylinux1_x86_64`, you should upgrade pip to the latest - -if the system supports :code:`manylinux_x86_64` and the local installation package is :code:`linux1_x86_64`, you can rename the whl package to :code:`manylinux1_x86_64` and then try again. - - -5. ImportError: No module named v2 ----------------------------------- -Please uninstall Paddle V1 if you have installed it before. - -.. code-block:: bash - - pip uninstall py_paddle paddle - -Then install Python for PaddlePaddle , enter the build directory and run the following commands - -pip install python/dist/paddle*.whl && pip install ../paddle/dist/py_paddle*.whl - -6. Illegal instruction ------------------------ -This issue may be caused by the wrong usage of PaddlePaddle binary version which uses avx SIMD instructions to increase the performance of cpu. Please choose the correct version. - -7. Python unittest fails --------------------------------- - -If the following python unittest testcases fail: - -.. code-block:: bash - - 24 - test_PyDataProvider (Failed) - 26 - test_RecurrentGradientMachine (Failed) - 27 - test_NetworkCompare (Failed) - 28 - test_PyDataProvider2 (Failed) - 32 - test_Prediction (Failed) - 33 - test_Compare (Failed) - 34 - test_Trainer (Failed) - 35 - test_TrainerOnePass (Failed) - 36 - test_CompareTwoNets (Failed) - 37 - test_CompareTwoOpts (Failed) - 38 - test_CompareSparse (Failed) - 39 - test_recurrent_machine_generation (Failed) - 40 - test_PyDataProviderWrapper (Failed) - 41 - test_config_parser (Failed) - 42 - test_swig_api (Failed) - 43 - layers_test (Failed) - -Please check the PaddlePaddle unittest logs which may suggest the following: - -.. code-block:: bash - - paddle package is already in your PYTHONPATH. But unittest need a clean environment. - Please uninstall paddle package before start unittest. Try to 'pip uninstall paddle'. - -The solution is: - -* Remove old PaddlePaddle to make a clean environment for the unit tests. If PaddlePaddle package is already in Python's site-packages, unit tests would refer Python package in site-packages instead of Python package in the :code:`/python` directory of the source directory. Setting :code:`PYTHONPATH` to :code:`/python` is also useless because Python's search path would give the priority to the installed Python package. - - -8. Failed to download the MKLML library ----------------------------------------------- - -.. code-block:: bash - - make[2]: *** [third_party/mklml/src/extern_mklml-stamp/extern_mklml-download] error 4 - make[1]: *** [CMakeFiles/extern_mklml.dir/all] error 2 - make[1]: *** waiting for the unfinished jobs.... - -Cause: The network speed or SSL link causes the MKLML library to download unsuccessfully. - -The solution is: manually download and install, the specific steps are as follows. - -.. code-block:: bash - - // 1. enter the directory - cd build/third_party/mklml/src/extern_mklml - - // 2. check the size of the package, normally 75M, if less than 75M, the download fails - du -sh mklml_lnx_2018.0.1.20171007.tgz - - // 3. manually download and unzip and make the download success tag: - wget --no-check-certificate https://github.com/01org/mkl-dnn/releases/download/v0.11/mklml_lnx_2018.0.1.20171007.tgz -c -O mklml_lnx_2018.0.1.20171007.tgz - tar zxf mklml_lnx_2018.0.1.20171007.tgz - touch ../extern_mklml-stamp/extern_mklml-download - - // 4. then compile - diff --git a/doc/v2/faq/cluster/index_cn.rst b/doc/v2/faq/cluster/index_cn.rst deleted file mode 100644 index e59c1e1a54a0c876d1e6e89f88030de59fb9fc1a..0000000000000000000000000000000000000000 --- a/doc/v2/faq/cluster/index_cn.rst +++ /dev/null @@ -1,17 +0,0 @@ -############### -集群训练与预测 -############### - -.. contents:: - -1. 集群多节点训练,日志中保存均为网络通信类错误 ------------------------------------------------- - -集群多节点训练,日志报错为网络通信类错误,比如 :code:`Connection reset by peer` 等。 -此类报错通常是由于某一个节点的错误导致这个节点的训练进程退出,从而引发其他节点无法连接导致,可以参考下面的步骤排查: - -* 从 :code:`train.log` , :code:`server.log` 找到最早报错的地方,查看是否是其他错误引发的报错(比如FPE,内存不足,磁盘空间不足等)。 - -* 如果发现最早的报错就是网络通信的问题,很有可能是非独占方式执行导致的端口冲突,可以联系OP,看当前MPI集群是否支持resource=full参数提交,如果支持增加此参数提交,并更换job 端口。 - -* 如果当前MPI集群并不支持任务独占模式,可以联系OP是否可以更换集群或升级当前集群。 diff --git a/doc/v2/faq/cluster/index_en.rst b/doc/v2/faq/cluster/index_en.rst deleted file mode 100644 index fa942a09625bef78b28456beeb735272b686e061..0000000000000000000000000000000000000000 --- a/doc/v2/faq/cluster/index_en.rst +++ /dev/null @@ -1,16 +0,0 @@ -############################### -Cluster Training and Prediction -############################### - -.. contents:: - -1. Network connection errors in the log during multi-node cluster training ------------------------------------------------- -There are maybe some errors in the log belonging to network connection problem during multi-node cluster training, for example, :code:`Connection reset by peer`. -This kind of error is usually caused by the abnormal exit of a training process in some node, and the other nodes cannot connect with this node any longer. Steps to troubleshoot the problem are as follows: - -* Find the first error in the :code:`train.log`, :code:`server.log`, check whether other fault casued the problem, such as FPE, lacking of memory or disk. - -* If the first error in server.log says "Address already used", this may be caused by the port conflict of the non-exclusive execution. Connect the sys-admin to check if the current MPI cluster supports jobs submitted with parameter :code:`resource=full`. If the current MPI cluster does not support this parameter, change the server port and try agian. - -* If the current MPI cluster does not support exclusive pattern which allows a process to occupy the whole node, ask the administrator to replace or update the this cluster. diff --git a/doc/v2/faq/index_cn.rst b/doc/v2/faq/index_cn.rst deleted file mode 100644 index 4537c7a481e2efbcfed5fa7be2c81c36e13cd108..0000000000000000000000000000000000000000 --- a/doc/v2/faq/index_cn.rst +++ /dev/null @@ -1,13 +0,0 @@ -FAQ -==== - -本文档对关于PaddlePaddle的一些常见问题提供了解答。如果您的问题未在此处,请您到 `PaddlePaddle社区 `_ 查找答案或直接提 `issue `_ ,我们会及时进行回复。 - -.. toctree:: - :maxdepth: 1 - - build_and_install/index_cn.rst - model/index_cn.rst - parameter/index_cn.rst - local/index_cn.rst - cluster/index_cn.rst diff --git a/doc/v2/faq/index_en.rst b/doc/v2/faq/index_en.rst deleted file mode 100644 index 3fa220792b252617848a1c76bc2be49928e35f64..0000000000000000000000000000000000000000 --- a/doc/v2/faq/index_en.rst +++ /dev/null @@ -1,13 +0,0 @@ -FAQ -==== - -This document provides answers to some of the frequently asked questions about PaddlePaddle. If you have a question that is not covered here, please go to `PaddlePaddle Community `_ , to find an answer or submit new `issue `_ , we will reply in time. - -.. toctree:: - :maxdepth: 1 - - build_and_install/index_en.rst - model/index_en.rst - parameter/index_en.rst - local/index_en.rst - cluster/index_en.rst diff --git a/doc/v2/faq/local/index_cn.rst b/doc/v2/faq/local/index_cn.rst deleted file mode 100644 index c6d3c5bfac5a276e253c248ffd415c7789b20b29..0000000000000000000000000000000000000000 --- a/doc/v2/faq/local/index_cn.rst +++ /dev/null @@ -1,259 +0,0 @@ -############### -本地训练与预测 -############### - -.. contents:: - -1. 如何减少内存占用 -------------------- - -神经网络的训练本身是一个非常消耗内存和显存的工作,经常会消耗数10GB的内存和数GB的显存。 -PaddlePaddle的内存占用主要分为如下几个方面\: - -* DataProvider缓冲池内存(只针对内存) -* 神经元激活内存(针对内存和显存) -* 参数内存 (针对内存和显存) -* 其他内存杂项 - -其中,其他内存杂项是指PaddlePaddle本身所用的一些内存,包括字符串分配,临时变量等等,暂不考虑在内。 - -减少DataProvider缓冲池内存 -++++++++++++++++++++++++++ - -PyDataProvider使用的是异步加载,同时在内存里直接随即选取数据来做Shuffle。即 - -.. graphviz:: - - digraph { - rankdir=LR; - 数据文件 -> 内存池 -> PaddlePaddle训练 - } - -所以,减小这个内存池即可减小内存占用,同时也可以加速开始训练前数据载入的过程。但是,这 -个内存池实际上决定了shuffle的粒度。所以,如果将这个内存池减小,又要保证数据是随机的, -那么最好将数据文件在每次读取之前做一次shuffle。可能的代码为 - -.. literalinclude:: src/reduce_min_pool_size.py - -这样做可以极大的减少内存占用,并且可能会加速训练过程,详细文档参考 :ref:`api_pydataprovider2` 。 - -神经元激活内存 -++++++++++++++ - -神经网络在训练的时候,会对每一个激活暂存一些数据,如神经元激活值等。 -在反向传递的时候,这些数据会被用来更新参数。这些数据使用的内存主要和两个参数有关系, -一是batch size,另一个是每条序列(Sequence)长度。所以,其实也是和每个mini-batch中包含 -的时间步信息成正比。 - -所以做法可以有两种: - -* 减小batch size。 即在网络配置中 :code:`settings(batch_size=1000)` 设置成一个小一些的值。但是batch size本身是神经网络的超参数,减小batch size可能会对训练结果产生影响。 -* 减小序列的长度,或者直接扔掉非常长的序列。比如,一个数据集大部分序列长度是100-200, - 但是突然有一个10000长的序列,就很容易导致内存超限,特别是在LSTM等RNN中。 - -参数内存 -++++++++ - -PaddlePaddle支持非常多的优化算法(Optimizer),不同的优化算法需要使用不同大小的内存。 -例如使用 :code:`adadelta` 算法,则需要使用等于权重参数规模大约5倍的内存。举例,如果参数保存下来的模型目录 -文件为 :code:`100M`, 那么该优化算法至少需要 :code:`500M` 的内存。 - -可以考虑使用一些优化算法,例如 :code:`momentum`。 - -2. 如何加速训练速度 -------------------- - -加速PaddlePaddle训练可以考虑从以下几个方面\: - -* 减少数据载入的耗时 -* 加速训练速度 -* 利用分布式训练驾驭更多的计算资源 - -减少数据载入的耗时 -++++++++++++++++++ - -使用\ :code:`pydataprovider`\ 时,可以减少缓存池的大小,同时设置内存缓存功能,即可以极大的加速数据载入流程。 -:code:`DataProvider` 缓存池的减小,和之前减小通过减小缓存池来减小内存占用的原理一致。 - -.. literalinclude:: src/reduce_min_pool_size.py - -同时 :code:`@provider` 接口有一个 :code:`cache` 参数来控制缓存方法,将其设置成 :code:`CacheType.CACHE_PASS_IN_MEM` 的话,会将第一个 :code:`pass` (过完所有训练数据即为一个pass)生成的数据缓存在内存里,在之后的 :code:`pass` 中,不会再从 :code:`python` 端读取数据,而是直接从内存的缓存里读取数据。这也会极大减少数据读入的耗时。 - - -加速训练速度 -++++++++++++ - -PaddlePaddle支持Sparse的训练,sparse训练需要训练特征是 :code:`sparse_binary_vector` 、 :code:`sparse_vector` 、或者 :code:`integer_value` 的任一一种。同时,与这个训练数据交互的Layer,需要将其Parameter设置成 sparse 更新模式,即设置 :code:`sparse_update=True` - -这里使用简单的 :code:`word2vec` 训练语言模型距离,具体使用方法为\: - -使用一个词前两个词和后两个词,来预测这个中间的词。这个任务的DataProvider为\: - -.. literalinclude:: src/word2vec_dataprovider.py - -这个任务的配置为\: - -.. literalinclude:: src/word2vec_config.py - - -利用更多的计算资源 -++++++++++++++++++ - -利用更多的计算资源可以分为以下几个方式来进行\: - -* 单机CPU训练 - - * 使用多线程训练。设置命令行参数 :code:`trainer_count`。 - -* 单机GPU训练 - - * 使用显卡训练。设置命令行参数 :code:`use_gpu`。 - * 使用多块显卡训练。设置命令行参数 :code:`use_gpu` 和 :code:`trainer_count` 。 - -* 多机训练 - - * 请参考 :ref:`cluster_train` 。 - -3. 如何指定GPU设备 ------------------- - -例如机器上有4块GPU,编号从0开始,指定使用2、3号GPU: - -* 方式1:通过 `CUDA_VISIBLE_DEVICES `_ 环境变量来指定特定的GPU。 - -.. code-block:: bash - - env CUDA_VISIBLE_DEVICES=2,3 paddle train --use_gpu=true --trainer_count=2 - -* 方式2:通过命令行参数 ``--gpu_id`` 指定。 - -.. code-block:: bash - - paddle train --use_gpu=true --trainer_count=2 --gpu_id=2 - - -4. 训练过程中出现 :code:`Floating point exception`, 训练因此退出怎么办? ------------------------------------------------------------------------- - -Paddle二进制在运行时捕获了浮点数异常,只要出现浮点数异常(即训练过程中出现NaN或者Inf),立刻退出。浮点异常通常的原因是浮点数溢出、除零等问题。 -主要原因包括两个方面: - -* 训练过程中参数或者训练过程中的梯度尺度过大,导致参数累加,乘除等时候,导致了浮点数溢出。 -* 模型一直不收敛,发散到了一个数值特别大的地方。 -* 训练数据有问题,导致参数收敛到了一些奇异的情况。或者输入数据尺度过大,有些特征的取值达到数百万,这时进行矩阵乘法运算就可能导致浮点数溢出。 - -这里有两种有效的解决方法: - -1. 设置 :code:`gradient_clipping_threshold` 参数,示例代码如下: - -.. code-block:: python - - optimizer = paddle.optimizer.RMSProp( - learning_rate=1e-3, - gradient_clipping_threshold=10.0, - regularization=paddle.optimizer.L2Regularization(rate=8e-4)) - -具体可以参考 `nmt_without_attention `_ 示例。 - -2. 设置 :code:`error_clipping_threshold` 参数,示例代码如下: - -.. code-block:: python - - decoder_inputs = paddle.layer.fc( - act=paddle.activation.Linear(), - size=decoder_size * 3, - bias_attr=False, - input=[context, current_word], - layer_attr=paddle.attr.ExtraLayerAttribute( - error_clipping_threshold=100.0)) - -完整代码可以参考示例 `machine translation `_ 。 - -两种方法的区别: - -1. 两者都是对梯度的截断,但截断时机不同,前者在 :code:`optimzier` 更新网络参数时应用;后者在激活函数反向计算时被调用; -2. 截断对象不同:前者截断可学习参数的梯度,后者截断回传给前层的梯度; - -除此之外,还可以通过减小学习率或者对数据进行归一化处理来解决这类问题。 - -5. 如何调用 infer 接口输出多个layer的预测结果 ------------------------------------------------ - -* 将需要输出的层作为 :code:`paddle.inference.Inference()` 接口的 :code:`output_layer` 参数输入,代码如下: - -.. code-block:: python - - inferer = paddle.inference.Inference(output_layer=[layer1, layer2], parameters=parameters) - -* 指定要输出的字段进行输出。以输出 :code:`value` 字段为例,代码如下: - -.. code-block:: python - - out = inferer.infer(input=data_batch, field=["value"]) - -需要注意的是: - -* 如果指定了2个layer作为输出层,实际上需要的输出结果是两个矩阵; -* 假设第一个layer的输出A是一个 N1 * M1 的矩阵,第二个 Layer 的输出B是一个 N2 * M2 的矩阵; -* paddle.v2 默认会将A和B 横向拼接,当N1 和 N2 大小不一样时,会报如下的错误: - -.. code-block:: python - - ValueError: all the input array dimensions except for the concatenation axis must match exactly - -多个层的输出矩阵的高度不一致导致拼接失败,这种情况常常发生在: - -* 同时输出序列层和非序列层; -* 多个输出层处理多个不同长度的序列; - -此时可以在调用infer接口时通过设置 :code:`flatten_result=False` , 跳过“拼接”步骤,来解决上面的问题。这时,infer接口的返回值是一个python list: - -* list 中元素的个数等于网络中输出层的个数; -* list 中每个元素是一个layer的输出结果矩阵,类型是numpy的ndarray; -* 每一个layer输出矩阵的高度,在非序列输入时:等于样本数;序列输入时等于:输入序列中元素的总数;宽度等于配置中layer的size; - -6. 如何在训练过程中获得某一个layer的output ------------------------------------------------ - -可以在event_handler中,通过 :code:`event.gm.getLayerOutputs("layer_name")` 获得在模型配置中某一层的name :code:`layer_name` 在当前 -mini-batch forward的output的值。获得的值类型均为 :code:`numpy.ndarray` ,可以通过这个输出来完成自定义的评估指标计算等功能。例如下面代码: - -.. code-block:: python - - def score_diff(right_score, left_score): - return np.average(np.abs(right_score - left_score)) - - def event_handler(event): - if isinstance(event, paddle.event.EndIteration): - if event.batch_id % 25 == 0: - diff = score_diff( - event.gm.getLayerOutputs("right_score")["right_score"][ - "value"], - event.gm.getLayerOutputs("left_score")["left_score"][ - "value"]) - logger.info(("Pass %d Batch %d : Cost %.6f, " - "average absolute diff scores: %.6f") % - (event.pass_id, event.batch_id, event.cost, diff)) - -注意:此方法不能获取 :code:`paddle.layer.recurrent_group` 里step的内容,但可以获取 :code:`paddle.layer.recurrent_group` 的输出。 - -7. 如何在训练过程中获得参数的权重和梯度 ------------------------------------------------ - -在某些情况下,获得当前mini-batch的权重(或称作weights, parameters)有助于在训练时观察具体数值,方便排查以及快速定位问题。 -可以通过在 :code:`event_handler` 中打印其值(注意,需要使用 :code:`paddle.event.EndForwardBackward` 保证使用GPU训练时也可以获得), -示例代码如下: - -.. code-block:: python - - ... - parameters = paddle.parameters.create(cost) - ... - def event_handler(event): - if isinstance(event, paddle.event.EndForwardBackward): - if event.batch_id % 25 == 0: - for p in parameters.keys(): - logger.info("Param %s, Grad %s", - parameters.get(p), parameters.get_grad(p)) - -注意:“在训练过程中获得某一个layer的output”和“在训练过程中获得参数的权重和梯度”都会造成训练中的数据从C++拷贝到numpy,会对训练性能造成影响。不要在注重性能的训练场景下使用。 \ No newline at end of file diff --git a/doc/v2/faq/local/index_en.rst b/doc/v2/faq/local/index_en.rst deleted file mode 100644 index fa95b1753dbe293811d7a8601497ad521fa3ecda..0000000000000000000000000000000000000000 --- a/doc/v2/faq/local/index_en.rst +++ /dev/null @@ -1,248 +0,0 @@ -############################# -Parameter Setting -############################# - -.. contents:: - -1. Reduce Memory Consumption -------------------- - -The training procedure of neural networks demands dozens of gigabytes of host memory or serval gigabytes of device memory, which is a rather memory consuming work. The memory consumed by PaddlePaddle framework mainly includes: -\: - -* Cache memory for DataProvider (only on host memory), -* Memory for neurons' activation information (on both host memory and device memory), -* Memory for parameters (on both host memory and device memory), -* Other memory demands. - -Other memory demands is mainly used to support the running demand of PaddlePaddle framework itself, such as string allocation,temporary variables, which are not considered currently. - -Reduce DataProvider Cache Memory -++++++++++++++++++++++++++ - -PyDataProvider works under asynchronous mechanism, it loads together with the data fetch and shuffle procedure in host memory: - -.. graphviz:: - - digraph { - rankdir=LR; - Data Files -> Host Memory Pool -> PaddlePaddle Training - } - -Thus the reduction of the DataProvider cache memory can reduce memory occupancy, meanwhile speed up the data loading procedure before training. However, the size of the memory pool can actually affect the granularity of shuffle,which means a shuffle operation is needed before each data file reading process to ensure the randomness of data when try to reduce the size of the memory pool. - -.. literalinclude:: src/reduce_min_pool_size.py - -In this way, the memory consumption can be significantly reduced and hence the training procedure can be accelerated. More details are demonstrated in :ref:`api_pydataprovider2`. - -The Neurons Activation Memory -++++++++++++++ - -Each neuron activation operating in a neural network training process contains certain amount of temporary data such as the activation data (like the output value of a neuron). These data will be used to update parameters in back propagation period. The scale of memory consumed by these data is mainly related with two parameters, which are batch size and the length of each Sequence. Therefore, the neurons activation memory consuming is actually in proportion to the information contains in each mini-batch training. - -Two practical ways: - -* Reduce batch size. Set a smaller value in network configuration settings(batch_size=1000) can be helpful. But setting batch size to a smaller value may affect the training result due to it is a super parameter of the neural network itself. -* Shorten the sequence length or cut off those excessively long sequences. For example, if the length of sequences in a dataset are mostly varies between 100 and 200, but there is sequence lengthen out to 10,000, then it’s quite potentially leads to OOM (out of memory), especially in RNN models such as LSTM. - -The Parameters Memory -++++++++ - -The PaddlePaddle framework supports almost all popular optimizers. Different optimizers have different memory requirement. For example, the :code:`adadelta` consumes approximately 5 times memory - -space than the weights parameter’s scale, which means the :code:`adadelta` needs at least :code:`500M` memory if the model file contains all - -parameters needs :code:`100M`. - -Some optimization algorithms such as :code:`momentum` are worth giving a shot. - -2. Tricks To Speed Up Training -------------------- - -The training procedure of PaddlePaddle may be speed up when considering following aspects:\: - -* Reduce the time consumption of data loading -* Speed up training epochs -* Introduce more computing resources with the utilization of distribute training frameworks - -Reduce The Time Consumption of Data Loading -++++++++++++++++++ - - -The \ :code:`pydataprovider`\ holds big potential to speed up the data loading procedure if the cache pool and enable memory cache when use it. The principle of the reduction of :code:`DataProvider` cache pool is basically the same with the method which reduct the memory occupation with the set of a smaller cache pool. - -.. literalinclude:: src/reduce_min_pool_size.py - -Beside, the interface :code:`@provider` provides a parameter :code:`cache` to control cache. If set it to :code:`CacheType.CACHE_PASS_IN_MEM`, the data after the first :code:`pass` ( a pass means all data have be fed into the network for training) will be cached in memory and no new data will be read from the :code:`python` side in following :code:`pass` , instead from the cached data in memory. This strategy can also drop the time consuming in data loading process. - - -Accelerating Training Epochs -++++++++++++ - -Sparse training is supported in PaddlePaddle. The features needs to be trained is any of :code:`sparse_binary_vector`, :code:`sparse_vector` and :code:`integer_value` . Meanwhile, the Layer interacts with the training data need to turn the Parameter to sparse updating mode by setting :code:`sparse_update=True`. -Take :code:`word2vec` as an example, to train a language distance, one needs to predict the middle word with two words prior to it and next to it. The DataProvider of this task is: - -.. literalinclude:: src/word2vec_dataprovider.py - -The configuration of this task is: - -.. literalinclude:: src/word2vec_config.py - -Introduce More Computing Resources -++++++++++++++++++ - -More computing resources can be introduced with following manners: -* Single CPU platform training - - * Use multi-threading by set :code:`trainer_count`。 - -* Single GPU platform training - - * Set :code:`use_gpu` to train on single GPU. - * Set :code:`use_gpu` and :code:`trainer_count` to enable multiple GPU training support. - -* Cluster Training - - * Refer to :ref:`cluster_train` 。 - -3. Assign GPU Devices ------------------- - -Assume a computing platform consists of 4 GPUs which serial number from 0 to 3: - -* Method1: specify a GPU as computing device by set: - `CUDA_VISIBLE_DEVICES `_ - -.. code-block:: bash - - env CUDA_VISIBLE_DEVICES=2,3 paddle train --use_gpu=true --trainer_count=2 - -* Method2: Assign by —gpu_id: - -.. code-block:: bash - - paddle train --use_gpu=true --trainer_count=2 --gpu_id=2 - - -4. How to Fix Training Termination Caused By :code:`Floating point exception` During Training. ------------------------------------------------------------------------- - -Paddle binary catches floating exceptions during runtime, it will be terminated when NaN or Inf occurs. Floating exceptions are mostly caused by float overflow, divide by zero. There are three main reasons may raise such exception: - -* Parameters or gradients during training are oversize, which leads to float overflow during calculation. -* The model failed to converge and diverges to a big value. -* Parameters may converge to a singular value due to bad training data. If the scale of input data is too big and contains millions of parameter values, float overflow error may arise when operating matrix multiplication. - -Two ways to solve this problem: - -1. Set :code:`gradient_clipping_threshold` as: - -.. code-block:: python - - optimizer = paddle.optimizer.RMSProp( - learning_rate=1e-3, - gradient_clipping_threshold=10.0, - regularization=paddle.optimizer.L2Regularization(rate=8e-4)) - -Details can refer to example `nmt_without_attention `_ 示例。 - -2. Set :code:`error_clipping_threshold` as: - -.. code-block:: python - - decoder_inputs = paddle.layer.fc( - act=paddle.activation.Linear(), - size=decoder_size * 3, - bias_attr=False, - input=[context, current_word], - layer_attr=paddle.attr.ExtraLayerAttribute( - error_clipping_threshold=100.0)) - -Details can refer to example `machine translation `_ 。 - -The main difference between these two methods are: - -1. They both block the gradient, but happen in different occasions,the former one happens when then :code:`optimzier` updates the network parameters while the latter happens when the back propagation computing of activation functions. -2. The block target are different, the former blocks the trainable parameters’ gradient while the later blocks the gradient to be propagated to prior layers. - -Moreover, Such problems may be fixed with smaller learning rates or data normalization. - -5. Fetch Multi Layers’ Prediction Result With Infer Interface ------------------------------------------------ - -* Join the layer to be used as :code:`output_layer` layer to the input parameters of :code:`paddle.inference.Inference()` interface with: - -.. code-block:: python - - inferer = paddle.inference.Inference(output_layer=[layer1, layer2], parameters=parameters) - -* Assign certain fields to output. Take :code:`value` as example, it can be down with following code: - -.. code-block:: python - - out = inferer.infer(input=data_batch, field=["value"]) - -It is important to note that: - -* If 2 layers are assigned as output layer, then the output results consists of 2 matrixes. -* Assume the output of first layer A is a matrix sizes N1 * M1, the output of second layer B is a matrix sizes N2 * M2; -* By default, paddle.v2 will transverse join A and B, when N1 not equal to N2, it will raise following error: - -.. code-block:: python - - ValueError: all the input array dimensions except for the concatenation axis must match exactly - -The transverse of different matrixes of multi layers mainly happens when: - -* Output sequence layer and non sequence layer; -* Multiple output layers process multiple sequence with different length; - -Such issue can be avoided by calling infer interface and set :code:`flatten_result=False`. Thus, the infer interface returns a python list, in which - -* The number of elements equals to the number of output layers in the network; -* Each element in list is a result matrix of a layer, which type is numpy.ndarray; -* The height of each matrix outputted by each layer equals to the number of samples under non sequential mode or equals to the number of elements in the input sequence under sequential mode. Their width are both equal to the layer size in configuration. - -6. Fetch the Output of A Certain Layer During Training ------------------------------------------------ - -In event_handler, the interface :code:`event.gm.getLayerOutputs("layer_name")` gives the forward output value organized in :code:`numpy.ndarray` corresponding to :code:`layer_name` in the mini-batch. -The output can be used in custom measurements in following way: - -.. code-block:: python - - def score_diff(right_score, left_score): - return np.average(np.abs(right_score - left_score)) - - def event_handler(event): - if isinstance(event, paddle.event.EndIteration): - if event.batch_id % 25 == 0: - diff = score_diff( - event.gm.getLayerOutputs("right_score")["right_score"][ - "value"], - event.gm.getLayerOutputs("left_score")["left_score"][ - "value"]) - logger.info(("Pass %d Batch %d : Cost %.6f, " - "average absolute diff scores: %.6f") % - (event.pass_id, event.batch_id, event.cost, diff)) - -Note: this function can not get content of :code:`paddle.layer.recurrent_group` step, but output of :code:`paddle.layer.recurrent_group` can be fetched. - -7. Fetch Parameters’ Weight and Gradient During Training ------------------------------------------------ - -Under certain situations, knowing the weights of currently training mini-batch can provide more inceptions of many problems. Their value can be acquired by printing values in :code:`event_handler` (note that to gain such parameters when training on GPU, you should set :code:`paddle.event.EndForwardBackward`). Detailed code is as following: - -.. code-block:: python - - ... - parameters = paddle.parameters.create(cost) - ... - def event_handler(event): - if isinstance(event, paddle.event.EndForwardBackward): - if event.batch_id % 25 == 0: - for p in parameters.keys(): - logger.info("Param %s, Grad %s", - parameters.get(p), parameters.get_grad(p)) - -Note that “acquire the output of a certain layer during training” or “acquire the weights and gradients of parameters during training ” both needs to copy training data from C++ environment to numpy, which have certain degree of influence on training performance. Don’t use these two functions when the training procedure cares about the performance. diff --git a/doc/v2/faq/local/src/reduce_min_pool_size.py b/doc/v2/faq/local/src/reduce_min_pool_size.py deleted file mode 100644 index cba96652f764d26c724ea22697e04572709bf6a4..0000000000000000000000000000000000000000 --- a/doc/v2/faq/local/src/reduce_min_pool_size.py +++ /dev/null @@ -1,21 +0,0 @@ -# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - - -@provider(min_pool_size=0, ...) -def process(settings, filename): - os.system('shuf %s > %s.shuf' % (filename, filename)) # shuffle before. - with open('%s.shuf' % filename, 'r') as f: - for line in f: - yield get_sample_from_line(line) diff --git a/doc/v2/faq/local/src/word2vec_config.py b/doc/v2/faq/local/src/word2vec_config.py deleted file mode 100644 index a5b84e8ed4de5123097026a5c7992b06fd321750..0000000000000000000000000000000000000000 --- a/doc/v2/faq/local/src/word2vec_config.py +++ /dev/null @@ -1,26 +0,0 @@ -# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -... # the settings and define data provider is omitted. -DICT_DIM = 3000 # dictionary dimension. -word_ids = data_layer('word_ids', size=DICT_DIM) - -emb = embedding_layer( - input=word_ids, size=256, param_attr=ParamAttr(sparse_update=True)) -emb_sum = pooling_layer(input=emb, pooling_type=SumPooling()) -predict = fc_layer(input=emb_sum, size=DICT_DIM, act=Softmax()) -outputs( - classification_cost( - input=predict, label=data_layer( - 'label', size=DICT_DIM))) diff --git a/doc/v2/faq/local/src/word2vec_dataprovider.py b/doc/v2/faq/local/src/word2vec_dataprovider.py deleted file mode 100644 index 9fe67b6d6cbbbdc8a98d497f352cf114a882636f..0000000000000000000000000000000000000000 --- a/doc/v2/faq/local/src/word2vec_dataprovider.py +++ /dev/null @@ -1,24 +0,0 @@ -# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -DICT_DIM = 3000 - - -@provider(input_types=[integer_sequence(DICT_DIM), integer_value(DICT_DIM)]) -def process(settings, filename): - with open(filename) as f: - # yield word ids to predict inner word id - # such as [28, 29, 10, 4], 4 - # It means the sentance is 28, 29, 4, 10, 4. - yield read_next_from_file(f) diff --git a/doc/v2/faq/model/index_cn.rst b/doc/v2/faq/model/index_cn.rst deleted file mode 100644 index 6947948bc79f4dba63954c459afb940e3242c405..0000000000000000000000000000000000000000 --- a/doc/v2/faq/model/index_cn.rst +++ /dev/null @@ -1,80 +0,0 @@ -######### -模型配置 -######### - -.. contents:: - -1. 出现 :code:`Duplicated layer name` 错误怎么办 --------------------------------------------------- - -出现该错误的原因一般是用户对不同layer的参数 :code:`name` 设置了相同的取值。遇到该错误时,先找出参数 :code:`name` 取值相同的layer,然后将这些layer的参数 :code:`name` 设置为不同的值。 - -2. :code:`paddle.layer.memory` 的参数 :code:`name` 如何使用 -------------------------------------------------------------- - -* :code:`paddle.layer.memory` 用于获取特定layer上一时间步的输出,该layer是通过参数 :code:`name` 指定,即,:code:`paddle.layer.memory` 会关联参数 :code:`name` 取值相同的layer,并将该layer上一时间步的输出作为自身当前时间步的输出。 - -* PaddlePaddle的所有layer都有唯一的name,用户通过参数 :code:`name` 设定,当用户没有显式设定时,PaddlePaddle会自动设定。而 :code:`paddle.layer.memory` 不是真正的layer,其name由参数 :code:`memory_name` 设定,当用户没有显式设定时,PaddlePaddle会自动设定。:code:`paddle.layer.memory` 的参数 :code:`name` 用于指定其要关联的layer,需要用户显式设定。 - -3. 两种使用 drop_out 的方法有何区别 ------------------------------------- - -* 在PaddlePaddle中使用dropout有两种方式 - - * 在相应layer的 :code:`layer_atter` 设置 :code:`drop_rate`,以 :code:`paddle.layer.fc` 为例,代码如下: - - .. code-block:: python - - fc = paddle.layer.fc(input=input, layer_attr=paddle.attr.ExtraLayerAttribute(drop_rate=0.5)) - - * 使用 :code:`paddle.layer.dropout`,以 :code:`paddle.layer.fc` 为例,代码如下: - - .. code-block:: python - - fc = paddle.layer.fc(input=input) - drop_fc = paddle.layer.dropout(input=fc, dropout_rate=0.5) - -* :code:`paddle.layer.dropout` 实际上使用了 :code:`paddle.layer.add_to`,并在该layer里采用第一种方式设置 :code:`drop_rate` 来使用dropout的。这种方式对内存消耗较大。 - -* PaddlePaddle在激活函数里实现dropout,而不是在layer里实现。 - -* :code:`paddle.layer.lstmemory`、:code:`paddle.layer.grumemory`、:code:`paddle.layer.recurrent` 不是通过一般的方式来实现对输出的激活,所以不能采用第一种方式在这几个layer里设置 :code:`drop_rate` 来使用dropout。若要对这几个layer使用dropout,可采用第二种方式,即使用 :code:`paddle.layer.dropout`。 - -4. 不同的 recurrent layer 的区别 ----------------------------------- -以LSTM为例,在PaddlePaddle中包含以下 recurrent layer: - -* :code:`paddle.layer.lstmemory` -* :code:`paddle.networks.simple_lstm` -* :code:`paddle.networks.lstmemory_group` -* :code:`paddle.networks.bidirectional_lstm` - -按照具体实现方式可以归纳为2类: - -1. 由 recurrent_group 实现的 recurrent layer: - - * 用户在使用这一类recurrent layer时,可以访问由recurrent unit在一个时间步内计算得到的中间值(例如:hidden states, memory cells等); - * 上述的 :code:`paddle.networks.lstmemory_group` 是这一类的 recurrent layer ; - -2. 将recurrent layer作为一个整体来实现: - - * 用户在使用这一类recurrent layer,只能访问它们的输出值; - * 上述的 :code:`paddle.networks.lstmemory_group` 、 :code:`paddle.networks.simple_lstm` 和 :code:`paddle.networks.bidirectional_lstm` 属于这一类的实现; - -将recurrent layer作为一个整体来实现, 能够针对CPU和GPU的计算做更多优化, 所以相比于recurrent group的实现方式, 第二类 recurrent layer 计算效率更高。 在实际应用中,如果用户不需要访问LSTM的中间变量,而只需要获得recurrent layer计算的输出,我们建议使用第二类实现。 - -此外,关于LSTM, PaddlePaddle中还包含 :code:`paddle.networks.lstmemory_unit` 这一计算单元: - - * 不同于上述介绍的recurrent layer , :code:`paddle.networks.lstmemory_unit` 定义了LSTM单元在一个时间步内的计算过程,它并不是一个完整的recurrent layer,也不能接收序列数据作为输入; - * :code:`paddle.networks.lstmemory_unit` 只能在recurrent_group中作为step function使用; - -5. PaddlePaddle的softmax能否指定计算的维度 ------------------------------------------ - -PaddlePaddle的softmax不能指定计算维度,只能按行计算。 -在图像任务中,对于NCHW,如果需要在C维度计算softmax,可以先使用 :code:`paddle.layer.switch_order` 改变维度顺序,即将NCHW转换成NHWC,再做一定的reshape,最后计算softmax。 - -6. PaddlePaddle是否支持维数可变的数据输入 ------------------------------------------- - -PaddlePaddle提供的 :code:`paddle.data_type.dense_array` 支持维数可变的数据输入。在使用时,将对应数据层的维数设置成一个大于输入数据维数的值用于占位即可。 diff --git a/doc/v2/faq/model/index_en.rst b/doc/v2/faq/model/index_en.rst deleted file mode 100644 index 67a33e08e192e5627ac3b0abd76e979f21ed2079..0000000000000000000000000000000000000000 --- a/doc/v2/faq/model/index_en.rst +++ /dev/null @@ -1,81 +0,0 @@ -################### -Model Configuration -################### - -.. contents:: - -1. How to deal with error :code:`Duplicated layer name` ----------------------------------------------------------- - -The general reason for this error is that users may have set the same value for the attribute :code:`name` in different layers. Try to find out the :code:`name` attribute with the same value in diffrent layers and set them differently. - -2. How to use :code:`paddle.layer.memory`'s attribute :code:`name` ----------------------------------------------------------------------- - -* :code:`paddle.layer.memory` is used to get the output of a layer's last timestep and the layer is specified by the attribute :code:`name` . Thus, :code:`paddle.layer.memory` will associate with the layer that has the same value of attribute :code:`name` , and uses the output of the layer's last timestep as the input of its current timestep. - -* All the PaddlePaddle's layers have a unique name, which is set by the attribute :code:`name` . PaddlePaddle will automatically set it for the user when it is not explicitly set. :code:`paddle.layer.memory` is not a real layer, its name is set by the attribute :code:`memory_name` and PaddlePaddle will also automatically set it when the user does not explicitly set. The :code:`paddle.layer.memory` attribute :code:`name` is used to specify the layer it is associated with, and needs to be explicitly set by the user. - - -3. What is the difference between the two ways of using dropout ------------------------------------------------------------------ - -* There are two ways to use dropout in PaddlePaddle - - * Set the :code:`drop_rate` parameter in the layer's :code:`layer_atter` attribute. Take :code:`paddle.layer.fc` as an example: - - .. code-block:: python - - fc = paddle.layer.fc(input=input, layer_attr=paddle.attr.ExtraLayerAttribute(drop_rate=0.5)) - - * Use :code:`paddle.layer.dropout` layer. Take :code:`paddle.layer.fc` as an example: - - .. code-block:: python - - fc = paddle.layer.fc(input=input) - drop_fc = paddle.layer.dropout(input=fc, dropout_rate=0.5) - -* :code:`paddle.layer.dropout` actually uses the :code:`paddle.layer.add_to` layer and sets :code:`drop_rate` as the previous method. This method is very memory intensive. - -* PaddlePaddle implements dropout in the activation function rather than in the layer. - -* :code:`paddle.layer.lstmemory`, :code:`paddle.layer.grumemory`, :code:`paddle.layer.recurrent` implement activation of output in an unusual way, so we cannot use dropout by setting :code:`drop_rate` . To use dropout for these layers, we could use the second method, which is to use :code:`paddle.layer.dropout`. - -4. The differences between different recurrent layers --------------------------------------------------------- -Take LSTM as an example. There are several kinds of recurrent layers in PaddlePaddle: - -* :code:`paddle.layer.lstmemory` -* :code:`paddle.networks.simple_lstm` -* :code:`paddle.networks.lstmemory_group` -* :code:`paddle.networks.bidirectional_lstm` - -According to implementations, recurrent layer can be classified into 2 types: - -1. Recurrent layer implemented by recurrent_group: - - * Using this type of recurrent layers, users can access the intermediate value calculated by the recurrent unit within a timestep (eg: hidden states, memory cells, etc.) - * :code:`paddle.networks.lstmemory_group` belongs to this type of recurrent layers. - -2. Recurrent layer implemented as a complete operation: - - * Users can only access output values when using this type of recurrent layers. - * :code:`paddle.networks.lstmemory_group` , :code:`paddle.networks.simple_lstm` and :code:`paddle.networks.bidirectional_lstm` belong to this type of recurrent layer; - -By implementing recurrent layer as a complete operation, CPU and GPU calculations can be optimized. Therefore, the second type of recurrent layer is more efficient than the first one. In practical applications, we propose to use the second type of recurrent layers if there is no need to access the intermediate variable of LSTM. - -In addition, PaddlePaddle also contains a kind of LSTM calculation unit: :code:`paddle.networks.lstmemory_unit`: - - * Unlike the recurrent layer described above, :code:`paddle.networks.lstmemory_unit` defines the computational process of an LSTM unit in a timestep. It is not a complete recurrent layer, nor can it receive sequence data as input. - * :code:`paddle.networks.lstmemory_unit` can only be used as a step function in recurrent_group. - -5. Can Softmax's calculation dimension be specified? --------------------------------------------------------------------- - -We can't specify calculation dimension for PaddlePaddle's softmax. It can only be calculated by rows. -In image tasks, for NCHW, if you need to calculate softmax in C dimension, you could use :code:`paddle.layer.switch_order` to change the dimension order, that is, convert NCHW to NHWC, then do the reshape operation and calculate softmax. - -6. Does PaddlePaddle support variable-dimensional data inputs ----------------------------------------------------------------- - -PaddlePaddle provides :code:`paddle.data_type.dense_array` to support variable-dimensional data input. Simply set the dimension of the data layer to a value larger than the dimension of the input data for occupancy. diff --git a/doc/v2/faq/parameter/index_cn.rst b/doc/v2/faq/parameter/index_cn.rst deleted file mode 100644 index 987e8cf088be4ee8daa7c28fdc855506cbfd31c7..0000000000000000000000000000000000000000 --- a/doc/v2/faq/parameter/index_cn.rst +++ /dev/null @@ -1,201 +0,0 @@ -######### -参数设置 -######### - -.. contents:: - -1. 如何选择SGD算法的学习率 --------------------------- - -在采用sgd/async_sgd进行训练时,一个重要的问题是选择正确的learning_rate。如果learning_rate太大,那么训练有可能不收敛,如果learning_rate太小,那么收敛可能很慢,导致训练时间过长。 - -通常做法是从一个比较大的learning_rate开始试,如果不收敛,那减少学习率10倍继续试验,直到训练收敛为止。那么如何判断训练不收敛呢?可以估计出如果模型采用不变的输出最小的cost0是多少。 - -如果训练过程的的cost明显高于这个常数输出的cost,那么我们可以判断为训练不收敛。举一个例子,假如我们是三分类问题,采用multi-class-cross-entropy作为cost,数据中0,1,2三类的比例为 :code:`0.2, 0.5, 0.3` , 那么常数输出所能达到的最小cost是 :code:`-(0.2*log(0.2)+0.5*log(0.5)+0.3*log(0.3))=1.03` 。如果训练一个pass(或者更早)后,cost还大于这个数,那么可以认为训练不收敛,应该降低学习率。 - -2. 如何设置学习率退火(learning rate annealing) ------------------------------------------------- - -在相应的优化算法里设置learning_rate_schedule及相关参数,以使用Adam算法为例,代码如下: - -.. code-block:: python - - optimizer = paddle.optimizer.Adam( - learning_rate=1e-3, - learning_rate_decay_a=0.5, - learning_rate_decay_b=0.75, - learning_rate_schedule="poly",) - -PaddlePaddle目前支持8种learning_rate_schedule,这8种learning_rate_schedule及其对应学习率计算方式如下: - -* "constant" - - lr = learning_rate - -* "poly" - - lr = learning_rate * pow(1 + learning_rate_decay_a * num_samples_processed, -learning_rate_decay_b) - - 其中,num_samples_processed为已训练样本数,下同。 - -* "caffe_poly" - - lr = learning_rate * pow(1.0 - num_samples_processed / learning_rate_decay_a, learning_rate_decay_b) - -* "exp" - - lr = learning_rate * pow(learning_rate_decay_a, num_samples_processed / learning_rate_decay_b) - -* "discexp" - - lr = learning_rate * pow(learning_rate_decay_a, floor(num_samples_processed / learning_rate_decay_b)) - -* "linear" - - lr = max(learning_rate - learning_rate_decay_a * num_samples_processed, learning_rate_decay_b) - -* "manual" - - 这是一种按已训练样本数分段取值的学习率退火方法。使用该learning_rate_schedule时,用户通过参数 :code:`learning_rate_args` 设置学习率衰减因子分段函数,当前的学习率为所设置 :code:`learning_rate` 与当前的衰减因子的乘积。以使用Adam算法为例,代码如下: - - .. code-block:: python - - optimizer = paddle.optimizer.Adam( - learning_rate=1e-3, - learning_rate_schedule="manual", - learning_rate_args="1000:1.0,2000:0.9,3000:0.8",) - - 在该示例中,当已训练样本数小于等于1000时,学习率为 :code:`1e-3 * 1.0`;当已训练样本数大于1000小于等于2000时,学习率为 :code:`1e-3 * 0.9`;当已训练样本数大于2000时,学习率为 :code:`1e-3 * 0.8`。 - -* "pass_manual" - - 这是一种按已训练pass数分段取值的学习率退火方法。使用该learning_rate_schedule时,用户通过参数 :code:`learning_rate_args` 设置学习率衰减因子分段函数,当前的学习率为所设置 :code:`learning_rate` 与当前的衰减因子的乘积。以使用Adam算法为例,代码如下: - - .. code-block:: python - - optimizer = paddle.optimizer.Adam( - learning_rate=1e-3, - learning_rate_schedule="pass_manual", - learning_rate_args="1:1.0,2:0.9,3:0.8",) - - 在该示例中,当已训练pass数小于等于1时,学习率为 :code:`1e-3 * 1.0`;当已训练pass数大于1小于等于2时,学习率为 :code:`1e-3 * 0.9`;当已训练pass数大于2时,学习率为 :code:`1e-3 * 0.8`。 - -3. 如何初始化参数 ------------------ - -默认情况下,PaddlePaddle使用均值0,标准差为 :math:`\frac{1}{\sqrt{d}}` 来初始化参数。其中 :math:`d` 为参数矩阵的宽度。这种初始化方式在一般情况下不会产生很差的结果。如果用户想要自定义初始化方式,PaddlePaddle目前提供两种参数初始化的方式\: - -* 高斯分布。将 :code:`param_attr` 设置成 :code:`param_attr=ParamAttr(initial_mean=0.0, initial_std=1.0)` -* 均匀分布。将 :code:`param_attr` 设置成 :code:`param_attr=ParamAttr(initial_max=1.0, initial_min=-1.0)` - -比如设置一个全连接层的参数初始化方式和bias初始化方式,可以使用如下代码。 - -.. code-block:: python - - hidden = fc_layer(input=ipt, param_attr=ParamAttr(initial_max=1.0, initial_min=-1.0), - bias_attr=ParamAttr(initial_mean=1.0, initial_std=0.0)) - -上述代码将bias全部初始化为1.0, 同时将参数初始化为 :code:`[1.0, -1.0]` 的均匀分布。 - -4. 如何共享参数 ---------------- - -PaddlePaddle的参数使用名字 :code:`name` 作为参数的ID,相同名字的参数,会共享参数。设置参数的名字,可以使用 :code:`ParamAttr(name="YOUR_PARAM_NAME")` 来设置。更方便的设置方式,是使得要共享的参数使用同样的 :code:`ParamAttr` 对象。 - -简单的全连接网络,参数共享的配置示例为\: - -.. literalinclude:: ../../python/paddle/trainer_config_helpers/tests/configs/shared_fc.py - -这里 :code:`hidden_a` 和 :code:`hidden_b` 使用了同样的parameter和bias。并且softmax层的两个输入也使用了同样的参数 :code:`softmax_param`。 - -5. 如何加载预训练参数 ------------------------- - -* 对加载预训练参数的层,设置其参数属性 :code:`is_static=True`,使该层的参数在训练过程中保持不变。以embedding层为例,代码如下: - -.. code-block:: python - - emb_para = paddle.attr.Param(name='emb', is_static=True) - paddle.layer.embedding(size=word_dim, input=x, param_attr=emb_para) - - -* 从模型文件将预训练参数载入 :code:`numpy.array`,在创建parameters后,使用 :code:`parameters.set()` 加载预训练参数。PaddlePaddle保存的模型参数文件前16字节为头信息,用户将参数载入 :code:`numpy.array` 时须从第17字节开始。以embedding层为例,代码如下: - -.. code-block:: python - - def load_parameter(file_name, h, w): - with open(file_name, 'rb') as f: - f.read(16) # skip header. - return np.fromfile(f, dtype=np.float32).reshape(h, w) - - parameters = paddle.parameters.create(my_cost) - parameters.set('emb', load_parameter(emb_param_file, 30000, 256)) - -6. 存储的参数格式是什么,如何和明文进行相互转化 --------------------------------------------------- - -PaddlePaddle保存的模型参数文件内容由16字节头信息和网络参数两部分组成。头信息中,1~4字节表示PaddlePaddle版本信息,请直接填充0;5~8字节表示每个参数占用的字节数,当保存的网络参数为float类型时为4,double类型时为8;9~16字节表示保存的参数总个数。 - -将PaddlePaddle保存的模型参数还原回明文时,可以使用相应数据类型的 :code:`numpy.array` 加载具体网络参数,此时可以跳过PaddlePaddle模型参数文件的头信息。若在PaddlePaddle编译时,未指定按照double精度编译,默认情况下按照float精度计算,保存的参数也是float类型。这时在使用 :code:`numpy.array` 时,一般设置 :code:`dtype=float32` 。示例如下: - -.. code-block:: python - - def read_parameter(fname, width): - s = open(fname).read() - # skip header - vec = np.fromstring(s[16:], dtype=np.float32) - # width is the size of the corresponding layer - np.savetxt(fname + ".csv", vec.reshape(width, -1), - fmt="%.6f", delimiter=",") - - -将明文参数转化为PaddlePaddle可加载的模型参数时,首先构造头信息,再写入网络参数。下面的代码将随机生成的矩阵转化为可以被PaddlePaddle加载的模型参数。 - -.. code-block:: python - - def gen_rand_param(param_file, width, height, need_trans): - np.random.seed() - header = struct.pack("iil", 0, 4, height * width) - param = np.float32(np.random.rand(height, width)) - with open(param_file, "w") as fparam: - fparam.write(header + param.tostring()) - -7. A protocol message was rejected because it was too big ------------------------------------------------------------- - -如果在训练NLP相关模型时,出现以下错误: - -.. code-block:: bash - - [libprotobuf ERROR google/protobuf/io/coded_stream.cc:171] A protocol message was rejected because it was too big (more than 67108864 bytes). To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h. - F1205 14:59:50.295174 14703 TrainerConfigHelper.cpp:59] Check failed: m->conf.ParseFromString(configProtoStr) - -可能的原因是:传给dataprovider的某一个args过大,一般是由于直接传递大字典导致的。错误的define_py_data_sources2类似: - -.. code-block:: python - - src_dict = dict() - for line_count, line in enumerate(open(src_dict_path, "r")): - src_dict[line.strip()] = line_count - - define_py_data_sources2( - train_list, - test_list, - module="dataprovider", - obj="process", - args={"src_dict": src_dict}) - -解决方案是:将字典的地址作为args传给dataprovider,然后在dataprovider里面根据该地址加载字典。即define_py_data_sources2应改为: - -.. code-block:: python - - define_py_data_sources2( - train_list, - test_list, - module="dataprovider", - obj="process", - args={"src_dict_path": src_dict_path}) - -完整源码可参考 `sequence_recurrent `_ 示例。 - - diff --git a/doc/v2/faq/parameter/index_en.rst b/doc/v2/faq/parameter/index_en.rst deleted file mode 100644 index 9edb8dd620f972d019db9c0063cefce616de0ebd..0000000000000000000000000000000000000000 --- a/doc/v2/faq/parameter/index_en.rst +++ /dev/null @@ -1,198 +0,0 @@ -################## -Parameter Settings -################## - -.. contents:: - -1. How to Choose the Learning Rate of SGD Algorithm --------------------------- - -An important issue when training with :code:`sgd/async_sgd` is to choose the correct value for :code:`learning_rate`. If it is too large, the training may not converge. If too small, the convergence may be slow, resulting in a long training time. - -Usually, we start with a relatively large learning rate. If the training does not converge, then we need to reduce the learning rate continuously by a factor of 10 until the training converges. We examine the convergence of the training by estimating the minimum cost at a constant output of the model. - -If the cost of the training process is significantly higher than the cost of the output, then we judge that the training does not converge. For example, if we have a three-class problem and use multi-class-cross-entropy as the cost, the ratio of 0, 1, and 2 in the data will be :code:`0.2, 0.5, 0.3`. The minimum cost thus will be :code:`-(0.2*log(0.2)+0.5*log(0.5)+0.3*log(0.3))=1.03`. If the cost is greater than this number after training a pass (or even before), then the training may not be converged and the learning rate should be reduced. - -2. How to Implement Learning Rate Annealing ------------------------------------------------- - -We use the Adam algorithm as an example. Set the parameters of :code:`learning_rate_schedule` in the corresponding optimization algorithm as follows: - -.. code-block:: python - -    Optimizer = paddle.optimizer.Adam( -        Learning_rate=1e-3, -        Learning_rate_decay_a=0.5, -        Learning_rate_decay_b=0.75, -        Learning_rate_schedule="poly",) - -PaddlePaddle currently supports 8 learning rate schedules. The 8 learning rate schedules and their corresponding learning rates are calculated as follows: - -* "constant" -   -  Lr = learning_rate - -* "poly" - -  Lr = learning_rate * pow(1 + learning_rate_decay_a * num_samples_processed, -learning_rate_decay_b) - -  Variable :code:`num_samples_processed` is the number of trained samples. - -* "caffe_poly" - -  Lr = learning_rate * pow(1.0 - num_samples_processed / learning_rate_decay_a, learning_rate_decay_b) - -* "exp" - -  Lr = learning_rate * pow(learning_rate_decay_a, num_samples_processed / learning_rate_decay_b) - -* "discexp" - -  Lr = learning_rate * pow(learning_rate_decay_a, floor(num_samples_processed / learning_rate_decay_b)) - -* "linear" - -  Lr = max(learning_rate - learning_rate_decay_a * num_samples_processed, learning_rate_decay_b) - -* "manual" - -  This is a learning rate annealing method that is segmented by the number of trained samples. When using this learning rate schedule, we modify the learning rate attenuation factor piecewise function by changing the parameter :code:`learning_rate_args`. The current learning rate is the product of :code:`learning_rate` and the current attenuation factor. Take the Adam algorithm as an example: - -  .. code-block:: python - -      Optimizer = paddle.optimizer.Adam( -          Learning_rate=1e-3, -          Learning_rate_schedule="manual", -          Learning_rate_args="1000:1.0,2000:0.9,3000:0.8",) - -  In this example, when the number of trained samples is less than or equal to 1000, the learning rate is: code:`1e-3*1.0`; when the number of trained samples is greater than 1000 or less than or equal to 2000, the learning rate is:code:`1e- 3 * 0.9`; when the number of trained samples is greater than 2,000, the learning rate is: code:`1e-3*0.8`. - -* "pass_manual" - -  This is a learning rate annealing method that piecewisely pick values according to the number of trained passes. When using this learning rate schedule, we set the learning rate attenuation factor piecewise function by the parameter :code:`learning_rate_args`. The current learning rate is the product of :code:`learning_rate` and the current attenuation factor. Take the Adam algorithm as an example: - -  .. code-block:: python - -      Optimizer = paddle.optimizer.Adam( -          Learning_rate=1e-3, -          Learning_rate_schedule="pass_manual", -          Learning_rate_args="1:1.0,2:0.9,3:0.8",) - -  In this example, when the number of trained passes is less than or equal to 1, the learning rate is :code:`1e-3*1.0`; when the number of trained passes is greater than 1 or less than 2, the learning rate is :code:`1e- 3 * 0.9`; when the number of trained passes is greater than 2, the learning rate is :code:`1e-3*0.8`. - -3. How to Initialize Parameters ------------------ - -By default, PaddlePaddle initializes parameters with an average of 0 and a standard deviation of :math:`\frac{1}{\sqrt{d}}`, where :math:`d` is the width of the parameter matrix. This initialization method does not produce bad results under normal circumstances. If users want to customize the initialization method, PaddlePaddle provides two ways to initialize the parameters: - -* Gaussian distribution. Set :code:`param_attr` to :code:`param_attr=ParamAttr(initial_mean=0.0, initial_std=1.0)` -* Uniform distribution. Set :code:`param_attr` to :code:`param_attr=ParamAttr(initial_max=1.0, initial_min=-1.0)` - -For example, to set a full connection layer parameter initialization mode and bias initialization mode, you can use the following code: - -.. code-block:: python - -    Hidden = fc_layer(input=ipt, param_attr=ParamAttr(initial_max=1.0, initial_min=-1.0), -                      Bias_attr=ParamAttr(initial_mean=1.0, initial_std=0.0)) - -The above code initializes the bias to 1.0 and initializes the parameters to a uniform distribution of :code:`[1.0, -1.0]`. - -4. How to Share Parameters ---------------- - -PaddlePaddle's parameters use :code:`name` as the ID. Parameters with the same name will share parameters//. We can set the name of the parameters using :code:`ParamAttr(name="YOUR_PARAM_NAME")`. More conveniently, we can make the parameters to be shared use the same :code:`ParamAttr` object. - -A simple fully connected network has its configuration of parameter sharing as follows \: - -.. literalinclude:: ../../python/paddle/trainer_config_helpers/tests/configs/shared_fc.py - -Here :code:`hidden_a` and :code:`hidden_b` have the same parameter and bias. The two input of the softmax layer also use the same parameter :code:`softmax_param`. - -5. How to Load Pre-training Parameters ------------------------- -* For layers that load pre-training parameters, set :code:`is_static = True` so that the parameters of that layer remain unchanged during the training process. Take the embedding layer as an example, the code is as follows: - -.. code-block:: python - -    Emb_para = paddle.attr.Param(name='emb', is_static=True) -    Paddle.layer.embedding(size=word_dim, input=x, param_attr=emb_para) - - -* Load pre-training parameters from the model file into :code:`numpy.array`. After creating the parameters, load the pre-training parameters using :code:`parameters.set()`. The first 16 bytes of the model parameter file saved by PaddlePaddle is the header information. The user must loads : :code:`numpy.array` starting with the 17th byte. Take the embedding layer as an example, the code is as follows: - -.. code-block:: python - -    Def load_parameter(file_name, h, w): -        With open(file_name, 'rb') as f: -            F.read(16) # skip header. -            Return np.fromfile(f, dtype=np.float32).reshape(h, w) - -    Parameters = paddle.parameters.create(my_cost) -    Parameters.set('emb', load_parameter(emb_param_file, 30000, 256)) - -6. Format of the Stored Parameter and How to Convert the File to Plain Text --------------------------------------------------- - -The model parameter file saved by PaddlePaddle consists of 16 bytes of header information and network parameters. In the header information, the first four bytes show PaddlePaddle's version information. The user should fill in with 0s. The next four bytes represent the number of bytes occupied by each parameter. If the saved network parameter is a float type, the number is four; if it is a double, the number is eight. The third group of four bytes represents the total number of saved parameters. - -When restoring the model parameters saved by PaddlePaddle back to plain text, we use the corresponding data type :code:`numpy.array` to load specific network parameters. At this time, you can skip the header information of the PaddlePaddle model parameter file. If not specified to compile with a precision for double in PaddlePaddle, then the parameter file will be caiculated with a precision for float, and the argument will be stored as a float. In this case, when using :code:`numpy.array`, generally we set :code:`dtype=float32`. An example is as follows: - -.. code-block:: python - -    Def read_parameter(fname, width): -        s = open(fname).read() -        # skip header -        Vec = np.fromstring(s[16:], dtype=np.float32) -        # width is the size of the corresponding layer -        Np.savetxt(fname + ".csv", vec.reshape(width, -1), -                Fmt="%.6f", delimiter=",") - - -When the plaintext parameters are converted into PaddlePaddle loadable model parameters, the header information is constructed first, then the network parameters are written. The following code converts the randomly generated matrix into model parameters that can be loaded by PaddlePaddle: - -.. code-block:: python - -    Def gen_rand_param(param_file, width, height, need_trans): -        Np.random.seed() -        Header = struct.pack("iil", 0, 4, height * width) -        Param = np.float32(np.random.rand(height, width)) -        With open(param_file, "w") as fparam: -            Fparam.write(header + param.tostring()) - -7. A Protocol Message Rejected Because of its Large Size --------------------------------------------------- ---------- - -If you are training NLP related models, and the following error occurs: - -.. code-block:: bash - -    [libprotobuf ERROR google/protobuf/io/coded_stream.cc:171] A protocol message was rejected because it was too big (more than 67108864 bytes). To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit( ) in google/protobuf/io/coded_stream.h. -    F1205 14:59:50.295174 14703 TrainerConfigHelper.cpp:59] Check failed: m->conf.ParseFromString(configProtoStr) - -The possible reason is that one of the args passed to the dataprovider is too large, which is usually caused by directly passing a large dictionary. A wrongly defineed `_py_data_sources2` is similar to: - -.. code-block:: python - -     Src_dict = dict() -     For line_count, line in enumerate(open(src_dict_path, "r")): -        Src_dict[line.strip()] = line_count - -     Define_py_data_sources2( -        Train_list, -        Test_list, -        Module="dataprovider", -        Obj="process", -        Args={"src_dict": src_dict}) - -The solution is to pass the address of the dictionary as args to the dataprovider, and then load the dictionary according to the address in the dataprovider. Change `_py_data_sources2` to: - -.. code-block:: python - -     Define_py_data_sources2( -        Train_list, -        Test_list, -        Module="dataprovider", -        Obj="process", -        Args={"src_dict_path": src_dict_path}) - -The full source code can be found in the `sequence_recurrent `_ example. diff --git a/doc/v2/getstarted/concepts/src/infer.py b/doc/v2/getstarted/concepts/src/infer.py deleted file mode 100644 index afe256f234a1c7d29c33f3b65b8302646df0c45c..0000000000000000000000000000000000000000 --- a/doc/v2/getstarted/concepts/src/infer.py +++ /dev/null @@ -1,32 +0,0 @@ -# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import paddle.v2 as paddle -import numpy as np - -paddle.init(use_gpu=False) -x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(2)) -y_predict = paddle.layer.fc(input=x, size=1, act=paddle.activation.Linear()) - -# loading the model which generated by training -with open('params_pass_90.tar', 'r') as f: - parameters = paddle.parameters.Parameters.from_tar(f) - -# Input multiple sets of data,Output the infer result in a array. -i = [[[1, 2]], [[3, 4]], [[5, 6]]] -print paddle.infer(output_layer=y_predict, parameters=parameters, input=i) -# Will print: -# [[ -3.24491572] -# [ -6.94668722] -# [-10.64845848]] diff --git a/doc/v2/getstarted/concepts/src/train.py b/doc/v2/getstarted/concepts/src/train.py deleted file mode 100644 index a85d5d8a3acee61d11488e5b842831a79072680a..0000000000000000000000000000000000000000 --- a/doc/v2/getstarted/concepts/src/train.py +++ /dev/null @@ -1,71 +0,0 @@ -# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import paddle.v2 as paddle -import numpy as np - -# init paddle -paddle.init(use_gpu=False) - -# network config -x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(2)) -y_predict = paddle.layer.fc(input=x, size=1, act=paddle.activation.Linear()) -y = paddle.layer.data(name='y', type=paddle.data_type.dense_vector(1)) -cost = paddle.layer.square_error_cost(input=y_predict, label=y) - -# create parameters -parameters = paddle.parameters.create(cost) -# create optimizer -optimizer = paddle.optimizer.Momentum(momentum=0) -# create trainer -trainer = paddle.trainer.SGD(cost=cost, - parameters=parameters, - update_equation=optimizer) - - -# event_handler to print training info -def event_handler(event): - if isinstance(event, paddle.event.EndIteration): - if event.batch_id % 1 == 0: - print "Pass %d, Batch %d, Cost %f" % (event.pass_id, event.batch_id, - event.cost) - # product model every 10 pass - if isinstance(event, paddle.event.EndPass): - if event.pass_id % 10 == 0: - with open('params_pass_%d.tar' % event.pass_id, 'w') as f: - trainer.save_parameter_to_tar(f) - - -# define training dataset reader -def train_reader(): - train_x = np.array([[1, 1], [1, 2], [3, 4], [5, 2]]) - train_y = np.array([[-2], [-3], [-7], [-7]]) - - def reader(): - for i in xrange(train_y.shape[0]): - yield train_x[i], train_y[i] - - return reader - - -# define feeding map -feeding = {'x': 0, 'y': 1} - -# training -trainer.train( - reader=paddle.batch( - train_reader(), batch_size=1), - feeding=feeding, - event_handler=event_handler, - num_passes=100) diff --git a/doc/v2/getstarted/concepts/use_concepts_cn.rst b/doc/v2/getstarted/concepts/use_concepts_cn.rst deleted file mode 100644 index 608f49f5a969b3291eb43bf2acf582af74e566a1..0000000000000000000000000000000000000000 --- a/doc/v2/getstarted/concepts/use_concepts_cn.rst +++ /dev/null @@ -1,155 +0,0 @@ -############ -基本使用概念 -############ - -PaddlePaddle是源于百度的一个深度学习平台。PaddlePaddle为深度学习研究人员提供了丰富的API,可以轻松地完成神经网络配置,模型训练等任务。 -这里将介绍PaddlePaddle的基本使用概念,并且展示了如何利用PaddlePaddle来解决一个经典的线性回归问题。 -在使用该文档之前,请参考 `安装文档 <../../build_and_install/index_cn.html>`_ 完成PaddlePaddle的安装。 - - -配置网络 -============ - -加载PaddlePaddle ----------------------- - -在进行网络配置之前,首先需要加载相应的Python库,并进行初始化操作。 - -.. code-block:: bash - - import paddle.v2 as paddle - import numpy as np - paddle.init(use_gpu=False) - - -搭建神经网络 ------------------------ - -搭建神经网络就像使用积木搭建宝塔一样。在PaddlePaddle中,layer是我们的积木,而神经网络是我们要搭建的宝塔。我们使用不同的layer进行组合,来搭建神经网络。 -宝塔的底端需要坚实的基座来支撑,同样,神经网络也需要一些特定的layer作为输入接口,来完成网络的训练。 - -例如,我们可以定义如下layer来描述神经网络的输入: - -.. code-block:: bash - - x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(2)) - y = paddle.layer.data(name='y', type=paddle.data_type.dense_vector(1)) - -其中x表示输入数据是一个维度为2的稠密向量,y表示输入数据是一个维度为1的稠密向量。 - -PaddlePaddle支持不同类型的输入数据,主要包括四种类型,和三种序列模式。 - -四种数据类型: - -* dense_vector:稠密的浮点数向量。 -* sparse_binary_vector:稀疏的01向量,即大部分值为0,但有值的地方必须为1。 -* sparse_float_vector:稀疏的向量,即大部分值为0,但有值的部分可以是任何浮点数。 -* integer:整数标签。 - -三种序列模式: - -* SequenceType.NO_SEQUENCE:不是一条序列 -* SequenceType.SEQUENCE:是一条时间序列 -* SequenceType.SUB_SEQUENCE: 是一条时间序列,且序列的每一个元素还是一个时间序列。 - -不同的数据类型和序列模式返回的格式不同,列表如下: - -+----------------------+---------------------+-----------------------------------+------------------------------------------------+ -| | NO_SEQUENCE | SEQUENCE | SUB_SEQUENCE | -+======================+=====================+===================================+================================================+ -| dense_vector | [f, f, ...] | [[f, ...], [f, ...], ...] | [[[f, ...], ...], [[f, ...], ...],...] | -+----------------------+---------------------+-----------------------------------+------------------------------------------------+ -| sparse_binary_vector | [i, i, ...] | [[i, ...], [i, ...], ...] | [[[i, ...], ...], [[i, ...], ...],...] | -+----------------------+---------------------+-----------------------------------+------------------------------------------------+ -| sparse_float_vector | [(i,f), (i,f), ...] | [[(i,f), ...], [(i,f), ...], ...] | [[[(i,f), ...], ...], [[(i,f), ...], ...],...] | -+----------------------+---------------------+-----------------------------------+------------------------------------------------+ -| integer_value | i | [i, i, ...] | [[i, ...], [i, ...], ...] | -+----------------------+---------------------+-----------------------------------+------------------------------------------------+ - -其中,f代表一个浮点数,i代表一个整数。 - -注意:对sparse_binary_vector和sparse_float_vector,PaddlePaddle存的是有值位置的索引。例如, - -- 对一个5维非序列的稀疏01向量 ``[0, 1, 1, 0, 0]`` ,类型是sparse_binary_vector,返回的是 ``[1, 2]`` 。 -- 对一个5维非序列的稀疏浮点向量 ``[0, 0.5, 0.7, 0, 0]`` ,类型是sparse_float_vector,返回的是 ``[(1, 0.5), (2, 0.7)]`` 。 - - -在定义输入layer之后,我们可以使用其他layer进行组合。在组合时,需要指定layer的输入来源。 - -例如,我们可以定义如下的layer组合: - -.. code-block:: bash - - y_predict = paddle.layer.fc(input=x, size=1, act=paddle.activation.Linear()) - cost = paddle.layer.square_error_cost(input=y_predict, label=y) - -其中,x与y为之前描述的输入层;而y_predict是接收x作为输入,接上一个全连接层;cost接收y_predict与y作为输入,接上平方误差层。 - -最后一层cost中记录了神经网络的所有拓扑结构,通过组合不同的layer,我们即可完成神经网络的搭建。 - - -训练模型 -============ - -在完成神经网络的搭建之后,我们首先需要根据神经网络结构来创建所需要优化的parameters,并创建optimizer。 -之后,我们可以创建trainer来对网络进行训练。 - -.. code-block:: bash - - parameters = paddle.parameters.create(cost) - optimizer = paddle.optimizer.Momentum(momentum=0) - trainer = paddle.trainer.SGD(cost=cost, - parameters=parameters, - update_equation=optimizer) - -其中,trainer接收三个参数,包括神经网络拓扑结构、神经网络参数以及迭代方程。 - -在搭建神经网络的过程中,我们仅仅对神经网络的输入进行了描述。而trainer需要读取训练数据进行训练,PaddlePaddle中通过reader来加载数据。 - -.. code-block:: bash - - # define training dataset reader - def train_reader(): - train_x = np.array([[1, 1], [1, 2], [3, 4], [5, 2]]) - train_y = np.array([[-2], [-3], [-7], [-7]]) - def reader(): - for i in xrange(train_y.shape[0]): - yield train_x[i], train_y[i] - return reader - -最终我们可以调用trainer的train方法启动训练: - -.. code-block:: bash - - # define feeding map - feeding = {'x': 0, 'y': 1} - - # event_handler to print training info - def event_handler(event): - if isinstance(event, paddle.event.EndIteration): - if event.batch_id % 1 == 0: - print "Pass %d, Batch %d, Cost %f" % ( - event.pass_id, event.batch_id, event.cost) - # training - trainer.train( - reader=paddle.batch(train_reader(), batch_size=1), - feeding=feeding, - event_handler=event_handler, - num_passes=100) - -关于PaddlePaddle的更多使用方法请参考 `进阶指南 <../../howto/index_cn.html>`_。 - -线性回归完整示例 -============== - -下面给出在三维空间中使用线性回归拟合一条直线的例子: - -.. literalinclude:: src/train.py - :linenos: - -使用以上训练好的模型进行预测,取其中一个模型params_pass_90.tar,输入需要预测的向量组,然后打印输出: - -.. literalinclude:: src/infer.py - :linenos: - -有关线性回归的实际应用,可以参考PaddlePaddle book的 `第一章节 `_。 diff --git a/doc/v2/getstarted/concepts/use_concepts_en.rst b/doc/v2/getstarted/concepts/use_concepts_en.rst deleted file mode 100644 index 406b0cbb913894dc333d8e4561c207793c33e475..0000000000000000000000000000000000000000 --- a/doc/v2/getstarted/concepts/use_concepts_en.rst +++ /dev/null @@ -1,3 +0,0 @@ -Basic Concept -============= -TBD diff --git a/doc/v2/getstarted/index_cn.rst b/doc/v2/getstarted/index_cn.rst deleted file mode 100644 index 75af7354be93a6eeabfa9ccf86903505402a7ca6..0000000000000000000000000000000000000000 --- a/doc/v2/getstarted/index_cn.rst +++ /dev/null @@ -1,19 +0,0 @@ -新手入门 -============ - - -如果需要快速了解PaddlePaddle的使用,可以参考以下指南。 - -.. toctree:: - :maxdepth: 1 - - quickstart_cn.rst - - -在使用PaddlePaddle构建应用时,需要了解一些基本概念。 -这里以一个线性回归为例子,详细介绍了PaddlePaddle的使用流程,包括数据格式,模型配置与训练等。 - -.. toctree:: - :maxdepth: 1 - - concepts/use_concepts_cn.rst diff --git a/doc/v2/getstarted/index_en.rst b/doc/v2/getstarted/index_en.rst deleted file mode 100644 index 94b306895c9ddf6140cf600131930a6675a583eb..0000000000000000000000000000000000000000 --- a/doc/v2/getstarted/index_en.rst +++ /dev/null @@ -1,19 +0,0 @@ -GET STARTED -============ - -If you want to quickly know how to use PaddlePaddle, please refer to the following guide: - -.. toctree:: - :maxdepth: 1 - - quickstart_en.rst - - -While using PaddlePaddle to build applications, please understand some basic concepts. - -Here is an example of linear regression. It introduces workflow of PaddlePaddle, including data format, model configuration and training, etc. - -.. toctree:: - :maxdepth: 1 - - concepts/use_concepts_en.rst diff --git a/doc/v2/getstarted/quickstart_cn.rst b/doc/v2/getstarted/quickstart_cn.rst deleted file mode 100644 index d511cead262dabafd095f68adb5ffc596a7fe596..0000000000000000000000000000000000000000 --- a/doc/v2/getstarted/quickstart_cn.rst +++ /dev/null @@ -1,47 +0,0 @@ -快速开始 -======== - -快速安装 --------- - -PaddlePaddle支持使用pip快速安装,目前支持CentOS 6以上, Ubuntu 14.04以及MacOS 10.12,并安装有Python2.7。 -执行下面的命令完成快速安装,版本为cpu_avx_openblas: - - .. code-block:: bash - - pip install paddlepaddle - -如果需要安装支持GPU的版本(cuda7.5_cudnn5_avx_openblas),需要执行: - - .. code-block:: bash - - pip install paddlepaddle-gpu - -更详细的安装和编译方法参考::ref:`install_steps` 。 - -快速使用 --------- - -创建一个 housing.py 并粘贴此Python代码: - - .. code-block:: python - - import paddle.v2 as paddle - - # Initialize PaddlePaddle. - paddle.init(use_gpu=False, trainer_count=1) - - # Configure the neural network. - x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13)) - y_predict = paddle.layer.fc(input=x, size=1, act=paddle.activation.Linear()) - - # Infer using provided test data. - probs = paddle.infer( - output_layer=y_predict, - parameters=paddle.dataset.uci_housing.model(), - input=[item for item in paddle.dataset.uci_housing.test()()]) - - for i in xrange(len(probs)): - print 'Predicted price: ${:,.2f}'.format(probs[i][0] * 1000) - -执行 :code:`python housing.py` 瞧! 它应该打印出预测住房数据的清单。 diff --git a/doc/v2/getstarted/quickstart_en.rst b/doc/v2/getstarted/quickstart_en.rst deleted file mode 100644 index 70f7fe0646068aa79cd72955c6848ac0250c2300..0000000000000000000000000000000000000000 --- a/doc/v2/getstarted/quickstart_en.rst +++ /dev/null @@ -1,51 +0,0 @@ -Quick Start -============ - -Quick Install -------------- - -You can use pip to install PaddlePaddle with a single command, supports -CentOS 6 above, Ubuntu 14.04 above or MacOS 10.12, with Python 2.7 installed. -Simply run the following command to install, the version is cpu_avx_openblas: - - .. code-block:: bash - - pip install paddlepaddle - -If you need to install GPU version (cuda7.5_cudnn5_avx_openblas), run: - - .. code-block:: bash - - pip install paddlepaddle-gpu - -For more details about installation and build: :ref:`install_steps` . - -Quick Use ---------- - -Create a new file called housing.py, and paste this Python -code: - - - .. code-block:: python - - import paddle.v2 as paddle - - # Initialize PaddlePaddle. - paddle.init(use_gpu=False, trainer_count=1) - - # Configure the neural network. - x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13)) - y_predict = paddle.layer.fc(input=x, size=1, act=paddle.activation.Linear()) - - # Infer using provided test data. - probs = paddle.infer( - output_layer=y_predict, - parameters=paddle.dataset.uci_housing.model(), - input=[item for item in paddle.dataset.uci_housing.test()()]) - - for i in xrange(len(probs)): - print 'Predicted price: ${:,.2f}'.format(probs[i][0] * 1000) - -Run :code:`python housing.py` and voila! It should print out a list of predictions -for the test housing data. diff --git a/doc/v2/howto/capi/compile_paddle_lib_cn.md b/doc/v2/howto/capi/compile_paddle_lib_cn.md deleted file mode 100644 index 8878ee9d85064ba27708ed92790aa9b83ba316e5..0000000000000000000000000000000000000000 --- a/doc/v2/howto/capi/compile_paddle_lib_cn.md +++ /dev/null @@ -1,181 +0,0 @@ -## 安装、编译与链接C-API预测库 - -### 直接下载安装 - -从CI系统中下载最新的C-API开发包进行安装,用户可以从下面的表格中找到需要的版本: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
版本说明C-API
cpu_avx_mklpaddle.tgz
cpu_avx_openblaspaddle.tgz
cpu_noavx_openblaspaddle.tgz
cuda7.5_cudnn5_avx_mklpaddle.tgz
cuda8.0_cudnn5_avx_mklpaddle.tgz
cuda8.0_cudnn7_avx_mklpaddle.tgz
cuda9.0_cudnn7_avx_mklpaddle.tgz
- -### 从源码编译 - -用户也可以从 PaddlePaddle 核心代码编译C-API链接库,只需在编译时配制下面这些编译选项: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
选项
WITH_C_APION
WITH_PYTHONOFF(推荐)
WITH_SWIG_PYOFF(推荐)
WITH_GOLANGOFF(推荐)
WITH_GPUON/OFF
WITH_MKLON/OFF
- -建议按照推荐值设置,以避免链接不必要的库。其它可选编译选项按需进行设定。 - -下面的代码片段从github拉取最新代码,配制编译选项(需要将PADDLE_ROOT替换为PaddlePaddle预测库的安装路径): - -```shell -PADDLE_ROOT=/path/of/capi -git clone https://github.com/PaddlePaddle/Paddle.git -cd Paddle -mkdir build -cd build -cmake -DCMAKE_INSTALL_PREFIX=$PADDLE_ROOT \ - -DCMAKE_BUILD_TYPE=Release \ - -DWITH_C_API=ON \ - -DWITH_SWIG_PY=OFF \ - -DWITH_GOLANG=OFF \ - -DWITH_PYTHON=OFF \ - -DWITH_MKL=OFF \ - -DWITH_GPU=OFF \ - .. -``` - -执行上述代码生成Makefile文件后,执行:`make && make install`。成功编译后,使用C-API所需的依赖(包括:(1)编译出的PaddlePaddle预测库和头文件;(2)第三方链接库和头文件)均会存放于`PADDLE_ROOT`目录中。 - -编译成功后在 `PADDLE_ROOT` 下会看到如下目录结构(包括了编译出的PaddlePaddle头文件和链接库,以及第三方依赖链接库和头文件(如果需要,由链接方式决定)): - -```text -├── include -│   └── paddle -│   ├── arguments.h -│   ├── capi.h -│   ├── capi_private.h -│   ├── config.h -│   ├── error.h -│   ├── gradient_machine.h -│   ├── main.h -│   ├── matrix.h -│   ├── paddle_capi.map -│   └── vector.h -├── lib -│   ├── libpaddle_capi_engine.a -│   ├── libpaddle_capi_layers.a -│   ├── libpaddle_capi_shared.so -│   └── libpaddle_capi_whole.a -└── third_party - ├── gflags - │   ├── include - │   │   └── gflags - │   │   ├── gflags_completions.h - │   │   ├── gflags_declare.h - │   │   ... - │   └── lib - │   └── libgflags.a - ├── glog - │   ├── include - │   │   └── glog - │   │   ├── config.h - │   │   ... - │   └── lib - │   └── libglog.a - ├── openblas - │   ├── include - │   │   ├── cblas.h - │   │   ... - │   └── lib - │   ... - ├── protobuf - │   ├── include - │   │   └── google - │   │   └── protobuf - │   │   ... - │   └── lib - │   └── libprotobuf-lite.a - └── zlib - ├── include - │   ... - └── lib - ... - -``` - -### 链接说明 - -目前提供三种链接方式: - -1. 链接`libpaddle_capi_shared.so` 动态库(这种方式最为简便,链接相对容易,**在无特殊需求情况下,推荐使用此方式**),需注意: - 1. 如果编译时指定编译CPU版本,且使用`OpenBLAS`数学库,在使用C-API开发预测程序时,只需要链接`libpaddle_capi_shared.so`这一个库。 - 1. 如果是用编译时指定CPU版本,且使用`MKL`数学库,由于`MKL`库有自己独立的动态库文件,在使用PaddlePaddle C-API开发预测程序时,需要自己链接MKL链接库。 - 1. 如果编译时指定编译GPU版本,CUDA相关库会在预测程序运行时动态装载,需要将CUDA相关的库设置到`LD_LIBRARY_PATH`环境变量中。 - -2. 链接静态库 `libpaddle_capi_whole.a`,需注意: - 1. 需要指定`-Wl,--whole-archive`链接选项。 - 1. 需要显式地链接 `gflags`、`glog`、`libz`、`protobuf` 等第三方库,可在`PADDLE_ROOT/third_party`下找到。 - 1. 如果在编译 C-API 时使用OpenBLAS数学库,需要显示地链接`libopenblas.a`。 - 1. 如果在编译 C-API 是使用MKL数学库,需要显示地链接MKL的动态库。 - -3. 链接静态库 `libpaddle_capi_layers.a`和`libpaddle_capi_engine.a`,需注意: - 1. 这种链接方式主要用于移动端预测。 - 1. 为了减少生成链接库的大小把`libpaddle_capi_whole.a`拆成以上两个静态链接库。 - 1. 需指定`-Wl,--whole-archive -lpaddle_capi_layers` 和 `-Wl,--no-whole-archive -lpaddle_capi_engine` 进行链接。 - 1. 第三方依赖库需要按照与方式2同样方法显示地进行链接。 diff --git a/doc/v2/howto/capi/compile_paddle_lib_en.md b/doc/v2/howto/capi/compile_paddle_lib_en.md deleted file mode 100644 index 70a6edef27e75af6b38d7d4824c928eba0d29b9a..0000000000000000000000000000000000000000 --- a/doc/v2/howto/capi/compile_paddle_lib_en.md +++ /dev/null @@ -1,180 +0,0 @@ -## Install and Build - -### Download & Install - - Download the latest C-API development package from CI system and install. You can find the required version in the table below: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Version TipsC-API
cpu_avx_mklpaddle.tgz
cpu_avx_openblaspaddle.tgz
cpu_noavx_openblaspaddle.tgz
cuda7.5_cudnn5_avx_mklpaddle.tgz
cuda8.0_cudnn5_avx_mklpaddle.tgz
cuda8.0_cudnn7_avx_mklpaddle.tgz
cuda9.0_cudnn7_avx_mklpaddle.tgz
- -### From source - - Users can also compile the C-API library from PaddlePaddle source code by compiling with the following compilation options: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
OptionsValue
WITH_C_APION
WITH_PYTHONOFF(recommended)
WITH_SWIG_PYOFF(recommended)
WITH_GOLANGOFF(recommended)
WITH_GPUON/OFF
WITH_MKLON/OFF
- -It is best to set up with recommended values to avoid linking with unnecessary libraries. Set other compilation options as you need. - -Pull the latest following code snippet from github, and configure compilation options(replace PADDLE_ROOT with the installation path of the PaddlePaddle C-API inference library): - -```shell -PADDLE_ROOT=/path/of/capi -git clone https://github.com/PaddlePaddle/Paddle.git -cd Paddle -mkdir build -cd build -cmake -DCMAKE_INSTALL_PREFIX=$PADDLE_ROOT \ - -DCMAKE_BUILD_TYPE=Release \ - -DWITH_C_API=ON \ - -DWITH_SWIG_PY=OFF \ - -DWITH_GOLANG=OFF \ - -DWITH_PYTHON=OFF \ - -DWITH_MKL=OFF \ - -DWITH_GPU=OFF \ - .. -``` - -After running the above code to generate Makefile , run: `make && make install`. After successful compilation, the dependencies required by C-API(includes: (1)PaddlePaddle inference library and header files; (2) Third-party libraries and header files) will be stored in the `PADDLE_ROOT` directory. - -If the compilation is successful, see the following directory structure under `PADDLE_ROOT`(includes PaddlePaddle header files and libraries, and third-party libraries and header files(determined by the link methods if necessary)): - -```text -├── include -│   └── paddle -│   ├── arguments.h -│   ├── capi.h -│   ├── capi_private.h -│   ├── config.h -│   ├── error.h -│   ├── gradient_machine.h -│   ├── main.h -│   ├── matrix.h -│   ├── paddle_capi.map -│   └── vector.h -├── lib -│   ├── libpaddle_capi_engine.a -│   ├── libpaddle_capi_layers.a -│   ├── libpaddle_capi_shared.so -│   └── libpaddle_capi_whole.a -└── third_party - ├── gflags - │   ├── include - │   │   └── gflags - │   │   ├── gflags_completions.h - │   │   ├── gflags_declare.h - │   │   ... - │   └── lib - │   └── libgflags.a - ├── glog - │   ├── include - │   │   └── glog - │   │   ├── config.h - │   │   ... - │   └── lib - │   └── libglog.a - ├── openblas - │   ├── include - │   │   ├── cblas.h - │   │   ... - │   └── lib - │   ... - ├── protobuf - │   ├── include - │   │   └── google - │   │   └── protobuf - │   │   ... - │   └── lib - │   └── libprotobuf-lite.a - └── zlib - ├── include - │   ... - └── lib - ... - -``` - -### Linking Description: - -There are three kinds of linking methods: - -1. Linking with dynamic library `libpaddle_capi_shared.so`(This way is much more convenient and easier, **Without special requirements, it is recommended**), refer to the following: - 1. Compiling with CPU version and using `OpenBLAS`; only need to link one library named `libpaddle_capi_shared.so` to develop prediction program through C-API. - 1. Compiling with CPU version and using `MKL` lib, you need to link MKL library directly to develop prediction program through PaddlePaddle C-API, due to `MKL` has its own dynamic library. - 1. Compiling with GPU version, CUDA library will be loaded dynamically on prediction program run-time, and also set CUDA library to  `LD_LIBRARY_PATH` environment variable. - -2. Linking with static library `libpaddle_capi_whole.a`,refer to the following: - 1. Specify `-Wl,--whole-archive` linking options. - 1. Explicitly link third-party libraries such as `gflags`、`glog`、`libz`、`protobuf` .etc, you can find them under `PADDLE_ROOT/third_party` directory. - 1. Use OpenBLAS library if compiling C-API,must explicitly link `libopenblas.a`. - 1. Use MKL when compiling C-API, must explicitly link MKL dynamic library. - -3. Linking with static library `libpaddle_capi_layers.a` and `libpaddle_capi_engine.a`,refer to the following: - 1. This linking methods is mainly used for mobile prediction. - 1. Split `libpaddle_capi_whole.a` into two static linking library at least to reduce the size of linking libraries. - 1. Specify `-Wl,--whole-archive -lpaddle_capi_layers`  and `-Wl,--no-whole-archive -lpaddle_capi_engine` for linking. - 1. The third-party dependencies need explicitly link same as method 2 above. diff --git a/doc/v2/howto/capi/images/csr.png b/doc/v2/howto/capi/images/csr.png deleted file mode 100644 index 3dc10b8de4f6d3f517624956b1694b689405a031..0000000000000000000000000000000000000000 Binary files a/doc/v2/howto/capi/images/csr.png and /dev/null differ diff --git a/doc/v2/howto/capi/images/sequence_data.png b/doc/v2/howto/capi/images/sequence_data.png deleted file mode 100644 index 6e47a46b8955dfe977e85898fe3c9f33ed28de7e..0000000000000000000000000000000000000000 Binary files a/doc/v2/howto/capi/images/sequence_data.png and /dev/null differ diff --git a/doc/v2/howto/capi/images/workflow_of_CAPI.png b/doc/v2/howto/capi/images/workflow_of_CAPI.png deleted file mode 100644 index a4399ade048b3fe10d2d9c714bc34333ca068edb..0000000000000000000000000000000000000000 Binary files a/doc/v2/howto/capi/images/workflow_of_CAPI.png and /dev/null differ diff --git a/doc/v2/howto/capi/index_cn.rst b/doc/v2/howto/capi/index_cn.rst deleted file mode 100644 index 7f100717983f5e950b801e6b05ee48bfff273c62..0000000000000000000000000000000000000000 --- a/doc/v2/howto/capi/index_cn.rst +++ /dev/null @@ -1,26 +0,0 @@ -C-API预测库 -================== - -当我们训练完一个神经网络模型之后,下一步就是用模型来做预测。预测就是准备输入数据,经过模型处理之后,得到预测结果的过程。 - -相比于模型训练,预测有如下特点: - -#. 预测不需要训练过程中反向传播和参数更新的部分。 -#. 预测不需要标签(label)。 -#. 预测很多时候需要和用户系统整合在一起。 - -因为上述特点,模型预测SDK需要单独设计,并具备以下特点: - -#. 预测SDK不包含反向传播和参数更新部分,以减小SDK的体积。 -#. 预测SDK需要提供一个简洁的用户接口,方便使用。 -#. 因为输入数据可能有多种结构,对输入数据的格式做清晰简洁的封装。 -#. 为了和用户系统兼容,SDK的接口需要是满足C标准的接口。 - -PaddlePaddle提供了C-API,用于解决上述问题。关于C-API的使用,我们提供了如下指南: - -.. toctree:: - :maxdepth: 1 - - compile_paddle_lib_cn.md - organization_of_the_inputs_cn.md - workflow_of_capi_cn.md diff --git a/doc/v2/howto/capi/index_en.rst b/doc/v2/howto/capi/index_en.rst deleted file mode 100644 index 4ec39c9d5223442cf6872edaf7befeb5053b538e..0000000000000000000000000000000000000000 --- a/doc/v2/howto/capi/index_en.rst +++ /dev/null @@ -1,26 +0,0 @@ -C-API Inference Library -======================== - -After we train a neural network, we use it to do inference. Inference is the process of preparing input data and propagating it through the model to produce the result. - -Compared with model training, prediction has the following features: - -#. Inference does not require backpropagation and parameter updates, as required during training. -#. Labels are not needed in prediction. -#. Most of the time, predictions need to be integrated with the user system. - -Therefore, the model prediction SDK needs to be designed separately and has the following features: - -#. The predictive SDK does not include backpropagation and parameter updates to reduce the size of the SDK. -#. The predictive SDK needs a simple user interface for ease of use. -#. Since the input data may have a variety of structures, the format of the input data is clearly and compactly packaged. -#. In order to be compatible with user's system, the SDK's interface must conform to the C-standard interface. - -PaddlePaddle provides C-API to solve the above problem. Following are the guidelines to use the C-API: - -.. toctree:: - :maxdepth: 1 - - compile_paddle_lib_en.md - organization_of_the_inputs_en.md - workflow_of_capi_en.md diff --git a/doc/v2/howto/capi/organization_of_the_inputs_cn.md b/doc/v2/howto/capi/organization_of_the_inputs_cn.md deleted file mode 100644 index 343526c213110cb9c6abaf9a12b3d634ad3fabe9..0000000000000000000000000000000000000000 --- a/doc/v2/howto/capi/organization_of_the_inputs_cn.md +++ /dev/null @@ -1,289 +0,0 @@ -## 输入/输出数据组织 - -这篇文档介绍在使用 PaddlePaddle C-API 时如何组织输入数据,以及如何解析神经网络前向计算的输出结果。 - -### 输入/输出数据类型 -在C-API中,按照基本数据类型在PaddlePaddle内部的定义和实现,输入数据可分为: - -1. 一维整型数组 -1. 二维浮点型矩阵 - - - 稠密矩阵 - - 稀疏矩阵 - -说明: - -1. 一维数组**仅支持整型值**; - - 常用于自然语言处理任务,例如:表示词语在词典中的序号; - - 分类任务中类别标签; -1. 逻辑上高于二维的数据(例如含有多个通道的图片,视频等)在程序实现中都会转化为二维矩阵,转化方法在相应的领域都有通用解决方案,需要使用者自己了解并完成转化; -1. 二维矩阵可以表示行向量和列向量,任何时候如果需要浮点型数组(向量),都应使用C-API中的矩阵来表示,而不是C-API中的一维数组。 -1. 不论是一维整型数组还是二维浮点数矩阵,**为它们附加上序列信息将变成序列输入。PaddlePaddle 会通过判数据是否附带有序列信息来判断一个向量/矩阵是否是一个序列**。当非序列输入时,无需关心和处理序列信息。关于什么是“序列信息”,下文会详细进行介绍。 - -### 基本使用概念 - -- 在PaddlePaddle内部,神经网络中一个计算层的输入/输出被组织为一个 `Argument` 结构体,如果神经网络有多个输入或者多个输出,每一个输入/输出都会对应有自己的`Argument`。 -- `Argument` 并不真正“存储”数据,而是将输入/输出信息有机地组织在一起。 -- 在`Argument`内部由`IVector`(对应着上文提到的一维整型数组)和`Matrix`(对应着上文提到的二维浮点型矩阵)来实际存储数据;由 `Sequence Start Positions` (下文详细解释) 来描述输入/输出的序列信息。 - -- **注**: - 1. 这篇文档之后部分将会统一使用`argument`来特指PaddlePaddle中神经网络计算层一个输入/输出数据。 - 1. 使用`paddle_ivector`来特指PaddlePaddle中的一维整型数组。 - 1. 使用`paddle_matrix`来特指PaddlePaddle中的二维浮点型矩阵。 - -### 组织输入数据 -- 一维整型数组 - - 概念上可以将`paddle_ivector`理解为一个一维的整型数组,通常用于表示离散的类别标签,或是在自然语言处理任务中表示词语在字典中的序号。下面的代码片段创建了含有三个元素`1`、`2`、`3`的`paddle_ivector`。 - ```c - int ids[] = {1, 2, 3}; - paddle_ivector ids_array = - paddle_ivector_create(ids, sizeof(ids) / sizeof(int), false, false); - CHECK(paddle_arguments_set_ids(in_args, 0, ids_array)); - ``` - -- **稠密矩阵** - - 一个`m×n`的稠密矩阵是一个由`m`行`n`列元素排列成的矩形阵列,矩阵里的元素是浮点数。对神经网络来说,矩阵的高度`m`是一次预测接受的样本数目,宽度$n$是神经网络定义时,`paddle.layer.data`的`size`。 - - 下面的代码片段创建了一个高度为1,宽度为`layer_size`的稠密矩阵,矩阵中每个元素的值随机生成。 - - ```c - paddle_matrix mat = paddle_matrix_create( - /* height = batch size */ 1, - /* width = dimensionality of the data layer */ layer_size, - /* whether to use GPU */ false); - - paddle_real* array; - // Get the pointer pointing to the start address of the first row of the - // created matrix. - CHECK(paddle_matrix_get_row(mat, 0, &array)); - - // Fill the matrix with a randomly generated test sample. - srand(time(0)); - for (int i = 0; i < layer_size; ++i) { - array[i] = rand() / ((float)RAND_MAX); - } - - // Assign the matrix to the argument. - CHECK(paddle_arguments_set_value(in_args, 0, mat)); - ``` - -- **稀疏矩阵** - - PaddlePaddle C-API 中 稀疏矩阵使用[CSR(Compressed Sparse Row Format)](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_(CSR,_CRS_or_Yale_format))格式存储。下图是CSR存储稀疏矩阵的示意图。 -

-
图1. 稀疏矩阵存储示意图 -

- - CSR存储格式通过:(1)非零元素的值(上图中的`values`);(2)行偏移(上图中的`row offsets`):每一行元素在`values`中的起始偏移,`row offsets`中元素个数总是等于行数 + 1;(3)非零元素的列号(上图中的`column indices`)来确定稀疏矩阵的内容。 - - 在PaddlePaddle C-API中,通过调用以下接口创建稀疏矩阵: - - ```c - PD_API paddle_matrix paddle_matrix_create_sparse( - uint64_t height, uint64_t width, uint64_t nnz, bool isBinary, bool useGpu); - ``` - - 1. 创建稀疏矩阵时需要显示地指定矩阵的(1)高度(`height`,在神经网络中等于一次预测处理的样本数)(2)宽度(`width`,`paddle.layer.data`的`size`)以及(3)非零元个数(`nnz`)。 - 1. 当上述接口第4个参数`isBinary`指定为`true`时,**只需要设置行偏移(`row_offset`)和列号(`colum indices`),不需要提供元素值(`values`)**,这时行偏移和列号指定的元素默认其值为1。 - - 下面的代码片段创建了一个CPU上的二值稀疏矩阵: - - ```c - paddle_matrix mat = paddle_matrix_create_sparse(1, layer_size, nnz, true, false); - int colIndices[] = {9, 93, 109}; // layer_size here is greater than 109. - int rowOffset[] = {0, sizeof(colIndices) / sizeof(int)}; - - CHECK(paddle_matrix_sparse_copy_from(mat, - rowOffset, - sizeof(rowOffset) / sizeof(int), - colIndices, - (colIndices) / sizeof(int), - NULL /*values array is NULL.*/, - 0 /*size of the value arrary is 0.*/)); - CHECK(paddle_arguments_set_value(in_args, 0, mat)); - ``` - 下面的代码片段在创建了一个CPU上的带元素值的稀疏矩阵: - ```c - paddle_matrix mat = paddle_matrix_create_sparse(1, layer_size, nnz, false, false); - int colIndices[] = {9, 93, 109}; // layer_size here is greater than 109. - int rowOffset[] = {0, sizeof(colIndices) / sizeof(int)}; - float values[] = {0.5, 0.5, 0.5}; - - CHECK(paddle_matrix_sparse_copy_from(mat, - rowOffset, - sizeof(rowOffset) / sizeof(int), - colIndices, - sizeof(colIndices) / sizeof(int), - values, - sizeof(values) / sizeof(float))); - ``` - 注意事项: - 1. 移动端预测**不支持**稀疏矩阵及相关的接口。 - -### 组织序列信息 - -多个排成一列的元素(可以是整型、浮点数、浮点数向量等)构成一个序列,元素之间的顺序是序列所携带的重要信息。不同序列可能会含有不同数目个元素。在 PaddlePaddle 中,序列输入/输出数据是在上文介绍的**数据输入(一维整型数组,二维浮点数矩阵)基础上,附加上序列信息**。下面详细解释什么是“序列信息”。 - -我们将神经网络一次计算接受的所有输入样本称之为一个`batch`(可以含有一条或多条样本),每一个序列在整个`batch`中的偏移,就是PaddlePaddle中所指的**序列信息**,称之为“sequence start positions”。PaddlePaddle 支持两种序列类型: - -1. 单层序列 - - 序列中的每一个元素是非序列,是进行计算的基本单位,不可再进行拆分。 - - 例如:自然语言中的句子是一个序列,序列中的元素是词语; -1. 双层序列 - - 序列中的每一个元素又是一个序列。 - - 例如:自然语言中的段落是一个双层序列;段落是由句子构成的序列;句子是由词语构成的序列。 - - 双层序列在处理长序列的任务或是构建层级模型时会发挥作用。 - -这篇文档之后部分会统一使用`sequence_start_positions`来特指:PaddlePaddle中神经网络计算层输入/输出所携带的序列信息。 - -对双层序列来讲,不仅要提供每一个外层序列在整个`batch`中的偏移,每一个外层序列又含有若干个内层序列,需要同时提供每一个内层序列在整个`batch`中的偏移。也就是说:**双层序列需要设置分别为外层序列和内层序列分别设置`sequence_start_positions`信息**。 - -**注:** -1. 不论序列中的元素在内存中占用多少实际存储空间,`sequence_start_positions`表示的偏移是以“序列中的一个元素”作为统计的基本单位,而不是相对`batch`起始存储地址以数据的存储大小为单位的偏移。 -1. 非序列输入不携带`sequence_start_positions`,非序列输入无需构造`sequence_start_positions`。 -1. **不论是单层序列还是双层序列的序列信息,都使用`paddle_ivector`(也就是PaddlePaddle中的一维整型数组)来存储。** - -图2 是PaddlePaddle中单层序列和双层序列存储示意图。 -

-
图2. 序列输入示意图 -

- -- 单层序列 - - 图2 (a) 展示了一个含有4个序列的`batch`输入: - 1. 4个序列的长度分别为:5、3、2、4; - 1. 这时的`sequence_start_positions`为:`[0, 5, 8, 10, 14]`; - 1. 本地训练. 不论数据域是`paddle_ivector`类型还是`paddle_matrix`类型,都可以通过调用下面的接口为原有的数据输入附加上序列信息,使之变为一个单层序列输入,代码片段如下: - - ```c - int seq_pos_array[] = {0, 5, 8, 10, 14}; - paddle_ivector seq_pos = paddle_ivector_create( - seq_pos_array, sizeof(seq_pos_array) / sizeof(int), false, false); - // Suppose the network only has one input data layer. - CHECK(paddle_arguments_set_sequence_start_pos(in_args, 0, 0, seq_pos)); - ``` - -- 双层序列 - - 图2 (b) 展示了一个含有4个序列的`batch`输入; - 1. 4个序列的长度分别为:5、3、2、4;这四个序列又分别含有3、2、1、2个子序列; - 1. 这时的需要同时提供: - - 外层序列在`batch`中的起始偏移`:[0, 5, 8, 10, 14]`; - - 内层序列在`batch`中的起始偏移:`[0, 2, 3, 5, 7, 8, 10, 13, 14]`; - 1. 不论数据域是`paddle_ivector`类型还是`paddle_matrix`类型,这时需要调用创建序列信息和为`argument`设置序列信息的接口**两次**,分别为数据输入添加外层序列和内层序列的序列信息,使之变为一个双层序列输入,代码片段如下: - ```c - // set the sequence start positions for the outter sequences. - int outter_seq_pos_array[] = {0, 5, 8, 10, 14}; - paddle_ivector seq_pos = - paddle_ivector_create(outter_seq_pos_array, - sizeof(outter_pos_array) / sizeof(int), - false, - false); - // The third parameter of this API indicates the sequence level. - // 0 for the outter sequence. 1 for the inner sequence. - // If the input is a sequence not the nested sequence, the third parameter is - // fixed to be 0. - CHECK(paddle_arguments_set_sequence_start_pos(in_args, 0, 0, seq_pos)); - - // set the sequence start positions for the outter sequences. - int inner_seq_pos_array[] = {0, 2, 3, 5, 7, 8, 10, 13, 14}; - paddle_ivector seq_pos = paddle_ivector_create( - inner_pos_array, sizeof(inner_pos_array) / sizeof(int), false, false); - // The third parameter of this API indicates the sequence level. - // 0 for the outter sequence. 1 for the inner sequence. - CHECK(paddle_arguments_set_sequence_start_pos(in_args, 0, 1, seq_pos)); - ``` - -注意事项: -1. 当一个`batch`中含有多个序列,**不支持序列长度为`0`的序列(也就是空输入)** 作为输入。不同计算层对空输入的处理策略有可能不同,潜在会引起未定义行为,或者引起行时错误,请在输入时进行合法性检查。 - -### Python 端数据类型说明 - -下表列出了Python端训练接口暴露的数据类型(`paddle.layer.data`函数`type`字段的取值)对应于调用C-API需要创建的数据类型: - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Python 端数据类型C-API 输入数据类型
paddle.data_type.integer_value整型数组,无需附加序列信息
paddle.data_type.dense_vector浮点型稠密矩阵,无需附加序列信息
paddle.data_type.sparse_binary_vector浮点型稀疏矩阵,无需提供非零元的值,默认为1,无需附加序列信息
paddle.data_type.sparse_vector浮点型稀疏矩阵,需提供非零元的值,无需附加序列信息
paddle.data_type.integer_value_sequence整型数组,需附加序列信息
paddle.data_type.dense_vector_sequence浮点型稠密矩阵,需附加序列信息
paddle.data_type.sparse_binary_vector_sequence浮点型稀疏矩阵,无需提供非零元的值,默认为1,需附加序列信息
paddle.data_type.sparse_vector_sequence浮点型稀疏矩阵,需提供非零元的值,需附加序列信息
paddle.data_type.integer_value_sub_sequence整型数组,需附加双层序列信息
paddle.data_type.dense_vector_sub_sequence浮点型稠密矩阵,需附加双层序列信息
paddle.data_type.sparse_binary_vector_sub_sequence浮点型稀疏矩阵,无需提供非零元的值,默认为1,需附加双层序列信息
paddle.data_type.sparse_vector_sub_sequence浮点型稀疏矩阵,需提供非零元的值,需附加双层序列信息
- -
- - -### 输出数据 - -PaddlePaddle中一个计算层的输出数据组织方式和输入数据组织方式完全相同。一个输出数据同样被组织为一个`argument`,`argument`通过`paddle_matrix`或`paddle_ivector`存数数据,如果输出是一个序列,那么会携带有`sequence_start_positions`信息。调用C-API相关接口,读取需要的结果即可。 - -### 总结 - -- 在PaddlePaddle内部,神经网络中一个计算层的输入/输出被组织为`argument`。 -- `argument`并不真正“存储”数据,而是将输入/输出信息有机地组织在一起。 -- 在`argument`内部由`paddle_ivector`(一维整型数组)和`paddle_matrix`(二维浮点型矩阵)来实际存储数据。 -如果是一个序列输入/输出由 `sequence start positions` 来记录输入/输出的序列信息。 - -于是,在组织神经网络输入时,需要思考完成以下工作: - -1. 为每一个输入/输出创建`argument`。 - - C-API 中操作`argument`的接口请查看[argument.h](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/capi/arguments.h)。 -1. 为每一个`argument`创建`paddle_matrix`或者`paddle_ivector`来存储数据。 - - C-API 中操作`paddle_ivector`的接口请查看 [vector.h](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/capi/vector.h)。 - - C-API 中操作`paddle_matrix`的接口请查看[matrix.h](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/capi/matrix.h)。 -1. 如果输入是序列数据,需要创建并填写`sequence_start_positions`信息。 - - 通过调用 [`paddle_arguments_set_sequence_start_pos`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/capi/arguments.h#L137) 来为一个`argument`添加序列信息。 - - 通过调用 [`paddle_arguments_get_sequence_start_pos`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/capi/arguments.h#L150) 来读取一个`argument`添加序列信息。 - - 接口说明请查看 [argument.h](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/capi/arguments.h) 文件。 diff --git a/doc/v2/howto/capi/organization_of_the_inputs_en.md b/doc/v2/howto/capi/organization_of_the_inputs_en.md deleted file mode 100644 index 250d3b2f749aed018e63527e817899c843dff996..0000000000000000000000000000000000000000 --- a/doc/v2/howto/capi/organization_of_the_inputs_en.md +++ /dev/null @@ -1,3 +0,0 @@ -## Input/Output Data Organization - -TBD diff --git a/doc/v2/howto/capi/workflow_of_capi_cn.md b/doc/v2/howto/capi/workflow_of_capi_cn.md deleted file mode 100644 index db1568a2afbea3cca0d4e1fe053ba9536a60ab3d..0000000000000000000000000000000000000000 --- a/doc/v2/howto/capi/workflow_of_capi_cn.md +++ /dev/null @@ -1,124 +0,0 @@ -## C-API使用流程 - -这篇文档介绍 PaddlePaddle C-API 整体使用流程。 - -### 使用流程 - -使用 C-API 的工作流程如图1所示,分为(1)准备预测模型和(2)预测程序开发两大部分。 - -

-
图1. C-API使用流程示意图 -

- -- 准备预测模型 - - 1. 只将神经网络结构进行序列化。 - - 只对神经网络结构进行序列化,加载模型需同时指定:网络结构的序列化结果和模型参数存储目录。 - 1. 将网络结构定义和训练结束存储下来的模型参数文件(多个)合并入一个文件。 - - 神经网络模型结构和训练好的模型将被序列化合并入一个文件。 - - 预测时只需加载一个文件便于发布。 - - **注意**:以上两种方式只需选择其一即可。 -- 调用 C-API 开发预测序 - - 1. 初始化PaddlePaddle运行环境。 - 1. 加载预测模型。 - 1. 创建神经网络输入,组织输入数据。 - 1. 进行前向计算,获得计算结果。 - 1. 清理和结束。 - -### 准备预测模型 - -准备预测模型部分,我们以手写数字识别任务为例进行介绍。手写数字识别任务定义了一个含有[两个隐层的简单全连接网络](https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/README.cn.md#softmax回归softmax-regression),网络接受一幅图片作为输入,将图片分类到 0 ~ 9 类别标签之一。完整代码可以查看[此目录](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/legacy/capi/examples/model_inference/dense) 中的相关脚本。 - -调用C-API开发预测程序需要一个训练好的模型,运行[MNIST手写数字识别目录](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/legacy/capi/examples/model_inference/dense)下的[mnist_v2.py](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/legacy/capi/examples/model_inference/dense/mnist_v2.py)脚本,在终端执行`python mnist_v2.py`,会使用 PaddlePaddle 内置的 [MNIST 数据集](http://yann.lecun.com/exdb/mnist/)进行训练。训练好的模型默认保存在当前运行目录下的`models`目录中。 - -下面,我们将训练结束后存储下来的模型转换成预测模型。 - -1. 序列化神经网络模型配置 - - PaddlePaddle 使用 protobuf 来传输网络配置文件中定义的网络结构和相关参数,使用 C-API 进行预测时,需要将网络结构使用 protobuf 进行序列化,写入文件中。 - - 调用[`paddle.utils.dump_v2_config`](https://github.com/PaddlePaddle/Paddle/tree/develop/python/paddle/utils/dump_v2_config.py)中的`dump_v2_config`函数能够将使用 PaddlePaddle V2 API 定义的神经网络结构 dump 到指定文件中,示例代码如下: - - ```python - from paddle.utils.dump_v2_config import dump_v2_config - from mnist_v2 import network - - predict = network(is_infer=True) - dump_v2_config(predict, "trainer_config.bin", True) - ``` - - 对[手写数字识别](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/legacy/capi/examples/model_inference/dense)这个示例,[`mnist_v2.py`](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/legacy/capi/examples/model_inference/dense/mnist_v2.py)脚本集成了序列化神经网络结构的过程,可以直接运行 `python mnist_v2.py --task dump_config` 对神经网络结构进行序列化,结果会写入当前运行目录下的`trainer_config.bin`文件中。 - - 使用这种方式,需要**在运行时将神经网络的多个可学习参数放在同一个目录中**,C-API可以通过分别指定序列化后的网络结构文件和参数目录来加载训练好的模型。 - -2. 合并模型文件(可选) - - 一些情况为了便于发布,希望能够将序列化后的神经网络结构和训练好的模型参数打包进一个文件。对于这样的需求,可以使用`paddle.utils.merge_model`中的`merge_v2_model`接口对神经网络结构和训练好的参数进行序列化,将序列化结果写入一个文件内。 - - 代码示例如下: - - ```python - from paddle.utils.merge_model import merge_v2_model - from mnist_v2 import network - - net = network(is_infer=True) - param_file = "models/params_pass_4.tar" - output_file = "output.paddle.model" - merge_v2_model(net, param_file, output_file) - ``` - - 对[手写数字识别](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/legacy/capi/examples/model_inference/dense)这个示例,可直接运行 `python` [merge_v2_model.py](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/legacy/capi/examples/model_inference/dense/merge_v2_model.py)。序列化结果会写入当前运行目录下的`output.paddle.model`文件中。使用这种方式,运行时C-API可以通过指定`output.paddle.model`文件的路径来加载预测模型。 - -#### 注意事项 -1. 为使用C-API,在调用`dump_v2_config`序列化神经网络结构时,参数`binary`必须指定为`True`。 -1. **预测使用的网络结构往往不同于训练**,通常需要去掉网络中的:(1)类别标签层;(2)损失函数层;(3)`evaluator`等,只留下核心计算层,请注意是否需要修改网络结构。 -1. 预测时,可以获取网络中定义的任意多个(大于等于一个)层前向计算的结果,需要哪些层的计算结果作为输出,就将这些层加入一个Python list中,作为调用`dump_v2_config`的第一个参数。 - -### 编写预测代码 - -预测代码更多详细示例代码请参考[C-API使用示例](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/legacy/capi/examples/model_inference) 目录下的代码示例。这一节对图1中预测代码编写的5个步骤进行介绍和说明。 - -#### step 1. 初始化PaddlePaddle运行环境 -第一步需调用[`paddle_init`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/legacy/capi/main.h#L27) 初始化PaddlePaddle运行环境,该接口接受两个参数:参数的个数和参数列表。 - -#### step2. 加载模型 - -这里介绍C-API使用中的一个重要概念:Gradient Machine。 - -概念上,在 PaddlePaddle 内部,一个GradientMachine类的对象管理着一组计算层(PaddlePaddle Layers)来完成前向和反向计算,并处理与之相关的所有细节。在调用C-API预测时,只需进行前向计算而无需调用反向计算。这篇文档之后部分会使用`gradient machine`来特指调用PaddlePaddle C-API创建的GradientMachine类的对象。每一个 `gradient machine` 都会管理维护一份训练好的模型,下面是C-API提供的,两种常用的模型加载方式: - -1. 调用[`paddle_gradient_machine_load_parameter_from_disk`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/legacy/capi/gradient_machine.h#L61)接口,从磁盘加载预测模型。这时`gradient machine`会独立拥有一份训练好的模型; -1. 调用[`paddle_gradient_machine_create_shared_param`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/legacy/capi/gradient_machine.h#L88)接口,与其它`gradient machine`的共享已经加载的预测模型。这种情况多出现在使用多线程预测时,通过多个线程共享同一个模型来减少内存开销。可参考[此示例](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/legacy/capi/examples/model_inference/multi_thread/main.c)。 - -- 注意事项 - - 1. 使用PaddlePaddle V2 API训练,模型中所有可学习参数会被存为一个压缩文件,需要手动进行解压,将它们放在同一目录中,C-API不会直接加载 V2 API 存储的压缩文件。 - 1. 如果使用`merge model`方式将神经网络结构和训练好的参数序列化到一个文件,请参考此[示例](https://github.com/PaddlePaddle/Mobile/blob/develop/Demo/linux/paddle_image_recognizer.cpp#L59)。 - 1. 通过灵活使用以上两个接口,加载模型可其它多种方式,例如也可在程序运行过程中再加载另外一个模型。 - -#### step 3. 创建神经网络输入,组织输入数据 - -基本使用概念: -- 在PaddlePaddle内部,神经网络中一个计算层的输入输出被组织为一个 `Argument` 结构体,如果神经网络有多个输入或者多个输出,每一个输入/输出都会对应有自己的`Argument`。 -- `Argument` 并不真正“存储”数据,而是将输入/输出数据有机地组织在一起。 -- 在`Argument`内部由:1. `Matrix`(二维矩阵,存储浮点类型输入/输出);2. `IVector`(一维数组,**仅用于存储整型值**,多用于自然语言处理任务)来实际存储数据。 - -C-API支持的所有输入数据类型和他们的组织方式,请参考“输入/输出数据组织”一节。 - -这篇文档的之后部分会使用`argument`来特指PaddlePaddle C-API中神经网络的一个输入/输出,使用`paddle_matrix`**特指**`argument`中用于存储数据的`Matrix`类的对象。 - -在组织神经网络输入,获取输出时,需要思考完成以下工作: - -1. 为每一个输入/输出创建`argument`; -1. 为每一个`argument`创建`paddle_matrix`来存储数据; - -与输入不同的是,不需在使用C-API时为输出`argument`的`paddle_matrix`对象分配空间。前向计算之后PaddlePaddle内部已经分配/管理了每个计算层输出的存储空间。 - -#### step 4. 前向计算 - -完成上述准备之后,通过调用 [`paddle_gradient_machine_forward`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/legacy/capi/gradient_machine.h#L73) 接口完成神经网络的前向计算。 - -#### step 5. 清理 - -结束预测之后,对使用的中间变量和资源进行清理和释放。 diff --git a/doc/v2/howto/capi/workflow_of_capi_en.md b/doc/v2/howto/capi/workflow_of_capi_en.md deleted file mode 100644 index 1692ecd56520675f02ad25ef73761330ebd0e740..0000000000000000000000000000000000000000 --- a/doc/v2/howto/capi/workflow_of_capi_en.md +++ /dev/null @@ -1,3 +0,0 @@ -## C-API Workflow - -TBD diff --git a/doc/v2/howto/cluster/cmd_argument_cn.md b/doc/v2/howto/cluster/cmd_argument_cn.md deleted file mode 100644 index c0ba093cbf2eac5c3b60a0b071b31776a11998f3..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/cmd_argument_cn.md +++ /dev/null @@ -1,167 +0,0 @@ -# 启动参数说明 - -下面以`doc/howto/cluster/src/word2vec`中的代码作为实例,介绍使用PaddlePaddle v2 API完成分布式训练。 - -## 启动参数服务器 - -执行以下的命令启动一个参数服务器并等待和计算节点的数据交互 - -```bash -$ paddle pserver --port=7164 --ports_num=1 --ports_num_for_sparse=1 --num_gradient_servers=1 -``` - -如果希望可以在后台运行pserver程序,并保存输出到一个日志文件,可以运行: - -```bash -$ stdbuf -oL /usr/bin/nohup paddle pserver --port=7164 --ports_num=1 --ports_num_for_sparse=1 --num_gradient_servers=1 &> pserver.log -``` - -参数说明 - -- port:**必选,默认7164**,pserver监听的起始端口,根据ports_num决定总端口个数,从起始端口监听多个端口用于通信 -- ports_num:**必选,默认1**,监听的端口个数 -- ports_num_for_sparse:**必选,默认0**,用于稀疏类型参数通信的端口个数 -- num_gradient_servers:**必选,默认1**,当前训练任务pserver总数 - -## 启动计算节点 - -执行以下命令启动使用python编写的trainer程序(文件名为任意文件名,如train.py) - -```bash -$ python train.py -``` - -trainer需要和pserver保持网络联通以完成训练。trainer启动需要传入端口、pserver地址等参数使trainer可以正确连接到pserver。这些参数可以通过[环境变量](https://zh.wikipedia.org/wiki/环境变量)或编写程序时`paddle.init()`中传入参数。如果同时使用`paddle.init()`参数和环境变量,将会优先使用`paddle.init()`中传入的参数。 - -使用环境变量: - -```bash -export PADDLE_INIT_USE_GPU=False -export PADDLE_INIT_TRAINER_COUNT=1 -export PADDLE_INIT_PORT=7164 -export PADDLE_INIT_PORTS_NUM=1 -export PADDLE_INIT_PORTS_NUM_FOR_SPARSE=1 -export PADDLE_INIT_NUM_GRADIENT_SERVERS=1 -export PADDLE_INIT_TRAINER_ID=0 -export PADDLE_INIT_PSERVERS=127.0.0.1 -``` - -使用参数: - -```python -paddle.init( - use_gpu=False, - trainer_count=1, - port=7164, - ports_num=1, - ports_num_for_sparse=1, - num_gradient_servers=1, - trainer_id=0, - pservers="127.0.0.1") -``` - -参数说明 - -- use_gpu: **可选,默认False**,是否启用GPU训练 -- trainer_count:**必选,默认1**,当前trainer的线程数目 -- port:**必选,默认7164**,连接到pserver的端口 -- ports_num:**必选,默认1**,连接到pserver的端口个数 -- ports_num_for_sparse:**必选,默认0**,和pserver之间用于稀疏类型参数通信的端口个数 -- num_gradient_servers:**必选,默认1**,当前训练任务trainer总数 -- trainer_id:**必选,默认0**,每个trainer的唯一ID,从0开始的整数 -- pservers:**必选,默认127.0.0.1**,当前训练任务启动的pserver的IP列表,多个IP使用“,”隔开 - -```python -trainer = paddle.trainer.SGD(..., is_local=False) -``` - -参数说明 - -- is_local: **必选, 默认True**, 是否使用PServer更新参数 - -## 准备数据集 - -参考样例数据准备脚本[prepare.py](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/src/word2vec/prepare.py),准备训练数据和验证数据集,我们使用paddle.dataset.imikolov数据集,并根据分布式训练并发数(trainer节点个数),在`prepare.py`开头部分指定`SPLIT_COUNT`将数据切分成多份。 - -在线上系统中,通常会使用MapReduce任务的输出结果作为训练结果,这样训练文件的个数会比较多,而且个数并不确定。在trainer中可以使用下面取模的方法为每个trainer分配训练数据文件: - -```python -import os -train_list = [] -flist = os.listdir("/train_data/") -for f in flist: - suffix = int(f.split("-")[1]) - if suffix % TRAINER_COUNT == TRAINER_ID: - train_list.append(f) -``` - -示例程序`prepare.py`会把训练集和测试集分别分割成多个文件(例子中为3个,后缀为`-00000`、`-00001`和`-00002`): - -```bash -train.txt -train.txt-00000 -train.txt-00001 -train.txt-00002 -test.txt -test.txt-00000 -test.txt-00001 -test.txt-00002 -``` - -在进行分布式训练时,每个trainer进程需要能够读取属于自己的一份数据。在一些分布式系统中,系统会提供一个分布式存储服务,这样保存在分布式存储中的数据可以被集群中的每个节点读取到。如果不使用分布式存储,则需要手动拷贝属于每个trainer节点的训练数据到对应的节点上。 - -对于不同的训练任务,训练数据格式和训练程序的`reader()`会大不相同,所以开发者需要根据自己训练任务的实际场景完成训练数据的分割和`reader()`的编写。 - -## 准备训练程序 - -我们会对每个训练任务都会在每个节点上创建一个工作空间(workspace),其中包含了用户的训练程序、程序依赖、挂载或下载的训练数据分片。 - -最后,工作空间应如下所示: - -```bash -. -|-- my_lib.py -|-- word_dict.pickle -|-- train.py -|-- train_data_dir/ -| |-- train.txt-00000 -| |-- train.txt-00001 -| |-- train.txt-00002 -`-- test_data_dir/ - |-- test.txt-00000 - |-- test.txt-00001 - `-- test.txt-00002 -``` - -- `my_lib.py`:会被`train.py`调用的一些用户定义的库函数,比如PIL库等。 -- `word_dict.pickle`:在`train.py`中会使用到的字典数据文件。 -- `train.py`:训练程序,代码参考[api_train_v2_cluster.py](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/src/word2vec/api_train_v2_cluster.py)。***注意:*** 对于本样例代码,在使用不同的分布式计算平台时,您可能需要修改`train.py`开头的部分(如下),以便获得训练数据的位置和获取环境变量配置: - - ```python - cluster_train_file = "./train_data_dir/train/train.txt" - cluster_test_file = "./test_data_dir/test/test.txt" - node_id = os.getenv("OMPI_COMM_WORLD_RANK") - if not node_id: - raise EnvironmentError("must provied OMPI_COMM_WORLD_RANK") - ``` - -- `train_data_dir`:包含训练数据的目录,可以是从分布式存储挂载过来的,也可以是在任务启动前下载到本地的。 -- `test_data_dir`:包含测试数据集的目录。 - -## 异步 SGD 更新 - -我们可以通过设置 `optimize` 的参数使之支持异步SGD更新。 -例如,设置 `AdaGrad` optimize 的 `is_async` 和 `async_lagged_grad_discard_ratio` 参数: - -```python -adagrad = paddle.optimizer.AdaGrad( - is_async=True, - async_lagged_grad_discard_ratio=1.6, - learning_rate=3e-3, - regularization=paddle.optimizer.L2Regularization(8e-4)) -``` - -- `is_async`: 是否为异步SGD更新模式。 -- `async_lagged_grad_discard_ratio`: 异步SGD更新的步长控制,接收到足够的gradient( - `async_lagged_grad_discard_ratio * num_gradient_servers`)之后,后面的gradient - 将会被抛弃。 diff --git a/doc/v2/howto/cluster/cmd_argument_en.md b/doc/v2/howto/cluster/cmd_argument_en.md deleted file mode 100644 index df1381a00fa0fa129eecffe002164c489a4183aa..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/cmd_argument_en.md +++ /dev/null @@ -1,169 +0,0 @@ -# Command-line arguments - -We'll take `doc/howto/cluster/src/word2vec` as an example to introduce distributed training using PaddlePaddle v2 API. - -## Starting parameter server - -Type the below command to start a parameter server which will wait for trainers to connect: - -```bash -$ paddle pserver --port=7164 --ports_num=1 --ports_num_for_sparse=1 --num_gradient_servers=1 --nics=eth0 -``` - -If you wish to run parameter servers in background, and save a log file, you can type: - -```bash -$ stdbuf -oL /usr/bin/nohup paddle pserver --port=7164 --ports_num=1 --ports_num_for_sparse=1 --num_gradient_servers=1 --nics=eth0 &> pserver.log & -``` - -Parameter Description - -- port: **required, default 7164**, port which parameter server will listen on. If ports_num greater than 1, parameter server will listen on multiple ports for more network throughput. -- ports_num: **required, default 1**, total number of ports will listen on. -- ports_num_for_sparse: **required, default 0**, number of ports which serves sparse parameter update. -- num_gradient_servers: **required, default 1**, total number of gradient servers. -- nics: **optional, default xgbe0,xgbe1**, network device name which paramter server will listen on. - -## Starting trainer - -Type the command below to start the trainer(name the file whatever you want, like "train.py") - -```bash -$ python train.py -``` - -Trainers' network need to be connected with parameter servers' network to finish the job. Trainers need to know port and IPs to locate parameter servers. You can pass arguments to trainers through [environment variables](https://en.wikipedia.org/wiki/Environment_variable) or pass to `paddle.init()` function. Arguments passed to the `paddle.init()` function will overwrite environment variables. - -Use environment viriables: - -```bash -export PADDLE_INIT_USE_GPU=False -export PADDLE_INIT_TRAINER_COUNT=1 -export PADDLE_INIT_PORT=7164 -export PADDLE_INIT_PORTS_NUM=1 -export PADDLE_INIT_PORTS_NUM_FOR_SPARSE=1 -export PADDLE_INIT_NUM_GRADIENT_SERVERS=1 -export PADDLE_INIT_TRAINER_ID=0 -export PADDLE_INIT_PSERVERS=127.0.0.1 -python train.py -``` - -Pass arguments: - -```python -paddle.init( - use_gpu=False, - trainer_count=1, - port=7164, - ports_num=1, - ports_num_for_sparse=1, - num_gradient_servers=1, - trainer_id=0, - pservers="127.0.0.1") -``` - -Parameter Description - -- use_gpu: **optional, default False**, set to "True" to enable GPU training. -- trainer_count: **required, default 1**, number of threads in current trainer. -- port: **required, default 7164**, port to connect to parameter server. -- ports_num: **required, default 1**, number of ports for communication. -- ports_num_for_sparse: **required, default 0**, number of ports for sparse type caculation. -- num_gradient_servers: **required, default 1**, number of trainers in current job. -- trainer_id: **required, default 0**, ID for every trainer, start from 0. -- pservers: **required, default 127.0.0.1**, list of IPs of parameter servers, separated by ",". - -```python -trainer = paddle.trainer.SGD(..., is_local=False) -``` - -Parameter Description - -- is_local: **required, default True**, whether update parameters by PServer. - -## Prepare Training Dataset - -Here's some example code [prepare.py](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/src/word2vec/prepare.py), it will download public `imikolov` dataset and split it into multiple files according to job parallelism(trainers count). Modify `SPLIT_COUNT` at the begining of `prepare.py` to change the count of output files. - -In the real world, we often use `MapReduce` job's output as training data, so there will be lots of files. You can use `mod` to assign training file to trainers: - -```python -import os -train_list = [] -flist = os.listdir("/train_data/") -for f in flist: - suffix = int(f.split("-")[1]) - if suffix % TRAINER_COUNT == TRAINER_ID: - train_list.append(f) -``` - -Example code `prepare.py` will split training data and testing data into 3 files with digital suffix like `-00000`, `-00001` and`-00002`: - -```bash -train.txt -train.txt-00000 -train.txt-00001 -train.txt-00002 -test.txt -test.txt-00000 -test.txt-00001 -test.txt-00002 -``` - -When job started, every trainer needs to get it's own part of data. In some distributed systems a storage service will be provided, so the date under that path can be accessed by all the trainer nodes. Without the storage service, you must copy the training data to each trainer node. - -Different training jobs may have different data format and `reader()` function, developers may need to write different data prepare scripts and `reader()` functions for their job. - -## Prepare Training program - -We'll create a *workspace* directory on each node, storing your training program, dependencies, mounted or downloaded dataset directory. - -Your workspace may looks like: - -```bash -. -|-- my_lib.py -|-- word_dict.pickle -|-- train.py -|-- train_data_dir/ -| |-- train.txt-00000 -| |-- train.txt-00001 -| |-- train.txt-00002 -`-- test_data_dir/ - |-- test.txt-00000 - |-- test.txt-00001 - `-- test.txt-00002 -``` - -- `my_lib.py`: user defined libraries, like PIL libs. This is optional. -- `word_dict.pickle`: dict file for training word embeding. -- `train.py`: training program. Sample code: [api_train_v2_cluster.py](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/src/word2vec/api_train_v2_cluster.py). ***NOTE:*** You may need to modify the head part of `train.py` when using different cluster platform to retrive configuration environment variables: - - ```python - cluster_train_file = "./train_data_dir/train/train.txt" - cluster_test_file = "./test_data_dir/test/test.txt" - node_id = os.getenv("OMPI_COMM_WORLD_RANK") - if not node_id: - raise EnvironmentError("must provied OMPI_COMM_WORLD_RANK") - ``` - -- `train_data_dir`: containing training data. Mount from storage service or copy trainning data to here. -- `test_data_dir`: containing testing data. - -## Async SGD Update - -We can set some parameters of the optimizer to make it support async SGD update. -For example, we can set the `is_async` and `async_lagged_grad_discard_ratio` of the `AdaGrad` optimizer: - -```python -adagrad = paddle.optimizer.AdaGrad( - is_async=True, - async_lagged_grad_discard_ratio=1.6, - learning_rate=3e-3, - regularization=paddle.optimizer.L2Regularization(8e-4)) -``` - -- `is_async`: Is Async-SGD or not. -- `async_lagged_grad_discard_ratio`: For async SGD gradient commit control. - when `async_lagged_grad_discard_ratio * num_gradient_servers` commit passed, - current async gradient will be discard silently. diff --git a/doc/v2/howto/cluster/index_cn.rst b/doc/v2/howto/cluster/index_cn.rst deleted file mode 100644 index 2583457c54116b7a1d797d4f7b7c2c4789c6d882..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/index_cn.rst +++ /dev/null @@ -1,36 +0,0 @@ -分布式训练 -========== - -深度学习模型的效果好坏与数据量的大小往往有直接的关系:相同的模型,在增大训练数据集后一般都能取得更好的效果。但是当数据量增大到一定程度后,单台计算机已经难以承受。这时,使用多台计算机进行分布式训练就是一个很自然的解决方案。在分布式训练中,训练数据被分割为多份,参与训练的多台机器分别读取自己的数据进行训练,并协同对整体模型的参数进行更新。 - -分布式训练一般有着如下图所示的架构: - -.. image:: src/ps_cn.png - :width: 500 - -- 数据分片(Data shard): 用于训练神经网络的数据,被切分成多个部分,每个部分分别给每个trainer使用。 -- 计算节点(Trainer): 每个trainer启动后读取切分好的一部分数据,开始神经网络的“前馈”和“后馈”计算,并和参数服务器通信。在完成一定量数据的训练后,上传计算得出的梯度(gradients),然后下载优化更新后的神经网络参数(parameters)。 -- 参数服务器(Parameter server):每个参数服务器只保存整个神经网络所有参数的一部分。参数服务器接收从计算节点上传的梯度,并完成参数优化更新,再将更新后的参数下发到每个计算节点。 - -通过计算节点和参数服务器的分布式协作,可以完成神经网络的同步随机梯度下降(SGD)方法的训练。PaddlePaddle同时支持同步随机梯度下降(SGD)和异步随机梯度下降(ASGD)。 - -在开始集群训练之前,需要先进行集群配置、PaddlePaddle安装等准备工作,了解如何通过这些步骤来配置分布式训练所需的基本环境: - -.. toctree:: - :maxdepth: 1 - - preparations_cn.md - -集群训练有大量可配置的参数,例如使用的机器数量、通信端口等。了解如何通过设置启动参数的方式,对分布式训练的过程进行配置: - -.. toctree:: - :maxdepth: 1 - - cmd_argument_cn.md - -PaddlePaddle可以兼容各种不同的集群。每种集群各有优势,使用的具体方式也有区别: - -.. toctree:: - :maxdepth: 1 - - multi_cluster/index_cn.rst diff --git a/doc/v2/howto/cluster/index_en.rst b/doc/v2/howto/cluster/index_en.rst deleted file mode 100644 index 31eda57c4fb3947d92df45ea8dbb9274c9814140..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/index_en.rst +++ /dev/null @@ -1,38 +0,0 @@ -Distributed Training -==================== - -The effectiveness of the deep learning model is often directly related to the scale of the data: it can generally achieve better results after increasing the size of the dataset on the same model. However, it can not fit in one single computer when the amount of data increases to a certain extent. At this point, using multiple computers for distributed training is a natural solution. In distributed training, the training data is divided into multiple copies (sharding), and multiple machines participating in the training read their own data for training and collaboratively update the parameters of the overall model. - -Distributed training generally has framwork as shown below: - -.. image:: src/ps_en.png - :width: 500 - -- Data shard: training data will be split into multiple partitions, trainers use the partitions of the whole dataset to do the training job. -- Trainer: each trainer reads the data shard, and train the neural network. Then the trainer will upload calculated "gradients" to parameter servers, and wait for parameters to be optimized on the parameter server side. When that finishes, the trainer download optimized parameters and continues its training. -- Parameter server: every parameter server stores part of the whole neural network model data. They will do optimization calculations when gradients are uploaded from trainers, and then send updated parameters to trainers. - -The training of synchronous random gradient descent for neural network can be achieved by cooperation of trainers and parameter servers. - -PaddlePaddle supports both synchronize stochastic gradient descent (SGD) and asynchronous SGD. - -Before starting the cluster training, you need to prepare the cluster configuration, PaddlePaddle installation, and other preparations. To understand how to configure the basic environment for distributed training, check the link below: - -.. toctree:: - :maxdepth: 1 - - preparations_en.md - -Cluster training has a large number of configurable parameters, such as the number of machines used, communication ports, etc. To learn how to configure the distributed training process by setting startup these parameters, check the link below: - -.. toctree:: - :maxdepth: 1 - - cmd_argument_en.md - -PaddlePaddle is compatible with a variety of different clusters. Each cluster has its own advantages, To learn how to run PaddlePaddle in different types of them, check the link below: - -.. toctree:: - :maxdepth: 1 - - multi_cluster/index_en.rst diff --git a/doc/v2/howto/cluster/multi_cluster/fabric_cn.md b/doc/v2/howto/cluster/multi_cluster/fabric_cn.md deleted file mode 100644 index 0385e401b399a51fad112e604dc56cb2f84c0a4b..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/multi_cluster/fabric_cn.md +++ /dev/null @@ -1,42 +0,0 @@ -# 使用fabric启动集群训练 - -## 准备一个Linux集群 -可以在`paddle/scripts/cluster_train_v2/fabric/docker_cluster`目录下,执行`kubectl -f ssh_servers.yaml`启动一个测试集群,并使用`kubectl get po -o wide`获得这些节点的IP地址。 - -## 启动集群作业 - -`paddle.py` 提供了自动化脚本来启动不同节点中的所有 PaddlePaddle 集群进程。默认情况下,所有命令行选项可以设置为 `paddle.py` 命令选项并且 `paddle.py` 将透明、自动地将这些选项应用到 PaddlePaddle 底层进程。 - -`paddle.py` 为方便作业启动提供了两个独特的命令选项。 - -- `job_dispatch_package` 设为本地 `workspace` 目录,它将被分发到 `conf.py` 中设置的所有节点。它有助于帮助频繁修改和访问工作区文件的用户减少负担,否则频繁的多节点工作空间部署可能会很麻烦。 -- `job_workspace` 设为已部署的工作空间目录,`paddle.py` 将跳过分发阶段直接启动所有节点的集群作业。它可以帮助减少分发延迟。 - -`cluster_train/run.sh` 提供了命令样例来运行 `doc/howto/usage/cluster/src/word2vec` 集群任务,只需用您定义的目录修改 `job_dispatch_package` 和 `job_workspace`,然后: -``` -sh run.sh -``` - -集群作业将会在几秒后启动。 - -## 终止集群作业 -`paddle.py`能获取`Ctrl + C` SIGINT 信号来自动终止它启动的所有进程。只需中断 `paddle.py` 任务来终止集群作业。如果程序崩溃你也可以手动终止。 - -## 检查集群训练结果 -详细信息请检查 $workspace/log 里的日志,每一个节点都有相同的日志结构。 - -`paddle_trainer.INFO` -提供几乎所有训练的内部输出日志,与本地训练相同。这里检验运行时间模型的收敛。 - -`paddle_pserver2.INFO` -提供 pserver 运行日志,有助于诊断分布式错误。 - -`server.log` -提供 parameter server 进程的 stderr 和 stdout。训练失败时可以检查错误日志。 - -`train.log` -提供训练过程的 stderr 和 stdout。训练失败时可以检查错误日志。 - -## 检查模型输出 -运行完成后,模型文件将被写入节点 0 的 `output` 目录中。 -工作空间中的 `nodefile` 表示当前集群作业的节点 ID。 diff --git a/doc/v2/howto/cluster/multi_cluster/fabric_en.md b/doc/v2/howto/cluster/multi_cluster/fabric_en.md deleted file mode 100644 index bac9ffe1526a06a3a23b1d8acf33a5fb74b7e50d..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/multi_cluster/fabric_en.md +++ /dev/null @@ -1,43 +0,0 @@ -# Fabric - -## Prepare a Linux cluster - -Run `kubectl -f ssh_servers.yaml` under the directory: `paddle/scripts/cluster_train_v2/fabric/docker_cluster` will launch a demo cluster. Run `kubectl get po -o wide` to get IP addresses of these nodes. - -## Launching Cluster Job -`paddle.py` provides automatical scripts to start all PaddlePaddle cluster processes in different nodes. By default, all command line options can be set as `paddle.py` command options and `paddle.py` will transparently and automatically set these options to PaddlePaddle lower level processes. - -`paddle.py`provides two distinguished command option for easy job launching. - -- `job_dispatch_package` set it with local `workspace` directory, it will be dispatched to all nodes which is set in `conf.py`. It could be helpful for frequently manipulating workspace files. otherwise, frequent multi-nodes workspace deployment is very annoying. -- `job_workspace` set it with already deployed workspace directory, `paddle.py` will skip dispatch stage to directly launch cluster job with all nodes. It could help to reduce heavy -dispatch latency. - -`cluster_train/run.sh` provides command line sample to run `demo/recommendation` cluster job, just modify `job_dispatch_package` and `job_workspace` with your defined directory, then: -``` -sh run.sh -``` - -The cluster Job will start in several seconds. - -## Kill Cluster Job -`paddle.py` can capture `Ctrl + C` SIGINT signal to automatically kill all processes launched by it. So just stop `paddle.py` to kill cluster job. You should manually kill the job if the program crashed. - -## Check Cluster Training Result -Check log in $workspace/log for details, each node owns same log structure. - -`paddle_trainer.INFO` -It provides almost all internal output log for training, same as local training. Check runtime model convergence here. - -`paddle_pserver2.INFO` -It provides parameter server running log, which could help to diagnose distributed error. - -`server.log` -It provides stderr and stdout of parameter server process. Check error log if training crashes. - -`train.log` -It provides stderr and stdout of trainer process. Check error log if training crashes. - -## Check Model Output -After one pass finished, model files will be written in `output` directory in node 0. -`nodefile` in workspace indicates the node id of current cluster job. diff --git a/doc/v2/howto/cluster/multi_cluster/index_cn.rst b/doc/v2/howto/cluster/multi_cluster/index_cn.rst deleted file mode 100644 index eabf95eda0b20f91913201a6b4e5b56fa440597e..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/multi_cluster/index_cn.rst +++ /dev/null @@ -1,35 +0,0 @@ -在不同集群中运行 -================ -用户的集群环境不尽相同,为了方便大家的部署,我们提供了多种的集群部署方式,方便提交集群训练任务,以下将一一介绍: - -`Kubernetes `_ 是Google开源的容器集群的调度框架,支持大规模集群生产环境的完整集群方案。以下指南展示了PaddlePaddle对Kubernetes的支持: - -.. toctree:: - :maxdepth: 1 - - k8s_cn.md - k8s_distributed_cn.md - -`OpenMPI `_ 是成熟的高性能并行计算框架,在HPC领域使用非常的广泛。以下指南介绍了如何使用OpenMPI来搭建PaddlePaddle的集群训练任务: - -.. toctree:: - :maxdepth: 1 - - openmpi_cn.md - -`Fabric `_ 是一个方便的程序部署和管理工具。我们提供了使用Fabric 进行部署、管理的方法,如果想详细了解,请阅读以下指南: - -.. toctree:: - :maxdepth: 1 - - fabric_cn.md - -我们也支持在AWS上部署PaddlePaddle,详细请了解: - -.. toctree:: - :maxdepth: 1 - - k8s_aws_cn.md - -您可以在 `cluster_train_v2 `_ 找到以上相关的例子。 - diff --git a/doc/v2/howto/cluster/multi_cluster/index_en.rst b/doc/v2/howto/cluster/multi_cluster/index_en.rst deleted file mode 100644 index 9bc1eb2e3796d95dd69b165e916e263ea34b87f6..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/multi_cluster/index_en.rst +++ /dev/null @@ -1,35 +0,0 @@ -Use different clusters -====================== - -The user's cluster environment is not the same. To facilitate everyone's deployment, we provide a variety of cluster deployment methods to facilitate the submission of cluster training tasks, which will be introduced as follows: - -`Kubernetes `_ is a scheduling framework of Google open source container cluster, supporting a complete cluster solution for large-scale cluster production environment. The following guidelines show PaddlePaddle's support for Kubernetes: - -.. toctree:: - :maxdepth: 1 - - k8s_en.md - k8s_distributed_en.md - -`OpenMPI `_ is a mature high-performance parallel computing framework, which is widely used in the field of HPC. The following guide describes how to use OpenMPI to build PaddlePaddle's cluster training task: - -.. toctree:: - :maxdepth: 1 - - openmpi_en.md - -`Fabric `_ is a convenient tool for program deployment and management. We provide a way to deploy and manage with Fabric. If you want to know more about it, please read the following guidelines: - -.. toctree:: - :maxdepth: 1 - - fabric_en.md - -We also support the deployment of PaddlePaddle on AWS. Learn more about: - -.. toctree:: - :maxdepth: 1 - - k8s_aws_en.md - -The examples can be found under `cluster_train_v2 `_ . diff --git a/doc/v2/howto/cluster/multi_cluster/k8s_aws_cn.md b/doc/v2/howto/cluster/multi_cluster/k8s_aws_cn.md deleted file mode 100644 index afc753aa42f19631c49a451a797f28365e65ed1d..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/multi_cluster/k8s_aws_cn.md +++ /dev/null @@ -1,672 +0,0 @@ -# Kubernetes on AWS - -我们将向你展示怎么样在AWS的Kubernetes集群上运行分布式PaddlePaddle训练,让我们从核心概念开始 - -## PaddlePaddle分布式训练的核心概念 - -### 分布式训练任务 - -一个分布式训练任务可以看做是一个Kubernetes任务 -每一个Kubernetes任务都有相应的配置文件,此配置文件指定了像任务的pod个数之类的环境变量信息 - -在分布式训练任务中,我们可以如下操作: - -1. 在分布式文件系统中,准备分块数据和配置文件(在此次教学中,我们会用到亚马逊分布式存储服务(EFS)) -2. 创建和提交一个kubernetes任务配置到集群中开始训练 - -### Parameter Server和Trainer - -在paddlepaddle集群中有两个角色:参数服务器(pserver)者和trainer, 每一个参数服务器过程都会保存一部分模型的参数。每一个trainer都保存一份完整的模型参数,并可以利用本地数据更新模型。在这个训练过程中,trainer发送模型更新到参数服务器中,参数服务器职责就是聚合这些更新,以便于trainer可以把全局模型同步到本地。 - -为了能够和pserver通信,trainer需要每一个pserver的IP地址。在Kubernetes中利用服务发现机制(比如:DNS、hostname)要比静态的IP地址要好一些,因为任何一个pod都会被杀掉然后新的pod被重启到另一个不同IP地址的node上。现在我们可以先用静态的IP地址方式,这种方式是可以更改的。 - -参数服务器和trainer一块被打包成一个docker镜像,这个镜像会运行在被Kubernetes集群调度的pod中。 - -### 训练者ID - -每一个训练过程都需要一个训练ID,以0作为基础值,作为命令行参数传递。训练过程因此用这个ID去读取数据分片。 - -### 训练 - -PaddlePaddle容器的入口是一个shell脚本,这个脚本可以读取Kubernetes内预置的环境变量。这里可以定义任务identity,在任务中identity可以用来远程访问包含所有pod的Kubernetes apiserver服务。 - -每一个pod通过ip来排序。每一个pod的序列作为“pod id”。因为我们会在每一个pod中运行训练和参数服务,可以用“pod id”作为训练ID。入口脚本详细工作流程如下: - -1. 查找apiserver得到pod信息,通过ip排序来分配一个trainer_id。 -2. 从EFS持久化卷中复制训练数据到容器中。 -3. 从环境变量中解析paddle pserver和 paddle trainer的启动参数,然后开始启动流程。 -4. 以trainer_id来训练将自动把结果写入到EFS卷中。 - - -## AWS的Kubernetes中的PaddlePaddle - -### 选择AWS服务区域 -这个教程需要多个AWS服务工作在一个区域中。在AWS创建任何东西之前,请检查链接https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/ 选择一个可以提供如下服务的区域:EC2, EFS, VPS, CloudFormation, KMS, VPC, S3。在教程中我们使用“Oregon(us-west-2)”作为例子。 - -### 创建aws账户和IAM账户 - -在每一个aws账户下可以创建多个IAM用户。允许为每一个IAM用户赋予权限,作为IAM用户可以创建/操作aws集群 - -注册aws账户,请遵循用户指南。在AWS账户下创建IAM用户和用户组,请遵循用户指南 - -请注意此教程需要如下的IAM用户权限: - -- AmazonEC2FullAccess -- AmazonS3FullAccess -- AmazonRoute53FullAccess -- AmazonRoute53DomainsFullAccess -- AmazonElasticFileSystemFullAccess -- AmazonVPCFullAccess -- IAMUserSSHKeys -- IAMFullAccess -- NetworkAdministrator -- AWSKeyManagementServicePowerUser - - -### 下载kube-aws and kubectl - -#### kube-aws - -在AWS中[kube-aws](https://github.com/coreos/kube-aws)是一个自动部署集群的CLI工具 - -##### kube-aws完整性验证 -提示:如果你用的是非官方版本(e.g RC release)的kube-aws,可以跳过这一步骤。引入coreos的应用程序签名公钥: - -``` -gpg2 --keyserver pgp.mit.edu --recv-key FC8A365E -``` - -指纹验证: - -``` -gpg2 --fingerprint FC8A365E -``` -正确的指纹是: `18AD 5014 C99E F7E3 BA5F 6CE9 50BD D3E0 FC8A 365E` - -我们可以从发布页面中下载kube-aws,教程使用0.9.1版本 [release page](https://github.com/coreos/kube-aws/releases). - -验证tar包的GPG签名: - -``` -PLATFORM=linux-amd64 - # Or -PLATFORM=darwin-amd64 - -gpg2 --verify kube-aws-${PLATFORM}.tar.gz.sig kube-aws-${PLATFORM}.tar.gz -``` -##### 安装kube-aws -解压: - -``` -tar zxvf kube-aws-${PLATFORM}.tar.gz -``` - -添加到环境变量: - -``` -mv ${PLATFORM}/kube-aws /usr/local/bin -``` - - -#### kubectl - -[kubectl](https://Kubernetes.io/docs/user-guide/kubectl-overview/) 是一个操作Kubernetes集群的命令行接口 - -利用`curl`工具从Kubernetes发布页面中下载`kubectl` - -``` -# OS X -curl -O https://storage.googleapis.com/kubernetes-release/release/"$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)"/bin/darwin/amd64/kubectl - -# Linux -curl -O https://storage.googleapis.com/kubernetes-release/release/"$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)"/bin/linux/amd64/kubectl -``` - -为了能是kubectl运行必须将之添加到环境变量中 (e.g. `/usr/local/bin`): - -``` -chmod +x ./kubectl -sudo mv ./kubectl /usr/local/bin/kubectl -``` - -### 配置AWS证书 - -首先检查这里 [this](http://docs.aws.amazon.com/cli/latest/userguide/installing.html) 安装AWS命令行工具 - -然后配置aws账户信息: - -``` -aws configure -``` - - -添加如下信息: - - -``` -AWS Access Key ID: YOUR_ACCESS_KEY_ID -AWS Secrete Access Key: YOUR_SECRETE_ACCESS_KEY -Default region name: us-west-2 -Default output format: json -``` - -`YOUR_ACCESS_KEY_ID`, and `YOUR_SECRETE_ACCESS_KEY` 是创建aws账户和IAM账户的IAM的key和密码 [Create AWS Account and IAM Account](#create-aws-account-and-iam-account) - -描述任何运行在你账户中的实例来验证凭据是否工作: - -``` -aws ec2 describe-instances -``` - -### 定义集群参数 - -#### EC2秘钥对 - -秘钥对将认证ssh访问你的EC2实例。秘钥对的公钥部分将配置到每一个COREOS节点中。 - -遵循 [EC2 Keypair User Guide](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html) Keypair用户指南来创建EC2秘钥对 - -你可以使用创建好的秘钥对名称来配置集群. - -在同一工作区中秘钥对为EC2实例唯一码。在教程中使用 us-west-2 ,所以请确认在这个区域(Oregon)中创建秘钥对。 - -在浏览器中下载一个`key-name.pem`文件用来访问EC2实例,我们待会会用到. - - -#### KMS秘钥 - -亚马逊的KMS秘钥在TLS秘钥管理服务中用来加密和解密集群。如果你已经有可用的KMS秘钥,你可以跳过创建新秘钥这一步,提供现存秘钥的ARN字符串。 - -利用aws命令行创建kms秘钥: - -``` -aws kms --region=us-west-2 create-key --description="kube-aws assets" -{ - "KeyMetadata": { - "CreationDate": 1458235139.724, - "KeyState": "Enabled", - "Arn": "arn:aws:kms:us-west-2:aaaaaaaaaaaaa:key/xxxxxxxxxxxxxxxxxxx", - "AWSAccountId": "xxxxxxxxxxxxx", - "Enabled": true, - "KeyUsage": "ENCRYPT_DECRYPT", - "KeyId": "xxxxxxxxx", - "Description": "kube-aws assets" - } -} -``` - -我们稍后用到`Arn` 的值. - -在IAM用户许可中添加多个内联策略. - -进入[IAM Console](https://console.aws.amazon.com/iam/home?region=us-west-2#/home)。点击`Users`按钮,点击刚才创建的用户,然后点击`Add inline policy`按钮,选择`Custom Policy` - -粘贴内联策略: - -``` - (Caution: node_0, node_1, node_2 directories represents PaddlePaddle node and train_id, not the Kubernetes node){ - "Version": "2012-10-17", - "Statement": [ - { - "Sid": "Stmt1482205552000", - "Effect": "Allow", - "Action": [ - "kms:Decrypt", - "kms:Encrypt" - ], - "Resource": [ - "arn:aws:kms:*:AWS_ACCOUNT_ID:key/*" - ] - }, - { - "Sid": "Stmt1482205746000", - "Effect": "Allow", - "Action": [ - "cloudformation:CreateStack", - "cloudformation:UpdateStack", - "cloudformation:DeleteStack", - "cloudformation:DescribeStacks", - "cloudformation:DescribeStackResource", - "cloudformation:GetTemplate", - "cloudformation:DescribeStackEvents" - ], - "Resource": [ - "arn:aws:cloudformation:us-west-2:AWS_ACCOUNT_ID:stack/MY_CLUSTER_NAME/*" - ] - } - ] -} -``` -`Version` : 值必须是"2012-10-17". -`AWS_ACCOUNT_ID`: 你可以从命令行中获取: - -``` -aws sts get-caller-identity --output text --query Account -``` - -`MY_CLUSTER_NAME`: 选择一个你喜欢的MY_CLUSTER_NAME,稍后会用到。 -请注意,堆栈名称必须是正则表达式:[a-zA-Z][-a-zA-Z0-9*]*, 在名称中不能有"_"或者"-",否则kube-aws在下面步骤中会抛出异常 - -#### 外部DNS名称 - -当集群被创建后,基于DNS名称控制器将会暴露安全的TLS API. - -DNS名称含有CNAME指向到集群DNS名称或者记录指向集群的IP地址。 - -我们稍后会用到DNS名称,如果没有DNS名称的话,你可以选择一个(比如:`paddle`)还可以修改`/etc/hosts`用本机的DNS名称和集群IP关联。还可以在AWS上增加一个名称服务来关联paddle集群IP,稍后步骤中会查找集群IP. - -#### S3 bucket - -在启动Kubernetes集群前需要创建一个S3 bucket - -在AWS上创建s3 bucket会有许多的bugs,所以使用[s3 console](https://console.aws.amazon.com/s3/home?region=us-west-2)。 - -链接到 `Create Bucket`,确保在us-west-2 (Oregon)上创建一个唯一的BUCKET_NAME。 - -#### 初始化assets - -在本机创建一个目录用来存放产生的assets: - -``` -$ mkdir my-cluster -$ cd my-cluster -``` - -利用KMS Arn、秘钥对名称和前一步产生的DNS名称来初始化集群的CloudFormation栈: - -``` -kube-aws init \ ---cluster-name=MY_CLUSTER_NAME \ ---external-dns-name=MY_EXTERNAL_DNS_NAME \ ---region=us-west-2 \ ---availability-zone=us-west-2a \ ---key-name=KEY_PAIR_NAME \ ---kms-key-arn="arn:aws:kms:us-west-2:xxxxxxxxxx:key/xxxxxxxxxxxxxxxxxxx" -``` - -`MY_CLUSTER_NAME`: the one you picked in [KMS key](#kms-key) - -`MY_EXTERNAL_DNS_NAME`: see [External DNS name](#external-dns-name) - -`KEY_PAIR_NAME`: see [EC2 key pair](#ec2-key-pair) - -`--kms-key-arn`: the "Arn" in [KMS key](#kms-key) - -这里的`us-west-2a`用于参数`--availability-zone`,但必须在AWS账户的有效可用区中 - -如果不能切换到其他的有效可用区(e.g., `us-west-2a`, or `us-west-2b`),请检查`us-west-2a`是支持`aws ec2 --region us-west-2 describe-availability-zones`。 - -现在在asset目录中就有了集群的主配置文件cluster.yaml。 - -默认情况下kube-aws会创建一个工作节点,修改`cluster.yaml`让`workerCount`从1个节点变成3个节点. - -#### 呈现asset目录内容 - -在这个简单的例子中,你可以使用kuber-aws生成TLS身份和证书 - -``` -kube-aws render credentials --generate-ca -``` - -下一步在asset目录中生成一组集群assets. - -``` -kube-aws render stack -``` -asserts(模板和凭证)用于创建、更新和当前目录被创建的Kubernetes集群相关联 - -### 启动Kubernetes集群 - -#### 创建一个在CloudFormation模板上定义好的实例 - -现在让我们创建集群(在命令行中选择任意的 `PREFIX`) - -``` -kube-aws up --s3-uri s3://BUCKET_NAME/PREFIX -``` - -`BUCKET_NAME`: t在[S3 bucket](#s3-bucket)上使用的bucket名称 - - -#### 配置DNS - -你可以执行命令 `kube-aws status`来查看创建后集群的API. - -``` -$ kube-aws status -Cluster Name: paddle-cluster -Controller DNS Name: paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com -``` -如果你用DNS名称,在ip上设置任何记录或是安装CNAME点到`Controller DNS Name` (`paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com`) - -##### 查询IP地址 - -用命令`dig`去检查负载均衡器的域名来获取ip地址. - -``` -$ dig paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com - -;; QUESTION SECTION: -;paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com. IN A - -;; ANSWER SECTION: -paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com. 59 IN A 54.241.164.52 -paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com. 59 IN A 54.67.102.112 -``` - -在上面的例子中,`54.241.164.52`, `54.67.102.112`这两个ip都将是工作状态 - -*如果你有DNS名称*,设置记录到ip上,然后你可以跳过“Access the cluster”这一步 - -*如果没有自己的DNS名称* - -编辑/etc/hosts文件用DNS关联IP - -##### 更新本地的DNS关联 -编辑`/etc/hosts`文件用DNS关联IP -##### 在VPC上添加route53私有名称服务 - - 打开[Route53 Console](https://console.aws.amazon.com/route53/home) - - 根据配置创建域名zone - - domain名称为: "paddle" - - Type: "Private hosted zone for amazon VPC" - - VPC ID: `` - - ![route53 zone setting](src/route53_create_zone.png) - - 添加记录 - - 点击zone中刚创建的“paddle” - - 点击按钮“Create record set” - - Name : leave blank - - type: "A" - - Value: `` - - ![route53 create recordset](src/route53_create_recordset.png) - - 检查名称服务 - - 连接通过kube-aws via ssh创建的任何实例 - - 运行命令"host paddle",看看是否ip为返回的kube-controller的私有IP - -#### 进入集群 - -集群运行后如下命令会看到: - -``` -$ kubectl --kubeconfig=kubeconfig get nodes -NAME STATUS AGE -ip-10-0-0-134.us-west-2.compute.internal Ready 6m -ip-10-0-0-238.us-west-2.compute.internal Ready 6m -ip-10-0-0-50.us-west-2.compute.internal Ready 6m -ip-10-0-0-55.us-west-2.compute.internal Ready 6m -``` - - -### 集群安装弹性文件系统 - -训练数据存放在AWS上的EFS分布式文件系统中. - -1. 在[security group console](https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#SecurityGroups:sort=groupId)为EFS创建一个安全组 - 1. 可以看到`paddle-cluster-sg-worker` (在sg-055ee37d镜像中)安全组id -
![](src/worker_security_group.png)
- - 2. 增加安全组`paddle-efs` ,以`paddle-cluster-sg-worker`的group id作为用户源和`ALL TCP`入栈规则。增加vpc `paddle-cluster-vpc`, 确保可用区是在[Initialize Assets](#initialize-assets)的时候用到的那一个. -
![](src/add_security_group.png)
- -2. 利用`paddle-cluster-vpc`私有网络在[EFS console](https://us-west-2.console.aws.amazon.com/efs/home?region=us-west-2#/wizard/1) 中创建弹性文件系统, 确定子网为`paddle-cluster-Subnet0`和安全区为`paddle-efs`. -
![](src/create_efs.png)
- - -### 开始在AWS上进行paddlepaddle的训练 - -#### 配置Kubernetes卷指向EFS - -首先需要创建一个持久卷[PersistentVolume](https://kubernetes.io/docs/user-guide/persistent-volumes/) 到EFS上 - -用 `pv.yaml`形式来保存 -``` -apiVersion: v1 -kind: PersistentVolume -metadata: - name: efsvol -spec: - capacity: - storage: 100Gi - accessModes: - - ReadWriteMany - nfs: - server: EFS_DNS_NAME - path: "/" -``` - -`EFS_DNS_NAME`: DNS名称最好能描述我们创建的`paddle-efs`,看起来像`fs-2cbf7385.efs.us-west-2.amazonaws.com` - -运行下面的命令来创建持久卷: -``` -kubectl --kubeconfig=kubeconfig create -f pv.yaml -``` -下一步创建 [PersistentVolumeClaim](https://kubernetes.io/docs/user-guide/persistent-volumes/)来声明持久卷 - -用`pvc.yaml`来保存. -``` -kind: PersistentVolumeClaim -apiVersion: v1 -metadata: - name: efsvol -spec: - accessModes: - - ReadWriteMany - resources: - requests: - storage: 50Gi -``` - -行下面命令来创建持久卷声明: -``` -kubectl --kubeconfig=kubeconfig create -f pvc.yaml -``` - -#### 准备训练数据 - -启动Kubernetes job在我们创建的持久层上进行下载、保存并均匀拆分训练数据为3份. - -用`paddle-data-job.yaml`保存 -``` -apiVersion: batch/v1 -kind: Job -metadata: - name: paddle-data -spec: - template: - metadata: - name: pi - spec: - containers: - - name: paddle-data - image: paddlepaddle/paddle-tutorial:k8s_data - imagePullPolicy: Always - volumeMounts: - - mountPath: "/efs" - name: efs - env: - - name: OUT_DIR - value: /efs/paddle-cluster-job - - name: SPLIT_COUNT - value: "3" - volumes: - - name: efs - persistentVolumeClaim: - claimName: efsvol - restartPolicy: Never -``` - -运行下面的命令来启动任务: -``` -kubectl --kubeconfig=kubeconfig create -f paddle-data-job.yaml -``` -任务运行大概需要7分钟,可以使用下面命令查看任务状态,直到`paddle-data`任务的`SUCCESSFUL`状态为`1`时成功,这里here有怎样创建镜像的源码 -``` -$ kubectl --kubeconfig=kubeconfig get jobs -NAME DESIRED SUCCESSFUL AGE -paddle-data 1 1 6m -``` -数据准备完成后的结果是以镜像`paddlepaddle/paddle-tutorial:k8s_data`存放,可以点击这里[here](src/k8s_data/README.md)查看如何创建docker镜像源码 - -#### 开始训练 - -现在可以开始运行paddle的训练任务,用`paddle-cluster-job.yaml`进行保存 -``` -apiVersion: batch/v1 -kind: Job -metadata: - name: paddle-cluster-job -spec: - parallelism: 3 - completions: 3 - template: - metadata: - name: paddle-cluster-job - spec: - volumes: - - name: efs - persistentVolumeClaim: - claimName: efsvol - containers: - - name: trainer - image: paddlepaddle/paddle-tutorial:k8s_train - command: ["bin/bash", "-c", "/root/start.sh"] - env: - - name: JOB_NAME - value: paddle-cluster-job - - name: JOB_PATH - value: /home/jobpath - - name: JOB_NAMESPACE - value: default - - name: TRAIN_CONFIG_DIR - value: quick_start - - name: CONF_PADDLE_NIC - value: eth0 - - name: CONF_PADDLE_PORT - value: "7164" - - name: CONF_PADDLE_PORTS_NUM - value: "2" - - name: CONF_PADDLE_PORTS_NUM_SPARSE - value: "2" - - name: CONF_PADDLE_GRADIENT_NUM - value: "3" - - name: TRAINER_COUNT - value: "3" - volumeMounts: - - mountPath: "/home/jobpath" - name: efs - ports: - - name: jobport0 - hostPort: 7164 - containerPort: 7164 - - name: jobport1 - hostPort: 7165 - containerPort: 7165 - - name: jobport2 - hostPort: 7166 - containerPort: 7166 - - name: jobport3 - hostPort: 7167 - containerPort: 7167 - restartPolicy: Never -``` - -`parallelism: 3, completions: 3` 意思是这个任务会同时开启3个paddlepaddle的pod,当pod启动后3个任务将被完成。 - -`env` 参数代表容器的环境变量,在这里指定paddlepaddle的参数. - -`ports` 指定TCP端口7164 - 7167和`pserver`进行连接,port从`CONF_PADDLE_PORT`(7164)到`CONF_PADDLE_PORT + CONF_PADDLE_PORTS_NUM + CONF_PADDLE_PORTS_NUM_SPARSE - 1`(7167)。我们使用多个端口密集和稀疏参数的更新来提高延迟 - -运行下面命令来启动任务. -``` -kubectl --kubeconfig=kubeconfig create -f paddle-claster-job.yaml -``` - -检查pods信息 - -``` -$ kubectl --kubeconfig=kubeconfig get pods -NAME READY STATUS RESTARTS AGE -paddle-cluster-job-cm469 1/1 Running 0 9m -paddle-cluster-job-fnt03 1/1 Running 0 9m -paddle-cluster-job-jx4xr 1/1 Running 0 9m -``` - -检查指定pod的控制台输出 -``` -kubectl --kubeconfig=kubeconfig log -f POD_NAME -``` - -`POD_NAME`: 任何一个pod的名称 (e.g., `paddle-cluster-job-cm469`). - -运行`kubectl --kubeconfig=kubeconfig describe job paddle-cluster-job`来检查训练任务的状态,将会在大约20分钟完成 - -`pserver`和`trainer`的细节都隐藏在docker镜像`paddlepaddle/paddle-tutorial:k8s_train`中,这里[here](src/k8s_train/README.md) 有创建docker镜像的源码. - -#### 检查训练输出 - -训练输出(模型快照和日志)将被保存在EFS上。我们可以用ssh登录到EC2的工作节点上,查看mount过的EFS和训练输出. - -1. ssh登录EC2工作节点 -``` -chmod 400 key-name.pem -ssh -i key-name.pem core@INSTANCE_IP -``` - -`INSTANCE_IP`: EC2上Kubernetes工作节点的公共IP地址,进入[EC2 console](https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Instances:sort=instanceId) 中检查任何`paddle-cluster-kube-aws-worker`实例的 `public IP` - -2. 挂载EFS -``` -mkdir efs -sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 EFS_DNS_NAME:/ efs -``` - -`EFS_DNS_NAME`: DNS名称最好能描述我们创建的`paddle-efs`,看起来像`fs-2cbf7385.efs.us-west-2.amazonaws.com`. - -文件夹`efs`上有这结构相似的node信息: -``` --- paddle-cluster-job - |-- ... - |-- output - | |-- node_0 - | | |-- server.log - | | `-- train.log - | |-- node_1 - | | |-- server.log - | | `-- train.log - | |-- node_2 - | | |-- server.log - | | `-- train.log - | |-- pass-00000 - | | |-- ___fc_layer_0__.w0 - | | |-- ___fc_layer_0__.wbias - | | |-- done - | | |-- path.txt - | | `-- trainer_config.lr.py - | |-- pass-00001... -``` -`server.log` 是`pserver`的log日志,`train.log`是`trainer`的log日志,模型快照和描述存放在`pass-0000*`. - -### Kubernetes集群卸载或删除 - -#### 删除EFS - -到[EFS Console](https://us-west-2.console.aws.amazon.com/efs/home?region=us-west-2) 中删除创建的EFS卷 - -#### 删除安全组 - -去[Security Group Console](https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#SecurityGroups:sort=groupId) 删除安全组`paddle-efs`. - -#### 删除S3 bucket - -进入 [S3 Console](https://console.aws.amazon.com/s3/home?region=us-west-2#)删除S3 bucket - -#### 销毁集群 - -``` -kube-aws destroy -``` - -命令会立刻返回,但需要大约5分钟来销毁集群 - -可以进入 [CludFormation Console](https://us-west-2.console.aws.amazon.com/cloudformation/home?region=us-west-2#/stacks?filter=active)检查销毁的过程。 diff --git a/doc/v2/howto/cluster/multi_cluster/k8s_aws_en.md b/doc/v2/howto/cluster/multi_cluster/k8s_aws_en.md deleted file mode 100644 index 8e8e87be711bd45177ed77c81c531606e801d1f0..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/multi_cluster/k8s_aws_en.md +++ /dev/null @@ -1,688 +0,0 @@ -# Kubernetes on AWS - -We will show you step by step on how to run distributed PaddlePaddle training on AWS cluster with Kubernetes. Let's start from core concepts. - -## Distributed PaddlePaddle Training Core Concepts - -### Distributed Training Job - -A distributed training job is represented by a [Kubernetes job](https://kubernetes.io/docs/user-guide/jobs/#what-is-a-job). - -Each Kuberentes job is described by a job config file, which specifies the information like the number of [pods](https://kubernetes.io/docs/user-guide/pods/#what-is-a-pod) in the job and environment variables. - -In a distributed training job, we would: - -1. prepare partitioned training data and configuration file on a distributed file system (in this tutorial we use Amazon Elastic File System), and -1. create and submit the Kubernetes job config to the Kubernetes cluster to start the training job. - -### Parameter Servers and Trainers - -There are two roles in a PaddlePaddle cluster: *parameter server (pserver)* and *trainer*. Each parameter server process maintains a shard of the global model. Each trainer has its local copy of the model, and uses its local data to update the model. During the training process, trainers send model updates to parameter servers, parameter servers are responsible for aggregating these updates, so that trainers can synchronize their local copy with the global model. - -
![Model is partitioned into two shards. Managed by two parameter servers respectively.](src/pserver_and_trainer.png)
- -In order to communicate with pserver, trainer needs to know the ip address of each pserver. In kubernetes it's better to use a service discovery mechanism (e.g., DNS hostname) rather than static ip address, since any pserver's pod may be killed and a new pod could be schduled onto another node of different ip address. However, now we are using static ip. This will be improved. - -Parameter server and trainer are packaged into a same docker image. They will run once pod is scheduled by kubernetes job. - -### Trainer ID - -Each trainer process requires a trainer ID, a zero-based index value, passed in as a command-line parameter. The trainer process thus reads the data partition indexed by this ID. - -### Training - -The entry-point of a container is a shell script. It can see some environment variables pre-defined by Kubernetes. This includes one that gives the job's identity, which can be used in a remote call to the Kubernetes apiserver that lists all pods in the job. - -We rank each pod by sorting them by their ips. The rank of each pod could be the "pod ID". Because we run one trainer and one parameter server in each pod, we can use this "pod ID" as the trainer ID. A detailed workflow of the entry-point script is as follows: - -1. Query the api server to get pod information, and assign the `trainer_id` by sorting the ip. -1. Copy the training data from EFS persistent volume into container. -1. Parse the `paddle pserver` and `paddle trainer` startup parameters from environment variables, and then start up the processes. -1. Trainer with `train_id` 0 will automatically write results onto EFS volume. - - -## PaddlePaddle on AWS with Kubernetes - -### Choose AWS Service Region -This tutorial requires several AWS services work in the same region. Before we create anything in AWS, please check the following link -https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/ -Choose a region which has the following services available: EC2, EFS, VPS, CloudFormation, KMS, VPC, S3. -In this tutorial, we use "Oregon(us-west-2)" as example. - -### Create AWS Account and IAM Account - -Under each AWS account, we can create multiple [IAM](http://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html) users. This allows us to grant some privileges to each IAM user and to create/operate AWS clusters as an IAM user. - -To sign up an AWS account, please -follow -[this guide](http://docs.aws.amazon.com/lambda/latest/dg/setting-up.html). -To create IAM users and user groups under an AWS account, please -follow -[this guide](http://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html). - -Please be aware that this tutorial needs the following privileges for the user in IAM: - -- AmazonEC2FullAccess -- AmazonS3FullAccess -- AmazonRoute53FullAccess -- AmazonRoute53DomainsFullAccess -- AmazonElasticFileSystemFullAccess -- AmazonVPCFullAccess -- IAMUserSSHKeys -- IAMFullAccess -- NetworkAdministrator -- AWSKeyManagementServicePowerUser - - -### Download kube-aws and kubectl - -#### kube-aws - -[kube-aws](https://github.com/coreos/kube-aws) is a CLI tool to automate cluster deployment to AWS. -##### Verify kube-aws integrity -Note: if you are using a non-official release (e.g RC release) kube-aws, you can skip this setp. -Import the CoreOS Application Signing Public Key: - -``` -gpg2 --keyserver pgp.mit.edu --recv-key FC8A365E -``` - -Validate the key fingerprint: - -``` -gpg2 --fingerprint FC8A365E -``` -The correct key fingerprint is `18AD 5014 C99E F7E3 BA5F 6CE9 50BD D3E0 FC8A 365E` - -We can download `kube-aws` from its [release page](https://github.com/coreos/kube-aws/releases). In this tutorial, we use version 0.9.1 - -Validate the tarball's GPG signature: - -``` -PLATFORM=linux-amd64 - # Or -PLATFORM=darwin-amd64 - -gpg2 --verify kube-aws-${PLATFORM}.tar.gz.sig kube-aws-${PLATFORM}.tar.gz -``` -##### Install kube-aws -Extract the binary: - -``` -tar zxvf kube-aws-${PLATFORM}.tar.gz -``` - -Add kube-aws to your path: - -``` -mv ${PLATFORM}/kube-aws /usr/local/bin -``` - - -#### kubectl - -[kubectl](https://kubernetes.io/docs/user-guide/kubectl-overview/) is a command line interface for running commands against Kubernetes clusters. - -Download `kubectl` from the Kubernetes release artifact site with the `curl` tool. - -``` -# OS X -curl -O https://storage.googleapis.com/kubernetes-release/release/"$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)"/bin/darwin/amd64/kubectl - -# Linux -curl -O https://storage.googleapis.com/kubernetes-release/release/"$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)"/bin/linux/amd64/kubectl -``` - -Make the kubectl binary executable and move it to your PATH (e.g. `/usr/local/bin`): - -``` -chmod +x ./kubectl -sudo mv ./kubectl /usr/local/bin/kubectl -``` - -### Configure AWS Credentials - -First check out [this](http://docs.aws.amazon.com/cli/latest/userguide/installing.html) for installing the AWS command line interface. - -And then configure your AWS account information: - -``` -aws configure -``` - - -Fill in the required fields: - - -``` -AWS Access Key ID: YOUR_ACCESS_KEY_ID -AWS Secrete Access Key: YOUR_SECRETE_ACCESS_KEY -Default region name: us-west-2 -Default output format: json -``` - -`YOUR_ACCESS_KEY_ID`, and `YOUR_SECRETE_ACCESS_KEY` is the IAM key and secret from [Create AWS Account and IAM Account](#create-aws-account-and-iam-account) - -Verify that your credentials work by describing any instances you may already have running on your account: - -``` -aws ec2 describe-instances -``` - -### Define Cluster Parameters - -#### EC2 key pair - -The keypair that will authenticate SSH access to your EC2 instances. The public half of this key pair will be configured on each CoreOS node. - -Follow [EC2 Keypair User Guide](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html) to create a EC2 key pair - -After creating a key pair, you will use the key pair name to configure the cluster. - -Key pairs are only available to EC2 instances in the same region. We are using us-west-2 in our tutorial, so make sure to creat key pairs in that region (Oregon). - -Your browser will download a `key-name.pem` file which is the key to access the EC2 instances. We will use it later. - - -#### KMS key - -Amazon KMS keys are used to encrypt and decrypt cluster TLS assets. If you already have a KMS Key that you would like to use, you can skip creating a new key and provide the Arn string for your existing key. - -You can create a KMS key with the aws command line tool: - -``` -aws kms --region=us-west-2 create-key --description="kube-aws assets" -{ - "KeyMetadata": { - "CreationDate": 1458235139.724, - "KeyState": "Enabled", - "Arn": "arn:aws:kms:us-west-2:aaaaaaaaaaaaa:key/xxxxxxxxxxxxxxxxxxx", - "AWSAccountId": "xxxxxxxxxxxxx", - "Enabled": true, - "KeyUsage": "ENCRYPT_DECRYPT", - "KeyId": "xxxxxxxxx", - "Description": "kube-aws assets" - } -} -``` - -We will need to use the value of `Arn` later. - -And then let's add several inline policies in your IAM user permission. - -Go to [IAM Console](https://console.aws.amazon.com/iam/home?region=us-west-2#/home). Click on button `Users`, click user that we just created, and then click on `Add inline policy` button, and select `Custom Policy`. - -Paste into following inline policies: - -``` - (Caution: node_0, node_1, node_2 directories represents PaddlePaddle node and train_id, not the Kubernetes node){ - "Version": "2012-10-17", - "Statement": [ - { - "Sid": "Stmt1482205552000", - "Effect": "Allow", - "Action": [ - "kms:Decrypt", - "kms:Encrypt" - ], - "Resource": [ - "arn:aws:kms:*:AWS_ACCOUNT_ID:key/*" - ] - }, - { - "Sid": "Stmt1482205746000", - "Effect": "Allow", - "Action": [ - "cloudformation:CreateStack", - "cloudformation:UpdateStack", - "cloudformation:DeleteStack", - "cloudformation:DescribeStacks", - "cloudformation:DescribeStackResource", - "cloudformation:GetTemplate", - "cloudformation:DescribeStackEvents" - ], - "Resource": [ - "arn:aws:cloudformation:us-west-2:AWS_ACCOUNT_ID:stack/MY_CLUSTER_NAME/*" - ] - } - ] -} -``` -`Version` : Its value has to be exactly "2012-10-17". -`AWS_ACCOUNT_ID`: You can get it from following command line: - -``` -aws sts get-caller-identity --output text --query Account -``` - -`MY_CLUSTER_NAME`: Pick a MY_CLUSTER_NAME that you like, you will use it later as well. -Please note, stack name must satisfy regular expression pattern: [a-zA-Z][-a-zA-Z0-9*]*, which means no "_" or "-" in stack name, or kube-aws will throw error in later steps. - -#### External DNS name - -When the cluster is created, the controller will expose the TLS-secured API on a DNS name. - -DNS name should have a CNAME points to cluster DNS name or an A record points to the cluster IP address. - -We will need to use DNS name later in tutorial. If you don't already own one, you can choose any DNS name (e.g., `paddle`) and modify `/etc/hosts` to associate cluster IP with that DNS name for your local machine. And add name service (route53) in aws to associate the IP to paddle for cluster. We will find the cluster IP in later steps. - -#### S3 bucket - -You need to create an S3 bucket before startup the Kubernetes cluster. - -There are some bugs in aws cli in creating S3 bucket, so let's use the [S3 Console](https://console.aws.amazon.com/s3/home?region=us-west-2). - -Click on `Create Bucket`, fill in a unique BUCKET_NAME, and make sure region is us-west-2 (Oregon). - - -#### Initialize Assets - -Create a directory on your local machine to hold the generated assets: - -``` -$ mkdir my-cluster -$ cd my-cluster -``` - -Initialize the cluster CloudFormation stack with the KMS Arn, key pair name, and DNS name from the previous step: - -``` -kube-aws init \ ---cluster-name=MY_CLUSTER_NAME \ ---external-dns-name=MY_EXTERNAL_DNS_NAME \ ---region=us-west-2 \ ---availability-zone=us-west-2a \ ---key-name=KEY_PAIR_NAME \ ---kms-key-arn="arn:aws:kms:us-west-2:xxxxxxxxxx:key/xxxxxxxxxxxxxxxxxxx" -``` - -`MY_CLUSTER_NAME`: the one you picked in [KMS key](#kms-key) - -`MY_EXTERNAL_DNS_NAME`: see [External DNS name](#external-dns-name) - -`KEY_PAIR_NAME`: see [EC2 key pair](#ec2-key-pair) - -`--kms-key-arn`: the "Arn" in [KMS key](#kms-key) - -Here `us-west-2a` is used for parameter `--availability-zone`, but supported availability zone varies among AWS accounts. - -Please check if `us-west-2a` is supported by `aws ec2 --region us-west-2 describe-availability-zones`, if not switch to other supported availability zone. (e.g., `us-west-2a`, or `us-west-2b`) - - -There will now be a cluster.yaml file in the asset directory. This is the main configuration file for your cluster. - -By default `kube-aws` will only create one worker node. Let's edit `cluster.yaml` and change `workerCount` from 1 to 3. - - -#### Render contents of the asset directory - -In the simplest case, you can have kube-aws generate both your TLS identities and certificate authority for you. - -``` -kube-aws render credentials --generate-ca -``` - -The next command generates the default set of cluster assets in your asset directory. - -``` -kube-aws render stack -``` -Assets (templates and credentials) that are used to create, update and interact with your Kubernetes cluster will be created under your current folder. - - -### Kubernetes Cluster Start Up - -#### Create the instances defined in the CloudFormation template - -Now let's create your cluster (choose any `PREFIX` for the command below): - -``` -kube-aws up --s3-uri s3://BUCKET_NAME/PREFIX -``` - -`BUCKET_NAME`: the bucket name that you used in [S3 bucket](#s3-bucket) - - -#### Configure DNS - -You can invoke `kube-aws status` to get the cluster API endpoint after cluster creation. - -``` -$ kube-aws status -Cluster Name: paddle-cluster -Controller DNS Name: paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com -``` - -If you own a DNS name, set the A record to any of the above ip. __Or__ you can set up CNAME point to `Controller DNS Name` (`paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com`) - -##### Find IP address - -Use command `dig` to check the load balancer hostname to get the ip address. - -``` -$ dig paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com - -;; QUESTION SECTION: -;paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com. IN A - -;; ANSWER SECTION: -paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com. 59 IN A 54.241.164.52 -paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com. 59 IN A 54.67.102.112 -``` - -In the above output, both ip `54.241.164.52`, `54.67.102.112` will work. - -*If you own a DNS name*, set the A record to any of the above ip. Then you can skip to the step "Access the cluster". - -*If you do not own a DNS name*: -##### Update local DNS association -Edit `/etc/hosts` to associate above ip with the DNS name. -##### Add Route53 private name service in VPC - - Open [Route53 Console](https://console.aws.amazon.com/route53/home) - - Create hosted zone with following config - - Domain name: "paddle" - - Type: "Private hosted zone for amazon VPC" - - VPC ID: `` - - ![route53 zone setting](src/route53_create_zone.png) - - Add A record - - Click on the zone "paddle" just created - - Click the button "Create record set" - - Name : leave blank - - type: "A" - - Value: `` - - ![route53 create recordset](src/route53_create_recordset.png) - - Verify name service - - Connect to any instance created by kube-aws via ssh - - Run command "host paddle", see if the ip returned is the private ip of kube-controller - -#### Access the cluster - -Once the API server is running, you should see: - -``` -$ kubectl --kubeconfig=kubeconfig get nodes -NAME STATUS AGE -ip-10-0-0-134.us-west-2.compute.internal Ready 6m -ip-10-0-0-238.us-west-2.compute.internal Ready 6m -ip-10-0-0-50.us-west-2.compute.internal Ready 6m -ip-10-0-0-55.us-west-2.compute.internal Ready 6m -``` - - -### Setup Elastic File System for Cluster - -Training data is usually served on a distributed filesystem, we use Elastic File System (EFS) on AWS. - -1. Create security group for EFS in [security group console](https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#SecurityGroups:sort=groupId) - 1. Look up security group id for `paddle-cluster-sg-worker` (`sg-055ee37d` in the image below) -
![](src/worker_security_group.png)
- 2. Add security group `paddle-efs` with `ALL TCP` inbound rule and custom source as group id of `paddle-cluster-sg-worker`. And VPC of `paddle-cluster-vpc`. Make sure availability zone is same as the one you used in [Initialize Assets](#initialize-assets). -
![](src/add_security_group.png)
- -2. Create the Elastic File System in [EFS console](https://us-west-2.console.aws.amazon.com/efs/home?region=us-west-2#/wizard/1) with `paddle-cluster-vpc` VPC. Make sure subnet is `paddle-cluster-Subnet0` andd security group is `paddle-efs`. -
![](src/create_efs.png)
- - -### Start PaddlePaddle Training Demo on AWS - -#### Configure Kubernetes Volume that Points to EFS - -First we need to create a [PersistentVolume](https://kubernetes.io/docs/user-guide/persistent-volumes/) to provision EFS volumn. - -Save following snippet as `pv.yaml` -``` -apiVersion: v1 -kind: PersistentVolume -metadata: - name: efsvol -spec: - capacity: - storage: 100Gi - accessModes: - - ReadWriteMany - nfs: - server: EFS_DNS_NAME - path: "/" -``` - -`EFS_DNS_NAME`: DNS name as shown in description of `paddle-efs` that we created. Looks similar to `fs-2cbf7385.efs.us-west-2.amazonaws.com` - -Run following command to create a persistent volumn: -``` -kubectl --kubeconfig=kubeconfig create -f pv.yaml -``` - -Next let's create a [PersistentVolumeClaim](https://kubernetes.io/docs/user-guide/persistent-volumes/) to claim the persistent volume. - -Save following snippet as `pvc.yaml`. -``` -kind: PersistentVolumeClaim -apiVersion: v1 -metadata: - name: efsvol -spec: - accessModes: - - ReadWriteMany - resources: - requests: - storage: 50Gi -``` - -Run following command to create a persistent volumn claim: -``` -kubectl --kubeconfig=kubeconfig create -f pvc.yaml -``` - -#### Prepare Training Data - -We will now launch a kubernetes job that downloads, saves and evenly splits training data into 3 shards on the persistent volumn that we just created. - -save following snippet as `paddle-data-job.yaml` -``` -apiVersion: batch/v1 -kind: Job -metadata: - name: paddle-data -spec: - template: - metadata: - name: pi - spec: - containers: - - name: paddle-data - image: paddlepaddle/paddle-tutorial:k8s_data - imagePullPolicy: Always - volumeMounts: - - mountPath: "/efs" - name: efs - env: - - name: OUT_DIR - value: /efs/paddle-cluster-job - - name: SPLIT_COUNT - value: "3" - volumes: - - name: efs - persistentVolumeClaim: - claimName: efsvol - restartPolicy: Never -``` - -Run following command to launch the job: -``` -kubectl --kubeconfig=kubeconfig create -f paddle-data-job.yaml -``` - -Job may take 7 min to finish, use following command to check job status. Do not proceed until `SUCCESSFUL` for `paddle-data` job is `1` -``` -$ kubectl --kubeconfig=kubeconfig get jobs -NAME DESIRED SUCCESSFUL AGE -paddle-data 1 1 6m -``` - -Data preparation is done by docker image `paddlepaddle/paddle-tutorial:k8s_data`, see [here](src/k8s_data/README.md) for how to build this docker image and source code. - -#### Start Training - -Now we are ready to start paddle training job. Save following snippet as `paddle-cluster-job.yaml` -``` -apiVersion: batch/v1 -kind: Job -metadata: - name: paddle-cluster-job -spec: - parallelism: 3 - completions: 3 - template: - metadata: - name: paddle-cluster-job - spec: - volumes: - - name: efs - persistentVolumeClaim: - claimName: efsvol - containers: - - name: trainer - image: paddlepaddle/paddle-tutorial:k8s_train - command: ["bin/bash", "-c", "/root/start.sh"] - env: - - name: JOB_NAME - value: paddle-cluster-job - - name: JOB_PATH - value: /home/jobpath - - name: JOB_NAMESPACE - value: default - - name: TRAIN_CONFIG_DIR - value: quick_start - - name: CONF_PADDLE_NIC - value: eth0 - - name: CONF_PADDLE_PORT - value: "7164" - - name: CONF_PADDLE_PORTS_NUM - value: "2" - - name: CONF_PADDLE_PORTS_NUM_SPARSE - value: "2" - - name: CONF_PADDLE_GRADIENT_NUM - value: "3" - - name: TRAINER_COUNT - value: "3" - volumeMounts: - - mountPath: "/home/jobpath" - name: efs - ports: - - name: jobport0 - hostPort: 7164 - containerPort: 7164 - - name: jobport1 - hostPort: 7165 - containerPort: 7165 - - name: jobport2 - hostPort: 7166 - containerPort: 7166 - - name: jobport3 - hostPort: 7167 - containerPort: 7167 - restartPolicy: Never -``` - -`parallelism: 3, completions: 3` means this job will simultaneously start 3 PaddlePaddle pods, and this job will be finished when there are 3 finished pods. - -`env` field represents container's environment variables, we specify PaddlePaddle parameters by environment variables. - -`ports` indicates that TCP port 7164 - 7167 are exposed for communication between `pserver` ans trainer. port starts continously from `CONF_PADDLE_PORT` (7164) to `CONF_PADDLE_PORT + CONF_PADDLE_PORTS_NUM + CONF_PADDLE_PORTS_NUM_SPARSE - 1` (7167). We use multiple ports for dense and sparse paramter updates to improve latency. - -Run following command to launch the job. -``` -kubectl --kubeconfig=kubeconfig create -f paddle-claster-job.yaml -``` - -Inspect individual pods - -``` -$ kubectl --kubeconfig=kubeconfig get pods -NAME READY STATUS RESTARTS AGE -paddle-cluster-job-cm469 1/1 Running 0 9m -paddle-cluster-job-fnt03 1/1 Running 0 9m -paddle-cluster-job-jx4xr 1/1 Running 0 9m -``` - -Inspect individual console output -``` -kubectl --kubeconfig=kubeconfig log -f POD_NAME -``` - -`POD_NAME`: name of any pod (e.g., `paddle-cluster-job-cm469`). - -Run `kubectl --kubeconfig=kubeconfig describe job paddle-cluster-job` to check training job status. It will complete in around 20 minutes. - -The details for start `pserver` and `trainer` are hidden inside docker image `paddlepaddle/paddle-tutorial:k8s_train`, see [here](src/k8s_train/README.md) for how to build the docker image and source code. - -#### Inspect Training Output - -Training output (model snapshot and logs) will be saved in EFS. We can ssh into worker EC2 instance, mount EFS and check training output. - -1. ssh Into Worker EC2 instance -``` -chmod 400 key-name.pem -ssh -i key-name.pem core@INSTANCE_IP -``` - -`INSTANCE_IP`: public IP address of EC2 kubernetes worker node. Go to [EC2 console](https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Instances:sort=instanceId) and check `public IP` of any `paddle-cluster-kube-aws-worker` instance. - -2. Mount EFS -``` -mkdir efs -sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 EFS_DNS_NAME:/ efs -``` - -`EFS_DNS_NAME`: DNS name as shown in description of `paddle-efs` that we created. Look similar to `fs-2cbf7385.efs.us-west-2.amazonaws.com`. - -Now folder `efs` will have structure similar to: -``` --- paddle-cluster-job - |-- ... - |-- output - | |-- node_0 - | | |-- server.log - | | `-- train.log - | |-- node_1 - | | |-- server.log - | | `-- train.log - | |-- node_2 - | | |-- server.log - | | `-- train.log - | |-- pass-00000 - | | |-- ___fc_layer_0__.w0 - | | |-- ___fc_layer_0__.wbias - | | |-- done - | | |-- path.txt - | | `-- trainer_config.lr.py - | |-- pass-00001... -``` -`server.log` contains log for `pserver`. `train.log` contains log for `trainer`. Model description and snapshot is stored in `pass-0000*`. - -### Kubernetes Cluster Tear Down - -#### Delete EFS - -Go to [EFS Console](https://us-west-2.console.aws.amazon.com/efs/home?region=us-west-2) and delete the EFS volumn that we created. - -#### Delete security group - -Go to [Security Group Console](https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#SecurityGroups:sort=groupId) and delete security group `paddle-efs`. - - -#### Delete S3 Bucket - -Go to [S3 Console](https://console.aws.amazon.com/s3/home?region=us-west-2#) and delete the S3 bucket that we created. - -#### Destroy Cluster - -``` -kube-aws destroy -``` - -The command will return immediately, but it might take 5 min to tear down the whole cluster. - -You can go to [CludFormation Console](https://us-west-2.console.aws.amazon.com/cloudformation/home?region=us-west-2#/stacks?filter=active) to check destroy process. diff --git a/doc/v2/howto/cluster/multi_cluster/k8s_cn.md b/doc/v2/howto/cluster/multi_cluster/k8s_cn.md deleted file mode 100644 index c1a11f7165a2f9da9dd044641274447e7943a597..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/multi_cluster/k8s_cn.md +++ /dev/null @@ -1,206 +0,0 @@ -# Kubernetes单机训练 - -在这篇文档里,我们介绍如何在 Kubernetes 集群上启动一个单机使用CPU的PaddlePaddle训练作业。在下一篇中,我们将介绍如何启动分布式训练作业。 - -## 制作Docker镜像 - -在一个功能齐全的Kubernetes机群里,通常我们会安装Ceph等分布式文件系统来存储训练数据。这样的话,一个分布式PaddlePaddle训练任务中 -的每个进程都可以从Ceph读取数据。在这个例子里,我们只演示一个单机作业,所以可以简化对环境的要求,把训练数据直接放在 -PaddlePaddle的Docker Image里。为此,我们需要制作一个包含训练数据的PaddlePaddle镜像。 - -PaddlePaddle的 `paddlepaddle/paddle:cpu-demo-latest` 镜像里有PaddlePaddle的源码与demo, -(请注意,默认的PaddlePaddle生产环境镜像 `paddlepaddle/paddle:latest` 是不包括源码的,PaddlePaddle的各版本镜像可以参考 -[Docker Installation Guide](http://paddlepaddle.org/docs/develop/documentation/zh/getstarted/build_and_install/docker_install_cn.html)), -下面我们使用这个镜像来下载数据到Docker Container中,并把这个包含了训练数据的Container保存为一个新的镜像。 - -### 运行容器 - -``` -$ docker run --name quick_start_data -it paddlepaddle/paddle:cpu-demo-latest -``` - -### 下载数据 - -进入容器`/root/paddle/demo/quick_start/data`目录,使用`get_data.sh`下载数据 - -``` -$ root@fbd1f2bb71f4:~/paddle/demo/quick_start/data# ./get_data.sh - -Downloading Amazon Electronics reviews data... ---2016-10-31 01:33:43-- http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz -Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80 -Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected. -HTTP request sent, awaiting response... 200 OK -Length: 495854086 (473M) [application/x-gzip] -Saving to: 'reviews_Electronics_5.json.gz' - - 10% [=======> ] 874,279 64.7KB/s eta 2h 13m - -``` - -### 修改启动脚本 - -下载完数据后,修改`/root/paddle/demo/quick_start/train.sh`文件,内容如下(增加了一条cd命令) -``` -set -e -cd /root/paddle/demo/quick_start -cfg=trainer_config.lr.py -#cfg=trainer_config.emb.py -#cfg=trainer_config.cnn.py -#cfg=trainer_config.lstm.py -#cfg=trainer_config.bidi-lstm.py -#cfg=trainer_config.db-lstm.py -paddle train \ - --config=$cfg \ - --save_dir=./output \ - --trainer_count=4 \ - --log_period=20 \ - --num_passes=15 \ - --use_gpu=false \ - --show_parameter_stats_period=100 \ - --test_all_data_in_one_period=1 \ - 2>&1 | tee 'train.log' -``` - -### 提交镜像 - -修改启动脚本后,退出容器,使用`docker commit`命令创建新镜像。 - -``` -$ docker commit quick_start_data mypaddle/paddle:quickstart -``` - -## 使用 Kubernetes 进行训练 - ->针对任务运行完成后容器自动退出的场景,Kubernetes有Job类型的资源来支持。下文就是用Job类型的资源来进行训练。 - -### 编写yaml文件 - -在训练时,输出结果可能会随着容器的消耗而被删除,需要在创建容器前挂载卷以便我们保存训练结果。使用我们之前构造的镜像,可以创建一个 [Kubernetes Job](http://kubernetes.io/docs/user-guide/jobs/#what-is-a-job),简单的yaml文件如下: - -``` -apiVersion: batch/v1 -kind: Job -metadata: - name: quickstart -spec: - parallelism: 1 - completions: 1 - template: - metadata: - name: quickstart - spec: - volumes: - - name: output - hostPath: - path: /home/work/paddle_output - containers: - - name: pi - image: mypaddle/paddle:quickstart - command: ["bin/bash", "-c", "/root/paddle/demo/quick_start/train.sh"] - volumeMounts: - - name: output - mountPath: /root/paddle/demo/quick_start/output - restartPolicy: Never -``` - -### 创建PaddlePaddle Job - -使用上文创建的yaml文件创建Kubernetes Job,命令为: - -``` -$ kubectl create -f paddle.yaml -``` - -查看job的详细情况: - -``` -$ kubectl get job -NAME DESIRED SUCCESSFUL AGE -quickstart 1 0 58s - -$ kubectl describe job quickstart -Name: quickstart -Namespace: default -Image(s): registry.baidu.com/public/paddle:cpu-demo-latest -Selector: controller-uid=f120da72-9f18-11e6-b363-448a5b355b84 -Parallelism: 1 -Completions: 1 -Start Time: Mon, 31 Oct 2016 11:20:16 +0800 -Labels: controller-uid=f120da72-9f18-11e6-b363-448a5b355b84,job-name=quickstart -Pods Statuses: 0 Running / 1 Succeeded / 0 Failed -Volumes: - output: - Type: HostPath (bare host directory volume) - Path: /home/work/paddle_output -Events: - FirstSeen LastSeen Count From SubobjectPath Type Reason Message - --------- -------- ----- ---- ------------- -------- ------ ------- - 1m 1m 1 {job-controller } Normal SuccessfulCreate Created pod: quickstart-fa0wx -``` - -### 查看训练结果 - -根据Job对应的Pod信息,可以查看此Pod运行的宿主机。 - -``` -kubectl describe pod quickstart-fa0wx -Name: quickstart-fa0wx -Namespace: default -Node: paddle-demo-let02/10.206.202.44 -Start Time: Mon, 31 Oct 2016 11:20:17 +0800 -Labels: controller-uid=f120da72-9f18-11e6-b363-448a5b355b84,job-name=quickstart -Status: Succeeded -IP: 10.0.0.9 -Controllers: Job/quickstart -Containers: - quickstart: - Container ID: docker://b8561f5c79193550d64fa47418a9e67ebdd71546186e840f88de5026b8097465 - Image: registry.baidu.com/public/paddle:cpu-demo-latest - Image ID: docker://18e457ce3d362ff5f3febf8e7f85ffec852f70f3b629add10aed84f930a68750 - Port: - Command: - bin/bash - -c - /root/paddle/demo/quick_start/train.sh - QoS Tier: - cpu: BestEffort - memory: BestEffort - State: Terminated - Reason: Completed - Exit Code: 0 - Started: Mon, 31 Oct 2016 11:20:20 +0800 - Finished: Mon, 31 Oct 2016 11:21:46 +0800 - Ready: False - Restart Count: 0 - Environment Variables: -Conditions: - Type Status - Ready False -Volumes: - output: - Type: HostPath (bare host directory volume) - Path: /home/work/paddle_output -``` - -我们还可以登录到宿主机上查看训练结果。 - -``` -[root@paddle-demo-let02 paddle_output]# ll -total 60 -drwxr-xr-x 2 root root 4096 Oct 31 11:20 pass-00000 -drwxr-xr-x 2 root root 4096 Oct 31 11:20 pass-00001 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00002 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00003 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00004 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00005 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00006 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00007 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00008 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00009 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00010 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00011 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00012 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00013 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00014 -``` diff --git a/doc/v2/howto/cluster/multi_cluster/k8s_distributed_cn.md b/doc/v2/howto/cluster/multi_cluster/k8s_distributed_cn.md deleted file mode 100644 index 167089b8074b33e3b094fa3ec8e377630cec42ac..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/multi_cluster/k8s_distributed_cn.md +++ /dev/null @@ -1,312 +0,0 @@ -# Kubernetes分布式训练 - -前一篇文章介绍了如何在Kubernetes集群上启动一个单机PaddlePaddle训练作业 (Job)。在这篇文章里,我们介绍如何在Kubernetes集群上进行分布式PaddlePaddle训练作业。关于PaddlePaddle的分布式训练,文章 [Cluster Training](http://www.paddlepaddle.org/docs/develop/documentation/zh/howto/usage/cluster/cluster_train_cn.html)介绍了一种通过SSH远程分发任务,进行分布式训练的方法,与此不同的是,本文将介绍在Kubernetes容器管理平台上快速构建PaddlePaddle容器集群,进行分布式训练的方案。 - -## 整体方案 - -在训练之前,用户将配置与训练数据切分好放在分布式文件系统预先分配好的目录中(不同的分布式文件系统,需要使用其制定的方式挂载后并导入数据),训练时,程序从此目录拷贝文件到容器内进行训练,将结果保存到此目录里。整体的结构图如下: - -![paddle on kubernetes结构图](src/k8s-paddle-arch.png) - -上图描述了一个3节点的分布式训练场景,在每个Pod上都通过volume方式挂载分布式文件系统的一个目录用于保存训练数据和输出结果。Kubernetes为这次训练创建了3个pod并且调度到了3个node上运行,每个pod包含一个PaddlePaddle容器。在容器创建后,会启动pserver与trainer进程,读取volume中的数据进行这次分布式训练。 - -根据前文的描述,要在已有的Kubernetes集群上进行PaddlePaddle的分布式训练,按照下面步骤即可: - -1. [制作PaddlePaddle镜像](#制作镜像) -1. [将训练文件与切分好的数据上传到共享存储](#上传训练文件) -1. [编写本次训练的YAML文件,创建一个Kubernetes job](#创建Job) -1. [训练结束后查看输出结果](#查看输出) - -下面就根据这几个步骤分别介绍。 - -### 制作镜像 - -PaddlePaddle镜像需要提供`paddle pserver`与`paddle train`进程的运行环境,用这个镜像创建的容器需要有以下两个功能: - -- 拷贝训练文件到容器内 -- 生成`paddle pserver`与`paddle train`进程的启动参数,并且启动训练 - -因为官方镜像 `paddlepaddle/paddle:latest` 内已经包含PaddlePaddle的执行程序但是还没上述功能,所以我们可以在这个基础上,添加启动脚本,制作新镜像来完成以上的工作。参考镜像的[*Dockerfile*](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/usage/cluster/src/k8s_train/Dockerfile)。 - -```bash -$ cd doc/howto/usage/k8s/src/k8s_train -$ docker build -t [YOUR_REPO]/paddle:mypaddle . -``` - -然后将构建成功的镜像上传到镜像仓库。 - -```bash -docker push [YOUR_REPO]/paddle:mypaddle -``` - -注意上述命令中`[YOUR_REPO]`表示读者所使用的Docker镜像仓库地址,读者需要替换成自己使用的仓库地址。下文使用`[YOUR_REPO]/paddle:mypaddle`这个地址来表示此步骤所构建出的镜像。 - -### 准备训练数据 - -这里我们通过在Kubernetes集群上启动一个Job来下载并切割数据,也可以通过修改[k8s_train](./src/k8s_train/README.md)的内容来定制image. - -在启动Job之前,需要根据不同的分布式存储来绑定一个[persistentVolumeClaim](https://kubernetes.io/docs/user-guide/persistent-volumes/),生成的数据将会存储在这个volume下. - -```yaml -apiVersion: batch/v1 -kind: Job -metadata: - name: paddle-data -spec: - template: - metadata: - name: pi - spec: - hostNetwork: true - containers: - - name: paddle-data - image: paddlepaddle/paddle-tutorial:k8s_data - imagePullPolicy: Always - volumeMounts: - - mountPath: "/mnt" - name: nfs - env: - - name: OUT_DIR - value: /home/work/mfs/paddle-cluster-job - - name: SPLIT_COUNT - value: "3" - volumes: - - name: nfs - persistentVolumeClaim: - claimName: mfs - restartPolicy: Never -``` - -完成后volume中的文件内容大致如下: -```base -[root@paddle-kubernetes-node0 nfsdir]$ tree -d -. -`-- paddle-cluster-job - |-- 0 - | `-- data - |-- 1 - | `-- data - |-- 2 - | `-- data - |-- output - |-- quick_start -``` - -目录中paddle-cluster-job是本次训练对应的job name,本次训练要求有3个PaddlePaddle节点,在paddle-cluster-job/data目录中存放切分好的数据,文件夹0,1,2分别代表3个节点的trainer_id。recommendation文件夹内存放训练文件,output文件夹存放训练结果与日志。 - -### 创建Job - -Kubernetes可以通过YAML文件来创建相关对象,然后可以使用命令行工具创建job。 - -Job YAML文件描述了这次训练使用的Docker镜像,需要启动的节点个数以及 `paddle pserver`与 `paddle train`进程启动的必要参数,也描述了容器需要使用的存储卷挂载的情况。YAML文件中各个字段的具体含义,可以查看[Kubernetes Job API](http://kubernetes.io/docs/api-reference/batch/v1/definitions/#_v1_job)。例如,本次训练的YAML文件可以写成: - -```yaml -apiVersion: batch/v1 -kind: Job -metadata: - name: paddle-cluster-job -spec: - parallelism: 3 - completions: 3 - template: - metadata: - name: paddle-cluster-job - spec: - volumes: - - name: jobpath - hostPath: - path: /home/work/mfs - containers: - - name: trainer - image: [YOUR_REPO]/paddle:mypaddle - command: ["bin/bash", "-c", "/root/start.sh"] - env: - - name: JOB_NAME - value: paddle-cluster-job - - name: JOB_PATH - value: /home/jobpath - - name: JOB_NAMESPACE - value: default - - name: TRAIN_CONFIG_DIR - value: recommendation - - name: CONF_PADDLE_NIC - value: eth0 - - name: CONF_PADDLE_PORT - value: "7164" - - name: CONF_PADDLE_PORTS_NUM - value: "2" - - name: CONF_PADDLE_PORTS_NUM_SPARSE - value: "2" - - name: CONF_PADDLE_GRADIENT_NUM - value: "3" - volumeMounts: - - name: jobpath - mountPath: /home/jobpath - restartPolicy: Never -``` - -文件中,`metadata`下的`name`表示这个job的名字。`parallelism,completions`字段表示这个job会同时开启3个PaddlePaddle节点,成功训练且退出的pod数目为3时,这个job才算成功结束。然后申明一个存储卷`jobpath`,代表宿主机目录`/home/work/mfs`,在对容器的描述`containers`字段中,将此目录挂载为容器的`/home/jobpath`目录,这样容器的`/home/jobpath`目录就成为了共享存储,放在这个目录里的文件其实是保存到了MFS上。 - -`env`字段表示容器的环境变量,我们将`paddle`运行的一些参数通过这种方式传递到容器内: - - -- JOB_PATH:共享存储挂在的路径 -- JOB_NAME:Job的名字 -- TRAIN_CONFIG_DIR:本次训练文件所在目录,与JOB_PATH,JOB_NAME组合可以找到本次训练需要的文件路径 -- CONF_PADDLE_NIC:`paddle pserver`进程需要的`--nics`参数,即网卡名 -- CONF_PADDLE_PORT:`paddle paserver`的`--port`参数 -- CONF_PADDLE_PORTS_NUM:稠密更新的端口数量,即`--ports_num`参数 -- CONF_PADDLE_PORTS_NUM_SPARSE:稀疏更新的端口数量,即`--ports_num_for_sparse`参数 -- CONF_PADDLE_GRADIENT_NUM:训练节点数量,即`--num_gradient_servers参数` - -这些参数的具体描述,读者可以查看[这里](http://www.paddlepaddle.org/docs/develop/documentation/zh/howto/usage/cmd_parameter/detail_introduction_cn.html)。 - -编写完YAML文件后,可以使用Kubernetes的命令行工具创建job。 - -```bash -kubectl create -f job.yaml -``` - -创建成功后,Kubernetes就会创建3个pod作为PaddlePaddle节点然后拉取镜像,启动容器开始训练。 - - -### 查看输出 - -在训练过程中,可以在共享存储上查看输出的日志和模型,例如output目录下就存放了输出结果。注意node_0,node_1,node_2这几个目录表示PaddlePaddle节点与trainer_id,并不是Kubernetes中的node概念。 - -```bash -[root@paddle-kubernetes-node0 output]# tree -d -. -├── node_0 -│   ├── server.log -│   └── train.log -├── node_1 -│   ├── server.log -│   └── train.log -├── node_2 -...... -├── pass-00002 -│   ├── done -│   ├── ___embedding_0__.w0 -│   ├── ___embedding_1__.w0 -...... -``` - -我们可以通过日志查看容器训练的情况,例如: - -```bash -[root@paddle-kubernetes-node0 node_0]# cat train.log -I1116 09:10:17.123121 50 Util.cpp:155] commandline: - /usr/local/bin/../opt/paddle/bin/paddle_trainer - --nics=eth0 --port=7164 - --ports_num=2 --comment=paddle_process_by_paddle - --pservers=192.168.129.66,192.168.223.143,192.168.129.71 - --ports_num_for_sparse=2 --config=./trainer_config.py - --trainer_count=4 --num_passes=10 --use_gpu=0 - --log_period=50 --dot_period=10 --saving_period=1 - --local=0 --trainer_id=0 - --save_dir=/home/jobpath/paddle-cluster-job/output -I1116 09:10:17.123440 50 Util.cpp:130] Calling runInitFunctions -I1116 09:10:17.123764 50 Util.cpp:143] Call runInitFunctions done. -[WARNING 2016-11-16 09:10:17,227 default_decorators.py:40] please use keyword arguments in paddle config. -[INFO 2016-11-16 09:10:17,239 networks.py:1282] The input order is [movie_id, title, genres, user_id, gender, age, occupation, rating] -[INFO 2016-11-16 09:10:17,239 networks.py:1289] The output order is [__square_error_cost_0__] -I1116 09:10:17.392917 50 Trainer.cpp:170] trainer mode: Normal -I1116 09:10:17.613910 50 PyDataProvider2.cpp:257] loading dataprovider dataprovider::process -I1116 09:10:17.680917 50 PyDataProvider2.cpp:257] loading dataprovider dataprovider::process -I1116 09:10:17.681543 50 GradientMachine.cpp:134] Initing parameters.. -I1116 09:10:18.012390 50 GradientMachine.cpp:141] Init parameters done. -I1116 09:10:18.018641 50 ParameterClient2.cpp:122] pserver 0 192.168.129.66:7164 -I1116 09:10:18.018950 50 ParameterClient2.cpp:122] pserver 1 192.168.129.66:7165 -I1116 09:10:18.019069 50 ParameterClient2.cpp:122] pserver 2 192.168.223.143:7164 -I1116 09:10:18.019492 50 ParameterClient2.cpp:122] pserver 3 192.168.223.143:7165 -I1116 09:10:18.019716 50 ParameterClient2.cpp:122] pserver 4 192.168.129.71:7164 -I1116 09:10:18.019836 50 ParameterClient2.cpp:122] pserver 5 192.168.129.71:7165 -``` - - -## 一些细节的补充 - -### 使用环境变量 - -使用容器方式运行训练任务的Kubernetes Job,通常会使用环境变量配置Job的配置信息`start_paddle.py`提供了一个启动脚本,将环境变量转换成paddle的命令行参数: -``` -API = "/api/v1/namespaces/" -JOBSELECTOR = "labelSelector=job-name=" -JOB_PATH = os.getenv("JOB_PATH") + "/" + os.getenv("JOB_NAME") -JOB_PATH_OUTPUT = JOB_PATH + "/output" -JOBNAME = os.getenv("JOB_NAME") -NAMESPACE = os.getenv("JOB_NAMESPACE") -PADDLE_NIC = os.getenv("CONF_PADDLE_NIC") -PADDLE_PORT = os.getenv("CONF_PADDLE_PORT") -PADDLE_PORTS_NUM = os.getenv("CONF_PADDLE_PORTS_NUM") -PADDLE_PORTS_NUM_SPARSE = os.getenv("CONF_PADDLE_PORTS_NUM_SPARSE") -PADDLE_SERVER_NUM = os.getenv("CONF_PADDLE_GRADIENT_NUM") -``` - -### Pod间通信 -`start_paddle.py`脚本开始时,会先进行参数的初始化与解析。 - -```python -parser = argparse.ArgumentParser(prog="start_paddle.py", - description='simple tool for k8s') - args, train_args_list = parser.parse_known_args() - train_args = refine_unknown_args(train_args_list) - train_args_dict = dict(zip(train_args[:-1:2], train_args[1::2])) - podlist = getPodList() -``` - -然后通过函数`getPodList()`访问Kubernetes的接口来查询此job对应的所有pod信息。当所有pod都处于running状态(容器运行都运行)时,再通过函数`getIdMap(podlist)`获取trainer_id。 - -```python - podlist = getPodList() - # need to wait until all pods are running - while not isPodAllRunning(podlist): - time.sleep(10) - podlist = getPodList() - idMap = getIdMap(podlist) -``` -* *注意*: `getPodList()`会获取当前namespace下的所有pod,如果已经有pod运行,可能会导致出错。这种集群节点管理方式会在将来使用[statfulsets](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/)代替。 - -在函数`getIdMap(podlist)`内部,我们通过读取`podlist`中每个pod的IP地址,将IP排序生成的序号作为trainer_id。 - -```python -def getIdMap(podlist): - ''' - generate tainer_id by ip - ''' - ips = [] - for pod in podlist["items"]: - ips.append(pod["status"]["podIP"]) - ips.sort() - idMap = {} - for i in range(len(ips)): - idMap[ips[i]] = i - return idMap -``` - -在得到`idMap`后,通过函数`startPaddle(idMap, train_args_dict)`构造`paddle pserver`与`paddle train`的启动参数并执行进程。 - -### 启动任务 - -在函数`startPaddle`中,最主要的工作就是解析出`paddle pserver`与`paddle train`的启动参数。例如`paddle train`参数的解析,解析环境变量得到`PADDLE_NIC`,`PADDLE_PORT`,`PADDLE_PORTS_NUM`等参数,然后通过自身的IP地址在`idMap`中获取`trainerId`。 - -```python - program = 'paddle train' - args = " --nics=" + PADDLE_NIC - args += " --port=" + str(PADDLE_PORT) - args += " --ports_num=" + str(PADDLE_PORTS_NUM) - args += " --comment=" + "paddle_process_by_paddle" - ip_string = "" - for ip in idMap.keys(): - ip_string += (ip + ",") - ip_string = ip_string.rstrip(",") - args += " --pservers=" + ip_string - args_ext = "" - for key, value in train_args_dict.items(): - args_ext += (' --' + key + '=' + value) - localIP = socket.gethostbyname(socket.gethostname()) - trainerId = idMap[localIP] - args += " " + args_ext + " --trainer_id=" + \ - str(trainerId) + " --save_dir=" + JOB_PATH_OUTPUT -``` diff --git a/doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md b/doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md deleted file mode 100644 index b2dc4da8451af317df76c5b3df328b6f58429610..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md +++ /dev/null @@ -1,372 +0,0 @@ -# Distributed Training on Kubernetes - -We introduced how to create a PaddlePaddle Job with a single node on Kuberentes in the -previous document. -In this article, we will introduce how to create a PaddlePaddle job with multiple nodes -on Kubernetes cluster. - -## Overall Architecture - -Before creating a training job, the users need to slice the training data and deploy -the Python scripts along with it into the distributed file system -(We can use the different type of Kuberentes Volumes to mount different distributed -file systems). Before training starts, The program will copy the training data into the -Container and also save the models at the same path during training. The global architecture -is as follows: - -![PaddlePaddle on Kubernetes Architecture](src/k8s-paddle-arch.png) - -The above figure describes a distributed training architecture which contains 3 nodes, each -Pod mounts a folder of the distributed file system to save training data and models -by Kubernetes Volume. Kubernetes created 3 Pods for this training phase and scheduled these on -3 nodes, each Pod has a PaddlePaddle container. After the containers car created, -PaddlePaddle starts up the communication between PServer and Trainer and read training -data for this training job. - -As the description above, we can start up a PaddlePaddle distributed training job on a -Kubernetes ready cluster with the following steps: - -1. [Build PaddlePaddle Docker Image](#Build a Docker Image) -1. [Split training data and upload to the distributed file system](#Upload Training Data) -1. [Edit a YAML file and create a Kubernetes Job](#Create a Job) -1. [Check the output](#Check The Output) - -We will introduce these steps as follows: - -### Build a Docker Image - -Training docker image needs to package the paddle pserver and paddle trainer runtimes, as well as two more processes before we can kick off the training: - -- Copying the training data into container. -- Generating the initialization arguments for `Paddle PServer` and `Paddle Training` processes. - -Since the paddlepaddle official docker image already has the runtimes we need, we'll take it as the base image and pack some additional scripts for the processes mentioned above to build our training image. for more detail, please find from the following link: -- https://github.com/PaddlePaddle/Paddle/tree/develop/doc/v2/howto/cluster/multi_cluster/src/k8s_train/Dockerfile - - -```bash -$ cd doc/howto/usage/k8s/src/k8s_train -$ docker build -t [YOUR_REPO]/paddle:mypaddle . -``` - -And then upload the new Docker Image to a Docker hub: - -```bash -docker push [YOUR_REPO]/paddle:mypaddle -``` - -**[NOTE]**, in the above command arguments, `[YOUR_REPO]` represents your Docker repository, -you need to use your repository instead of it. We will replace it with your respository name to -represent the Docker Image which built in this step. - -### Prepare Training Data - -We can download and split the training job by creating a Kubernetes Job, or custom your image -by editing [k8s_train](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/v2/howto/cluster/multi_cluster/src/k8s_train). - -Before creating a Job, we need to bind a [persistenVolumeClaim](https://kubernetes.io/docs/user-guide/persistent-volumes) by the different type of -the different file system, the generated dataset would be saved on this volume. - -```yaml -apiVersion: batch/v1 -kind: Job -metadata: - name: paddle-data -spec: - template: - metadata: - name: pi - spec: - hostNetwork: true - containers: - - name: paddle-data - image: paddlepaddle/paddle-tutorial:k8s_data - imagePullPolicy: Always - volumeMounts: - - mountPath: "/mnt" - name: nfs - env: - - name: OUT_DIR - value: /home/work/mfs/paddle-cluster-job - - name: SPLIT_COUNT - value: "3" - volumes: - - name: nfs - persistentVolumeClaim: - claimName: mfs - restartPolicy: Never -``` - -Create the Job with the following command: - -```bash -> kubectl create -f xxx.yaml -``` - -If created successfully, you can see some information like this: - -```base -[root@paddle-kubernetes-node0 nfsdir]$ tree -d -. -`-- paddle-cluster-job - |-- 0 - | `-- data - |-- 1 - | `-- data - |-- 2 - | `-- data - |-- output - |-- quick_start -``` - -The `paddle-cluster-job` above is the job name for this training job; we need 3 -PaddlePaddle training nodes and save the split training data in `paddle-cluster-job` path, -the folder `0`, `1` and `2` represents the `training_id` on each node, `quick_start` folder is used to store training data, `output` folder is used to store the models and logs. - - -### Create a Job - -Kubernetes allow users to create objects with YAML files, and we can use a command-line tool -to create it. - -The Job YAML file describes that which Docker Image would be used in this training job, how much nodes would be created, what's the startup arguments of `Paddle PServer/Trainer` process and what's the type of Volumes. You can find the details of the YAML filed in -[Kubernetes Job API](http://kubernetes.io/docs/api-reference/batch/v1/definitions/#_v1_job). -The following is an example for this training job: - -```yaml -apiVersion: batch/v1 -kind: Job -metadata: - name: paddle-cluster-job -spec: - parallelism: 3 - completions: 3 - template: - metadata: - name: paddle-cluster-job - spec: - volumes: - - name: jobpath - hostPath: - path: /home/work/mfs - containers: - - name: trainer - image: [YOUR_REPO]/paddle:mypaddle - command: ["bin/bash", "-c", "/root/start.sh"] - env: - - name: JOB_NAME - value: paddle-cluster-job - - name: JOB_PATH - value: /home/jobpath - - name: JOB_NAMESPACE - value: default - - name: TRAIN_CONFIG_DIR - value: recommendation - - name: CONF_PADDLE_NIC - value: eth0 - - name: CONF_PADDLE_PORT - value: "7164" - - name: CONF_PADDLE_PORTS_NUM - value: "2" - - name: CONF_PADDLE_PORTS_NUM_SPARSE - value: "2" - - name: CONF_PADDLE_GRADIENT_NUM - value: "3" - volumeMounts: - - name: jobpath - mountPath: /home/jobpath - restartPolicy: Never -``` - -In the above YAML file: -- `metadata.name`, The job name. -- `parallelism`, Whether the Kubernetes Job would create `parallelism` Pods at the same time. -- `completions`, The Job would become the success status only when the number of successful Pod(the exit code is 0) - is equal to `completions`. -- `volumeMounts`, the name field `jobpath` is a key, the `mountPath` field represents - the path in the container, and we can define the `jobpath` in `volumes` filed, use `hostPath` - to configure the host path we want to mount. -- `env`, the environment variables in the Container, we pass some startup arguments by - this approach, some details are as following: - - JOB_PATH:the mount path in the container - - JOB_NAME:the job name - - TRAIN_CONFIG_DIR:the job path in the container, we can find the training data path by - combine with JOB_NAME. - - CONF_PADDLE_NIC: the argument `--nics` of `Paddle PServer` process, the network - device name. - - CONF_PADDLE_PORT: the argument `--port` of `Paddle PServer` process. - - CONF_PADDLE_PORTS_NUM: the argument `--ports_num` of `Paddle PServer`, the port number - for dense prameter update. - - CONF_PADDLE_PORTS_NUM_SPARSE:the argument `--ports_num_for_sparse` of `Paddle PServer`, - the port number for sparse parameter update. - - CONF_PADDLE_GRADIENT_NUM:the number of training node, the argument - `--num_gradient_servers` of `Paddle PServer` and `Paddle Trainer`. - -You can find some details information at [here] -(http://www.paddlepaddle.org/docs/develop/documentation/zh/howto/usage/cmd_parameter/detail_introduction_cn.html)。 - -We can use the command-line tool of Kubernetes to create a Job when we finish the YAML file: - -```bash -kubectl create -f job.yaml -``` - -Upon successful creation, Kubernetes would create 3 Pods as PaddlePaddle training node, -pull the Docker image and begin to train. - - -### Checkout the Output - -At the process of training, we can check the logs and the output models which is stored in -the `output` folder. - -**NOTE**, `node_0`, `node_1` and `node_2` represent the -`trainer_id` of the PaddlePaddle training job rather than the node id of Kubernetes. - -```bash -[root@paddle-kubernetes-node0 output]# tree -d -. -├── node_0 -│   ├── server.log -│   └── train.log -├── node_1 -│   ├── server.log -│   └── train.log -├── node_2 -...... -├── pass-00002 -│   ├── done -│   ├── ___embedding_0__.w0 -│   ├── ___embedding_1__.w0 -...... -``` - -We can checkout the status of each training Pod by viewing the logs: - -```bash -[root@paddle-kubernetes-node0 node_0]# cat train.log -I1116 09:10:17.123121 50 Util.cpp:155] commandline: - /usr/local/bin/../opt/paddle/bin/paddle_trainer - --nics=eth0 --port=7164 - --ports_num=2 --comment=paddle_process_by_paddle - --pservers=192.168.129.66,192.168.223.143,192.168.129.71 - --ports_num_for_sparse=2 --config=./trainer_config.py - --trainer_count=4 --num_passes=10 --use_gpu=0 - --log_period=50 --dot_period=10 --saving_period=1 - --local=0 --trainer_id=0 - --save_dir=/home/jobpath/paddle-cluster-job/output -I1116 09:10:17.123440 50 Util.cpp:130] Calling runInitFunctions -I1116 09:10:17.123764 50 Util.cpp:143] Call runInitFunctions done. -[WARNING 2016-11-16 09:10:17,227 default_decorators.py:40] please use keyword arguments in paddle config. -[INFO 2016-11-16 09:10:17,239 networks.py:1282] The input order is [movie_id, title, genres, user_id, gender, age, occupation, rating] -[INFO 2016-11-16 09:10:17,239 networks.py:1289] The output order is [__square_error_cost_0__] -I1116 09:10:17.392917 50 Trainer.cpp:170] trainer mode: Normal -I1116 09:10:17.613910 50 PyDataProvider2.cpp:257] loading dataprovider dataprovider::process -I1116 09:10:17.680917 50 PyDataProvider2.cpp:257] loading dataprovider dataprovider::process -I1116 09:10:17.681543 50 GradientMachine.cpp:134] Initing parameters.. -I1116 09:10:18.012390 50 GradientMachine.cpp:141] Init parameters done. -I1116 09:10:18.018641 50 ParameterClient2.cpp:122] pserver 0 192.168.129.66:7164 -I1116 09:10:18.018950 50 ParameterClient2.cpp:122] pserver 1 192.168.129.66:7165 -I1116 09:10:18.019069 50 ParameterClient2.cpp:122] pserver 2 192.168.223.143:7164 -I1116 09:10:18.019492 50 ParameterClient2.cpp:122] pserver 3 192.168.223.143:7165 -I1116 09:10:18.019716 50 ParameterClient2.cpp:122] pserver 4 192.168.129.71:7164 -I1116 09:10:18.019836 50 ParameterClient2.cpp:122] pserver 5 192.168.129.71:7165 -``` - -## Some Additional Details - -### Using Environment Variables - -Usually we use the environment varialbes to configurate the PaddlePaddle Job which runs in -Kubernetes, `start_paddle.py` provides a start up script to convert the environment variable -to the start up arguments of PaddlePaddle process: - -```bash -API = "/api/v1/namespaces/" -JOBSELECTOR = "labelSelector=job-name=" -JOB_PATH = os.getenv("JOB_PATH") + "/" + os.getenv("JOB_NAME") -JOB_PATH_OUTPUT = JOB_PATH + "/output" -JOBNAME = os.getenv("JOB_NAME") -NAMESPACE = os.getenv("JOB_NAMESPACE") -PADDLE_NIC = os.getenv("CONF_PADDLE_NIC") -PADDLE_PORT = os.getenv("CONF_PADDLE_PORT") -PADDLE_PORTS_NUM = os.getenv("CONF_PADDLE_PORTS_NUM") -PADDLE_PORTS_NUM_SPARSE = os.getenv("CONF_PADDLE_PORTS_NUM_SPARSE") -PADDLE_SERVER_NUM = os.getenv("CONF_PADDLE_GRADIENT_NUM") -``` - -### Communication between Pods - -At the begin of `start_paddle.py`, it would initializes and parses the arguments. - -```python -parser = argparse.ArgumentParser(prog="start_paddle.py", - description='simple tool for k8s') - args, train_args_list = parser.parse_known_args() - train_args = refine_unknown_args(train_args_list) - train_args_dict = dict(zip(train_args[:-1:2], train_args[1::2])) - podlist = getPodList() -``` - -And then query the status of all the other Pods of this Job by the function `getPodList()`, and fetch `triner_id` by the function `getIdMap(podlist)` if all the Pods status is `RUNNING`. - -```python - podlist = getPodList() - # need to wait until all pods are running - while not isPodAllRunning(podlist): - time.sleep(10) - podlist = getPodList() - idMap = getIdMap(podlist) -``` - -**NOTE**: `getPodList()` would prefetch all the Pods in the current namespace, if some -Pods are alreay running, it may cause some error. We will use [statfulesets](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets) instead of -Kubernetes Pod or Replicaset in the future. - -The function `getIdMap(podlist)` fetches IPs addresses of `podlist` and then sort them -to generate `trainer_id`. - -```python -def getIdMap(podlist): - ''' - generate tainer_id by ip - ''' - ips = [] - for pod in podlist["items"]: - ips.append(pod["status"]["podIP"]) - ips.sort() - idMap = {} - for i in range(len(ips)): - idMap[ips[i]] = i - return idMap -``` - -After getting the `idMap`, we can generate the arguments of `Paddle PServer` and `Paddle Trainer` -so that we can start up them by `startPaddle(idMap, train_args_dict)`. - -### Create Job - -The main goal of `startPaddle` is generating the arguments of `Paddle PServer` and -`Paddle Trainer` processes. Take `Paddle Trainer` as an example, we parse the -environment variable and then get `PADDLE_NIC`, `PADDLE_PORT`, `PADDLE_PORTS_NUM` and etc..., -finally find `trainerId` from `idMap` according to its IP address. - -```python - program = 'paddle train' - args = " --nics=" + PADDLE_NIC - args += " --port=" + str(PADDLE_PORT) - args += " --ports_num=" + str(PADDLE_PORTS_NUM) - args += " --comment=" + "paddle_process_by_paddle" - ip_string = "" - for ip in idMap.keys(): - ip_string += (ip + ",") - ip_string = ip_string.rstrip(",") - args += " --pservers=" + ip_string - args_ext = "" - for key, value in train_args_dict.items(): - args_ext += (' --' + key + '=' + value) - localIP = socket.gethostbyname(socket.gethostname()) - trainerId = idMap[localIP] - args += " " + args_ext + " --trainer_id=" + \ - str(trainerId) + " --save_dir=" + JOB_PATH_OUTPUT -``` diff --git a/doc/v2/howto/cluster/multi_cluster/k8s_en.md b/doc/v2/howto/cluster/multi_cluster/k8s_en.md deleted file mode 100644 index 96ff652705726fc56fa0078593cd2a695fcdb5e2..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/multi_cluster/k8s_en.md +++ /dev/null @@ -1,210 +0,0 @@ -# Kubernetes - -In this article, we will introduce how to run PaddlePaddle training job on single CPU machine using Kubernetes. In next article, we will introduce how to run PaddlePaddle training job on distributed cluster. - -## Build Docker Image - -In distributed Kubernetes cluster, we will use Ceph or other distributed -storage system for storing training related data so that all processes in -PaddlePaddle training can retrieve data from Ceph. In this example, we will -only demo training job on single machine. In order to simplify the requirement -of the environment, we will directly put training data into the PaddlePaddle Docker Image, -so we need to create a PaddlePaddle Docker image that includes the training data. - -The production Docker Image `paddlepaddle/paddle:cpu-demo-latest` has the PaddlePaddle -source code and demo. (Caution: Default PaddlePaddle Docker Image `paddlepaddle/paddle:latest` doesn't include -the source code, PaddlePaddle's different versions of Docker Image can be referred here: -[Docker Installation Guide](http://paddlepaddle.org/docs/develop/documentation/zh/getstarted/build_and_install/docker_install_en.html)), -so we run this Docker Image and download the training data, and then commit the whole -Container to be a new Docker Image. - -### Run Docker Container - -``` -$ docker run --name quick_start_data -it paddlepaddle/paddle:cpu-demo-latest -``` - -### Download Training Data - -Getting into `/root/paddle/demo/quick_start/data` Directory,using `get_data.sh` to download training data. -Then getting into `/root/paddle/demo/quick_start` Directory, using `preprocess.sh` to pre-process training data. - -``` -$ root@fbd1f2bb71f4:~/paddle/demo/quick_start/data# ./get_data.sh - -Downloading Amazon Electronics reviews data... ---2016-10-31 01:33:43-- http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz -Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80 -Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected. -HTTP request sent, awaiting response... 200 OK -Length: 495854086 (473M) [application/x-gzip] -Saving to: 'reviews_Electronics_5.json.gz' - - 10% [=======> ] 874,279 64.7KB/s eta 2h 13m - -``` - -### Modify Startup Script - -After downloading the data,modify `/root/paddle/demo/quick_start/train.sh` file contents are as follows (one more cd cmd): -``` -set -e -cd /root/paddle/demo/quick_start -cfg=trainer_config.lr.py -#cfg=trainer_config.emb.py -#cfg=trainer_config.cnn.py -#cfg=trainer_config.lstm.py -#cfg=trainer_config.bidi-lstm.py -#cfg=trainer_config.db-lstm.py -paddle train \ - --config=$cfg \ - --save_dir=./output \ - --trainer_count=4 \ - --log_period=20 \ - --num_passes=15 \ - --use_gpu=false \ - --show_parameter_stats_period=100 \ - --test_all_data_in_one_period=1 \ - 2>&1 | tee 'train.log' -``` - -### Commit Docker Image - -``` -$ docker commit quick_start_data mypaddle/paddle:quickstart -``` - -## Use Kubernetes For Training - -We will use Kubernetes job for training process, following steps shows how to do the training with Kubernetes. - -### Create Yaml Files - -The output result in container will be demolished when job finished (container stopped running), so we need to mount the volume out to the local disk when creating the container to store the training result. Using our previously created image, we can create a [Kubernetes Job](http://kubernetes.io/docs/user-guide/jobs/#what-is-a-job), the yaml contents are as follows: - -``` -apiVersion: batch/v1 -kind: Job -metadata: - name: quickstart -spec: - parallelism: 1 - completions: 1 - template: - metadata: - name: quickstart - spec: - volumes: - - name: output - hostPath: - path: /home/work/paddle_output - containers: - - name: pi - image: mypaddle/paddle:quickstart - command: ["bin/bash", "-c", "/root/paddle/demo/quick_start/train.sh"] - volumeMounts: - - name: output - mountPath: /root/paddle/demo/quick_start/output - restartPolicy: Never -``` - -### Start PaddlePaddle Job - -Using the above yaml file to start the Kubernetes job. - -``` -$ kubectl create -f paddle.yaml -``` - -Get the detailed status of the job: - -``` -$ kubectl get job -NAME DESIRED SUCCESSFUL AGE -quickstart 1 0 58s - -$ kubectl describe job quickstart -Name: quickstart -Namespace: default -Image(s): registry.baidu.com/public/paddle:cpu-demo-latest -Selector: controller-uid=f120da72-9f18-11e6-b363-448a5b355b84 -Parallelism: 1 -Completions: 1 -Start Time: Mon, 31 Oct 2016 11:20:16 +0800 -Labels: controller-uid=f120da72-9f18-11e6-b363-448a5b355b84,job-name=quickstart -Pods Statuses: 0 Running / 1 Succeeded / 0 Failed -Volumes: - output: - Type: HostPath (bare host directory volume) - Path: /home/work/paddle_output -Events: - FirstSeen LastSeen Count From SubobjectPath Type Reason Message - --------- -------- ----- ---- ------------- -------- ------ ------- - 1m 1m 1 {job-controller } Normal SuccessfulCreate Created pod: quickstart-fa0wx -``` - -### Get Training Result - -We can use kubectl command to take a look at the status of related pod. - -``` -$ kubectl describe pod quickstart-fa0wx -Name: quickstart-fa0wx -Namespace: default -Node: paddle-demo-let02/10.206.202.44 -Start Time: Mon, 31 Oct 2016 11:20:17 +0800 -Labels: controller-uid=f120da72-9f18-11e6-b363-448a5b355b84,job-name=quickstart -Status: Succeeded -IP: 10.0.0.9 -Controllers: Job/quickstart -Containers: - quickstart: - Container ID: docker://b8561f5c79193550d64fa47418a9e67ebdd71546186e840f88de5026b8097465 - Image: registry.baidu.com/public/paddle:cpu-demo-latest - Image ID: docker://18e457ce3d362ff5f3febf8e7f85ffec852f70f3b629add10aed84f930a68750 - Port: - Command: - bin/bash - -c - /root/paddle/demo/quick_start/train.sh - QoS Tier: - cpu: BestEffort - memory: BestEffort - State: Terminated - Reason: Completed - Exit Code: 0 - Started: Mon, 31 Oct 2016 11:20:20 +0800 - Finished: Mon, 31 Oct 2016 11:21:46 +0800 - Ready: False - Restart Count: 0 - Environment Variables: -Conditions: - Type Status - Ready False -Volumes: - output: - Type: HostPath (bare host directory volume) - Path: /home/work/paddle_output -``` - -We can also ssh to Kubernetes node to take a look at the training result. - -``` -[root@paddle-demo-let02 paddle_output]# ll -total 60 -drwxr-xr-x 2 root root 4096 Oct 31 11:20 pass-00000 -drwxr-xr-x 2 root root 4096 Oct 31 11:20 pass-00001 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00002 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00003 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00004 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00005 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00006 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00007 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00008 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00009 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00010 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00011 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00012 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00013 -drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00014 -``` diff --git a/doc/v2/howto/cluster/multi_cluster/openmpi_cn.md b/doc/v2/howto/cluster/multi_cluster/openmpi_cn.md deleted file mode 100644 index 954b2215cc3136ec5b3e1cdc2f6d3f508f814516..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/multi_cluster/openmpi_cn.md +++ /dev/null @@ -1,41 +0,0 @@ -# 在OpenMPI集群中启动训练 - -## 准备OpenMPI集群 - -执行下面的命令以启动3个节点的OpenMPI集群和一个"head"节点: - -```bash -paddle/scripts/cluster_train_v2/openmpi/docker_cluster -kubectl create -f head.yaml -kubectl create -f mpi-nodes.yaml -``` - -然后可以从head节点ssh无密码登录到OpenMPI的每个节点上。 - -## 启动集群作业 - -您可以按照下面的步骤在OpenMPI集群中提交paddle训练任务: - -```bash -# 获得head和node节点的IP地址 -kubectl get po -o wide -# 将node节点的IP地址保存到machines文件中 -kubectl get po -o wide | grep nodes | awk '{print $6}' > machines -# 拷贝必要的文件到head节点 -scp -i ssh/id_rsa.mpi.pub machines prepare.py train.py start_mpi_train.sh tutorial@[headIP]:~ -# ssh 登录到head节点 -ssh -i ssh/id_rsa.mpi.pub tutorial@[headIP] -# --------------- 以下操作均在head节点中执行 --------------- -# 准备训练数据 -python prepare.py -# 拷贝训练程序和字典文件到每台MPI节点 -cat machines | xargs -i scp word_dict.pickle train.py start_mpi_train.sh machines {}:/home/tutorial -# 创建日志目录 -mpirun -hostfile machines -n 3 mkdir /home/tutorial/logs -# 拷贝训练数据到各自的节点 -scp train.txt-00000 test.txt-00000 [node1IP]:/home/tutorial -scp train.txt-00001 test.txt-00001 [node2IP]:/home/tutorial -scp train.txt-00002 test.txt-00002 [node3IP]:/home/tutorial -# 启动训练任务 -mpirun -hostfile machines -n 3 /home/tutorial/start_mpi_train.sh -``` diff --git a/doc/v2/howto/cluster/multi_cluster/openmpi_en.md b/doc/v2/howto/cluster/multi_cluster/openmpi_en.md deleted file mode 100644 index a5c02b336b8a974f546499acae32edac24219be9..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/multi_cluster/openmpi_en.md +++ /dev/null @@ -1,41 +0,0 @@ -# OpenMPI - -## Prepare an OpenMPI cluster - -Run the following command to start a 3-node MPI cluster and one "head" node. - -```bash -cd paddle/scripts/cluster_train_v2/openmpi/docker_cluster -kubectl create -f head.yaml -kubectl create -f mpi-nodes.yaml -``` - -Then you can log in to every OpenMPI node using ssh without input any passwords. - -## Launching Cluster Job - -Follow the steps to launch a PaddlePaddle training job in OpenMPI cluster:\ - -```bash -# find out node IP addresses -kubectl get po -o wide -# generate a "machines" file containing node IP addresses -kubectl get po -o wide | grep nodes | awk '{print $6}' > machines -# copy necessary files onto "head" node -scp -i ssh/id_rsa.mpi.pub machines prepare.py train.py start_mpi_train.sh tutorial@[headIP]:~ -# login to head node using ssh -ssh -i ssh/id_rsa.mpi.pub tutorial@[headIP] -# --------------- in head node --------------- -# prepare training data -python prepare.py -# copy training data and dict file to MPI nodes -cat machines | xargs -i scp word_dict.pickle train.py start_mpi_train.sh machines {}:/home/tutorial -# creat a directory for storing log files -mpirun -hostfile machines -n 3 mkdir /home/tutorial/logs -# copy training data to every node -scp train.txt-00000 test.txt-00000 [node1IP]:/home/tutorial -scp train.txt-00001 test.txt-00001 [node2IP]:/home/tutorial -scp train.txt-00002 test.txt-00002 [node3IP]:/home/tutorial -# start the job -mpirun -hostfile machines -n 3 /home/tutorial/start_mpi_train.sh -``` diff --git a/doc/v2/howto/cluster/multi_cluster/src/add_security_group.png b/doc/v2/howto/cluster/multi_cluster/src/add_security_group.png deleted file mode 100644 index bd34f46c9b0ada7027fd53e553e7d033255d25fc..0000000000000000000000000000000000000000 Binary files a/doc/v2/howto/cluster/multi_cluster/src/add_security_group.png and /dev/null differ diff --git a/doc/v2/howto/cluster/multi_cluster/src/create_efs.png b/doc/v2/howto/cluster/multi_cluster/src/create_efs.png deleted file mode 100644 index e5f1526033d1daf401700989af1d25919bcb7675..0000000000000000000000000000000000000000 Binary files a/doc/v2/howto/cluster/multi_cluster/src/create_efs.png and /dev/null differ diff --git a/doc/v2/howto/cluster/multi_cluster/src/k8s-paddle-arch.png b/doc/v2/howto/cluster/multi_cluster/src/k8s-paddle-arch.png deleted file mode 100644 index b3800c4fe81302d35e49f7dbacb9221c4dfa5cde..0000000000000000000000000000000000000000 Binary files a/doc/v2/howto/cluster/multi_cluster/src/k8s-paddle-arch.png and /dev/null differ diff --git a/doc/v2/howto/cluster/multi_cluster/src/k8s_data/Dockerfile b/doc/v2/howto/cluster/multi_cluster/src/k8s_data/Dockerfile deleted file mode 100644 index 6d3a12ae393aa594b8e6e9a5f726109426937284..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/multi_cluster/src/k8s_data/Dockerfile +++ /dev/null @@ -1,7 +0,0 @@ -FROM alpine - -RUN apk update && apk upgrade && apk add coreutils -ADD quick_start /quick_start -ADD get_data.sh /bin/ -RUN chmod +x /bin/get_data.sh -ENTRYPOINT ["/bin/get_data.sh"] diff --git a/doc/v2/howto/cluster/multi_cluster/src/k8s_data/README.md b/doc/v2/howto/cluster/multi_cluster/src/k8s_data/README.md deleted file mode 100644 index 83cef7affd0ac4d3a1ca08ea5b046fa81e1bc630..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/multi_cluster/src/k8s_data/README.md +++ /dev/null @@ -1,6 +0,0 @@ -To build PaddlePaddle data preparation image in tutorial [Distributed PaddlePaddle Training on AWS with Kubernetes](../../k8s_aws_en.md), run following commands: - -``` -cp -r ../../../../../../demo/quick_start . -docker build . -t prepare-data-image-name -``` diff --git a/doc/v2/howto/cluster/multi_cluster/src/k8s_data/get_data.sh b/doc/v2/howto/cluster/multi_cluster/src/k8s_data/get_data.sh deleted file mode 100755 index d187ba5ac8d03f69dfdefd4f63610ed7921575be..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/multi_cluster/src/k8s_data/get_data.sh +++ /dev/null @@ -1,26 +0,0 @@ -#!/bin/sh - -out_dir=$OUT_DIR -split_count=$SPLIT_COUNT - -set -e - -mkdir -p $out_dir -cp -r /quick_start $out_dir/ - -mkdir -p $out_dir/0/data -cd $out_dir/0/data -wget http://paddlepaddle.bj.bcebos.com/demo/quick_start_preprocessed_data/preprocessed_data.tar.gz -tar zxvf preprocessed_data.tar.gz -rm preprocessed_data.tar.gz - -split -d --number=l/$split_count -a 5 train.txt train. -mv train.00000 train.txt - -cd $out_dir -end=$(expr $split_count - 1) -for i in $(seq 1 $end); do - mkdir -p $i/data - cp -r 0/data/* $i/data - mv $i/data/train.`printf %05d $i` $i/data/train.txt -done; diff --git a/doc/v2/howto/cluster/multi_cluster/src/k8s_train/Dockerfile b/doc/v2/howto/cluster/multi_cluster/src/k8s_train/Dockerfile deleted file mode 100644 index 77f021a89a70d934bf70424eaa3c6dc3f7c93a28..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/multi_cluster/src/k8s_train/Dockerfile +++ /dev/null @@ -1,6 +0,0 @@ -FROM paddlepaddle/paddle:latest - -COPY start.sh /root/ -COPY start_paddle.py /root/ -RUN chmod +x /root/start.sh -CMD ["bash"," -c","/root/start.sh"] diff --git a/doc/v2/howto/cluster/multi_cluster/src/k8s_train/README.md b/doc/v2/howto/cluster/multi_cluster/src/k8s_train/README.md deleted file mode 100644 index 96bf65497ffa23e90c4c9350504f86367b48daf2..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/multi_cluster/src/k8s_train/README.md +++ /dev/null @@ -1,5 +0,0 @@ -To build PaddlePaddle training image in tutorial [Distributed PaddlePaddle Training on AWS with Kubernetes](../../k8s_aws_en.md), run following command: - -``` -docker build . -t train-image-name -``` diff --git a/doc/v2/howto/cluster/multi_cluster/src/k8s_train/start.sh b/doc/v2/howto/cluster/multi_cluster/src/k8s_train/start.sh deleted file mode 100755 index 12dfe1e6386885a6989d3887f21c6922f137a9ae..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/multi_cluster/src/k8s_train/start.sh +++ /dev/null @@ -1,19 +0,0 @@ -#!/bin/sh - -set -eu - -jobconfig=${JOB_PATH}"/"${JOB_NAME}"/"${TRAIN_CONFIG_DIR} -cd /root -cp -rf $jobconfig/* . - -python /root/start_paddle.py \ - --dot_period=10 \ - --ports_num=$CONF_PADDLE_PORTS_NUM \ - --ports_num_for_sparse=$CONF_PADDLE_PORTS_NUM_SPARSE \ - --log_period=50 \ - --num_passes=10 \ - --trainer_count=$TRAINER_COUNT \ - --saving_period=1 \ - --local=0 \ - --config=trainer_config.lr.py \ - --use_gpu=0 diff --git a/doc/v2/howto/cluster/multi_cluster/src/k8s_train/start_paddle.py b/doc/v2/howto/cluster/multi_cluster/src/k8s_train/start_paddle.py deleted file mode 100755 index 935c12bb67e1fe08bc135a7a2220fcd43c548482..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/multi_cluster/src/k8s_train/start_paddle.py +++ /dev/null @@ -1,170 +0,0 @@ -#!/usr/bin/python -# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import requests -import time -import socket -import os -import argparse - -# configuration for cluster -API = "/api/v1/namespaces/" -JOBSELECTOR = "labelSelector=job-name=" -JOB_PATH = os.getenv("JOB_PATH") + "/" + os.getenv("JOB_NAME") -JOB_PATH_OUTPUT = JOB_PATH + "/output" -JOBNAME = os.getenv("JOB_NAME") -NAMESPACE = os.getenv("JOB_NAMESPACE") -PADDLE_NIC = os.getenv("CONF_PADDLE_NIC") -PADDLE_PORT = os.getenv("CONF_PADDLE_PORT") -PADDLE_PORTS_NUM = os.getenv("CONF_PADDLE_PORTS_NUM") -PADDLE_PORTS_NUM_SPARSE = os.getenv("CONF_PADDLE_PORTS_NUM_SPARSE") -PADDLE_SERVER_NUM = os.getenv("CONF_PADDLE_GRADIENT_NUM") - -tokenpath = '/var/run/secrets/kubernetes.io/serviceaccount/token' - - -def refine_unknown_args(cmd_args): - ''' - refine unknown parameters to handle some special parameters - ''' - new_args = [] - for arg in cmd_args: - if arg.startswith("--") and arg.find("=") != -1: - equal_pos = arg.find("=") # find first = pos - arglist = list(arg) - arglist[equal_pos] = " " - arg = "".join(arglist) - arg = arg.lstrip("-") - new_args += arg.split(" ") - elif arg.startswith("--") and arg.find("=") == -1: - arg = arg.lstrip("-") - new_args.append(arg) - else: - new_args.append(arg) - return new_args - - -def isPodAllRunning(podlist): - ''' - check all pod is running - ''' - require = len(podlist["items"]) - running = 0 - for pod in podlist["items"]: - if pod["status"]["phase"] == "Running": - running += 1 - print "waiting for pods running, require:", require, "running:", running - if require == running: - return True - return False - - -def getPodList(): - ''' - get all container status of the job - ''' - apiserver = "https://" + \ - os.getenv("KUBERNETES_SERVICE_HOST") + ":" + \ - os.getenv("KUBERNETES_SERVICE_PORT_HTTPS") - - pod = API + NAMESPACE + "/pods?" - job = JOBNAME - if os.path.isfile(tokenpath): - tokenfile = open(tokenpath, mode='r') - token = tokenfile.read() - Bearer = "Bearer " + token - headers = {"Authorization": Bearer} - return requests.get(apiserver + pod + JOBSELECTOR + job, - headers=headers, - verify=False).json() - else: - return requests.get(apiserver + pod + JOBSELECTOR + job, - verify=False).json() - - -def getIdMap(podlist): - ''' - generate tainer_id by ip - ''' - ips = [] - for pod in podlist["items"]: - ips.append(pod["status"]["podIP"]) - ips.sort() - idMap = {} - for i in range(len(ips)): - idMap[ips[i]] = i - return idMap - - -def startPaddle(idMap={}, train_args_dict=None): - ''' - start paddle pserver and trainer - ''' - program = 'paddle train' - args = " --nics=" + PADDLE_NIC - args += " --port=" + str(PADDLE_PORT) - args += " --ports_num=" + str(PADDLE_PORTS_NUM) - args += " --comment=" + "paddle_process_by_paddle" - ip_string = "" - for ip in idMap.keys(): - ip_string += (ip + ",") - ip_string = ip_string.rstrip(",") - args += " --pservers=" + ip_string - args_ext = "" - for key, value in train_args_dict.items(): - args_ext += (' --' + key + '=' + value) - localIP = socket.gethostbyname(socket.gethostname()) - trainerId = idMap[localIP] - args += " " + args_ext + " --trainer_id=" + \ - str(trainerId) + " --save_dir=" + JOB_PATH_OUTPUT - logDir = JOB_PATH_OUTPUT + "/node_" + str(trainerId) - if not os.path.exists(JOB_PATH_OUTPUT): - os.makedirs(JOB_PATH_OUTPUT) - if not os.path.exists(logDir): - os.mkdir(logDir) - copyCommand = 'cp -rf ' + JOB_PATH + \ - "/" + str(trainerId) + "/data/*" + " ./data/" - os.system(copyCommand) - startPserver = 'nohup paddle pserver' + \ - " --port=" + str(PADDLE_PORT) + \ - " --ports_num=" + str(PADDLE_PORTS_NUM) + \ - " --ports_num_for_sparse=" + str(PADDLE_PORTS_NUM_SPARSE) + \ - " --nics=" + PADDLE_NIC + \ - " --comment=" + "paddle_process_by_paddle" + \ - " --num_gradient_servers=" + str(PADDLE_SERVER_NUM) +\ - " > " + logDir + "/server.log 2>&1 &" - print startPserver - os.system(startPserver) - # wait until pservers completely start - time.sleep(20) - startTrainer = program + args + " 2>&1 | tee " + \ - logDir + "/train.log" - print startTrainer - os.system(startTrainer) - - -if __name__ == '__main__': - parser = argparse.ArgumentParser( - prog="start_paddle.py", description='simple tool for k8s') - args, train_args_list = parser.parse_known_args() - train_args = refine_unknown_args(train_args_list) - train_args_dict = dict(zip(train_args[:-1:2], train_args[1::2])) - podlist = getPodList() - # need to wait until all pods are running - while not isPodAllRunning(podlist): - time.sleep(20) - podlist = getPodList() - idMap = getIdMap(podlist) - startPaddle(idMap, train_args_dict) diff --git a/doc/v2/howto/cluster/multi_cluster/src/pserver_and_trainer.png b/doc/v2/howto/cluster/multi_cluster/src/pserver_and_trainer.png deleted file mode 100644 index f41fe48920590333ad332bb51eb18e03dc251541..0000000000000000000000000000000000000000 Binary files a/doc/v2/howto/cluster/multi_cluster/src/pserver_and_trainer.png and /dev/null differ diff --git a/doc/v2/howto/cluster/multi_cluster/src/route53_create_recordset.png b/doc/v2/howto/cluster/multi_cluster/src/route53_create_recordset.png deleted file mode 100644 index 34e476c7beac30fcdde13fccc4cc8d08b4be3d35..0000000000000000000000000000000000000000 Binary files a/doc/v2/howto/cluster/multi_cluster/src/route53_create_recordset.png and /dev/null differ diff --git a/doc/v2/howto/cluster/multi_cluster/src/route53_create_zone.png b/doc/v2/howto/cluster/multi_cluster/src/route53_create_zone.png deleted file mode 100644 index 25b7ddb831c5cba97f4b2edddd27da3234d621af..0000000000000000000000000000000000000000 Binary files a/doc/v2/howto/cluster/multi_cluster/src/route53_create_zone.png and /dev/null differ diff --git a/doc/v2/howto/cluster/multi_cluster/src/worker_security_group.png b/doc/v2/howto/cluster/multi_cluster/src/worker_security_group.png deleted file mode 100644 index 57eb0265a34ad4223b69600d2a3dd355482e0bf5..0000000000000000000000000000000000000000 Binary files a/doc/v2/howto/cluster/multi_cluster/src/worker_security_group.png and /dev/null differ diff --git a/doc/v2/howto/cluster/preparations_cn.md b/doc/v2/howto/cluster/preparations_cn.md deleted file mode 100644 index ce40697e703503b66f6306e15ebdb0ce1329991d..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/preparations_cn.md +++ /dev/null @@ -1,16 +0,0 @@ -## 环境准备 - -1. 准备您的计算集群。计算集群通常由一组(几台到几千台规模)的Linux服务器组成。服务器之间可以通过局域网(LAN)联通,每台服务器具有集群中唯一的IP地址(或者可被DNS解析的主机名)。集群中的每台计算机通常被成为一个“节点”。 -1. 我们需要在集群的所有节点上安装 PaddlePaddle。 如果要启用GPU,还需要在节点上安装对应的GPU驱动以及CUDA。PaddlePaddle的安装可以参考[build_and_install](http://www.paddlepaddle.org/docs/develop/documentation/zh/getstarted/build_and_install/index_cn.html)的多种安装方式。我们推荐使用[Docker](http://www.paddlepaddle.org/docs/develop/documentation/zh/getstarted/build_and_install/docker_install_cn.html)安装方式来快速安装PaddlePaddle。 - -安装完成之后,执行下面的命令可以查看已经安装的版本(docker安装方式可以进入docker容器执行:`docker run -it paddlepaddle/paddle:[tag] /bin/bash`): -```bash -$ paddle version -PaddlePaddle 0.10.0, compiled with - with_avx: ON - with_gpu: OFF - with_double: OFF - with_python: ON - with_rdma: OFF - with_timer: OFF -``` diff --git a/doc/v2/howto/cluster/preparations_en.md b/doc/v2/howto/cluster/preparations_en.md deleted file mode 100644 index 4b77b293907ae0548134fc65ceed3aa0ed0b845d..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/preparations_en.md +++ /dev/null @@ -1,17 +0,0 @@ -## Preparations - -1. Prepare your computer cluster. It's normally a bunch of Linux servers connected by LAN. Each server will be assigned a unique IP address. The computers in the cluster can be called "nodes". -2. Install PaddlePaddle on every node. If you are going to take advantage of GPU cards, you'll also need to install proper driver and CUDA libraries. To install PaddlePaddle please read [this build and install](http://www.paddlepaddle.org/docs/develop/documentation/en/getstarted/build_and_install/index_en.html) document. We strongly recommend using [Docker installation](http://www.paddlepaddle.org/docs/develop/documentation/en/getstarted/build_and_install/docker_install_en.html). - -After installation, you can check the version by typing the below command (run a docker container if using docker: `docker run -it paddlepaddle/paddle:[tag] /bin/bash`): - -```bash -$ paddle version -PaddlePaddle 0.10.0rc, compiled with - with_avx: ON - with_gpu: OFF - with_double: OFF - with_python: ON - with_rdma: OFF - with_timer: OFF -``` diff --git a/doc/v2/howto/cluster/src/Dockerfile b/doc/v2/howto/cluster/src/Dockerfile deleted file mode 100644 index e178bf4da0f32fca9586b5b69a2c7419de5d9cb1..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/src/Dockerfile +++ /dev/null @@ -1,7 +0,0 @@ -FROM paddlepaddle/paddle:latest - -MAINTAINER zjsxzong89@gmail.com - -COPY start.sh /root/ -COPY start_paddle.py /root/ -CMD ["bash"," -c","/root/start.sh"] \ No newline at end of file diff --git a/doc/v2/howto/cluster/src/efs_mount.png b/doc/v2/howto/cluster/src/efs_mount.png deleted file mode 100644 index 0f9e3cab98445707e5e9baa18ddabe15cdf04576..0000000000000000000000000000000000000000 Binary files a/doc/v2/howto/cluster/src/efs_mount.png and /dev/null differ diff --git a/doc/v2/howto/cluster/src/managed_policy.png b/doc/v2/howto/cluster/src/managed_policy.png deleted file mode 100644 index c7ecda555b81d7750e9292a9ab72d2f517f76a2a..0000000000000000000000000000000000000000 Binary files a/doc/v2/howto/cluster/src/managed_policy.png and /dev/null differ diff --git a/doc/v2/howto/cluster/src/ps_cn.png b/doc/v2/howto/cluster/src/ps_cn.png deleted file mode 100644 index f9525739cc8bc6506adde642aafa0a85ae3ebebc..0000000000000000000000000000000000000000 Binary files a/doc/v2/howto/cluster/src/ps_cn.png and /dev/null differ diff --git a/doc/v2/howto/cluster/src/ps_en.png b/doc/v2/howto/cluster/src/ps_en.png deleted file mode 100644 index 6537d3d56589ca9f19a77a50a970e4b5275e6ce0..0000000000000000000000000000000000000000 Binary files a/doc/v2/howto/cluster/src/ps_en.png and /dev/null differ diff --git a/doc/v2/howto/cluster/src/trainer.png b/doc/v2/howto/cluster/src/trainer.png deleted file mode 100644 index 6537d3d56589ca9f19a77a50a970e4b5275e6ce0..0000000000000000000000000000000000000000 Binary files a/doc/v2/howto/cluster/src/trainer.png and /dev/null differ diff --git a/doc/v2/howto/cluster/src/trainer_cn.png b/doc/v2/howto/cluster/src/trainer_cn.png deleted file mode 100644 index f9525739cc8bc6506adde642aafa0a85ae3ebebc..0000000000000000000000000000000000000000 Binary files a/doc/v2/howto/cluster/src/trainer_cn.png and /dev/null differ diff --git a/doc/v2/howto/cluster/src/word2vec/api_train_v2.py b/doc/v2/howto/cluster/src/word2vec/api_train_v2.py deleted file mode 100644 index 9107e24c175f1fbf29d86e222e4b66031a5b505e..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/src/word2vec/api_train_v2.py +++ /dev/null @@ -1,114 +0,0 @@ -# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import gzip -import math - -import paddle.v2 as paddle - -embsize = 32 -hiddensize = 256 -N = 5 - - -def wordemb(inlayer): - wordemb = paddle.layer.embedding( - input=inlayer, - size=embsize, - param_attr=paddle.attr.Param( - name="_proj", - initial_std=0.001, - learning_rate=1, - l2_rate=0, - sparse_update=True)) - return wordemb - - -def main(): - # for local training - cluster_train = False - - if not cluster_train: - paddle.init(use_gpu=False, trainer_count=1) - else: - paddle.init( - use_gpu=False, - trainer_count=2, - port=7164, - ports_num=1, - ports_num_for_sparse=1, - num_gradient_servers=1) - word_dict = paddle.dataset.imikolov.build_dict() - dict_size = len(word_dict) - firstword = paddle.layer.data( - name="firstw", type=paddle.data_type.integer_value(dict_size)) - secondword = paddle.layer.data( - name="secondw", type=paddle.data_type.integer_value(dict_size)) - thirdword = paddle.layer.data( - name="thirdw", type=paddle.data_type.integer_value(dict_size)) - fourthword = paddle.layer.data( - name="fourthw", type=paddle.data_type.integer_value(dict_size)) - nextword = paddle.layer.data( - name="fifthw", type=paddle.data_type.integer_value(dict_size)) - - Efirst = wordemb(firstword) - Esecond = wordemb(secondword) - Ethird = wordemb(thirdword) - Efourth = wordemb(fourthword) - - contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth]) - hidden1 = paddle.layer.fc(input=contextemb, - size=hiddensize, - act=paddle.activation.Sigmoid(), - layer_attr=paddle.attr.Extra(drop_rate=0.5), - bias_attr=paddle.attr.Param(learning_rate=2), - param_attr=paddle.attr.Param( - initial_std=1. / math.sqrt(embsize * 8), - learning_rate=1)) - predictword = paddle.layer.fc(input=hidden1, - size=dict_size, - bias_attr=paddle.attr.Param(learning_rate=2), - act=paddle.activation.Softmax()) - - def event_handler(event): - if isinstance(event, paddle.event.EndIteration): - if event.batch_id % 100 == 0: - with gzip.open("batch-" + str(event.batch_id) + ".tar.gz", - 'w') as f: - trainer.save_parameter_to_tar(f) - result = trainer.test( - paddle.batch( - paddle.dataset.imikolov.test(word_dict, N), 32)) - print "Pass %d, Batch %d, Cost %f, %s, Testing metrics %s" % ( - event.pass_id, event.batch_id, event.cost, event.metrics, - result.metrics) - - cost = paddle.layer.classification_cost(input=predictword, label=nextword) - - parameters = paddle.parameters.create(cost) - adagrad = paddle.optimizer.AdaGrad( - learning_rate=3e-3, - regularization=paddle.optimizer.L2Regularization(8e-4)) - trainer = paddle.trainer.SGD(cost, - parameters, - adagrad, - is_local=not cluster_train) - trainer.train( - paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32), - num_passes=30, - event_handler=event_handler) - - -if __name__ == '__main__': - main() diff --git a/doc/v2/howto/cluster/src/word2vec/api_train_v2_cluster.py b/doc/v2/howto/cluster/src/word2vec/api_train_v2_cluster.py deleted file mode 100644 index 791504094f3ecae925226ff1d90f20f91d4c018d..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/src/word2vec/api_train_v2_cluster.py +++ /dev/null @@ -1,137 +0,0 @@ -# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import math -import os -import paddle.v2 as paddle -import pickle - -embsize = 32 -hiddensize = 256 -N = 5 -cluster_train_file = "./train_data_dir/train/train.txt" -cluster_test_file = "./test_data_dir/test/test.txt" -node_id = os.getenv("OMPI_COMM_WORLD_RANK") -if not node_id: - raise EnvironmentError("must provied OMPI_COMM_WORLD_RANK") - - -def wordemb(inlayer): - wordemb = paddle.layer.embedding( - input=inlayer, - size=embsize, - param_attr=paddle.attr.Param( - name="_proj", - initial_std=0.001, - learning_rate=1, - l2_rate=0, - sparse_update=True)) - return wordemb - - -def cluster_reader_cluster(filename, node_id): - def cluster_reader(): - with open("-".join([filename, "%05d" % int(node_id)]), "r") as f: - for l in f: - csv_data = [int(cell) for cell in l.split(",")] - yield tuple(csv_data) - - return cluster_reader - - -def main(): - # get arguments from env - - # for local training - TRUTH = ["true", "True", "TRUE", "1", "yes", "Yes", "YES"] - cluster_train = os.getenv('PADDLE_CLUSTER_TRAIN', "False") in TRUTH - use_gpu = os.getenv('PADDLE_INIT_USE_GPU', "False") - - if not cluster_train: - paddle.init( - use_gpu=use_gpu, - trainer_count=int(os.getenv("PADDLE_INIT_TRAINER_COUNT", "1"))) - else: - paddle.init( - use_gpu=use_gpu, - trainer_count=int(os.getenv("PADDLE_INIT_TRAINER_COUNT", "1")), - port=int(os.getenv("PADDLE_INIT_PORT", "7164")), - ports_num=int(os.getenv("PADDLE_INIT_PORTS_NUM", "1")), - ports_num_for_sparse=int( - os.getenv("PADDLE_INIT_PORTS_NUM_FOR_SPARSE", "1")), - num_gradient_servers=int( - os.getenv("PADDLE_INIT_NUM_GRADIENT_SERVERS", "1")), - trainer_id=int(os.getenv("PADDLE_INIT_TRAINER_ID", "0")), - pservers=os.getenv("PADDLE_INIT_PSERVERS", "127.0.0.1")) - fn = open("thirdparty/wuyi_train_thdpty/word_dict.pickle", "r") - word_dict = pickle.load(fn) - fn.close() - dict_size = len(word_dict) - firstword = paddle.layer.data( - name="firstw", type=paddle.data_type.integer_value(dict_size)) - secondword = paddle.layer.data( - name="secondw", type=paddle.data_type.integer_value(dict_size)) - thirdword = paddle.layer.data( - name="thirdw", type=paddle.data_type.integer_value(dict_size)) - fourthword = paddle.layer.data( - name="fourthw", type=paddle.data_type.integer_value(dict_size)) - nextword = paddle.layer.data( - name="fifthw", type=paddle.data_type.integer_value(dict_size)) - - Efirst = wordemb(firstword) - Esecond = wordemb(secondword) - Ethird = wordemb(thirdword) - Efourth = wordemb(fourthword) - - contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth]) - hidden1 = paddle.layer.fc(input=contextemb, - size=hiddensize, - act=paddle.activation.Sigmoid(), - layer_attr=paddle.attr.Extra(drop_rate=0.5), - bias_attr=paddle.attr.Param(learning_rate=2), - param_attr=paddle.attr.Param( - initial_std=1. / math.sqrt(embsize * 8), - learning_rate=1)) - predictword = paddle.layer.fc(input=hidden1, - size=dict_size, - bias_attr=paddle.attr.Param(learning_rate=2), - act=paddle.activation.Softmax()) - - def event_handler(event): - if isinstance(event, paddle.event.EndIteration): - if event.batch_id % 100 == 0: - result = trainer.test( - paddle.batch( - cluster_reader_cluster(cluster_test_file, node_id), 32)) - print "Pass %d, Batch %d, Cost %f, %s, Testing metrics %s" % ( - event.pass_id, event.batch_id, event.cost, event.metrics, - result.metrics) - - cost = paddle.layer.classification_cost(input=predictword, label=nextword) - parameters = paddle.parameters.create(cost) - adagrad = paddle.optimizer.AdaGrad( - learning_rate=3e-3, - regularization=paddle.optimizer.L2Regularization(8e-4)) - trainer = paddle.trainer.SGD(cost, - parameters, - adagrad, - is_local=not cluster_train) - trainer.train( - paddle.batch(cluster_reader_cluster(cluster_train_file, node_id), 32), - num_passes=30, - event_handler=event_handler) - - -if __name__ == '__main__': - main() diff --git a/doc/v2/howto/cluster/src/word2vec/prepare.py b/doc/v2/howto/cluster/src/word2vec/prepare.py deleted file mode 100644 index a42548fbf03a0298e1e397c868e4d531801ec89a..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/src/word2vec/prepare.py +++ /dev/null @@ -1,55 +0,0 @@ -# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import paddle.v2 as paddle -import tarfile -import os -import pickle - -SPLIT_COUNT = 3 -N = 5 - - -def file_len(fd): - for i, l in enumerate(fd): - pass - return i + 1 - - -def split_from_reader_by_line(filename, reader, split_count): - fn = open(filename, "w") - for batch_id, batch_data in enumerate(reader()): - batch_data_str = [str(d) for d in batch_data] - fn.write(",".join(batch_data_str)) - fn.write("\n") - fn.close() - - fn = open(filename, "r") - total_line_count = file_len(fn) - fn.close() - per_file_lines = total_line_count / split_count + 1 - cmd = "split -d -a 5 -l %d %s %s-" % (per_file_lines, filename, filename) - os.system(cmd) - - -word_dict = paddle.dataset.imikolov.build_dict() -with open("word_dict.pickle", "w") as dict_f: - pickle.dump(word_dict, dict_f) - -split_from_reader_by_line("train.txt", - paddle.dataset.imikolov.train(word_dict, N), - SPLIT_COUNT) -split_from_reader_by_line("test.txt", - paddle.dataset.imikolov.test(word_dict, N), - SPLIT_COUNT) diff --git a/doc/v2/howto/cmd_parameter/arguments_cn.md b/doc/v2/howto/cmd_parameter/arguments_cn.md deleted file mode 100644 index 2dea231ca5487978d59a4d0a570431722ed6b3bf..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cmd_parameter/arguments_cn.md +++ /dev/null @@ -1,394 +0,0 @@ -# 参数概述 - -虽然Paddle看起来包含了众多参数,但是大部分参数是为开发者提供的,或者已经在集群提交环境中自动设置,因此用户并不需要关心它们。在此,根据这些参数的使用场合,我们将它们划分为不同的类别。例如,`通用`类别中的参数可用于所有场合。某些参数只可用于特定的层中,而有些参数需要在集群多机训练中使用等。 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -√ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
参数本地训练集群训练本地测试集群测试
通用job
use_gpu
local
config
config_args
num_passes
trainer_count
version
show_layer_stat
训练dot_period
test_period
saving_period
show_parameter_stats_period
init_model_path
load_missing_parameter_strategy
saving_period_by_batches
use_old_updater
enable_grad_share
grad_share_block_num
log_error_clipping
log_clipping
save_only_one
start_pass
训练/测试save_dir
训练过程中测试test_period
average_test_period
测试model_list
test_wait
test_pass
predict_output_dir
distribute_test
Auc/正负对验证(PnpairValidation)predict_file
GPUgpu_id
parallel_nn
allow_only_one_model_on_one_gpu
cudnn_dir
cuda_dir
cudnn_conv_workspace_limit_in_mb
递归神经网络(RNN)beam_size
rnn_use_batch
prev_batch_state
diy_beam_search_prob_so
参数服务器(PServer)start_pserver
pservers
port
port_num
ports_num_for_sparse
nics
rdma_tcp
small_messages
loadsave_parameters_in_pserver
log_period_server
pserver_num_threads
sock_send_buf_size
sock_recv_buf_size
num_gradient_servers
parameter_block_size
parameter_block_size_for_sparse
异步随机梯度下降(Async SGD)async_count
async_lagged_ratio_min
async_lagged_ratio_default
性能调优(Performance Tuning)log_barrier_abstract
log_barrier_lowest_nodes
log_barrier_show_log
check_sparse_distribution_batches
check_sparse_distribution_ratio
check_sparse_distribution_unbalance_degree
check_sparse_distribution_in_pserver
show_check_sparse_distribution_log
数据提供器(Data Provider)memory_threshold_on_load_data
随机数seed
thread_local_rand_use_global_seed
单元测试checkgrad_eps
矩阵/向量enable_parallel_vector
- diff --git a/doc/v2/howto/cmd_parameter/arguments_en.md b/doc/v2/howto/cmd_parameter/arguments_en.md deleted file mode 100644 index d1963067bda949b11ececefed3db7db1432c6223..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cmd_parameter/arguments_en.md +++ /dev/null @@ -1,394 +0,0 @@ -# Argument Outline - -It looks like there are a lot of arguments. However, most of them are for developers or alrealy set automatically in cluster submitting environment and users do not need to care about them. Here, we divide these arguments into serveral classes according to the scenario that they are used in. For example, the arguments in `common` can be used in all scenes. Some arguments can be only used in certain layers. Some are needed by multi machines training in cluster, etc. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -√ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
argslocal traincluster trainlocal testcluster test
commonjob
use_gpu
local
config
config_args
num_passes
trainer_count
version
show_layer_stat
traindot_period
test_period
saving_period
show_parameter_stats_period
init_model_path
load_missing_parameter_strategy
saving_period_by_batches
use_old_updater
enable_grad_share
grad_share_block_num
log_error_clipping
log_clipping
save_only_one
start_pass
train/testsave_dir
testing during trainingtest_period
average_test_period
testmodel_list
test_wait
test_pass
predict_output_dir
distribute_test
Auc/PnpairValidationpredict_file
GPUgpu_id
parallel_nn
allow_only_one_model_on_one_gpu
cudnn_dir
cuda_dir
cudnn_conv_workspace_limit_in_mb
RNNbeam_size
rnn_use_batch
prev_batch_state
diy_beam_search_prob_so
PServerstart_pserver
pservers
port
port_num
ports_num_for_sparse
nics
rdma_tcp
small_messages
loadsave_parameters_in_pserver
log_period_server
pserver_num_threads
sock_send_buf_size
sock_recv_buf_size
num_gradient_servers
parameter_block_size
parameter_block_size_for_sparse
Async SGDasync_count
async_lagged_ratio_min
async_lagged_ratio_default
Performance Tuninglog_barrier_abstract
log_barrier_lowest_nodes
log_barrier_show_log
check_sparse_distribution_batches
check_sparse_distribution_ratio
check_sparse_distribution_unbalance_degree
check_sparse_distribution_in_pserver
show_check_sparse_distribution_log
Data Providermemory_threshold_on_load_data
RandomNumberseed
thread_local_rand_use_global_seed
UnitTestcheckgrad_eps
Matrix/Vectorenable_parallel_vector
- diff --git a/doc/v2/howto/cmd_parameter/detail_introduction_cn.md b/doc/v2/howto/cmd_parameter/detail_introduction_cn.md deleted file mode 100644 index b4625ba68cf23e5697554ba94efaf0b873f2c1de..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cmd_parameter/detail_introduction_cn.md +++ /dev/null @@ -1,323 +0,0 @@ -# 细节描述 - -## 通用 - -* `--job` - - 工作模式,包括: **train, test, checkgrad**,其中checkgrad主要为开发者使用,使用者不需要关心。 - - 类型: string (默认: train) - -* `--config` - - 用于指定网络配置文件。 - - 类型: string (默认: null). - -* `--use_gpu` - - 训练过程是否使用GPU,设置为true使用GPU模式,否则使用CPU模式。 - - 类型: bool (默认: 1). - -* `--local` -  - 训练过程是否为本地模式,设置为true使用本地训练或者使用集群上的一个节点,否则使用多机训练。 - - 类型: bool (默认: 1). - -* `--trainer_count` - - 指定一台机器上使用的线程数。例如,trainer_count = 4, 意思是在GPU模式下使用4个GPU,或者在CPU模式下使用4个线程。每个线程(或GPU)分配到当前数据块样本数的四分之一。也就是说,如果在训练配置中设置batch_size为512,每个线程分配到128个样本用于训练。 - - 类型: int32 (默认: 1). - -* `--num_passes` - - 当模式为`--job=train`时, 该参数的意思是训练num_passes轮。每轮会将数据集中的所有训练样本使用一次。当模式为`--job=test`时,意思是使用第test_pass个模型到第 num_passes-1 个模型测试数据。 - - 类型: int32 (默认: 100). - -* `--config_args` - - 传递给配置文件的参数。格式: key1=value1,key2=value2. - - 类型: string (默认: null). - -* `--version` - - 是否打印版本信息。 - - 类型: bool (默认: 0). - -* `--show_layer_stat` - - 是否显示**每个批次数据**中每层的数值统计. - - 类型: bool (默认: 0). - -## 训练 - -* `--log_period` - - 每log_period个批次打印日志进度. - - 类型: int32 (默认: 100). - -* `--dot_period` - - 每dot_period个批次输出符号'.'. - - 类型: int32 (默认: 1). - -* `--saving_period` - - 每saving_period轮保存训练参数. - - 类型: int32 (默认: 1). - -* `--save_dir` - - 保存模型参数的目录,需要明确指定,但不需要提前创建。 - - 类型: string (默认: null). - -* `--start_pass` - - 从start_pass轮开始训练,会加载上一轮的参数。 - - 类型: int32 (默认: 0). - -* `--show_parameter_stats_period` - - 在训练过程中每show_parameter_stats_period个批次输出参数统计。默认不显示。 - - 类型: int32 (默认: 0). - -* `--save_only_one` - - 只保存最后一轮的参数,而之前的参数将会被删除。 - - 类型: bool (默认: 0). - -* `--load_missing_parameter_strategy` - - 当模型参数不存在时,指定加载的方式。目前支持fail/rand/zero三种操作. - - `fail`: 程序直接退出. - - `rand`: 根据网络配置中的**initial\_strategy**采用均匀分布或者高斯分布初始化。均匀分布的范围是: **[mean - std, mean + std]**, 其中mean和std是训练配置中的参数. - - `zero`: 所有参数置为零. - - 类型: string (默认: fail). - -* `--init_model_path` - - 初始化模型的路径。如果设置该参数,start\_pass将不起作用。同样也可以在测试模式中指定模型路径。 - - 类型: string (默认: null). - -* `--saving_period_by_batches` - - 在一轮中每saving_period_by_batches个批次保存一次参数。 - - 类型: int32 (默认: 0). - -* `--log_error_clipping` - - 当在网络层配置中设置**error_clipping_threshold**时,该参数指示是否打印错误截断日志。如果为true,**每批次**的反向传播将会打印日志信息。该截断会影响**输出的梯度**. - - 类型: bool (默认: 0). - -* `--log_clipping` - - 当在训练配置中设置**gradient_clipping_threshold**时,该参数指示是否打印日志截断信息。该截断会影响**权重更新的梯度**. - - 类型: bool (默认: 0). - -* `--use_old_updater` - - 是否使用旧的RemoteParameterUpdater。 默认使用ConcurrentRemoteParameterUpdater,主要为开发者使用,使用者通常无需关心. - - 类型: bool (默认: 0). - -* `--enable_grad_share` - - 启用梯度参数的阈值,在多CPU训练时共享该参数. - - 类型: int32 (默认: 100 \* 1024 \* 1024). - -* `--grad_share_block_num` - - 梯度参数的分块数目,在多CPU训练时共享该参数. - - 类型: int32 (默认: 64). - -## 测试 - -* `--test_pass` - - 加载test_pass轮的模型用于测试. - - 类型: int32 (默认: -1). - -* `--test_period` - - 如果为0,每轮结束时对所有测试数据进行测试;如果不为0,每test_period个批次对所有测试数据进行测试. - - 类型: int32 (默认: 0). - -* `--test_wait` - - 指示当指定轮的测试模型不存在时,是否需要等待该轮模型参数。如果在训练期间同时发起另外一个进程进行测试,可以使用该参数. - - 类型: bool (默认: 0). - -* `--model_list` - - 测试时指定的存储模型列表的文件. - - 类型: string (默认: "", null). - -* `--predict_output_dir` - - 保存网络层输出结果的目录。该参数在网络配置的Outputs()中指定,默认为null,意思是不保存结果。在测试阶段,如果你想要保存某些层的特征图,请指定该目录。需要注意的是,网络层的输出是经过激活函数之后的值. - - 类型: string (默认: "", null). - -* `--average_test_period` - - 使用`average_test_period`个批次的参数平均值进行测试。该参数必须能被FLAGS_log_period整除,默认为0,意思是不使用平均参数执行测试. - - 类型: int32 (默认: 0). - -* `--distribute_test` - - 在分布式环境中测试,将多台机器的测试结果合并. - - 类型: bool (默认: 0). - -* `--predict_file` - - 保存预测结果的文件名。该参数默认为null,意思是不保存结果。目前该参数仅用于AucValidationLayer和PnpairValidationLayer层,每轮都会保存预测结果. - - 类型: string (默认: "", null). - -## GPU - -* `--gpu_id` - - 指示使用哪个GPU核. - - 类型: int32 (默认: 0). - -* `--allow_only_one_model_on_one_gpu` - - 如果为true,一个GPU设备上不允许配置多个模型. - - 类型: bool (默认: 1). - -* `--parallel_nn` - - 指示是否使用多线程来计算一个神经网络。如果为false,设置gpu_id指定使用哪个GPU核(训练配置中的设备属性将会无效)。如果为true,GPU核在训练配置中指定(gpu_id无效). - - 类型: bool (默认: 0). - -* `--cudnn_dir` - - 选择路径来动态加载NVIDIA CuDNN库,例如,/usr/local/cuda/lib64. [默认]: LD_LIBRARY_PATH - - 类型: string (默认: "", null) - -* `--cuda_dir` - - 选择路径来动态加载NVIDIA CUDA库,例如,/usr/local/cuda/lib64. [默认]: LD_LIBRARY_PATH - - 类型: string (默认: "", null) - -* `--cudnn_conv_workspace_limit_in_mb` - - 指定cuDNN的最大工作空间容限,单位是MB,默认为4096MB=4GB. - - 类型: int32 (默认: 4096MB=4GB) - -## 自然语言处理(NLP): RNN/LSTM/GRU -* `--rnn_use_batch` - - 指示在简单的RecurrentLayer层的计算中是否使用批处理方法. - - 类型: bool (默认: 0). - -* `--prev_batch_state` - - 标识是否为连续的batch计算. - - 类型: bool (默认: 0). - -* `--beam_size` - - 集束搜索使用广度优先搜索的方式构建查找树。在树的每一层上,都会产生当前层状态的所有继承结果,按启发式损失的大小递增排序。然而,每层上只能保存固定数目个最好的状态,该数目是提前定义好的,称之为集束大小. - - 类型: int32 (默认: 1). - -* `--diy_beam_search_prob_so` -  - 用户可以自定义beam search的方法,编译成动态库,供PaddlePaddle加载。 该参数用于指定动态库路径. - - 类型: string (默认: "", null). - -## 数据支持(DataProvider) - -* `--memory_threshold_on_load_data` - - 内存容限阈值,当超过该阈值时,停止加载数据. - - 类型: double (默认: 1.0). - -## 单元测试 - -* `--checkgrad_eps` - - 使用checkgrad模式时的参数变化大小. - - 类型: double (默认: 1e-05). - -## 参数服务器和分布式通信 - -* `--start_pserver` - - 指示是否开启参数服务器(parameter server). - - 类型: bool (默认: 0). - -* `--pservers` - - 参数服务器的IP地址,以逗号间隔. - - 类型: string (默认: "127.0.0.1"). - -* `--port` - - 参数服务器的监听端口. - - 类型: int32 (默认: 20134). - -* `--ports_num` - - 发送参数的端口号,根据默认端口号递增. - - 类型: int32 (默认: 1). - -* `--trainer_id` -  - 在分布式训练中,每个训练节点必须指定一个唯一的id号,从0到num_trainers-1。0号训练节点是主训练节点。使用者无需关心这个参数. - - 类型: int32 (默认: 0). - -* `--num_gradient_servers` - - 梯度服务器的数量,该参数在集群提交环境中自动设置. - - 类型: int32 (默认: 1). - -* `--small_messages` - - 如果消息数据太小,建议将该参数设为true,启动快速应答,无延迟. - - 类型: bool (默认: 0). - -* `--sock_send_buf_size` - - 限制套接字发送缓冲区的大小。如果仔细设置的话,可以有效减小网络的阻塞. - - 类型: int32 (默认: 1024 \* 1024 \* 40). - -* `--sock_recv_buf_size` - - 限制套接字接收缓冲区的大小. - - 类型: int32 (默认: 1024 \* 1024 \* 40). - -* `--parameter_block_size` - - 参数服务器的参数分块大小。如果未设置,将会自动计算出一个合适的值. - - 类型: int32 (默认: 0). - -* `--parameter_block_size_for_sparse` - - 参数服务器稀疏更新的参数分块大小。如果未设置,将会自动计算出一个合适的值. - - 类型: int32 (默认: 0). - -* `--log_period_server` - - 在参数服务器终端每log_period_server个批次打印日志进度. - - 类型: int32 (默认: 500). - -* `--loadsave_parameters_in_pserver` - - 在参数服务器上加载和保存参数,只有当设置了sparse_remote_update参数时才有效. - - 类型: bool (默认: 0). - -* `--pserver_num_threads` - - 同步执行操作的线程数. - - 类型: bool (默认: 1). - -* `--ports_num_for_sparse` - - 发送参数的端口号,根据默认值递增(port + ports_num),用于稀疏训练中. - - 类型: int32 (默认: 0). - -* `--nics` - - 参数服务器的网络设备名称,已经在集群提交环境中完成设置. - - 类型: string (默认: "xgbe0,xgbe1"). - -* `--rdma_tcp` - - 使用rdma还是tcp传输协议,该参数已经在集群提交环境中完成设置. - - 类型: string (默认: "tcp"). - -## 异步随机梯度下降(Async SGD) -* `--async_count` - - 定义异步训练的长度,如果为0,则使用同步训练. - - 类型: int32 (默认: 0). - -* `--async_lagged_ratio_min` - - 控制`config_.async_lagged_grad_discard_ratio()`的最小值. - - 类型: double (默认: 1.0). - -* `--async_lagged_ratio_default` - - 如果在网络配置中未设置async_lagged_grad_discard_ratio,则使用该参数作为默认值. - - 类型: double (默认: 1.5). - -## 性能调优(Performance Tuning) - -* `--log_barrier_abstract` - - 如果为true,则显示阻隔性能的摘要信息. - - 类型: bool (默认: 1). - -* `--log_barrier_show_log` - - 如果为true,则总会显示阻隔摘要信息,即使间隔很小. - - 类型: bool (默认: 0). - -* `--log_barrier_lowest_nodes` - - 最少显示多少个节点. - - 类型: int32 (默认: 5). - -* `--check_sparse_distribution_in_pserver` - - 指示是否检查所有参数服务器上的稀疏参数的分布是均匀的. - - 类型: bool (默认: 0). - -* `--show_check_sparse_distribution_log` - - 指示是否显示参数服务器上的稀疏参数分布的日志细节. - - 类型: bool (默认: 0). - -* `--check_sparse_distribution_batches` - - 每运行多少个批次执行一次稀疏参数分布的检查. - - 类型: int32 (默认: 100). - -* `--check_sparse_distribution_ratio` - - 如果检查到分配在不同参数服务器上的参数的分布不均匀次数大于check_sparse_distribution_ratio * check_sparse_distribution_batches次,程序停止. - - 类型: double (默认: 0.6). - -* `--check_sparse_distribution_unbalance_degree` - - 不同参数服务器上数据大小的最大值与最小值的比率. - - 类型: double (默认: 2). - -## 矩阵/向量/随机数 -* `--enable_parallel_vector` - - 启动并行向量的阈值. - - 类型: int32 (默认: 0). - -* `--seed` - - 随机数的种子。srand(time)的为0. - - 类型: int32 (默认: 1) - -* `--thread_local_rand_use_global_seed` - - 是否将全局种子应用于本地线程的随机数. - - 类型: bool (默认: 0). diff --git a/doc/v2/howto/cmd_parameter/detail_introduction_en.md b/doc/v2/howto/cmd_parameter/detail_introduction_en.md deleted file mode 100644 index b681ebc81a355dfc1a7638a4463dff6979929a45..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cmd_parameter/detail_introduction_en.md +++ /dev/null @@ -1,327 +0,0 @@ -```eval_rst -.. _cmd_detail_introduction: -``` - -# Detail Description - -## Common - -* `--job` - - Job mode, including: **train, test, checkgrad**, where checkgrad is mainly for developers and users do not need to care about. - - type: string (default: train) - -* `--config` - - Use to specfiy network configure file. - - type: string (default: null). - -* `--use_gpu` - - Whether to use GPU for training, false is cpu mode and true is gpu mode. - - type: bool (default: 1). - -* `--local` - - Whether the training is in local mode or not. True when training locally or using one node in cluster. False when using multiple machines in cluster. - - type: bool (default: 1). - -* `--trainer_count` - - Define the number of threads used in one machine. For example, trainer_count = 4, means use 4 GPU in GPU mode and 4 threads in CPU mode. Each thread (or GPU) is assigned to 1/4 samples in current batch. That is to say, if setting batch_size of 512 in trainer config, each thread train 128 samples. - - type: int32 (default: 1). - -* `--num_passes` - - When `--job=train`, means training for num_passes passes. One pass means training all samples in dataset one time. When `--job=test`, means testing data from model of test_pass to model of (num_passes - 1). - - type: int32 (default: 100). - -* `--config_args` - - arguments passed to config file. Format: key1=value1,key2=value2. - - type: string (default: null). - -* `--version` - - Whether to print version information. - - type: bool (default: 0). - -* `--show_layer_stat` - - Whether to show the statistics of each layer **per batch**. - - type: bool (default: 0). - -## Train - -* `--log_period` - - Log progress every log_period batches. - - type: int32 (default: 100). - -* `--dot_period` - - Print '.' every dot_period batches. - - type: int32 (default: 1). - -* `--saving_period` - - Save parameters every saving_period passes - - type: int32 (default: 1). - -* `--save_dir` - - Directory for saving model parameters. It needs to be specified, but no need to be created in advance. - - type: string (default: null). - -* `--start_pass` - - Start training from this pass. It will load parameters from the previous pass. - - type: int32 (default: 0). - -* `--show_parameter_stats_period` - - Show parameter statistic during training every show_parameter_stats_period batches. It will not show by default. - - type: int32 (default: 0). - -* `--save_only_one` - - Save the parameters only in last pass, while the previous parameters will be removed. - - type: bool (default: 0). - -* `--load_missing_parameter_strategy` - - Specify the loading operation when model file is missing. Now support fail/rand/zero three operations. - - `fail`: program will exit. - - `rand`: uniform or normal distribution according to **initial\_strategy** in network config. Uniform range is: **[mean - std, mean + std]**, where mean and std are configures in trainer config. - - `zero`: all parameters are zero. - - type: string (default: fail). - -* `--init_model_path` - - Path of the initialization model. If it was set, start\_pass will be ignored. It can be used to specify model path in testing mode as well. - - type: string (default: null). - -* `--saving_period_by_batches` - - Save parameters every saving_period_by_batches batches in one pass. - - type: int32 (default: 0). - -* `--log_error_clipping` - - Whether to print error clipping log when setting **error_clipping_threshold** in layer config. If it is true, log will be printed in backward propagation **per batch**. This clipping effects on **gradient of output**. - - type: bool (default: 0). - -* `--log_clipping` - - Enable print log clipping or not when setting **gradient_clipping_threshold** in trainer config. This clipping effects on **gradient w.r.t. (with respect to) weight**. - - type: bool (default: 0). - -* `--use_old_updater` - - Whether to use the old RemoteParameterUpdater. Default use ConcurrentRemoteParameterUpdater. It is mainly for deverlopers and users usually do not need to care about. - - type: bool (default: 0). - -* `--enable_grad_share` - - threshold for enable gradient parameter, which is shared for batch multi-cpu training. - - type: int32 (default: 100 \* 1024 \* 1024). - -* `--grad_share_block_num` - - block number of gradient parameter, which is shared for batch multi-cpu training. - - type: int32 (default: 64). - -## Test - -* `--test_pass` - - Load parameter from this pass to test. - - type: int32 (default: -1). - -* `--test_period` - - if equal 0, do test on all test data at the end of each pass. While if equal non-zero, do test on all test data every test_period batches. - - type: int32 (default: 0). - -* `--test_wait` -  - Whether to wait for parameter per pass if not exist. It can be used when user launch another process to perfom testing during the training process. - - type: bool (default: 0). - -* `--model_list` - - File that saves the model list when testing. - - type: string (default: "", null). - -* `--predict_output_dir` - - Directory that saves the layer output. It is configured in Outputs() in network config. Default, this argument is null, meaning save nothing. Specify this directory if you want to save feature map of some layers in testing mode. Note that, layer outputs are values after activation function. - - type: string (default: "", null). - -* `--average_test_period` - - Do test on average parameter every `average_test_period` batches. It MUST be devided by FLAGS_log_period. Default 0 means do not test on average parameter. - - type: int32 (default: 0). - -* `--distribute_test` - - Testing in distribute environment will merge results from multiple machines. - - type: bool (default: 0). - -* `--predict_file` - - File name for saving predicted result. Default, this argument is null, meaning save nothing. Now, this argument is only used in AucValidationLayer and PnpairValidationLayer, and saves predicted result every pass. - - type: string (default: "", null). - -## GPU - -* `--gpu_id` - - Which gpu core to use. - - type: int32 (default: 0). - -* `--allow_only_one_model_on_one_gpu` - - If true, do not allow multiple models on one GPU device. - - type: bool (default: 1). - -* `--parallel_nn` - - Whether to use multi-thread to calculate one neural network or not. If false, use gpu_id specify which gpu core to use (the device property in trainer config will be ingored). If true, the gpu core is specified in trainer config (gpu_id will be ignored). - - type: bool (default: 0). - -* `--cudnn_dir` - - Choose path to dynamic load NVIDIA CuDNN library, for instance, /usr/local/cuda/lib64. [Default]: LD_LIBRARY_PATH - - type: string (default: "", null) - -* `--cuda_dir` - - Choose path to dynamic load NVIDIA CUDA library, for instance, /usr/local/cuda/lib64. [Default]: LD_LIBRARY_PATH - - type: string (default: "", null) - -* `--cudnn_conv_workspace_limit_in_mb` - - Specify cuDNN max workspace limit, in units MB, 4096MB=4GB by default. - - type: int32 (default: 4096MB=4GB) - -## NLP: RNN/LSTM/GRU -* `--rnn_use_batch` - - Whether to use batch method for calculation in simple RecurrentLayer. - - type: bool (default: 0). - -* `--prev_batch_state` - - batch is continue with next batch. - - type: bool (default: 0). - -* `--beam_size` - - Beam search uses breadth-first search to build its search tree. At each level of the tree, it generates all successors of the states at the current level, sorting them in increasing order of heuristic cost. However, it only stores a predetermined number of best states at each level (called the beam size). - - type: int32 (default: 1). - -* `--diy_beam_search_prob_so` - - Specify shared dynamic library. It can be defined out of paddle by user. - - type: string (default: "", null). - -## DataProvider - -* `--memory_threshold_on_load_data` - - Stop loading data when memory is not sufficient. - - type: double (default: 1.0). - -## Unit Test - -* `--checkgrad_eps` - - parameter change size for checkgrad. - - type: double (default: 1e-05). - -## Parameter Server and Distributed Communication - -* `--start_pserver` - - Whether to start pserver (parameter server). - - type: bool (default: 0). - -* `--pservers` - - Comma separated IP addresses of pservers. - - type: string (default: "127.0.0.1"). - -* `--port` - - Listening port for pserver. - - type: int32 (default: 20134). - -* `--ports_num` - - The ports number for parameter send, increment based on default port number. - - type: int32 (default: 1). - -* `--trainer_id` - - In distributed training, each trainer must be given an unique id ranging from 0 to num_trainers-1. Trainer 0 is the master trainer. User do not need to care this flag. - - type: int32 (default: 0). - -* `--num_gradient_servers` - - Numbers of gradient servers. This arguments is set automatically in cluster submitting environment. - - type: int32 (default: 1). - -* `--small_messages` - - If message size is small, recommend set it True to enable quick ACK and no delay - - type: bool (default: 0). - -* `--sock_send_buf_size` - - Restrict socket send buffer size. It can reduce network congestion if set carefully. - - type: int32 (default: 1024 \* 1024 \* 40). - -* `--sock_recv_buf_size` - - Restrict socket recieve buffer size. - - type: int32 (default: 1024 \* 1024 \* 40). - -* `--parameter_block_size` - - Parameter block size for pserver, will automatically calculate a suitable value if it's not set. - - type: int32 (default: 0). - -* `--parameter_block_size_for_sparse` - - Parameter block size for sparse update pserver, will automatically calculate a suitable value if it's not set. - - type: int32 (default: 0). - -* `--log_period_server` - - Log progress every log_period_server batches at pserver end. - - type: int32 (default: 500). - -* `--loadsave_parameters_in_pserver` - - Load and save parameters in pserver. Only work when parameter set sparse_remote_update. - - type: bool (default: 0). - -* `--pserver_num_threads` - - number of threads for sync op exec. - - type: bool (default: 1). - -* `--ports_num_for_sparse` - - The ports number for parameter send, increment based on default (port + ports_num). It is used by sparse Tranning. - - type: int32 (default: 0). - -* `--nics` - - Network device name for pservers, already set in cluster submitting environment. - - type: string (default: "xgbe0,xgbe1"). - -* `--rdma_tcp` - - Use rdma or tcp transport protocol, already set in cluster submitting environment. - - type: string (default: "tcp"). - -## Async SGD -* `--async_count` - - Defined the asynchronous training length, if 0, then use synchronized training. - - type: int32 (default: 0). - -* `--async_lagged_ratio_min` - - Control the minimize value of `config_.async_lagged_grad_discard_ratio()`. - - type: double (default: 1.0). - -* `--async_lagged_ratio_default` - - If async_lagged_grad_discard_ratio is not set in network config, use it as defalut value. - - type: double (default: 1.5). - -## Performance Tuning - -* `--log_barrier_abstract` - - If true, show abstract barrier performance information. - - type: bool (default: 1). - -* `--log_barrier_show_log` - - If true, always show barrier abstract even with little gap. - - type: bool (default: 0). - -* `--log_barrier_lowest_nodes` - - How many lowest node will be logged. - - type: int32 (default: 5). - -* `--check_sparse_distribution_in_pserver` - - Whether to check that the distribution of sparse parameter on all pservers is balanced. - - type: bool (default: 0). - -* `--show_check_sparse_distribution_log` - - show log details for sparse parameter distribution in pserver. - - type: bool (default: 0). - -* `--check_sparse_distribution_batches` - - Running sparse parameter distribution check every so many batches. - - type: int32 (default: 100). - -* `--check_sparse_distribution_ratio` - - If parameters dispatched to different pservers have an unbalanced distribution for check_sparse_distribution_ratio * check_sparse_distribution_batches times, crash program. - - type: double (default: 0.6). - -* `--check_sparse_distribution_unbalance_degree` - - The ratio of maximum data size / minimun data size for different pserver. - - type: double (default: 2). - -## Matrix/Vector/RandomNumber -* `--enable_parallel_vector` - - threshold for enable parallel vector. - - type: int32 (default: 0). - -* `--seed` - - random number seed. 0 for srand(time) - - type: int32 (default: 1) - -* `--thread_local_rand_use_global_seed` - - Whether to use global seed in rand of thread local. - - type: bool (default: 0). diff --git a/doc/v2/howto/cmd_parameter/index_cn.rst b/doc/v2/howto/cmd_parameter/index_cn.rst deleted file mode 100644 index 6900bb1443e611d326e8d5640e794ac2b9079beb..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cmd_parameter/index_cn.rst +++ /dev/null @@ -1,26 +0,0 @@ -.. _cmd_line_index: - -命令行参数设置 -=============== -深度学习算法的实现有着多样化的特点,运行环境、运行阶段、模型结构、训练策略等等这些都是常见的变化因素。PaddlePaddle支持用户灵活地设置各种命令行参数,以实现对模型训练或预测流程的控制。 - -在这一部分,首先以几个实际场景为例,展示了部分命令行参数的使用: - -.. toctree:: - :maxdepth: 1 - - use_case_cn.md - -接着对所有参数的使用场合进行概述和分类: - -.. toctree:: - :maxdepth: 1 - - arguments_cn.md - -最后给出细节描述,详细解释这些参数的属性和意义: - -.. toctree:: - :maxdepth: 1 - - detail_introduction_cn.md diff --git a/doc/v2/howto/cmd_parameter/index_en.rst b/doc/v2/howto/cmd_parameter/index_en.rst deleted file mode 100644 index f49683948ef78f363e2439cc25332431830eeb24..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cmd_parameter/index_en.rst +++ /dev/null @@ -1,26 +0,0 @@ -.. _cmd_line_index: - -Set Command-line Parameters -=========================== -The implementation of deep learning algorithms has a variety of characteristics, such as running environment, running stage, structure of the model and the traning strategy. PaddlePaddle supports the user to set various command-line parameters flexibly, which helps to achieve control of the model training or prediction process. - -In this part, we take several actual scenarios as an example, and the use of some command-line parameters is displayed: - -.. toctree:: - :maxdepth: 1 - - use_case_en.md - -Then, we summarize and classify the use of all command-line parameters: - -.. toctree:: - :maxdepth: 1 - - arguments_en.md - -Finally, the detailed descriptions are given, and we try to explain the propeties and significance of these command-line parameters in detail: - -.. toctree:: - :maxdepth: 1 - - detail_introduction_en.md diff --git a/doc/v2/howto/cmd_parameter/use_case_cn.md b/doc/v2/howto/cmd_parameter/use_case_cn.md deleted file mode 100644 index db8c39d950771726346ff9c9481990abc13036cf..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cmd_parameter/use_case_cn.md +++ /dev/null @@ -1,182 +0,0 @@ -# 使用案例 - -## 本地训练 - -本地训练的实验,诸如图像分类,自然语言处理等,通常都会使用下面这些命令行参数。 - -``` -paddle train \ - --use_gpu=1/0 \ #1:GPU,0:CPU(默认为1) - --config=network_config \ - --save_dir=output \ - --trainer_count=COUNT \ #(默认为1) - --test_period=M \ #(默认为0) - --num_passes=N \ #(默认为100) - --log_period=K \ #(默认为100) - --dot_period=1000 \ #(默认为1) - #[--show_parameter_stats_period=100] \ #(默认为0) - #[--saving_period_by_batches=200] \ #(默认为0) -``` -根据你的任务,可以选择是否使用参数`show_parameter_stats_period`和`saving_period_by_batches`。 - -### 1) 将命令参数传给网络配置 - -`config_args`是一个很有用的参数,用于将参数传递给网络配置。 - -``` ---config_args=generating=1,beam_size=5,layer_num=10 \ -``` -`get_config_arg`可用于在网络配置中解析这些参数,如下所示: - -``` -generating = get_config_arg('generating', bool, False) -beam_size = get_config_arg('beam_size', int, 3) -layer_num = get_config_arg('layer_num', int, 8) -``` - -`get_config_arg`: - -``` -get_config_arg(name, type, default_value) -``` -- name: `--config_args`中指定的名字 -- type: 值类型,包括bool, int, str, float等 -- default_value: 默认值 - -### 2) 使用模型初始化网络 - -增加如下参数: - -``` ---init_model_path=model_path ---load_missing_parameter_strategy=rand -``` - -## 本地测试 - -方法一: - -``` -paddle train --job=test \ - --use_gpu=1/0 \ - --config=network_config \ - --trainer_count=COUNT \ - --init_model_path=model_path \ -``` -- 使用init\_model\_path指定测试的模型 -- 只能测试单个模型 - -方法二: - -``` -paddle train --job=test \ - --use_gpu=1/0 \ - --config=network_config \ - --trainer_count=COUNT \ - --model_list=model.list \ -``` -- 使用model_list指定测试的模型列表 -- 可以测试多个模型,文件model.list如下所示: - -``` -./alexnet_pass1 -./alexnet_pass2 -``` - -方法三: - -``` -paddle train --job=test \ - --use_gpu=1/0 \ - --config=network_config \ - --trainer_count=COUNT \ - --save_dir=model \ - --test_pass=M \ - --num_passes=N \ -``` -这种方式必须使用Paddle存储的模型路径格式,如:`model/pass-%5d`。测试的模型包括从第M轮到第N-1轮存储的所有模型。例如,M=12,N=14这种写法将会测试模型`model/pass-00012`和`model/pass-00013`。 - -## 稀疏训练 - -当输入是维度很高的稀疏数据时,通常使用稀疏训练来加速计算过程。例如,输入数据的字典维数是1百万,但是每个样本仅包含几个词。在Paddle中,稀疏矩阵的乘积应用于前向传播过程,而稀疏更新在反向传播之后的权重更新时进行。 - -### 1) 本地训练 - -用户需要在网络配置中指定**sparse\_update=True**。请参照网络配置的文档了解更详细的信息。 - -### 2) 集群训练 - -在集群上训练一个稀疏模型需要加上下面的参数。同时用户需要在网络配置中指定**sparse\_remote\_update=True**。请参照网络配置的文档了解更详细的信息。 - -``` ---ports_num_for_sparse=1 #(默认为0) -``` - -## parallel_nn -用户可以设置`parallel_nn`来混合使用GPU和CPU计算网络层的参数。也就是说,你可以将网络配置成某些层使用GPU计算,而其他层使用CPU计算。另一种方式是将网络层划分到不同的GPU上去计算,这样可以减小GPU内存,或者采用并行计算来加速某些层的更新。 - -如果你想使用这些特性,你需要在网络配置中指定设备的ID号(表示为deviceId),并且加上下面的命令行参数: - -``` ---parallel_nn=true -``` -### 案例一:GPU和CPU混合使用 -请看下面的例子: - -``` -#command line: -paddle train --use_gpu=true --parallel_nn=true trainer_count=COUNT - -default_device(0) - -fc1=fc_layer(...) -fc2=fc_layer(...) -fc3=fc_layer(...,layer_attr=ExtraAttr(device=-1)) - -``` -- default_device(0): 设置默认设备号为0。这意味着除了指定device=-1的层之外,其他所有层都会使用GPU计算,每层使用的GPU号依赖于参数trainer\_count和gpu\_id(默认为0)。在此,fc1和fc2层在GPU上计算。 - -- device=-1: fc3层使用CPU计算。 - -- trainer_count: - - trainer_count=1: 如果未设置gpu\_id,那么fc1和fc2层将会使用第1个GPU来计算。否则使用gpu\_id指定的GPU。 - - - trainer_count>1: 在trainer\_count个GPU上使用数据并行来计算某一层。例如,trainer\_count=2意味着0号和1号GPU将会使用数据并行来计算fc1和fc2层。 - -### 案例二:在不同设备上指定层 - -``` -#command line: -paddle train --use_gpu=true --parallel_nn=true --trainer_count=COUNT - -#network: -fc2=fc_layer(input=l1, layer_attr=ExtraAttr(device=0), ...) -fc3=fc_layer(input=l1, layer_attr=ExtraAttr(device=1), ...) -fc4=fc_layer(input=fc2, layer_attr=ExtraAttr(device=-1), ...) -``` -在本例中,我们假设一台机器上有4个GPU。 - -- trainer_count=1: - - 使用0号GPU计算fc2层。 - - 使用1号GPU计算fc3层。 - - 使用CPU计算fc4层。 - -- trainer_count=2: - - 使用0号和1号GPU计算fc2层。 - - 使用2号和3号GPU计算fc3层。 - - 使用CPU两线程计算fc4层。 - -- trainer_count=4: - - 运行失败(注意到我们已经假设机器上有4个GPU),因为参数`allow_only_one_model_on_one_gpu`默认设置为真。 - -**当`device!=-1`时设备ID号的分配:** - -``` -(deviceId + gpu_id + threadId * numLogicalDevices_) % numDevices_ - -deviceId: 在层中指定 -gpu_id: 默认为0 -threadId: 线程ID号,范围: 0,1,..., trainer_count-1 -numDevices_: 机器的设备(GPU)数目 -numLogicalDevices_: min(max(deviceId + 1), numDevices_) -``` diff --git a/doc/v2/howto/cmd_parameter/use_case_en.md b/doc/v2/howto/cmd_parameter/use_case_en.md deleted file mode 100644 index e287f0c4b9617cbc6504596512bf408c56dc10f9..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cmd_parameter/use_case_en.md +++ /dev/null @@ -1,182 +0,0 @@ -# Use Case - -## Local Training - -These command line arguments are commonly used by local training experiments, such as image classification, natural language processing, et al. - -``` -paddle train \ - --use_gpu=1/0 \ #1:GPU,0:CPU(default:true) - --config=network_config \ - --save_dir=output \ - --trainer_count=COUNT \ #(default:1) - --test_period=M \ #(default:0) - --num_passes=N \ #(defalut:100) - --log_period=K \ #(default:100) - --dot_period=1000 \ #(default:1) - #[--show_parameter_stats_period=100] \ #(default:0) - #[--saving_period_by_batches=200] \ #(default:0) -``` -`show_parameter_stats_period` and `saving_period_by_batches` are optional according to your task. - -### 1) Pass Command Argument to Network config - -`config_args` is a useful parameter to pass arguments to network config. - -``` ---config_args=generating=1,beam_size=5,layer_num=10 \ -``` -And `get_config_arg` can be used to parse these arguments in network config as follows: - -``` -generating = get_config_arg('generating', bool, False) -beam_size = get_config_arg('beam_size', int, 3) -layer_num = get_config_arg('layer_num', int, 8) -``` - -`get_config_arg`: - -``` -get_config_arg(name, type, default_value) -``` -- name: the name specified in the `--config_args` -- type: value type, bool, int, str, float etc. -- default_value: default value if not set. - -### 2) Use Model to Initialize Network - -add argument: - -``` ---init_model_path=model_path ---load_missing_parameter_strategy=rand -``` - -## Local Testing - -Method 1: - -``` -paddle train --job=test \ - --use_gpu=1/0 \ - --config=network_config \ - --trainer_count=COUNT \ - --init_model_path=model_path \ -``` -- use init\_model\_path to specify test model. -- only can test one model. - -Method 2: - -``` -paddle train --job=test \ - --use_gpu=1/0 \ - --config=network_config \ - --trainer_count=COUNT \ - --model_list=model.list \ -``` -- use model_list to specify test models -- can test several models, where model.list likes: - -``` -./alexnet_pass1 -./alexnet_pass2 -``` - -Method 3: - -``` -paddle train --job=test \ - --use_gpu=1/0 \ - --config=network_config \ - --trainer_count=COUNT \ - --save_dir=model \ - --test_pass=M \ - --num_passes=N \ -``` -This way must use model path saved by Paddle like this: `model/pass-%5d`. Testing model is from M-th pass to (N-1)-th pass. For example: M=12 and N=14 will test `model/pass-00012` and `model/pass-00013`. - -## Sparse Training - -Sparse training is usually used to accelerate calculation when input is sparse data with highly dimension. For example, dictionary dimension of input data is 1 million, but one sample just have several words. In paddle, sparse matrix multiplication is used in forward propagation and sparse updating is perfomed on weight updating after backward propagation. - -### 1) Local training - -You need to set **sparse\_update=True** in network config. Check the network config documentation for more details. - -### 2) cluster training - -Add the following argument for cluster training of a sparse model. At the same time you need to set **sparse\_remote\_update=True** in network config. Check the network config documentation for more details. - -``` ---ports_num_for_sparse=1 #(default: 0) -``` - -## parallel_nn -`parallel_nn` can be set to mixed use of GPUs and CPUs to compute layers. That is to say, you can deploy network to use a GPU to compute some layers and use a CPU to compute other layers. The other way is to split layers into different GPUs, which can **reduce GPU memory** or **use parallel computation to accelerate some layers**. - -If you want to use these characteristics, you need to specify device ID in network config (denote it as deviceId) and add command line argument: - -``` ---parallel_nn=true -``` -### case 1: Mixed Use of GPU and CPU -Consider the following example: - -``` -#command line: -paddle train --use_gpu=true --parallel_nn=true trainer_count=COUNT - -default_device(0) - -fc1=fc_layer(...) -fc2=fc_layer(...) -fc3=fc_layer(...,layer_attr=ExtraAttr(device=-1)) - -``` -- default_device(0): set default device ID to 0. This means that except the layers with device=-1, all layers will use a GPU, and the specific GPU used for each layer depends on trainer\_count and gpu\_id (0 by default). Here, layer fc1 and fc2 are computed on the GPU. - -- device=-1: use the CPU for layer fc3. - -- trainer_count: - - trainer_count=1: if gpu\_id is not set, then use the first GPU to compute layers fc1 and fc2. Otherwise use the GPU with gpu\_id. - - - trainer_count>1: use trainer\_count GPUs to compute one layer using data parallelism. For example, trainer\_count=2 means that GPUs 0 and 1 will use data parallelism to compute layer fc1 and fc2. - -### Case 2: Specify Layers in Different Devices - -``` -#command line: -paddle train --use_gpu=true --parallel_nn=true --trainer_count=COUNT - -#network: -fc2=fc_layer(input=l1, layer_attr=ExtraAttr(device=0), ...) -fc3=fc_layer(input=l1, layer_attr=ExtraAttr(device=1), ...) -fc4=fc_layer(input=fc2, layer_attr=ExtraAttr(device=-1), ...) -``` -In this case, we assume that there are 4 GPUs in one machine. - -- trainer_count=1: - - Use GPU 0 to compute layer fc2. - - Use GPU 1 to compute layer fc3. - - Use CPU to compute layer fc4. - -- trainer_count=2: - - Use GPU 0 and 1 to compute layer fc2. - - Use GPU 2 and 3 to compute layer fc3. - - Use CPU to compute fc4 in two threads. - -- trainer_count=4: - - It will fail (note, we have assumed that there are 4 GPUs in machine), because argument `allow_only_one_model_on_one_gpu` is true by default. - -**Allocation of device ID when `device!=-1`**: - -``` -(deviceId + gpu_id + threadId * numLogicalDevices_) % numDevices_ - -deviceId: specified in layer. -gpu_id: 0 by default. -threadId: thread ID, range: 0,1,..., trainer_count-1 -numDevices_: device (GPU) count in machine. -numLogicalDevices_: min(max(deviceId + 1), numDevices_) -``` diff --git a/doc/v2/howto/index_cn.rst b/doc/v2/howto/index_cn.rst deleted file mode 100644 index b0268907bceb11cd53a4630c3f8b8e0424abe247..0000000000000000000000000000000000000000 --- a/doc/v2/howto/index_cn.rst +++ /dev/null @@ -1,37 +0,0 @@ -进阶使用 -======== - -PaddlePaddle支持用户灵活地设置各种命令行参数,以实现对模型训练或预测流程的控制。使用方式请参考: - -.. toctree:: - :maxdepth: 1 - - cmd_parameter/index_cn.rst - -PaddlePaddle支持在fabric集群、MPI集群、kubernetes集群上分布式训练任务,具体环境配置和使用说明请参考: - -.. toctree:: - :maxdepth: 1 - - cluster/index_cn.rst - -PaddlePaddle提供了用于预测的C-API,关于C-API的使用,我们提供了如下指南: - -.. toctree:: - :maxdepth: 1 - - capi/index_cn.rst - -PaddlePaddle支持多种灵活和高效的循环神经网络,具体配置使用方式请参考: - -.. toctree:: - :maxdepth: 1 - - rnn/index_cn.rst - -关于如何使用内置的定时工具、nvprof 或 nvvp 来运行性能分析和调优,请参考: - -.. toctree:: - :maxdepth: 1 - - optimization/gpu_profiling_cn.rst diff --git a/doc/v2/howto/index_en.rst b/doc/v2/howto/index_en.rst deleted file mode 100644 index 35ef197f58f1f865e2cdbdebb567d5637284637a..0000000000000000000000000000000000000000 --- a/doc/v2/howto/index_en.rst +++ /dev/null @@ -1,37 +0,0 @@ -HOW TO -======== - -PaddlePaddle provides the users the ability to flexibly set various command line parameters to control the model training and inference process. Please refer to the following instructions on using PaddlePaddle: - -.. toctree:: - :maxdepth: 1 - - cmd_parameter/index_en.rst - -PaddlePaddle supports distributed training tasks on fabric clusters, MPI clusters, and Kubernetes clusters. For detailed configuration and usage instructions, refer to: - -.. toctree:: - :maxdepth: 1 - - cluster/index_en.rst - -PaddlePaddle provides a C-API for inference. We provide the following guidelines for using the C-API: - -.. toctree:: - :maxdepth: 1 - - capi/index_en.rst - -PaddlePaddle supports a variety of flexible and efficient recurrent neural networks. For details, please refer to: - -.. toctree:: - :maxdepth: 1 - - rnn/index_en.rst - -How to use the built-in timing tool, nvprof, or nvvp to run performance analysis and tuning, please refer to: - -.. toctree:: - :maxdepth: 1 - - optimization/gpu_profiling_en.rst diff --git a/doc/v2/howto/optimization/gpu_profiling_cn.rst b/doc/v2/howto/optimization/gpu_profiling_cn.rst deleted file mode 100644 index f2396716bddd4810fa77c738d41f5482aa6d6055..0000000000000000000000000000000000000000 --- a/doc/v2/howto/optimization/gpu_profiling_cn.rst +++ /dev/null @@ -1,242 +0,0 @@ -============ -GPU性能调优 -============ - -.. contents:: - -此教程将向您分步介绍如何使用内置的定时工具、 **nvprof** 或 **nvvp** 来运行性能分析和调优。 - -- 什么是性能分析? -- 为什么需要性能分析? -- 如何进行性能分析? -- 性能分析工具介绍 -- 详细教程 -- 性能分析小技巧 - -什么是性能分析? -================ -在软件工程的范畴里,性能分析(Profiling)是一个动态程序分析的术语,它可以指测量一个程序的空间(内存)复杂度或时间复杂度, -也可以说是某些特定指令的使用情况,或者是函数调用的频率和耗时等。通常情况下,分析得到的信息用于协助进行程序的优化。 - -简单来说,性能分析工具是用于给应用程序的性能做定量分析的。如果想很好的理解程序的行为,那程序分析工具是必不可少的利器。简单的性能分析,可以告诉您某个操作到底花了多长时间?而更深入的分析,甚至能解释为什么某个操作花了很长时间? - -为什么需要性能分析? -============================ -训练好一个深层神经网络通常要耗费非常长的时间,所以性能也就逐步变成了深度学习领域最重要的指标。 -而优化性能的首要任务,是需要了解哪些步骤拖慢了整体。 -如果某一块根本就不怎么耗时,那也就不需要急着优化性能啦! - -如何进行性能分析? -======================== -为了达到性能最优,您可以采用下面五个步骤: - -- 对代码进行性能分析 -- 找到运行慢的部分 -- 找到运行慢的原因 -- 修改成更快的版本 -- 再次对代码进行性能分析 - -Usually, processor has two key performance limits include float point throughput and -memory throughput. For GPU, it also need more parallelism to fulfill its potential. -This is why they can be so fast. - -通常情况下,处理器有两个关键性能限制:一个是浮点计算量,另一个是内存操作量。 -GPU则还需要高并行性,才能发挥其全部能力。这正是它们速度快的原因。 - -性能分析工具介绍 -====================== -就通常的GPU性能分析来说,市面上已经有NVIDIA或第三方提供的众多工具。 - -**nvprof** 是Nvidia性能分析工具, **nvvp** 则是带GUI的Nvidia可视化性能分析工具。 -在这个教程中,我们主要会介绍nvprof和nvvp。 - -:code:`test_GpuProfiler` from :code:`paddle/legacy/math/tests` directory will be used to evaluate -above profilers. - -:code:`paddle/legacy/math/test` 目录中的 :code:`test_GpuProfiler` 就是用于展示上述分析工具的用法。 - -.. literalinclude:: ../../../../paddle/legacy/math/tests/test_GpuProfiler.cpp - :language: c++ - :lines: 137-151 - :linenos: - -上述的代码片段包含了两种方法,您可以任意使用一个或两个来对感兴趣的代码段做性能分析。 - -1. :code:`REGISTER_TIMER_INFO` 是一个内置的定时器封装,可以用来计算CPU函数或cuda内核的时间消耗。 - -2. :code:`REGISTER_GPU_PROFILER` is a general purpose wrapper object of :code:`cudaProfilerStart` and :code:`cudaProfilerStop` to avoid -program crashes when CPU version of PaddlePaddle invokes them. - -3. :code:`REGISTER_GPU_PROFILER` 是一个封装对象,封装了 :code:`cudaProfilerStart` 和 :code:`cudaProfileStop` 两个操作;同时其内部实现可以避免纯CPU版本PaddlePaddle在执行本语句时发生崩溃。 - -您会在接下来的部分中获得更多的细节介绍。 - -详细教程 -============ - -内置定时器 ------------- - -如果想要启用PaddlePaddle的内置定时器,您首先需要在相关代码段中加入 :code:`REGISTER_TIMER_INFO`。 -接下来就可以使用 :code:`printStatus` 或者 :code:`printAllStatus` 函数来将信息输出到界面中。 -下面举个简单的例子: - -1. 加入 :code:`REGISTER_TIMER_INFO` 和 :code:`printAllStatus` 函数(如高亮部分)。 - - .. literalinclude:: ../../../../paddle/legacy/math/tests/test_GpuProfiler.cpp - :language: c++ - :lines: 137-151 - :emphasize-lines: 8-12,14 - :linenos: - -2. cmake配置中将 **WITH_TIMER** 打开,重新编译PaddlePaddle。 - - .. code-block:: bash - - cmake .. -DWITH_TIMER=ON - make - -3. 执行您的代码,并观察结果(如高亮部分)。 - - .. code-block:: bash - :emphasize-lines: 1,12-15 - - > ./paddle/legacy/math/tests/test_GpuProfiler - I1117 11:13:42.313065 2522362816 Util.cpp:155] commandline: ./paddle/legacy/math/tests/test_GpuProfiler - I1117 11:13:42.845065 2522362816 Util.cpp:130] Calling runInitFunctions - I1117 11:13:42.845208 2522362816 Util.cpp:143] Call runInitFunctions done. - [==========] Running 1 test from 1 test case. - [----------] Global test environment set-up. - [----------] 1 test from Profiler - [ RUN ] Profiler.BilinearFwdBwd - I1117 11:13:42.845310 2522362816 test_GpuProfiler.cpp:114] Enable GPU Profiler Stat: [testBilinearFwdBwd] "numSamples = 10, channels = 16, im - gSizeX = 64, imgSizeY = 64" - I1117 11:13:42.850154 2522362816 ThreadLocal.cpp:37] thread use undeterministic rand seed:20659751 - I1117 11:13:42.981501 2522362816 Stat.cpp:130] ======= StatSet: [GlobalStatInfo] status ====== - I1117 11:13:42.981539 2522362816 Stat.cpp:133] Stat=testBilinearFwdBwd total=136.141 avg=136.141 max=136.141 min=136.141 count=1 - I1117 11:13:42.981572 2522362816 Stat.cpp:141] ======= BarrierStatSet status ====== - I1117 11:13:42.981575 2522362816 Stat.cpp:154] -------------------------------------------------- - [ OK ] Profiler.BilinearFwdBwd (136 ms) - [----------] 1 test from Profiler (136 ms total) - - [----------] Global test environment tear-down - [==========] 1 test from 1 test case ran. (136 ms total) - [ PASSED ] 1 test. - -nvprof 工具 ----------------- - -要使用命令行分析工具 **nvprof**,您按如下步骤操作即可: - -1. 将 :code:`REGISTER_GPU_PROFILER` 函数加到代码中(参考强调部分)。 - - .. literalinclude:: ../../../../paddle/legacy/math/tests/test_GpuProfiler.cpp - :language: c++ - :lines: 137-151 - :emphasize-lines: 6-7 - :linenos: - -2. cmake中将 **WITH_PROFILER** 配置打开,重新编译PaddlePaddle。 - - .. code-block:: bash - - cmake .. -DWITH_PROFILER=ON - make - -3. 使用 **nvprof** 来分析执行文件。 - - .. code-block:: bash - - nvprof ./paddle/legacy/math/tests/test_GpuProfiler - -然后,您就能获得如下的分析结果: - -.. code-block:: bash - - ==78544== Profiling application: ./paddle/legacy/math/tests/test_GpuProfiler - ==78544== Profiling result: - Time(%) Time Calls Avg Min Max Name - 27.60% 9.6305ms 5 1.9261ms 3.4560us 6.4035ms [CUDA memcpy HtoD] - 26.07% 9.0957ms 1 9.0957ms 9.0957ms 9.0957ms KeBilinearInterpBw - 23.78% 8.2977ms 1 8.2977ms 8.2977ms 8.2977ms KeBilinearInterpFw - 22.55% 7.8661ms 2 3.9330ms 1.5798ms 6.2863ms [CUDA memcpy DtoH] - - ==78544== API calls: - Time(%) Time Calls Avg Min Max Name - 46.85% 682.28ms 8 85.285ms 12.639us 682.03ms cudaStreamCreateWithFlags - 39.83% 580.00ms 4 145.00ms 302ns 550.27ms cudaFree - 9.82% 143.03ms 9 15.892ms 8.7090us 142.78ms cudaStreamCreate - 1.23% 17.983ms 7 2.5690ms 23.210us 6.4563ms cudaMemcpy - 1.23% 17.849ms 2 8.9247ms 8.4726ms 9.3768ms cudaStreamSynchronize - 0.66% 9.5969ms 7 1.3710ms 288.43us 2.4279ms cudaHostAlloc - 0.13% 1.9530ms 11 177.54us 7.6810us 591.06us cudaMalloc - 0.07% 1.0424ms 8 130.30us 1.6970us 453.72us cudaGetDevice - 0.04% 527.90us 40 13.197us 525ns 253.99us cudaEventCreateWithFlags - 0.03% 435.73us 348 1.2520us 124ns 42.704us cuDeviceGetAttribute - 0.03% 419.36us 1 419.36us 419.36us 419.36us cudaGetDeviceCount - 0.02% 260.75us 2 130.38us 129.32us 131.43us cudaGetDeviceProperties - 0.02% 222.32us 2 111.16us 106.94us 115.39us cudaLaunch - 0.01% 214.06us 4 53.514us 28.586us 77.655us cuDeviceGetName - 0.01% 115.45us 4 28.861us 9.8250us 44.526us cuDeviceTotalMem - 0.01% 83.988us 4 20.997us 578ns 77.760us cudaSetDevice - 0.00% 38.918us 1 38.918us 38.918us 38.918us cudaEventCreate - 0.00% 34.573us 31 1.1150us 279ns 12.784us cudaDeviceGetAttribute - 0.00% 17.767us 1 17.767us 17.767us 17.767us cudaProfilerStart - 0.00% 15.228us 2 7.6140us 3.5460us 11.682us cudaConfigureCall - 0.00% 14.536us 2 7.2680us 1.1490us 13.387us cudaGetLastError - 0.00% 8.6080us 26 331ns 173ns 783ns cudaSetupArgument - 0.00% 5.5470us 6 924ns 215ns 2.6780us cuDeviceGet - 0.00% 5.4090us 6 901ns 328ns 3.3320us cuDeviceGetCount - 0.00% 4.1770us 3 1.3920us 1.0630us 1.8300us cuDriverGetVersion - 0.00% 3.4650us 3 1.1550us 1.0810us 1.2680us cuInit - 0.00% 830ns 1 830ns 830ns 830ns cudaRuntimeGetVersion - - -nvvp 工具 --------------- - -如果想使用可视化的分析器 **nvvp**,您可以导入 :code:`nvprof -o ...` 的输出,或者从工具的界面里运行您的应用。 - -**备注: nvvp 也支持CPU的性能分析** (需在nvvp界面中选上才能开启) - -.. image:: nvvp1.png - :align: center - :scale: 33% - -从内核函数的角度, **nvvp** 可以精确说明一个长耗时操作的具体原因。 -同时,如下图所示, **nvvp** 的内核block使用情况、寄存器使用情况和共享内存使用情况能让我们对GPU的整体使用有更好的理解。 - - -.. image:: nvvp2.png - :align: center - :scale: 33% - -而从应用的角度, **nvvp** 可以帮您提供一些定位性能瓶颈的建议。 -例如,下图中就展示了一些关于内存数据迁徙和计算资源利用率的建议,为您做性能调优提供了方向。 - -.. image:: nvvp3.png - :align: center - :scale: 33% - -.. image:: nvvp4.png - :align: center - :scale: 33% - -性能分析小技巧 -================== - -- 开始阶段,从 **nvprof** 和 **nvvp** 的输出信息入手是个不错的选择。 -- 接下来可以考虑下时间线的分析。 -- 如果真想挖掘内核深处的某个秘密,您最好先确认:这一块的耗时比例真的太高,值得深入分析。 -- 可能的情况下,试着让输出的分析数据和理论值对应。 - - 1) 例如,如果我知道内核花了10ms来移动1GB数据,那我会期望分析工具统计到速度是100GB/s。 - 2) 若有不一致之处,很有可能实际应用就是没有按照您的预期情况运行。 -- 了解您的硬件:如果您的GPU理论可以达到6 TFLOPs(6万亿次浮点运算每秒),而当前已经有5.5 TFLOPs了,那估计这里的潜力就没啥好挖的了…… - -性能分析是性能优化的关键一步。有的时候简简单单的改变就能在性能上产生明显的优化效果! -当然,具体情况因人而异。 - -参考资料 -=========== -Jeremy Appleyard, `GPU Profiling for Deep Learning `_, 2015 diff --git a/doc/v2/howto/optimization/gpu_profiling_en.rst b/doc/v2/howto/optimization/gpu_profiling_en.rst deleted file mode 100644 index 6e439be9bba8935cdd65f1c131cfd3725530ec0e..0000000000000000000000000000000000000000 --- a/doc/v2/howto/optimization/gpu_profiling_en.rst +++ /dev/null @@ -1,240 +0,0 @@ -==================== -Tune GPU Performance -==================== - -.. contents:: - -This tutorial will guide you step-by-step through how to conduct profiling and performance tuning using built-in timer, **nvprof** and **nvvp**. - -- What is profiling? -- Why we need profiling? -- How to do profiling? -- Profile tools -- Hands-on Tutorial -- Profiling tips - -What's profiling? -================= -In software engineering, profiling is a form of dynamic program analysis that measures the space (memory) or time -complexity of a program, the usage of particular instructions, or the frequency and duration of function calls. -Most commonly, profiling information serves to aid program optimization. - -Briefly, profiler is used to measure application performance. Program analysis tools are extremely important for -understanding program behavior. Simple profiling can tell you that how long does an operation take? For advanced -profiling, it can interpret why does an operation take a long time? - -Why we need profiling? -====================== -Since training deep neural network typically take a very long time to get over, performance is gradually becoming -the most important thing in deep learning field. The first step to improve performance is to understand what parts -are slow. There is no point in improving performance of a region which doesn’t take much time! - - -How to do profiling? -==================== -To achieve maximum performance, there are five steps you can take to reach your goals. - -- Profile the code -- Find the slow parts -- Work out why they’re slow -- Make them fast -- Profile the code again - -Usually, processor has two key performance limits include float point throughput and -memory throughput. For GPU, it also need more parallelism to fulfill its potential. -This is why they can be so fast. - -Profiler Tools -============== -For general GPU profiling, a bunch of tools are provided from both NVIDIA and third party. - -**nvprof** is Nvidia profiler and **nvvp** is (GUI based) Nvidia visual profiler. -In this tutorial, we will focus on nvprof and nvvp. - -:code:`test_GpuProfiler` from :code:`paddle/legacy/math/tests` directory will be used to evaluate -above profilers. - -.. literalinclude:: ../../../../paddle/legacy/math/tests/test_GpuProfiler.cpp - :language: c++ - :lines: 137-151 - :linenos: - -The above code snippet includes two methods, you can use any of them to profile the regions of interest. - -1. :code:`REGISTER_TIMER_INFO` is a built-in timer wrapper which can calculate the time overhead of both cpu functions and cuda kernels. - -2. :code:`REGISTER_GPU_PROFILER` is a general purpose wrapper object of :code:`cudaProfilerStart` and :code:`cudaProfilerStop` to avoid -program crashes when CPU version of PaddlePaddle invokes them. - -You can find more details about how to use both of them in the next session. - -Hands-on Approach -================= - -Built-in Timer --------------- - -To enable built-in timer in PaddlePaddle, first you have to add :code:`REGISTER_TIMER_INFO` into the regions of you interest. -Then, all information could be stamped in the console via :code:`printStatus` or :code:`printAllStatus` function. -As a simple example, consider the following: - -1. Add :code:`REGISTER_TIMER_INFO` and :code:`printAllStatus` functions (see the emphasize-lines). - - .. literalinclude:: ../../../../paddle/legacy/math/tests/test_GpuProfiler.cpp - :language: c++ - :lines: 137-151 - :emphasize-lines: 8-12,14 - :linenos: - -2. Configure cmake with **WITH_TIMER** and recompile PaddlePaddle. - - .. code-block:: bash - - cmake .. -DWITH_TIMER=ON - make - -3. Execute your code and observe the results (see the emphasize-lines). - - .. code-block:: bash - :emphasize-lines: 1,12-15 - - > ./paddle/legacy/math/tests/test_GpuProfiler - I1117 11:13:42.313065 2522362816 Util.cpp:155] commandline: ./paddle/legacy/math/tests/test_GpuProfiler - I1117 11:13:42.845065 2522362816 Util.cpp:130] Calling runInitFunctions - I1117 11:13:42.845208 2522362816 Util.cpp:143] Call runInitFunctions done. - [==========] Running 1 test from 1 test case. - [----------] Global test environment set-up. - [----------] 1 test from Profiler - [ RUN ] Profiler.BilinearFwdBwd - I1117 11:13:42.845310 2522362816 test_GpuProfiler.cpp:114] Enable GPU Profiler Stat: [testBilinearFwdBwd] "numSamples = 10, channels = 16, im - gSizeX = 64, imgSizeY = 64" - I1117 11:13:42.850154 2522362816 ThreadLocal.cpp:37] thread use undeterministic rand seed:20659751 - I1117 11:13:42.981501 2522362816 Stat.cpp:130] ======= StatSet: [GlobalStatInfo] status ====== - I1117 11:13:42.981539 2522362816 Stat.cpp:133] Stat=testBilinearFwdBwd total=136.141 avg=136.141 max=136.141 min=136.141 count=1 - I1117 11:13:42.981572 2522362816 Stat.cpp:141] ======= BarrierStatSet status ====== - I1117 11:13:42.981575 2522362816 Stat.cpp:154] -------------------------------------------------- - [ OK ] Profiler.BilinearFwdBwd (136 ms) - [----------] 1 test from Profiler (136 ms total) - - [----------] Global test environment tear-down - [==========] 1 test from 1 test case ran. (136 ms total) - [ PASSED ] 1 test. - -nvprof profiler ---------------- - -To use this command line profiler **nvprof**, you can simply issue the following command: - -1. Add :code:`REGISTER_GPU_PROFILER` function (see the emphasize-lines). - - .. literalinclude:: ../../../../paddle/legacy/math/tests/test_GpuProfiler.cpp - :language: c++ - :lines: 137-151 - :emphasize-lines: 6-7 - :linenos: - -2. Configure cmake with **WITH_PROFILER** and recompile PaddlePaddle. - - .. code-block:: bash - - cmake .. -DWITH_PROFILER=ON - make - -3. Use Nvidia profiler **nvprof** to profile the binary. - - .. code-block:: bash - - nvprof ./paddle/legacy/math/tests/test_GpuProfiler - -Then, you can get the following profiling result: - -.. code-block:: bash - - ==78544== Profiling application: ./paddle/legacy/math/tests/test_GpuProfiler - ==78544== Profiling result: - Time(%) Time Calls Avg Min Max Name - 27.60% 9.6305ms 5 1.9261ms 3.4560us 6.4035ms [CUDA memcpy HtoD] - 26.07% 9.0957ms 1 9.0957ms 9.0957ms 9.0957ms KeBilinearInterpBw - 23.78% 8.2977ms 1 8.2977ms 8.2977ms 8.2977ms KeBilinearInterpFw - 22.55% 7.8661ms 2 3.9330ms 1.5798ms 6.2863ms [CUDA memcpy DtoH] - - ==78544== API calls: - Time(%) Time Calls Avg Min Max Name - 46.85% 682.28ms 8 85.285ms 12.639us 682.03ms cudaStreamCreateWithFlags - 39.83% 580.00ms 4 145.00ms 302ns 550.27ms cudaFree - 9.82% 143.03ms 9 15.892ms 8.7090us 142.78ms cudaStreamCreate - 1.23% 17.983ms 7 2.5690ms 23.210us 6.4563ms cudaMemcpy - 1.23% 17.849ms 2 8.9247ms 8.4726ms 9.3768ms cudaStreamSynchronize - 0.66% 9.5969ms 7 1.3710ms 288.43us 2.4279ms cudaHostAlloc - 0.13% 1.9530ms 11 177.54us 7.6810us 591.06us cudaMalloc - 0.07% 1.0424ms 8 130.30us 1.6970us 453.72us cudaGetDevice - 0.04% 527.90us 40 13.197us 525ns 253.99us cudaEventCreateWithFlags - 0.03% 435.73us 348 1.2520us 124ns 42.704us cuDeviceGetAttribute - 0.03% 419.36us 1 419.36us 419.36us 419.36us cudaGetDeviceCount - 0.02% 260.75us 2 130.38us 129.32us 131.43us cudaGetDeviceProperties - 0.02% 222.32us 2 111.16us 106.94us 115.39us cudaLaunch - 0.01% 214.06us 4 53.514us 28.586us 77.655us cuDeviceGetName - 0.01% 115.45us 4 28.861us 9.8250us 44.526us cuDeviceTotalMem - 0.01% 83.988us 4 20.997us 578ns 77.760us cudaSetDevice - 0.00% 38.918us 1 38.918us 38.918us 38.918us cudaEventCreate - 0.00% 34.573us 31 1.1150us 279ns 12.784us cudaDeviceGetAttribute - 0.00% 17.767us 1 17.767us 17.767us 17.767us cudaProfilerStart - 0.00% 15.228us 2 7.6140us 3.5460us 11.682us cudaConfigureCall - 0.00% 14.536us 2 7.2680us 1.1490us 13.387us cudaGetLastError - 0.00% 8.6080us 26 331ns 173ns 783ns cudaSetupArgument - 0.00% 5.5470us 6 924ns 215ns 2.6780us cuDeviceGet - 0.00% 5.4090us 6 901ns 328ns 3.3320us cuDeviceGetCount - 0.00% 4.1770us 3 1.3920us 1.0630us 1.8300us cuDriverGetVersion - 0.00% 3.4650us 3 1.1550us 1.0810us 1.2680us cuInit - 0.00% 830ns 1 830ns 830ns 830ns cudaRuntimeGetVersion - - -nvvp profiler -------------- - -For visual profiler **nvvp**, you can either import the output of :code:`nvprof –o ...` or -run application through GUI. - -**Note: nvvp also support CPU profiling** (Click the box in nvvp to enable profile execution on CPU). - -.. image:: nvvp1.png - :align: center - :scale: 33% - -From the perspective of kernel functions, **nvvp** can even illustrate why does an operation take a long time? -As shown in the following figure, kernel's block usage, register usage and shared memory usage from :code:`nvvp` -allow us to fully utilize all warps on the GPU. - -.. image:: nvvp2.png - :align: center - :scale: 33% - -From the perspective of application, **nvvp** can give you some suggestions to address performance bottleneck. -For instance, some advice in data movement and compute utilization from the below figure can guide you to tune performance. - -.. image:: nvvp3.png - :align: center - :scale: 33% - -.. image:: nvvp4.png - :align: center - :scale: 33% - -Profiling tips -============== - -- The **nvprof** and **nvvp** output is a very good place to start. -- The timeline is a good place to go next. -- Only dig deep into a kernel if it’s taking a significant amount of your time. -- Where possible, try to match profiler output with theory. - 1) For example, if I know I’m moving 1GB, and my kernel takes 10ms, I expect the profiler to report 100GB/s. - 2) Discrepancies are likely to mean your application isn’t doing what you thought it was. -- Know your hardware: If your GPU can do 6 TFLOPs, and you’re already doing 5.5 TFLOPs, you won’t go much faster! - - -Profiling is a key step in optimization. Sometimes quite simple changes can lead to big improvements in performance. -Your mileage may vary! - -Reference -========= -Jeremy Appleyard, `GPU Profiling for Deep Learning `_, 2015 diff --git a/doc/v2/howto/optimization/nvvp1.png b/doc/v2/howto/optimization/nvvp1.png deleted file mode 100644 index 1af23ac3c52929b2b0645d2f9fa4d4c6db1f6e77..0000000000000000000000000000000000000000 Binary files a/doc/v2/howto/optimization/nvvp1.png and /dev/null differ diff --git a/doc/v2/howto/optimization/nvvp2.png b/doc/v2/howto/optimization/nvvp2.png deleted file mode 100644 index 177c9db708da6863d1075f3e615f5962dbe18b29..0000000000000000000000000000000000000000 Binary files a/doc/v2/howto/optimization/nvvp2.png and /dev/null differ diff --git a/doc/v2/howto/optimization/nvvp3.png b/doc/v2/howto/optimization/nvvp3.png deleted file mode 100644 index d8f393667d6569b6f1e61ffccac43fae5888b6db..0000000000000000000000000000000000000000 Binary files a/doc/v2/howto/optimization/nvvp3.png and /dev/null differ diff --git a/doc/v2/howto/optimization/nvvp4.png b/doc/v2/howto/optimization/nvvp4.png deleted file mode 100644 index 51f2f3e183295de6cf8ddaf2b3b8a0862aa35f01..0000000000000000000000000000000000000000 Binary files a/doc/v2/howto/optimization/nvvp4.png and /dev/null differ diff --git a/doc/v2/howto/rnn/hierarchical_layer_cn.rst b/doc/v2/howto/rnn/hierarchical_layer_cn.rst deleted file mode 100644 index 2f8f408b40299890da694862a7b9418cf9ff07f2..0000000000000000000000000000000000000000 --- a/doc/v2/howto/rnn/hierarchical_layer_cn.rst +++ /dev/null @@ -1,89 +0,0 @@ -########################### -支持双层序列作为输入的Layer -########################### - -.. contents:: - -概述 -==== - -在自然语言处理任务中,序列是一种常见的数据类型。一个独立的词语,可以看作是一个非序列输入,或者,我们称之为一个0层的序列;由词语构成的句子,是一个单层序列;若干个句子构成一个段落,是一个双层的序列。 - -双层序列是一个嵌套的序列,它的每一个元素,又是一个单层的序列。这是一种非常灵活的数据组织方式,帮助我们构造一些复杂的输入信息。 - -我们可以按照如下层次定义非序列,单层序列,以及双层序列。 - -+ 0层序列:一个独立的元素,类型可以是PaddlePaddle支持的任意输入数据类型 -+ 单层序列:排成一列的多个元素,每个元素是一个0层序列,元素之间的顺序是重要的输入信息 -+ 双层序列:排成一列的多个元素,每个元素是一个单层序列,称之为双层序列的一个子序列(subseq),subseq的每个元素是一个0层序列 - -在 PaddlePaddle中,下面这些Layer能够接受双层序列作为输入,完成相应的计算。 - -pooling -======== - -pooling 的使用示例如下。 - -.. code-block:: bash - - seq_pool = pooling(input=layer, - pooling_type=pooling.Max(), - agg_level=AggregateLevel.TO_SEQUENCE) - -- `pooling_type` 目前支持两种,分别是:pooling.Max()和pooling.Avg()。 - -- `agg_level=AggregateLevel.TO_NO_SEQUENCE` 时(默认值): - - - 作用:双层序列经过运算变成一个0层序列,或单层序列经过运算变成一个0层序列 - - 输入:一个双层序列,或一个单层序列 - - 输出:一个0层序列,即整个输入序列(单层或双层)的平均值(或最大值) - -- `agg_level=AggregateLevel.TO_SEQUENCE` 时: - - - 作用:一个双层序列经过运算变成一个单层序列 - - 输入:必须是一个双层序列 - - 输出:一个单层序列,序列的每个元素是原来双层序列每个subseq元素的平均值(或最大值) - -last_seq 和 first_seq -===================== - -last_seq 的使用示例如下(first_seq 类似)。 - -.. code-block:: bash - - last = last_seq(input=layer, - agg_level=AggregateLevel.TO_SEQUENCE) - -- `agg_level=AggregateLevel.TO_NO_SEQUENCE` 时(默认值): - - - 作用:一个双层序列经过运算变成一个0层序列,或一个单层序列经过运算变成一个0层序列 - - 输入:一个双层序列或一个单层序列 - - 输出:一个0层序列,即整个输入序列(双层或者单层)最后一个,或第一个元素。 - -- `agg_level=AggregateLevel.TO_SEQUENCE` 时: - - 作用:一个双层序列经过运算变成一个单层序列 - - 输入:必须是一个双层序列 - - 输出:一个单层序列,其中每个元素是双层序列中每个subseq最后一个(或第一个)元素。 - -expand -====== - -expand 的使用示例如下。 - -.. code-block:: bash - - ex = expand(input=layer1, - expand_as=layer2, - expand_level=ExpandLevel.FROM_NO_SEQUENCE) - -- `expand_level=ExpandLevel.FROM_NO_SEQUENCE` 时(默认值): - - - 作用:一个0层序列经过运算扩展成一个单层序列,或者一个双层序列 - - 输入:layer1必须是一个0层序列,是待扩展的数据;layer2 可以是一个单层序列,或者是一个双层序列,提供扩展的长度信息 - - 输出:一个单层序列或一个双层序列,输出序列的类型(双层序列或单层序列)和序列中含有元素的数目同 layer2 一致。若输出是单层序列,单层序列的每个元素(0层序列),都是对layer1元素的拷贝;若输出是双层序列,双层序列每个subseq中每个元素(0层序列),都是对layer1元素的拷贝 - -- `expand_level=ExpandLevel.FROM_SEQUENCE` 时: - - - 作用:一个单层序列经过运算扩展成一个双层序列 - - 输入:layer1必须是一个单层序列,是待扩展的数据;layer2 必须是一个双层序列,提供扩展的长度信息 - - 输出:一个双层序列,序列中含有元素的数目同 layer2 一致。要求单层序列含有元素的数目(0层序列)和双层序列含有subseq 的数目一致。单层序列第i个元素(0层序列),被扩展为一个单层序列,构成了输出双层序列的第i个 subseq 。 diff --git a/doc/v2/howto/rnn/hierarchical_layer_en.rst b/doc/v2/howto/rnn/hierarchical_layer_en.rst deleted file mode 100644 index fb668f1babb47f49b2dab6d2411565e99599d8b0..0000000000000000000000000000000000000000 --- a/doc/v2/howto/rnn/hierarchical_layer_en.rst +++ /dev/null @@ -1,89 +0,0 @@ -########################### -Layers that Support Hierarchical Sequences as Input -########################### -  -.. contents:: -  -Overview -==== -  -A sequence is a common data type in natural language processing tasks. An independent word can be regarded as a non-sequential input or a 0-level sequence. A sentence made up of words is a single-level sequence; a number of sentences make up a paragraph, which is a double-level sequence. -  -A double-level sequence is a nested sequence where each element is a single-level sequence. This is a very flexible way of organizing data that helps us construct some complex input information. -  -We can define non-sequences, single-level sequences, and double-level sequences at the following levels. -  -+ 0-level sequence: an independent element. Its type can be any input data type supported by PaddlePaddle; -+ Single-level sequence: multiple elements arranged in a row; each element is a 0-level sequence. The order of elements is an important input information; -+ Double-level sequence: multiple elements arranged in a row; each element is a single-layer sequence called a subseq of a double-level sequence, and each element of the subseq is a 0-level sequence. -  -In PaddlePaddle, the following layers accept double-layer sequences as input and perform corresponding calculations. -  -`pooling` -======== -  -The use of pooling is as follows: -  -.. code-block:: bash -  -        Seq_pool = pooling(input=layer, -                           Pooling_type=pooling.Max(), -                           Agg_level=AggregateLevel.TO_SEQUENCE) -         -- `pooling_type` currently supports two types: pooling.Max() and pooling.Avg(). -  -- When ʻagg_level=AggregateLevel.TO_NO_SEQUENCE` (default): -  -  - Effect: a double-level sequence input will be converted into a 0-level sequence, and a single-level sequence will be converted into a 0-level sequence  -  - Input: a double-level sequence or a single-level sequence -  - Output: a 0-level sequence which is the average (or maximum) of the entire input sequence (single or double) -  -- When ʻagg_level=AggregateLevel.TO_SEQUENCE`: -  -  - Effect: a double-level sequence will be transformed into a single-level sequence -  - Input: a double-level sequence -  - Output: a single-level sequence where each element of the sequence is the average (or maximum) value of each subseq element of the original double-level sequence. -  -`last_seq` and `first_seq` -===================== -  -An example of using `last_seq` is as follows (usage of `first_seq` is similar). -  -.. code-block:: bash -  -        Last = last_seq(input=layer, -                        Agg_level=AggregateLevel.TO_SEQUENCE) -         -- When ʻagg_level=AggregateLevel.TO_NO_SEQUENCE` (default): -  -  - Effect: a double-level sequence input will be converted into a 0-level sequence, and a single-level sequence will be converted into a 0-level sequence -  - Input: a double-level sequence or a single-level sequence -  - Output: a 0-level sequence, which is the last or the first element of the input sequence (double or single level). -  -- When ʻagg_level=AggregateLevel.TO_SEQUENCE`: -  - Effect: a double-level sequence will be transformed into a single-level sequence -  - Input: a double-level sequence -  - Output: a single-layer sequence in which each element is the last (or first) element of each subseq in a double-level sequence. -  -`expand` -====== -  -The use of expand is as follows. -  -.. code-block:: bash -  -        Ex = expand(input=layer1, -                    Expand_as=layer2, -                    Expand_level=ExpandLevel.FROM_NO_SEQUENCE) -         -- When `expand_level=ExpandLevel.FROM_NO_SEQUENCE` (default): -  -  - Effect: a 0-level sequence is extended to a single-level sequence or a double-level sequence -  - Input: layer1 must be a 0-level sequence to be extended; layer2 can be a single-level sequence or a double-level sequence that provides the extended length information -  - Output: a single-level sequence or a double-level sequence; the type of the output sequence and the number of elements contained in the sequence are the same as layer2. If the output is a single-level sequence, each element of the single-level sequence will be a copy of the layer1 element. If the output is a double-level sequence, each element in the double-level sequence will be a copy of the layer1 element -  -- When `expand_level=ExpandLevel.FROM_SEQUENCE`: -  -  - Effect: a single-level sequence is extended to a double-level sequence -  - Input: layer1 must be a single-level sequence to be extended; layer2 must be a double-level sequence providing extended length information -  - Output: a double-level sequence with the same number of elements as that of layer2. It is required that the number of elements in the single-level sequence be the same as the number of subseq in the double-level sequences. The i-th element of the single-level sequence (the 0-level sequence) is expanded into a single-level sequence that constitutes the i-th subseq of the output, the double-level sequence. diff --git a/doc/v2/howto/rnn/hrnn_rnn_api_compare_cn.rst b/doc/v2/howto/rnn/hrnn_rnn_api_compare_cn.rst deleted file mode 100644 index 9d6d417075485dceb1ee71f527b408aa6a6638ea..0000000000000000000000000000000000000000 --- a/doc/v2/howto/rnn/hrnn_rnn_api_compare_cn.rst +++ /dev/null @@ -1,226 +0,0 @@ -.. _algo_hrnn_rnn_api_compare: - -##################### -单双层RNN API对比介绍 -##################### - -本文以PaddlePaddle的双层RNN单元测试为示例,用多对效果完全相同的、分别使用单双层RNN作为网络配置的模型,来讲解如何使用双层RNN。本文中所有的例子,都只是介绍双层RNN的API接口,并不是使用双层RNN解决实际的问题。如果想要了解双层RNN在具体问题中的使用,请参考\ :ref:`algo_hrnn_demo`\ 。本文中示例所使用的单元测试文件是\ `test_RecurrentGradientMachine.cpp `_\ 。 - -示例1:双层RNN,子序列间无Memory -================================ - -在双层RNN中的经典情况是将内层的每一个时间序列数据,分别进行序列操作;并且内层的序列操作之间独立无依赖,即不需要使用Memory\ 。 - -在本示例中,单层RNN和双层RNN的网络配置,都是将每一句分好词后的句子,使用LSTM作为encoder,压缩成一个向量。区别是RNN使用两层序列模型,将多句话看成一个整体同时使用encoder压缩。二者语意上完全一致。这组语义相同的示例配置如下: - -* 单层RNN\: `sequence_layer_group.conf `_ -* 双层RNN\: `sequence_nest_layer_group.conf `_ - - -读取双层序列数据 ----------------- - -首先,本示例中使用的原始数据如下\: - -- 本例中的原始数据一共有10个样本。每个样本由两部分组成,一个label(此处都为2)和一个已经分词后的句子。这个数据也被单层RNN网络直接使用。 - -.. literalinclude:: ../../../../paddle/legacy/gserver/tests/Sequence/tour_train_wdseg - :language: text - - -- 双层序列数据一共有4个样本。 每个样本间用空行分开,整体数据和原始数据完全一样。但于双层序列的LSTM来说,第一个样本同时encode两条数据成两个向量。这四条数据同时处理的句子数量为\ :code:`[2, 3, 2, 3]`\ 。 - -.. literalinclude:: ../../../../paddle/legacy/gserver/tests/Sequence/tour_train_wdseg.nest - :language: text - -其次,对于两种不同的输入数据类型,不同DataProvider对比如下(`sequenceGen.py `_)\: - -.. literalinclude:: ../../../../paddle/legacy/gserver/tests/sequenceGen.py - :language: python - :lines: 21-39 - :linenos: - -- 这是普通的单层时间序列的DataProvider代码,其说明如下: - - * DataProvider共返回两个数据,分别是words和label。即上述代码中的第19行。 - - - words是原始数据中的每一句话,所对应的词表index数组。它是integer_value_sequence类型的,即整数数组。words即为这个数据中的单层时间序列。 - - label是原始数据中对于每一句话的分类标签,它是integer_value类型的。 - -.. literalinclude:: ../../../../paddle/legacy/gserver/tests/sequenceGen.py - :language: python - :lines: 42-71 - :linenos: - -- 对于同样的数据,双层时间序列的DataProvider的代码。其说明如下: - - - DataProvider共返回两组数据,分别是sentences和labels。即在双层序列的原始数据中,每一组内的所有句子和labels - - sentences是双层时间序列的数据。由于它内部包含了每组数据中的所有句子,且每个句子表示为对应的词表索引数组,因此它是integer_value_sub_sequence 类型的,即双层时间序列。 - - labels是每组内每个句子的标签,故而是一个单层时间序列。 - - -模型配置的模型配置 ------------------------------------------- - -首先,我们看一下单层RNN的配置。代码中9-15行(高亮部分)即为单层RNN序列的使用代码。这里使用了PaddlePaddle预定义好的RNN处理函数。在这个函数中,RNN对于每一个时间步通过了一个LSTM网络。 - -.. literalinclude:: ../../../../paddle/legacy/gserver/tests/sequence_layer_group.conf - :language: python - :lines: 38-63 - :linenos: - :emphasize-lines: 9-15 - - -其次,我们看一下语义相同的双层RNN的网络配置\: - -* PaddlePaddle中的许多layer并不在意输入是否是时间序列,例如\ :code:`embedding_layer`\ 。在这些layer中,所有的操作都是针对每一个时间步来进行的。 - -* 在该配置的7-26行(高亮部分),将双层时间序列数据先变换成单层时间序列数据,再对每一个单层时间序列进行处理。 - - * 使用\ :code:`recurrent_group`\ 这个函数进行变换,在变换时需要将输入序列传入。由于我们想要的变换是双层时间序列=> 单层时间序列,所以我们需要将输入数据标记成\ :code:`SubsequenceInput`\ 。 - - * 在本例中,我们将原始数据的每一组,通过\ :code:`recurrent_group`\ 进行拆解,拆解成的每一句话再通过一个LSTM网络。这和单层RNN的配置是等价的。 - -* 与单层RNN的配置类似,我们只需要使用LSTM encode成的最后一个向量。所以对\ :code:`recurrent_group`\ 进行了\ :code:`last_seq`\ 操作。但和单层RNN不同,我们是对每一个子序列取最后一个元素,因此\ :code:`agg_level=AggregateLevel.TO_SEQUENCE`\ 。 - -* 至此,\ :code:`lstm_last`\ 便和单层RNN配置中的\ :code:`lstm_last`\ 具有相同的结果了。 - -.. literalinclude:: ../../../../paddle/legacy/gserver/tests/sequence_nest_layer_group.conf - :language: python - :lines: 38-64 - :linenos: - :emphasize-lines: 7-26 - -示例2:双层RNN,子序列间有Memory -================================ - -本示例意图使用单层RNN和双层RNN实现两个完全等价的全连接RNN。 - -* 对于单层RNN,输入数据为一个完整的时间序列,例如\ :code:`[4, 5, 2, 0, 9, 8, 1, 4]`\ 。 - -* 对于双层RNN,输入数据为在单层RNN数据里面,任意将一些数据组合成双层时间序列,例如\ :code:`[ [4, 5, 2], [0, 9], [8, 1, 4]]`。 - -模型配置的模型配置 ------------------- - -我们选取单双层序列配置中的不同部分,来对比分析两者语义相同的原因。 - -- 单层RNN:过了一个很简单的recurrent_group。每一个时间步,当前的输入y和上一个时间步的输出rnn_state做了一个全链接。 - -.. literalinclude:: ../../../../paddle/legacy/gserver/tests/sequence_rnn.conf - :language: python - :lines: 36-48 - -- 双层RNN,外层memory是一个元素: - - - 内层inner_step的recurrent_group和单层序列的几乎一样。除了boot_layer=outer_mem,表示将外层的outer_mem作为内层memory的初始状态。外层outer_step中,outer_mem是一个子句的最后一个向量,即整个双层group是将前一个子句的最后一个向量,作为下一个子句memory的初始状态。 - - 从输入数据上看,单双层序列的句子是一样的,只是双层序列将其又做了子序列划分。因此双层序列的配置中,必须将前一个子句的最后一个元素,作为boot_layer传给下一个子句的memory,才能保证和单层序列的配置中“每个时间步都用了上一个时间步的输出结果”一致。 - -.. literalinclude:: ../../../../paddle/legacy/gserver/tests/sequence_nest_rnn.conf - :language: python - :lines: 39-66 - -.. warning:: - PaddlePaddle目前只支持在每个时间步中,Memory的时间序列长度一致的情况。 - -示例3:双层RNN,输入不等长 -========================== - -.. role:: red - -.. raw:: html - - - -**输入不等长** 是指recurrent_group的多个输入序列,在每个时间步的子序列长度可以不相等。但序列输出时,需要指定与某一个输入的序列信息是一致的。使用\ :red:`targetInlink`\ 可以指定哪一个输入和输出序列信息一致,默认指定第一个输入。 - -示例3的配置分别为\ `单层不等长RNN `_\ 和\ `双层不等长RNN `_\ 。 - -示例3对于单层RNN和双层RNN数据完全相同。 - -* 对于单层RNN的数据一共有两个样本,他们分别是\ :code:`[1, 2, 4, 5, 2], [5, 4, 1, 3, 1]`\ 和\ :code:`[0, 2, 2, 5, 0, 1, 2], [1, 5, 4, 2, 3, 6, 1]`\ 。对于每一个单层RNN的数据,均有两组特征。 - -* 在单层数据的基础上,双层RNN数据随意加了一些隔断,例如将第一条数据转化为\ :code:`[[0, 2], [2, 5], [0, 1, 2]],[[1, 5], [4], [2, 3, 6, 1]]`\ 。 - -* 需要注意的是PaddlePaddle目前只支持子序列数目一样的多输入双层RNN。例如本例中的两个特征,均有三个子序列。每个子序列长度可以不一致,但是子序列的数目必须一样。 - - -模型配置 --------- - -和示例2中的配置类似,示例3的配置使用了单层RNN和双层RNN,实现两个完全等价的全连接RNN。 - -* 单层RNN\: - -.. literalinclude:: ../../../../paddle/legacy/gserver/tests/sequence_rnn_multi_unequalength_inputs.py - :language: python - :lines: 42-59 - :linenos: - -* 双层RNN\ \: - -.. literalinclude:: ../../../../paddle/legacy/gserver/tests/sequence_nest_rnn_multi_unequalength_inputs.py - :language: python - :lines: 41-80 - :linenos: - -在上面代码中,单层和双层序列的使用和示例2中的示例类似,区别是同时处理了两个输入。而对于双层序列,两个输入的子序列长度也并不相同。但是,我们使用了\ :code:`targetInlink`\ 参数设置了外层\ :code:`recurrent_group`\ 的输出格式。所以外层输出的序列形状,和\ :code:`emb2`\ 的序列形状一致。 - - -词汇表 -====== - -.. _glossary_memory: - -Memory ------- - -Memory是PaddlePaddle实现RNN时候使用的一个概念。RNN即时间递归神经网络,通常要求时间步之间具有一些依赖性,即当前时间步下的神经网络依赖前一个时间步神经网络中某一个神经元输出。如下图所示。 - -.. graphviz:: src/glossary_rnn.dot - -上图中虚线的连接,即是跨越时间步的网络连接。PaddlePaddle在实现RNN的时候,将这种跨越时间步的连接用一个特殊的神经网络单元实现。这个神经网络单元就叫Memory。Memory可以缓存上一个时刻某一个神经元的输出,然后在下一个时间步输入给另一个神经元。使用Memory的RNN实现便如下图所示。 - -.. graphviz:: src/glossary_rnn_with_memory.dot - -使用这种方式,PaddlePaddle可以比较简单的判断哪些输出是应该跨越时间步的,哪些不是。 - -.. _glossary_timestep: - -时间步 ------- - -参考时间序列。 - - -.. _glossary_sequence: - -时间序列 --------- - -时间序列(time series)是指一系列的特征数据。这些特征数据之间的顺序是有意义的。即特征的数组,而不是特征的集合。而这每一个数组元素,或者每一个系列里的特征数据,即为一个时间步(time step)。值得注意的是,时间序列、时间步的概念,并不真正的和『时间』有关。只要一系列特征数据中的『顺序』是有意义的,即为时间序列的输入。 - -举例说明,例如文本分类中,我们通常将一句话理解成一个时间序列。比如一句话中的每一个单词,会变成词表中的位置。而这一句话就可以表示成这些位置的数组。例如 :code:`[9, 2, 3, 5, 3]` 。 - -关于时间序列(time series)的更详细准确的定义,可以参考 `维基百科页面 Time series `_ 或者 `维基百科中文页面 时间序列 `_ 。 - -另外,Paddle中经常会将时间序列成为 :code:`Sequence` 。他们在Paddle的文档和API中是一个概念。 - -.. _glossary_RNN: - -RNN ---- - -RNN 在PaddlePaddle的文档中,一般表示 :code:`Recurrent neural network`,即时间递归神经网络。详细介绍可以参考 `维基百科页面 Recurrent neural network `_ 或者 `中文维基百科页面 `_ 中关于时间递归神经网络的介绍。 - -RNN 一般在PaddlePaddle中,指对于一个时间序列输入数据,每一个时间步之间的神经网络具有一定的相关性。例如,某一个神经元的一个输入为上一个时间步网络中某一个神经元的输出。或者,从每一个时间步来看,神经网络的网络结构中具有有向环结构。 - -.. _glossary_双层RNN: - -双层RNN -------- - -双层RNN顾名思义,即RNN之间有一次嵌套关系。输入数据整体上是一个时间序列,而对于每一个内层特征数据而言,也是一个时间序列。即二维数组,或者数组的数组这个概念。 而双层RNN是可以处理这种输入数据的网络结构。 - -例如,对于段落的文本分类,即将一段话进行分类。我们将一段话看成句子的数组,每个句子又是单词的数组。这便是一种双层RNN的输入数据。而将这个段落的每一句话用lstm编码成一个向量,再对每一句话的编码向量用lstm编码成一个段落的向量。再对这个段落向量进行分类,即为这个双层RNN的网络结构。 - diff --git a/doc/v2/howto/rnn/hrnn_rnn_api_compare_en.rst b/doc/v2/howto/rnn/hrnn_rnn_api_compare_en.rst deleted file mode 100644 index a4485f7b5edf21871444801230ab1ee191b1137b..0000000000000000000000000000000000000000 --- a/doc/v2/howto/rnn/hrnn_rnn_api_compare_en.rst +++ /dev/null @@ -1,226 +0,0 @@ -.. _algo_hrnn_rnn_api_compare: - -##################### -API comparision between RNN and hierarchical RNN -##################### - -This article takes PaddlePaddle's hierarchical RNN unit test as an example. We will use several examples to illestrate the usage of single-layer and hierarchical RNNs. Each example has two model configurations, one for single-layer, and the other for hierarchical RNN. Although the implementations are different, both the two model configurations' effects are the same. All of the examples in this article only describe the API interface of the hierarchical RNN, while we do not use this hierarchical RNN to solve practical problems. If you want to understand the use of hierarchical RNN in specific issues, please refer to \ :ref:`algo_hrnn_demo`\ 。The unit test file used in this article's example is \ `test_RecurrentGradientMachine.cpp `_\ 。 - -Example 1:Hierarchical RNN without Memory between subsequences -================================ - -The classical case in the hierarchical RNN is to perform sequence operations on each time series data in the inner layers seperately. And the sequence operations in the inner layers is independent, that is, it does not need to use Memory. - -In this example, the network configuration of single-layer RNNs and hierarchical RNNs are all to use LSTM as en encoder to compress a word-segmented sentence into a vector. The difference is that, RNN uses a hierarchical RNN model, treating multiple sentences as a whole to use encoder to compress simultaneously. They are completely consistent in their semantic meanings. This pair of semantically identical example configurations is as follows: - -* RNN\: `sequence_layer_group.conf `_ -* Hierarchical RNN\: `sequence_nest_layer_group.conf `_ - - -Reading hierarchical sequence data ----------------- - -Firstly, the original data in this example is as follows \: - -- The original data in this example has 10 samples. Each of the sample includes two components: a lable(all 2 here), and a word-segmented sentence. This data is used by single RNN as well. - -.. literalinclude:: ../../../../paddle/legacy/gserver/tests/Sequence/tour_train_wdseg - :language: text - - -- The data for hierarchical RNN has 4 samples. Every sample is seperated by a blank line, while the content of the data is the same as the original data. But as for hierarchical LSTM, the first sample will encode two sentences into two vectors simultaneously. The sentence count dealed simultaneously by this 4 samples are \ :code:`[2, 3, 2, 3]`\ . - -.. literalinclude:: ../../../../paddle/legacy/gserver/tests/Sequence/tour_train_wdseg.nest - :language: text - -Secondly, as for these two types of different input data formats, the contrast of different DataProviders are as follows (`sequenceGen.py `_)\: - -.. literalinclude:: ../../../../paddle/legacy/gserver/tests/sequenceGen.py - :language: python - :lines: 21-39 - :linenos: - -- This is the DataProvider code for an ordinary single-layer time series. Its description is as follows: - - * DataProvider returns two parts, that are "words" and "label",as line 19 in the above code. - - - "words" is a list of word table indices corresponding to each word in the sentence in the original data. Its data type is integer_value_sequence, that is integer list. So, "words" is a singler-layer time series in the data. - - "label" is the categorical label of each sentence, whose data type is integer_value. - -.. literalinclude:: ../../../../paddle/legacy/gserver/tests/sequenceGen.py - :language: python - :lines: 42-71 - :linenos: - -- As for the same data, the DataProvider code for hierarchical time series. Its description is as follows: - - - DataProvider returns two lists of data, that are "sentences" and "labels", corresponding to the sentences and labels in each group in the original data of hierarchical time series. - - "sentences" comes from the hierarchical time series original data. As it contains every sentences in each group internally, and each sentences are represented by a list of word table indices, so its data type is integer_value_sub_sequence, which is hierarchical time series. - - "labels" is the categorical lable of each sentence, so it is a sigle-layer time series. - - -Model configuration ------------------------------------------- - -Firstly, let's look at the configuration of single-layer RNN. The hightlighted part of line 9 to line 15 is the usage of single-layer RNN. Here we use the pre-defined RNN process function in PaddlePaddle. In this function, for each time step, RNN passes through an LSTM network. - -.. literalinclude:: ../../../../paddle/legacy/gserver/tests/sequence_layer_group.conf - :language: python - :lines: 38-63 - :linenos: - :emphasize-lines: 9-15 - - -Secondly, let's look at the model configuration of hierarchical RNN which has the same semantic meaning. \: - -* Most layers in PaddlePaddle do not care about whether the input is time series or not, e.g. \ :code:`embedding_layer`\ . In these layers, every operation is processed on each time step. - -* In the hightlighted part of line 7 to line 26 of this configuration, we transform the hierarchical time series data into single-layer time series data, then process each single-layer time series. - - * Use the function \ :code:`recurrent_group`\ to transform. Input sequences need to be passed in when transforming. As we want to transform hierarchical time series into single-layer sequences, we need to lable the input data as \ :code:`SubsequenceInput`\ . - - * In this example, we disassemble every group of the original data into sentences using \ :code:`recurrent_group`\ . Each of the disassembled sentences passes through an LSTM network. This is equivalent to single-layer RNN configuration. - -* Similar to single-layer RNN configuration, we only use the last vector after the encode of LSTM. So we use the operation of \ :code:`last_seq`\ to \ :code:`recurrent_group`\ . But unlike single-layer RNN, we use the last element of every subsequence, so we need to set \ :code:`agg_level=AggregateLevel.TO_SEQUENCE`\ . - -* Till now, \ :code:`lstm_last`\ has the same result as \ :code:`lstm_last`\ in single-layer RNN configuration. - -.. literalinclude:: ../../../../paddle/legacy/gserver/tests/sequence_nest_layer_group.conf - :language: python - :lines: 38-64 - :linenos: - :emphasize-lines: 7-26 - -Example 2:Hierarchical RNN with Memory between subsequences -================================ - -This example is intended to implement two fully-equivalent fully-connected RNNs using single-layer RNN and hierarchical RNN. - -* As for single-layer RNN, input is a full time series, e.g. \ :code:`[4, 5, 2, 0, 9, 8, 1, 4]`\ . - -* As for hierarchical RNN, input is a hierarchical time series which elements are arbitrarily combination of data in single-layer RNN, e.g. \ :code:`[ [4, 5, 2], [0, 9], [8, 1, 4]]`. - -model configuration ------------------- - -We select the different parts between single-layer RNN and hierarchical RNN configurations, to compare and analyze the reason why they have same semantic meanings. - -- single-layer RNN:passes through a simple recurrent_group. For each time step, the current input y and the last time step's output rnn_state pass through a fully-connected layer. - -.. literalinclude:: ../../../../paddle/legacy/gserver/tests/sequence_rnn.conf - :language: python - :lines: 36-48 - -- hierarchical RNN, the outer layer's memory is an element. - - - The recurrent_group of inner layer's inner_step is nearly the same as single-layer sequence, except for the case of boot_layer=outer_mem, which means using the outer layer's outer_mem as the initial state for the inner layer's memory. In the outer layer's out_step, outer_mem is the last vector of a subsequence, that is, the whole hierarchical group uses the last vector of the previous subsequence as the initial state for the next subsequence's memory. - - From the aspect of the input data, sentences from single-layer and hierarchical RNN are the same. The only difference is that, hierarchical RNN disassembes the sequence into subsequences. So in the hierarchical RNN configuration, we must use the last element of the previous subsequence as a boot_layer for the memory of the next subsequence, so that it makes no difference with "every time step uses the output of last time step" in the sigle-layer RNN configuration. - -.. literalinclude:: ../../../../paddle/legacy/gserver/tests/sequence_nest_rnn.conf - :language: python - :lines: 39-66 - -.. warning:: - Currently PaddlePaddle only supports the case that the lengths of the time series of Memory in each time step are the same. - -Example 3:hierarchical RNN with unequal length inputs -========================== - -.. role:: red - -.. raw:: html - - - -**unequal length inputs** means in the multiple input sequences of recurrent_group, the lengths of subsequences can be unequal. But the output of the sequence, needs to be consistent with one of the input sequences. Using \ :red:`targetInlink`\ can help you specify which of the input sequences and the output sequence can be consistent, by default is the first input. - -The configurations of Example 3 are \ `sequence_rnn_multi_unequalength_inputs `_ \ and \ `sequence_nest_rnn_multi_unequalength_inputs `_\ . - -The data for the configurations of Example 3's single-layer RNN and hierarchical RNN are exactly the same. - -* For the single-layer RNN, the data has two samples, which are \ :code:`[1, 2, 4, 5, 2], [5, 4, 1, 3, 1]`\ and \ :code:`[0, 2, 2, 5, 0, 1, 2], [1, 5, 4, 2, 3, 6, 1]`\ . Each of the data for the single-layer RNN has two group of features. - -* On the basis of the single-layer's data, hierarchical RNN's data randomly adds some partitions. For example, the first sample is transformed to \ :code:`[[0, 2], [2, 5], [0, 1, 2]],[[1, 5], [4], [2, 3, 6, 1]]`\ . - -* You need to pay attention that, PaddlePaddle only supports multiple input hierarchical RNNs that have same amount of subsequences currently. In this example, the two features both have 3 subsequences. Although the length of each subsequence can be different, the amount of subsequences should be the same. - - -model configuration --------- - -Similar to Example 2's configuration, Example 3's configuration uses single-layer and hierarchical RNN to implement 2 fully-equivalent fully-connected RNNs. - -* single-layer RNN\: - -.. literalinclude:: ../../../../paddle/legacy/gserver/tests/sequence_rnn_multi_unequalength_inputs.py - :language: python - :lines: 42-59 - :linenos: - -* hierarchical RNN\ \: - -.. literalinclude:: ../../../../paddle/legacy/gserver/tests/sequence_nest_rnn_multi_unequalength_inputs.py - :language: python - :lines: 41-80 - :linenos: - -In the above code, the usage of single-layer and hierarchical RNNs are similar to Example 2, which difference is that it processes 2 inputs simultaneously. As for the hierarchical RNN, the lengths of the 2 input's subsequences are not equal. But we use the parameter \ :code:`targetInlink` \ to set the outper layer's \ :code:`recurrent_group` \ 's output format, so the shape of outer layer's output is the same as the shape of \ :code:`emb2`\ . - - -Glossary -====== - -.. _glossary_memory: - -Memory ------- - -Memory is a concept when PaddlePaddle is implementing RNN. RNN, recurrent neural network, usually requires some dependency between time steps, that is, the neural network in current time step depends on one of the neurons in the neural network in previous time steps, as the following figure shows: - -.. graphviz:: src/glossary_rnn.dot - -The dotted connections in the figure, is the network connections across time steps. When PaddlePaddle is implementing RNN, this connection accross time steps is implemented using a special neural network unit, called Memory. Memory can cache the output of one of the neurons in previous time step, then can be passed to another neuron in next time step. The implementation of an RNN using Memory is as follows: - -.. graphviz:: src/glossary_rnn_with_memory.dot - -With this method, PaddlePaddle can easily determine which outputs should cross time steps, and which should not. - -.. _glossary_timestep: - -time step ------- - -refers to time series - - -.. _glossary_sequence: - -time series --------- - -Time series is a series of featured data. The order among these featured data is meaningful. So it is a list of features, not a set of features. As for each element of this list, or the featured data in each series, is called a time step. It must be noted that, the concepts of time series and time steps, are not necessarrily related to "time". As long as the "order" in a series of featured data is meaningful, it can be the input of time series. - -For example, in text classification task, we regard a sentence as a time series. So, each word in the sentence can become the index of the word in the word table. So this sentence can be represented as a list of these indices, e.g.:code:`[9, 2, 3, 5, 3]` . - -For a more detailed and accurate definition of the time series, please refer to `Wikipedia of Time series `_ or `Chinese Wikipedia of time series `_ . - -In additioin, Paddle always calls time series as :code:`Sequence` . They are a same concept in Paddle's documentations and APIs. - -.. _glossary_RNN: - -RNN ---- - -In PaddlePaddle's documentations, RNN is usually represented as :code:`Recurrent neural network` . For more information, please refer to `Wikipedia Recurrent neural network `_ or `Chinese Wikipedia `_ . - -In PaddlePaddle, RNN usually means, for the input data of a time series, the neural network between each time steps has a certain relevance. For example, the input of a certain neuron is the output of a certain neuron in the neural network of the last time step. Or, as for each time step, the network structure of the neural network has a directed ring structure. - -.. _glossary_hierarchical_RNN: - -hierarchical RNN -------- - -Hierarchical RNN, as the name suggests, means there is a nested relationship in RNNs. The input data is a time series, but for each of the inner featured data, it is also a time series, namely 2-dimentional array, or, array of array. Hierarchical RNN is a neural network that can process this type of input data. - -For example, the task of text classification of a paragragh, meaning to classify a paragraph of sentences. We can treat a paragraph as an array of sentences, and each sentence is an array of words. This is a type of the input data for the hierarchical RNN. We encode each sentence of this paragraph into a vector using LSTM, then encode each of the encoded vectors into a vector of this paragraph using LSTM. Finally we use this paragraph vector perform classification, which is the neural network structure of this hierarchical RNN. - diff --git a/doc/v2/howto/rnn/index_cn.rst b/doc/v2/howto/rnn/index_cn.rst deleted file mode 100644 index 2032fb9e296ab024c68da1348064580c8c88d5be..0000000000000000000000000000000000000000 --- a/doc/v2/howto/rnn/index_cn.rst +++ /dev/null @@ -1,34 +0,0 @@ -RNN模型 -=========== -循环神经网络(RNN)是对序列数据建模的重要工具。PaddlePaddle提供了灵活的接口以支持复杂循环神经网络的构建。 -这里将分为以下四个部分详细介绍如何使用PaddlePaddle搭建循环神经网络。 - -第一部分由浅入深的展示了使用PaddlePaddle搭建循环神经网络的全貌:首先以简单的循环神经网络(vanilla RNN)为例, -说明如何封装配置循环神经网络组件;然后更进一步的通过序列到序列(sequence to sequence)模型,逐步讲解如何构建完整而复杂的循环神经网络模型。 - -.. toctree:: - :maxdepth: 1 - - rnn_config_cn.rst - -Recurrent Group是PaddlePaddle中实现复杂循环神经网络的关键,第二部分阐述了PaddlePaddle中Recurrent Group的相关概念和原理, -对Recurrent Group接口进行了详细说明。另外,对双层RNN(对应的输入为双层序列)及Recurrent Group在其中的使用进行了介绍。 - -.. toctree:: - :maxdepth: 1 - - recurrent_group_cn.md - -第三部分对双层序列进行了解释说明,列出了PaddlePaddle中支持双层序列作为输入的Layer,并对其使用进行了逐一介绍。 - -.. toctree:: - :maxdepth: 1 - - hierarchical_layer_cn.rst - -第四部分以PaddlePaddle的双层RNN单元测试中的网络配置为示例,辅以效果相同的单层RNN网络配置作为对比,讲解了多种情况下双层RNN的使用。 - -.. toctree:: - :maxdepth: 1 - - hrnn_rnn_api_compare_cn.rst diff --git a/doc/v2/howto/rnn/index_en.rst b/doc/v2/howto/rnn/index_en.rst deleted file mode 100644 index 6e8b5c61b23ca2725dc0c9761c8dd4165033973c..0000000000000000000000000000000000000000 --- a/doc/v2/howto/rnn/index_en.rst +++ /dev/null @@ -1,32 +0,0 @@ -RNN Models -========== -Recurrent neural networks(RNN) are an important tool to model sequential data. PaddlePaddle provides flexible interface for building complex recurrent neural network. We will demonstrate how to use PaddlePaddle to build RNN models in the following 4 parts. - -In the first part, we will guide you how to configure recurrent neural network in PaddlePaddle from simple to complex. First, we will use a vanilla recurrent neural network as an example to show how to configure recurrent neural network architecture. Then We will use the sequence to sequence model as an example to demonstrate how you can configure complex recurrent neural network models gradually. - -.. toctree:: - :maxdepth: 1 - - rnn_config_en.rst - -Recurrent Group is the key unit to build complex recurrent neural network models. The second part describes related concepts and Basic principles of Recurrent Group, and give a detailed description of Recurrent Group API interface. In addition, it also introduces Sequence-level RNN(hierarchical sequence as input) and the usage of Recurrent Group in it. - -.. toctree:: - :maxdepth: 1 - - recurrent_group_en.md - -In the third part, two-level sequence is demonstrated briefly and then layers supporting two-level sequence as input are listed and described respectively. - -.. toctree:: - :maxdepth: 1 - - hierarchical_layer_en.rst - -In the last part, the unit test of hierarchical RNN is presented as an example to explain how to use hierarchical RNN. We will use two-level sequence RNN and single-layer sequence RNN which have same effects with former as the network configuration seperately in unit test. - -.. toctree:: - :maxdepth: 1 - - hrnn_rnn_api_compare_en.rst - diff --git a/doc/v2/howto/rnn/recurrent_group_cn.md b/doc/v2/howto/rnn/recurrent_group_cn.md deleted file mode 100644 index 06dc9e089ab2b2b926fcb1bd034262f2c846f06f..0000000000000000000000000000000000000000 --- a/doc/v2/howto/rnn/recurrent_group_cn.md +++ /dev/null @@ -1,96 +0,0 @@ -# Recurrent Group教程 - -## 概述 - -序列数据是自然语言处理任务面对的一种主要输入数据类型。 - -一句话是由词语构成的序列,多句话进一步构成了段落。因此,段落可以看作是一个嵌套的双层的序列,这个序列的每个元素又是一个序列。 - -双层序列是PaddlePaddle支持的一种非常灵活的数据组织方式,帮助我们更好地描述段落、多轮对话等更为复杂的语言数据。基于双层序列输入,我们可以设计搭建一个灵活的、层次化的RNN,分别从词语和句子级别编码输入数据,同时也能够引入更加复杂的记忆机制,更好地完成一些复杂的语言理解任务。 - -在PaddlePaddle中,`recurrent_group`是一种任意复杂的RNN单元,用户只需定义RNN在一个时间步内完成的计算,PaddlePaddle负责完成信息和误差在时间序列上的传播。 - -更进一步,`recurrent_group`同样可以扩展到双层序列的处理上。通过两个嵌套的`recurrent_group`分别定义子句级别和词语级别上需要完成的运算,最终实现一个层次化的复杂RNN。 - -目前,在PaddlePaddle中,能够对双向序列进行处理的有`recurrent_group`和部分Layer,具体可参考文档:支持双层序列作为输入的Layer。 - -## 相关概念 - -### 基本原理 -`recurrent_group` 是PaddlePaddle支持的一种任意复杂的RNN单元。使用者只需要关注于设计RNN在一个时间步之内完成的计算,PaddlePaddle负责完成信息和梯度在时间序列上的传播。 - -PaddlePaddle中,`recurrent_group`的一个简单调用如下: - -``` python -recurrent_group(step, input, reverse) -``` -- step:一个可调用的函数,定义一个时间步之内RNN单元完成的计算 -- input:输入,必须是一个单层序列,或者一个双层序列 -- reverse:是否以逆序处理输入序列 - -使用`recurrent_group`的核心是设计step函数的计算逻辑。step函数内部可以自由组合PaddlePaddle支持的各种layer,完成任意的运算逻辑。`recurrent_group` 的输入(即input)会成为step函数的输入,由于step 函数只关注于RNN一个时间步之内的计算,在这里`recurrent_group`替我们完成了原始输入数据的拆分。 - -### 输入 -`recurrent_group`处理的输入序列主要分为以下三种类型: - -- **数据输入**:一个双层序列进入`recurrent_group`会被拆解为一个单层序列,一个单层序列进入`recurrent_group`会被拆解为非序列,然后交给step函数,这一过程对用户是完全透明的。可以有以下两种:1)通过data_layer拿到的用户输入;2)其它layer的输出。 - -- **只读Memory输入**:`StaticInput` 定义了一个只读的Memory,由`StaticInput`指定的输入不会被`recurrent_group`拆解,`recurrent_group` 循环展开的每个时间步总是能够引用所有输入,可以是一个非序列,或者一个单层序列。 - -- **序列生成任务的输入**:`GeneratedInput`只用于在序列生成任务中指定输入数据。 - -### 输入示例 - -序列生成任务大多遵循encoder-decoer架构,encoder和decoder可以是能够处理序列的任意神经网络单元,而RNN是最流行的选择。 - -给定encoder输出和当前词,decoder每次预测产生下一个最可能的词语。在这种结构中,decoder接受两个输入: - -- 要生成的目标序列:是decoder的数据输入,也是decoder循环展开的依据,`recurrent_group`会对这类输入进行拆解。 - -- encoder输出,可以是一个非序列,或者一个单层序列:是一个unbounded memory,decoder循环展开的每一个时间步会引用全部结果,不应该被拆解,这种类型的输入必须通过`StaticInput`指定。关于Unbounded Memory的更多讨论请参考论文 [Neural Turning Machine](https://arxiv.org/abs/1410.5401)。 - -在序列生成任务中,decoder RNN总是引用上一时刻预测出的词的词向量,作为当前时刻输入。`GeneratedInput`自动完成这一过程。 - -### 输出 -`step`函数必须返回一个或多个Layer的输出,这个Layer的输出会作为整个`recurrent_group` 最终的输出结果。在输出的过程中,`recurrent_group` 会将每个时间步的输出拼接,这个过程对用户也是透明的。 - -### memory -memory只能在`recurrent_group`中定义和使用。memory不能独立存在,必须指向一个PaddlePaddle定义的Layer。引用memory得到这layer上一时刻输出,因此,可以将memory理解为一个时延操作。 - -可以显示地指定一个layer的输出用于初始化memory。不指定时,memory默认初始化为0。 - -## 双层RNN介绍 -`recurrent_group`帮助我们完成对输入序列的拆分,对输出的合并,以及计算逻辑在序列上的循环展开。 - -利用这种特性,两个嵌套的`recurrent_group`能够处理双层序列,实现词语和句子两个级别的双层RNN结构。 - -- 单层(word-level)RNN:每个状态(state)对应一个词(word)。 -- 双层(sequence-level)RNN:一个双层RNN由多个单层RNN组成,每个单层RNN(即双层RNN的每个状态)对应一个子句(subseq)。 - -为了描述方便,下文以NLP任务为例,将含有子句(subseq)的段落定义为一个双层序列,将含有词语的句子定义为一个单层序列,那么0层序列即为一个词语。 - -## 双层RNN的使用 - -### 训练流程的使用方法 -使用 `recurrent_group`需要遵循以下约定: - -- **单进单出**:输入和输出都是单层序列。 - - 如果有多个输入,不同输入序列含有的词语数必须严格相等。 - - 输出一个单层序列,输出序列的词语数和输入序列一致。 - - memory:在step函数中定义 memory指向一个layer,通过引用memory得到这个layer上一个时刻输出,形成recurrent 连接。memory的is_seq参数必须为false。如果没有定义memory,每个时间步之内的运算是独立的。 - - boot_layer:memory的初始状态,默认初始状为0,memory的is_seq参数必须为false。 - -- **双进双出**:输入和输出都是双层序列。 - - 如果有多个输入序列,不同输入含有的子句(subseq)数必须严格相等,但子句含有的词语数可以不相等。 - - 输出一个双层序列,子句(subseq)数、子句的单词数和指定的一个输入序列一致,默认为第一个输入。 - - memory:在step函数中定义memory,指向一个layer,通过引用memory得到这个layer上一个时刻的输出,形成recurrent连接。定义在外层`recurrent_group` step函数中的memory,能够记录上一个subseq 的状态,可以是一个单层序列(只作为read-only memory),也可以是一个词语。如果没有定义memory,那么 subseq 之间的运算是独立的。 - - boot_layer:memory 初始状态,可以是一个单层序列(只作为read-only memory)或一个向量。默认不设置,即初始状态为0。 - -- **双进单出**:目前还未支持,会报错"In hierachical RNN, all out links should be from sequences now"。 - - -### 生成流程的使用方法 -使用`beam_search`需要遵循以下约定: - -- 单层RNN:从一个word生成下一个word。 -- 双层RNN:即把单层RNN生成后的subseq给拼接成一个新的双层seq。从语义上看,也不存在一个subseq直接生成下一个subseq的情况。 diff --git a/doc/v2/howto/rnn/recurrent_group_en.md b/doc/v2/howto/rnn/recurrent_group_en.md deleted file mode 100644 index de6b60f29eb97029a54609cd2194bb7faf3ffec5..0000000000000000000000000000000000000000 --- a/doc/v2/howto/rnn/recurrent_group_en.md +++ /dev/null @@ -1,96 +0,0 @@ -# Recurrent Group Tutorial - -## Overview - -Sequential data is common in natural language processing. - -A sentence is a sequence of words and many sentences form a paragraph further. Therefore, a paragraph can be viewed as a nested sequence with two level, where each element of the sequence is another sequence. That is to say, sequential data could be recursive. An example of two-level recursive sequential data is that an article is composed of a sequence of sentences, and each sentence a sequence of words. - -PaddlePaddle and PaddlePaddle v2 support two-level recursive sequential data. The two-level sequence is a very flexible data, which helps us to better describe more complex language data such as discribing paragraphs and several rounds of dialogues. Based on two-level sequence input, we can design and build a flexible, hierarchical RNN model that encodes input data from the word and sentence level. For the support of arbitrary levels, please refer to PaddlePaddle Fluid. - -In PaddlePaddle, `recurrent_group` is an arbitrarily complex RNN unit. The user only needs to define the calculation that the RNN will complete in one time step. PaddlePaddle is responsible for the propagation of information and error in time series. - -Furthermore, `recurrent_group` can also be extended to handle two-level sequence. By defining two nested `recurrent_group` operations at the clause level and the word level respectively, a hierarchical and complex RNN is finally achieved. - -Currently, in the PaddlePaddle, there are `recurrent_group` and some Layers that can process bidirectional sequences. For details, refer to the document: Layers for supporting double-layer sequences as input. - -## Related Concepts - -### Basic Principle -`recurrent_group` is an arbitrarily complex RNN unit supported by PaddlePaddle. The user only needs to focus on the calculations that the RNN is designed to complete within a single time step. The PaddlePaddle is responsible for completing the propagation of information and gradients over time. - -In PaddlePaddle, a simple call to `recurrent_group` is as follows: - -``` python -recurrent_group(step, input, reverse) -``` -- step: A callable function that defines the calculations completed by the RNN unit within a time step -- input: The input must be a single-layer sequence or a double-layer sequence -- reverse: Whether to process the input sequence in reverse order - -The core of using `recurrent_group` is to design the logic of the step function. The step function can be freely combined with various layers supported by PaddlePaddle to complete arbitrary arithmetic logic. The input of `recurrent_group` (input) becomes the input of the step function. Since the step function only focuses on the calculation within one time step of RNN, here `recurrent_group` completes the splitting of the original input data for us. - -### Input -The input sequence processed by `recurrent_group` is mainly divided into the following three types: - -- **Input Data**: When putting a two-level sequence into `recurrent_group`, it will be disassembled into a single-level sequence. When putting a single-level sequence into `recurrent_group`, it will be disassembled into a non-sequence and then passed to the step function. This process is completely transparent to the user. There are two possible types: 1) User input via data_layer; 2) Output from other layers. - -- **Read-only Memory Input**: `StaticInput` defines a read-only Memory. The input specified by `StaticInput` will not be disassembled by `recurrent_group`, and each time step of the `recurrent_group` loop will always be able to reference all inputs. It may be a non-sequence or a single-layer sequence. - -- **Input of Sequence Generation Task**: `GeneratedInput` is only used to specify input data in a sequence generation task. - -### Input Example - -Sequence generation tasks mostly follow the encoder-decoer architecture. The encoder and decoder can be arbitrary neural network units capable of processing sequences and RNN is the most popular choice. - -Given the encoder output and the current word, the decoder predicts the next most likely word each time. In this structure, the decoder accepts two inputs: - -- Target sequence to be generated: a input of the decoder and the basis of the decoder loop. `recurrent_group` will disassemble this input type. - -- Encoder output, an non-sequencce or single-sequence: a unbounded memory. Each time step in the decoder loop will reference the entire result and should not be disassembled. This type of input must be specified via `StaticInput`. For more discussion on Unbounded Memory, please refer to the paper [Neural Turning Machine](https://arxiv.org/abs/1410.5401). - -In a sequence generation task, the decoder RNN always refers to the word vector of the word predicted at the previous moment as the current time input. `GeneratedInput` will automate this process. - -### Output -The `step` function must return the output of one or more Layers. The output of this Layer will be the final output of the entire `recurrent_group`. In the output process, `recurrent_group` will concatenate the output of each time step, which is also transparent to the user. - -### Memory -Memory can only be defined and used in `recurrent_group`. Memory cannot exist independently and must point to a layer defined by PaddlePaddle. Memory is referenced to get a momentary output from this layer, so memory can be interpreted as a delay operation. - -The user can explicitly specify the output of a layer to initialize the memory. When not specified, memory is initialized to 0 by default. - -## Sequence-level RNN Introduction - -`recurrent_group` helps us to split the input sequence, merge the output, and loop through the sequence of computational logic. - -Using this feature, the two nested `recurrent_group` can handle the nested two-level sequences, implementing sequence-level RNN structures at both the word and sentence levels. - -- Word-level RNN: each state corresponds to a word. -- Sequence-level RNN: a sequence-layer RNN consists of multiple word-layer RNNs. Each word-layer RNN (ie, each state of a sequence-layer RNN) has a subsequence. - -For convenience of description, the following takes the NLP task as an example. A paragraph containing a subsequence is defined as a two-level sequence, and a sentence containing a word is defined as a single-layer sequence. Then, the zero-level sequence is a word. - -## Usage of Sequence-level RNN - -### Usage of Training Process -Using `recurrent_group` requires the following conventions: - -- **Single-input Single-output**: Both input and output are single layer sequences. - - If there are multiple inputs, the number of words in different input sequences must be exactly equal. - - A single-layer sequence is output, and the number of words in the output sequence is the same as the input sequence. - - memory: define memory to point to a layer in the step function, get a moment output from this layer by referencing memory to form a recurrent connection. The is_seq parameter of memory must be false. If memory is not defined, the operations within each time step are independent. - - boot_layer: the initial state of memory, set 0 by default. is_seq in memory must be false. - -- **Double-input Double-output**: Both input and output are two-level sequence. - - If there are multiple input sequences, the number of subsequence contained in different inputs must be strictly equal, but the number of words in the subsequence may not be equal. - - output a two-level sequence. The number of subsequence and the number of words are the same as the specified input sequence and the first input is default. - - memory: defining memory in the step function, pointing to a layer, by referring to the memory to get the output of this layer at a time, forming a recurrent connection. The memory defined in the outer `recurrent_group` step function can record the state of the previous subsequence, either as a single-level sequence (only as read-only memory) or as a word. If memory is not defined, the operations between subsequence are independent. - - boot_layer: the initial state of memory. It is either a single-level sequence (only as read-only memory) or a vector. The default is not set, that is, the initial state is 0. - -- **Double-input Single-output**: not support for now, and output the error with "In hierachical RNN, all out links should be from sequences now". - -### Usage of Generation Process -Using `beam_search` need follow those conventions: - -- Word-level RNN: generate the next word from a word. -- Sequence-level RNN: the single-layer RNN generated subsequence is concatenated into a new double-layer sequence. Semantically, there is no case where a subsequence generates the next subseq directly. diff --git a/doc/v2/howto/rnn/rnn_config_cn.rst b/doc/v2/howto/rnn/rnn_config_cn.rst deleted file mode 100644 index 63fa161fafed0f3a8ec8799af21304cbec62d813..0000000000000000000000000000000000000000 --- a/doc/v2/howto/rnn/rnn_config_cn.rst +++ /dev/null @@ -1,261 +0,0 @@ -RNN配置 -======== - -本教程将指导你如何在 PaddlePaddle -中配置循环神经网络(RNN)。PaddlePaddle -高度支持灵活和高效的循环神经网络配置。 在本教程中,您将了解如何: - -- 配置循环神经网络架构。 -- 使用学习完成的循环神经网络模型生成序列。 - -我们将使用 vanilla 循环神经网络和 sequence to sequence -模型来指导你完成这些步骤。sequence to sequence -模型的代码可以在 `book/08.machine_translation `_ 找到。 -wmt14数据的提供文件在 `python/paddle/v2/dataset/wmt14.py `_ 。 - -配置循环神经网络架构 --------------------- - -简单门控循环神经网络(Gated Recurrent Neural Network) -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -循环神经网络在每个时间步骤顺序地处理序列。下面列出了 LSTM 的架构的示例。 - -.. image:: src/bi_lstm.jpg - :align: center - -一般来说,循环网络从 :math:`t=1` 到 :math:`t=T` 或者反向地从 :math:`t=T` 到 :math:`t=1` 执行以下操作。 - -.. math:: - - x_{t+1} = f_x(x_t), y_t = f_y(x_t) - -其中 :math:`f_x(.)` 称为\ **单步函数**\ (即单时间步执行的函数,step -function),而 :math:`f_y(.)` 称为\ **输出函数**\ 。在 vanilla -循环神经网络中,单步函数和输出函数都非常简单。然而,PaddlePaddle -可以通过修改这两个函数来实现复杂的网络配置。我们将使用 sequence to -sequence -模型演示如何配置复杂的循环神经网络模型。在本节中,我们将使用简单的 -vanilla -循环神经网络作为使用\ ``recurrent_group``\ 配置简单循环神经网络的例子。 -注意,如果你只需要使用简单的RNN,GRU或LSTM,那么推荐使用\ ``grumemory``\ 和\ ``lstmemory``\ ,因为它们的计算效率比\ ``recurrent_group``\ 更高。 - -对于 vanilla RNN,在每个时间步长,\ **单步函数**\ 为: - -.. math:: - - x_{t+1} = W_x x_t + W_i I_t + b - -其中 :math:`x_t` 是RNN状态,并且 :math:`I_t` 是输入,:math:`W_x` 和 -:math:`W_i` 分别是RNN状态和输入的变换矩阵。:math:`b` 是偏差。它的\ **输出函数**\ 只需要 :math:`x_t` 作为输出。 - -``recurrent_group``\ 是构建循环神经网络的最重要的工具。 -它定义了\ **单步函数**\ ,\ **输出函数**\ 和循环神经网络的输入。注意,这个函数的\ ``step``\ 参数需要实现\ ``step function``\ (单步函数)和\ ``output function``\ (输出函数): - -.. code:: python - - def simple_rnn(input, - size=None, - name=None, - reverse=False, - rnn_bias_attr=None, - act=None, - rnn_layer_attr=None): - def __rnn_step__(ipt): - out_mem = paddle.layer.memory(name=name, size=size) - rnn_out = paddle.layer.mixed(input = [paddle.layer.full_matrix_projection(input=ipt), - paddle.layer.full_matrix_projection(input=out_mem)], - name = name, - bias_attr = rnn_bias_attr, - act = act, - layer_attr = rnn_layer_attr, - size = size) - return rnn_out - return paddle.layer.recurrent_group(name='%s_recurrent_group' % name, - step=__rnn_step__, - reverse=reverse, - input=input) - -PaddlePaddle -使用“Memory”(记忆模块)实现单步函数。\ **Memory**\ 是在PaddlePaddle中构造循环神经网络时最重要的概念。 -Memory是在单步函数中循环使用的状态,例如 :math:`x_{t+1} = f_x(x_t)` 。 -一个Memory包含\ **输出**\ 和\ **输入**\ 。当前时间步处的Memory的输出作为下一时间步Memory的输入。Memory也可以具有\ **boot -layer(引导层)**\ ,其输出被用作Memory的初始值。 -在我们的例子中,门控循环单元的输出被用作输出Memory。请注意,\ ``rnn_out``\ 层的名称与\ ``out_mem``\ 的名称相同。这意味着\ ``rnn_out`` -(*x*\ \ *t* + 1)的输出被用作\ ``out_mem``\ Memory的\ **输出**\ 。 - -Memory也可以是序列。在这种情况下,在每个时间步中,我们有一个序列作为循环神经网络的状态。这在构造非常复杂的循环神经网络时是有用的。 -其他高级功能包括定义多个Memory,以及使用子序列来定义分级循环神经网络架构。 - -我们在函数的结尾返回\ ``rnn_out``\ 。 这意味着 ``rnn_out`` -层的输出被用作门控循环神经网络的\ **输出**\ 函数。 - -Sequence to Sequence Model with Attention -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -我们将使用 sequence to sequence model with attention -作为例子演示如何配置复杂的循环神经网络模型。该模型的说明如下图所示。 - -.. image:: src/encoder-decoder-attention-model.png - :align: center - -在这个模型中,源序列 :math:`S = \{s_1, \dots, s_T\}` -用双向门控循环神经网络编码。双向门控循环神经网络的隐藏状态 -:math:`H_S = \{H_1, \dots, H_T\}` 被称为 -*编码向量*\ 。解码器是门控循环神经网络。当解读每一个 :math:`y_t` 时, -这个门控循环神经网络生成一系列权重 :math:`W_S^t = \{W_1^t, \dots, W_T^t\}` , -用于计算编码向量的加权和。加权和用来生成 :math:`y_t` 。 - -模型的编码器部分如下所示。它叫做\ ``grumemory``\ 来表示门控循环神经网络。如果网络架构简单,那么推荐使用循环神经网络的方法,因为它比 -``recurrent_group`` -更快。我们已经实现了大多数常用的循环神经网络架构,可以参考 :ref:`api_trainer_config_helpers_layers` 了解更多细节。 - -我们还将编码向量投射到 ``decoder_size`` -维空间。这通过获得反向循环网络的第一个实例,并将其投射到 -``decoder_size`` 维空间完成: - -.. code:: python - - # 定义源语句的数据层 - src_word_id = paddle.layer.data( - name='source_language_word', - type=paddle.data_type.integer_value_sequence(source_dict_dim)) - # 计算每个词的词向量 - src_embedding = paddle.layer.embedding( - input=src_word_id, - size=word_vector_dim, - param_attr=paddle.attr.ParamAttr(name='_source_language_embedding')) - # 应用前向循环神经网络 - src_forward = paddle.networks.simple_gru( - input=src_embedding, size=encoder_size) - # 应用反向递归神经网络(reverse=True表示反向循环神经网络) - src_backward = paddle.networks.simple_gru( - input=src_embedding, size=encoder_size, reverse=True) - # 将循环神经网络的前向和反向部分混合在一起 - encoded_vector = paddle.layer.concat(input=[src_forward, src_backward]) - - # 投射编码向量到 decoder_size - encoded_proj = paddle.layer.mixed( - size=decoder_size, - input=paddle.layer.full_matrix_projection(encoded_vector)) - - # 计算反向RNN的第一个实例 - backward_first = paddle.layer.first_seq(input=src_backward) - - # 投射反向RNN的第一个实例到 decoder size - decoder_boot = paddle.layer.mixed( - size=decoder_size, - act=paddle.activation.Tanh(), - input=paddle.layer.full_matrix_projection(backward_first)) - -解码器使用 ``recurrent_group`` 来定义循环神经网络。单步函数和输出函数在 -``gru_decoder_with_attention`` 中定义: - -.. code:: python - - group_input1 = paddle.layer.StaticInput(input=encoded_vector, is_seq=True) - group_input2 = paddle.layer.StaticInput(input=encoded_proj, is_seq=True) - group_inputs = [group_input1, group_input2] - trg_embedding = paddle.layer.embedding( - input=paddle.layer.data( - name='target_language_word', - type=paddle.data_type.integer_value_sequence(target_dict_dim)), - size=word_vector_dim, - param_attr=paddle.attr.ParamAttr(name='_target_language_embedding')) - group_inputs.append(trg_embedding) - group_inputs.append(trg_embedding) - - # 对于配备有注意力机制的解码器,在训练中, - # 目标向量(groudtruth)是数据输入, - # 而源序列的编码向量可以被无边界的memory访问 - # StaticInput 意味着不同时间步的输入都是相同的值, - # 否则它以一个序列输入,不同时间步的输入是不同的。 - # 所有输入序列应该有相同的长度。 - decoder = paddle.layer.recurrent_group( - name=decoder_group_name, - step=gru_decoder_with_attention, - input=group_inputs) - -单步函数的实现如下所示。首先,它定义解码网络的\ **Memory**\ 。然后定义 -attention,门控循环单元单步函数和输出函数: - -.. code:: python - - def gru_decoder_with_attention(enc_vec, enc_proj, current_word): - # 定义解码器的Memory - # Memory的输出定义在 gru_step 内 - # 注意 gru_step 应该与它的Memory名字相同 - decoder_mem = paddle.layer.memory( - name='gru_decoder', size=decoder_size, boot_layer=decoder_boot) - # 计算 attention 加权编码向量 - context = paddle.networks.simple_attention( - encoded_sequence=enc_vec, - encoded_proj=enc_proj, - decoder_state=decoder_mem) - # 混合当前词向量和attention加权编码向量 - decoder_inputs = paddle.layer.mixed( - size=decoder_size * 3, - input=[ - paddle.layer.full_matrix_projection(input=context), - paddle.layer.full_matrix_projection(input=current_word) - ]) - # 定义门控循环单元循环神经网络单步函数 - gru_step = paddle.layer.gru_step( - name='gru_decoder', - input=decoder_inputs, - output_mem=decoder_mem, - size=decoder_size) - # 定义输出函数 - out = paddle.layer.mixed( - size=target_dict_dim, - bias_attr=True, - act=paddle.activation.Softmax(), - input=paddle.layer.full_matrix_projection(input=gru_step)) - return out - -生成序列 --------- - -训练模型后,我们可以使用它来生成序列。通常的做法是使用\ **beam search** -生成序列。以下代码片段定义 beam search 算法。注意,\ ``beam_search`` -函数假设 ``step`` 的输出函数返回的是下一个时刻输出词的 softmax -归一化概率向量。我们对模型进行了以下更改。 - -- 使用 ``GeneratedInput`` 来表示 trg\_embedding。 ``GeneratedInput`` - 将上一时间步所生成的词的向量来作为当前时间步的输入。 -- 使用 ``beam_search`` 函数。这个函数需要设置: - - - ``bos_id``: 开始标记。每个句子都以开始标记开头。 - - ``eos_id``: 结束标记。每个句子都以结束标记结尾。 - - ``beam_size``: beam search 算法中的beam大小。 - - ``max_length``: 生成序列的最大长度。 - -代码如下: - -.. code:: python - - group_input1 = paddle.layer.StaticInput(input=encoded_vector, is_seq=True) - group_input2 = paddle.layer.StaticInput(input=encoded_proj, is_seq=True) - group_inputs = [group_input1, group_input2] - # 在生成时,解码器基于编码源序列和最后生成的目标词预测下一目标词。 - # 编码源序列(编码器输出)必须由只读Memory的 StaticInput 指定。 - # 这里, GeneratedInputs 自动获取上一个生成的词,并在最开始初始化为起始词,如 。 - trg_embedding = paddle.layer.GeneratedInput( - size=target_dict_dim, - embedding_name='_target_language_embedding', - embedding_size=word_vector_dim) - group_inputs.append(trg_embedding) - beam_gen = paddle.layer.beam_search( - name=decoder_group_name, - step=gru_decoder_with_attention, - input=group_inputs, - bos_id=0, # Beginnning token. - eos_id=1, # End of sentence token. - beam_size=beam_size, - max_length=max_length) - - return beam_gen - -注意,这种生成技术只用于类似解码器的生成过程。如果你正在处理序列标记任务,请参阅 `book/06.understand_sentiment `_ 了解更多详细信息。 - -完整的配置文件在 `book/08.machine_translation/train.py `_ 。 diff --git a/doc/v2/howto/rnn/rnn_config_en.rst b/doc/v2/howto/rnn/rnn_config_en.rst deleted file mode 100644 index f92edd108ff5c10a31b5f181f0f6dcb7a3f119f3..0000000000000000000000000000000000000000 --- a/doc/v2/howto/rnn/rnn_config_en.rst +++ /dev/null @@ -1,235 +0,0 @@ -RNN Configuration -================= - -This tutorial will guide you how to configure recurrent neural network in PaddlePaddle. PaddlePaddle supports highly flexible and efficient recurrent neural network configuration. In this tutorial, you will learn how to: - -- configure recurrent neural network architecture. -- generate sequence with learned recurrent neural network models. - -We will use vanilla recurrent neural network, and sequence to sequence model to guide you through these steps. The code of sequence to sequence model can be found at `book/08.machine_translation `_ . -And the data preparation of this model can be found at `python/paddle/v2/dataset/wmt14.py `_ - -=============================================== -Configure Recurrent Neural Network Architecture -=============================================== - -------------------------------------- -Simple Gated Recurrent Neural Network -------------------------------------- - -Recurrent neural network process a sequence at each time step sequentially. An example of the architecture of LSTM is listed below. - -.. image:: src/bi_lstm.jpg - :align: center - -Generally speaking, a recurrent network perform the following operations from :math:`t=1` to :math:`t=T`, or reversely from :math:`t=T` to :math:`t=1`. - -.. math:: - - x_{t+1} = f_x(x_t), y_t = f_y(x_t) - - -where :math:`f_x(.)` is called **step function**, and :math:`f_y(.)` is called **output function**. In vanilla recurrent neural network, both of the step function and output function are very simple. However, PaddlePaddle supports the configuration of very complex architectures by modifying these two functions. We will use the sequence to sequence model with attention as an example to demonstrate how you can configure complex recurrent neural network models. In this section, we will use a simple vanilla recurrent neural network as an example of configuring simple recurrent neural network using :code:`recurrent_group`. Notice that if you only need to use simple RNN, GRU, or LSTM, then :code:`grumemory` and :code:`lstmemory` is recommended because they are more computationally efficient than :code:`recurrent_group`. - -For vanilla RNN, at each time step, the **step function** is: - -.. math:: - - x_{t+1} = W_x x_t + W_i I_t + b - -where :math:`x_t` is the RNN state, and :math:`I_t` is the input, :math:`W_x` and :math:`W_i` are transformation matrices for RNN states and inputs, respectively. :math:`b` is the bias. -Its **output function** simply takes :math:`x_t` as the output. - -:code:`recurrent_group` is the most important tools for constructing recurrent neural networks. It defines the **step function**, **output function** and the inputs of the recurrent neural network. Notice that the :code:`step` argument of this function implements both the :code:`step function` and the :code:`output function`: - -.. code-block:: python - - def simple_rnn(input, - size=None, - name=None, - reverse=False, - rnn_bias_attr=None, - act=None, - rnn_layer_attr=None): - def __rnn_step__(ipt): - out_mem = paddle.layer.memory(name=name, size=size) - rnn_out = paddle.layer.mixed(input = [paddle.layer.full_matrix_projection(input=ipt), - paddle.layer.full_matrix_projection(input=out_mem)], - name = name, - bias_attr = rnn_bias_attr, - act = act, - layer_attr = rnn_layer_attr, - size = size) - return rnn_out - return paddle.layer.recurrent_group(name='%s_recurrent_group' % name, - step=__rnn_step__, - reverse=reverse, - input=input) - - -PaddlePaddle uses memory to construct step function. **Memory** is the most important concept when constructing recurrent neural networks in PaddlePaddle. A memory is a state that is used recurrently in step functions, such as :math:`x_{t+1} = f_x(x_t)`. One memory contains an **output** and a **input**. The output of memory at the current time step is utilized as the input of the memory at the next time step. A memory can also has a **boot layer**, whose output is utilized as the initial value of the memory. In our case, the output of the gated recurrent unit is employed as the output memory. Notice that the name of the layer :code:`rnn_out` is the same as the name of :code:`out_mem`. This means the output of the layer :code:`rnn_out` (:math:`x_{t+1}`) is utilized as the **output** of :code:`out_mem` memory. - -A memory can also be a sequence. In this case, at each time step, we have a sequence as the state of the recurrent neural network. This can be useful when constructing very complex recurrent neural network. Other advanced functions include defining multiple memories, and defining hierarchical recurrent neural network architecture using sub-sequence. - -We return :code:`rnn_out` at the end of the function. It means that the output of the layer :code:`rnn_out` is utilized as the **output** function of the gated recurrent neural network. - ------------------------------------------ -Sequence to Sequence Model with Attention ------------------------------------------ -We will use the sequence to sequence model with attention as an example to demonstrate how you can configure complex recurrent neural network models. An illustration of the sequence to sequence model with attention is shown in the following figure. - -.. image:: src/encoder-decoder-attention-model.png - :align: center - -In this model, the source sequence :math:`S = \{s_1, \dots, s_T\}` is encoded with a bidirectional gated recurrent neural networks. The hidden states of the bidirectional gated recurrent neural network :math:`H_S = \{H_1, \dots, H_T\}` is called *encoder vector* The decoder is a gated recurrent neural network. When decoding each token :math:`y_t`, the gated recurrent neural network generates a set of weights :math:`W_S^t = \{W_1^t, \dots, W_T^t\}`, which are used to compute a weighted sum of the encoder vector. The weighted sum of the encoder vector is utilized to condition the generation of the token :math:`y_t`. - -The encoder part of the model is listed below. It calls :code:`grumemory` to represent gated recurrent neural network. It is the recommended way of using recurrent neural network if the network architecture is simple, because it is faster than :code:`recurrent_group`. We have implemented most of the commonly used recurrent neural network architectures, you can refer to :ref:`api_trainer_config_helpers_layers` for more details. - -We also project the encoder vector to :code:`decoder_size` dimensional space, get the first instance of the backward recurrent network, and project it to :code:`decoder_size` dimensional space: - -.. code-block:: python - - # Define the data layer of the source sentence. - src_word_id = paddle.layer.data( - name='source_language_word', - type=paddle.data_type.integer_value_sequence(source_dict_dim)) - # Calculate the word embedding of each word. - src_embedding = paddle.layer.embedding( - input=src_word_id, - size=word_vector_dim, - param_attr=paddle.attr.ParamAttr(name='_source_language_embedding')) - # Apply forward recurrent neural network. - src_forward = paddle.networks.simple_gru( - input=src_embedding, size=encoder_size) - # Apply backward recurrent neural network. reverse=True means backward recurrent neural network. - src_backward = paddle.networks.simple_gru( - input=src_embedding, size=encoder_size, reverse=True) - # Mix the forward and backward parts of the recurrent neural network together. - encoded_vector = paddle.layer.concat(input=[src_forward, src_backward]) - - # Project encoding vector to decoder_size. - encoded_proj = paddle.layer.mixed( - size=decoder_size, - input=paddle.layer.full_matrix_projection(encoded_vector)) - - # Compute the first instance of the backward RNN. - backward_first = paddle.layer.first_seq(input=src_backward) - - # Project the first instance of backward RNN to decoder size. - decoder_boot = paddle.layer.mixed( - size=decoder_size, - act=paddle.activation.Tanh(), - input=paddle.layer.full_matrix_projection(backward_first)) - - -The decoder uses :code:`recurrent_group` to define the recurrent neural network. The step and output functions are defined in :code:`gru_decoder_with_attention`: - -.. code-block:: python - - group_input1 = paddle.layer.StaticInput(input=encoded_vector, is_seq=True) - group_input2 = paddle.layer.StaticInput(input=encoded_proj, is_seq=True) - group_inputs = [group_input1, group_input2] - trg_embedding = paddle.layer.embedding( - input=paddle.layer.data( - name='target_language_word', - type=paddle.data_type.integer_value_sequence(target_dict_dim)), - size=word_vector_dim, - param_attr=paddle.attr.ParamAttr(name='_target_language_embedding')) - group_inputs.append(trg_embedding) - group_inputs.append(trg_embedding) - - # For decoder equipped with attention mechanism, in training, - # target embedding (the groudtruth) is the data input, - # while encoded source sequence is accessed to as an unbounded memory. - # StaticInput means the same value is utilized at different time steps. - # Otherwise, it is a sequence input. Inputs at different time steps are different. - # All sequence inputs should have the same length. - decoder = paddle.layer.recurrent_group( - name=decoder_group_name, - step=gru_decoder_with_attention, - input=group_inputs) - - -The implementation of the step function is listed as below. First, it defines the **memory** of the decoder network. Then it defines attention, gated recurrent unit step function, and the output function: - -.. code-block:: python - - def gru_decoder_with_attention(enc_vec, enc_proj, current_word): - # Defines the memory of the decoder. - # The output of this memory is defined in gru_step. - # Notice that the name of gru_step should be the same as the name of this memory. - decoder_mem = paddle.layer.memory( - name='gru_decoder', size=decoder_size, boot_layer=decoder_boot) - # Compute attention weighted encoder vector. - context = paddle.networks.simple_attention( - encoded_sequence=enc_vec, - encoded_proj=enc_proj, - decoder_state=decoder_mem) - # Mix the current word embedding and the attention weighted encoder vector. - decoder_inputs = paddle.layer.mixed( - size=decoder_size * 3, - input=[ - paddle.layer.full_matrix_projection(input=context), - paddle.layer.full_matrix_projection(input=current_word) - ]) - # Define Gated recurrent unit recurrent neural network step function. - gru_step = paddle.layer.gru_step( - name='gru_decoder', - input=decoder_inputs, - output_mem=decoder_mem, - size=decoder_size) - # Defines the output function. - out = paddle.layer.mixed( - size=target_dict_dim, - bias_attr=True, - act=paddle.activation.Softmax(), - input=paddle.layer.full_matrix_projection(input=gru_step)) - return out - - -================= -Generate Sequence -================= -After training the model, we can use it to generate sequences. A common practice is to use **beam search** to generate sequences. The following code snippets defines a beam search algorithm. Notice that :code:`beam_search` function assumes the output function of the :code:`step` returns a softmax normalized probability vector of the next token. We made the following changes to the model. - -* use :code:`GeneratedInput` for trg_embedding. :code:`GeneratedInput` computes the embedding of the generated token at the last time step for the input at the current time step. -* use :code:`beam_search` function. This function needs to set: - - - :code:`bos_id`: the start token. Every sentence starts with the start token. - - :code:`eos_id`: the end token. Every sentence ends with the end token. - - :code:`beam_size`: the beam size used in beam search. - - :code:`max_length`: the maximum length of the generated sentences. - -The code is listed below: - -.. code-block:: python - - group_input1 = paddle.layer.StaticInput(input=encoded_vector, is_seq=True) - group_input2 = paddle.layer.StaticInput(input=encoded_proj, is_seq=True) - group_inputs = [group_input1, group_input2] - # In generation, decoder predicts a next target word based on - # the encoded source sequence and the last generated target word. - # The encoded source sequence (encoder's output) must be specified by - # StaticInput which is a read-only memory. - # Here, GeneratedInputs automatically fetchs the last generated word, - # which is initialized by a start mark, such as . - trg_embedding = paddle.layer.GeneratedInput( - size=target_dict_dim, - embedding_name='_target_language_embedding', - embedding_size=word_vector_dim) - group_inputs.append(trg_embedding) - beam_gen = paddle.layer.beam_search( - name=decoder_group_name, - step=gru_decoder_with_attention, - input=group_inputs, - bos_id=0, # Beginnning token. - eos_id=1, # End of sentence token. - beam_size=beam_size, - max_length=max_length) - - return beam_gen - - -Notice that this generation technique is only useful for decoder like generation process. If you are working on sequence tagging tasks, please refer to `book/06.understand_sentiment `_ for more details. - -The full configuration file is located at `book/08.machine_translation/train.py `_ . diff --git a/doc/v2/howto/rnn/src/bi_lstm.jpg b/doc/v2/howto/rnn/src/bi_lstm.jpg deleted file mode 100644 index adec1606d64d6e35ffe7e62abfa9a09309b05c84..0000000000000000000000000000000000000000 Binary files a/doc/v2/howto/rnn/src/bi_lstm.jpg and /dev/null differ diff --git a/doc/v2/howto/rnn/src/encoder-decoder-attention-model.png b/doc/v2/howto/rnn/src/encoder-decoder-attention-model.png deleted file mode 100644 index 79f911d4ba12ac0c0d1a936c9df639c302786914..0000000000000000000000000000000000000000 Binary files a/doc/v2/howto/rnn/src/encoder-decoder-attention-model.png and /dev/null differ diff --git a/doc/v2/howto/rnn/src/glossary_rnn.dot b/doc/v2/howto/rnn/src/glossary_rnn.dot deleted file mode 100644 index 2cd0fb1820c44b0e8e0b869f9d39fcad27efa758..0000000000000000000000000000000000000000 --- a/doc/v2/howto/rnn/src/glossary_rnn.dot +++ /dev/null @@ -1,42 +0,0 @@ -digraph G{ - subgraph cluster_timestep0 { - label="recurrent timestep i-1" - bgcolor=lightgray - node [style=filled,color=white] - fc0_0 [label="fc 0"] - fc0_1 [label="fc 1"] - fc0_2 [label="fc 2"] - - fc0_0 -> fc0_1 - fc0_1 -> fc0_2 - } - - subgraph cluster_timestep1 { - label="recurrent timestep i" - node [style=filled]; - fc1_0 [label="fc 0"] - fc1_1 [label="fc 1"] - fc1_2 [label="fc 2"] - color=blue - - fc1_0 -> fc1_1 - fc1_1 -> fc1_2 - } - - subgraph cluster_timestep2 { - label="recurrent timestep i+1" - bgcolor=lightgray - node [style=filled,color=white] - fc2_0 [label="fc 0"] - fc2_1 [label="fc 1"] - fc2_2 [label="fc 2"] - - fc2_0 -> fc2_1 - fc2_1 -> fc2_2 - } - - - fc0_1 -> fc1_1 [style="dotted" constraint=false] - fc1_1 -> fc2_1 [style="dotted" constraint=false] - -} \ No newline at end of file diff --git a/doc/v2/howto/rnn/src/glossary_rnn_with_memory.dot b/doc/v2/howto/rnn/src/glossary_rnn_with_memory.dot deleted file mode 100644 index 0f101ec2d8f15aec76c57f328046b6b55cf0c7eb..0000000000000000000000000000000000000000 --- a/doc/v2/howto/rnn/src/glossary_rnn_with_memory.dot +++ /dev/null @@ -1,48 +0,0 @@ -digraph G{ - subgraph cluster_timestep0 { - label="recurrent timestep i-1" - bgcolor=lightgray - node [style=filled,color=white] - fc0_0 [label="fc 0"] - fc0_1 [label="fc 1"] - fc0_2 [label="fc 2"] - m0 [label="memory"] - fc0_0 -> fc0_1 - fc0_1 -> fc0_2 - fc0_1 -> m0 - m0 -> fc0_1 - } - - subgraph cluster_timestep1 { - label="recurrent timestep i" - node [style=filled]; - fc1_0 [label="fc 0"] - fc1_1 [label="fc 1"] - fc1_2 [label="fc 2"] - m1 [label="memory"] - color=blue - fc1_0 -> fc1_1 - fc1_1 -> fc1_2 - fc1_1 -> m1 - m1 -> fc1_1 - } - - subgraph cluster_timestep2 { - label="recurrent timestep i+1" - bgcolor=lightgray - node [style=filled,color=white] - fc2_0 [label="fc 0"] - fc2_1 [label="fc 1"] - fc2_2 [label="fc 2"] - m2 [label="memory"] - fc2_0 -> fc2_1 - fc2_1 -> fc2_2 - fc2_1 -> m2 - m2 -> fc2_1 - } - - - m0 -> m1 [style="dotted" constraint=false] - m1 -> m2 [style="dotted" constraint=false] - -} \ No newline at end of file diff --git a/doc/v2/howto/rnn/src/simple_full_hierarchical_recurrent.dot b/doc/v2/howto/rnn/src/simple_full_hierarchical_recurrent.dot deleted file mode 100644 index ff278a0323bb2c3ef07bf6f016a3a8df05783581..0000000000000000000000000000000000000000 --- a/doc/v2/howto/rnn/src/simple_full_hierarchical_recurrent.dot +++ /dev/null @@ -1,30 +0,0 @@ -digraph G { - rankdir=LR; - - subgraph cluster_t0 { - a [label="4"] - b [label="5"] - c [label="2"] - } - - subgraph cluster_t1 { - d [label="0"] - e [label="9"] - } - - subgraph cluster_t2 { - f [label="8"] - g [label="1"] - h [label="4"] - } - - a -> b; - b -> c; - c -> d [constraint=false]; - - d -> e; - e -> f [constraint=false]; - - f -> g; - g -> h; -} \ No newline at end of file diff --git a/doc/v2/howto/rnn/src/simple_full_recurrent.dot b/doc/v2/howto/rnn/src/simple_full_recurrent.dot deleted file mode 100644 index cee281fbac993afbd0cc3416570f95965cdf0a59..0000000000000000000000000000000000000000 --- a/doc/v2/howto/rnn/src/simple_full_recurrent.dot +++ /dev/null @@ -1,19 +0,0 @@ -digraph G { - rankdir=LR; - a [label="4"] - b [label="5"] - c [label="2"] - d [label="0"] - e [label="9"] - f [label="8"] - g [label="1"] - h [label="4"] - - a -> b; - b -> c; - c -> d; - d -> e; - e -> f; - f -> g; - g -> h; -} \ No newline at end of file diff --git a/doc/v2/images/FullyConnected.jpg b/doc/v2/images/FullyConnected.jpg deleted file mode 100644 index b2241f401434e527f95ee4e0e541a3f2ff78fd1e..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/FullyConnected.jpg and /dev/null differ diff --git a/doc/v2/images/add_security_group.png b/doc/v2/images/add_security_group.png deleted file mode 100644 index bd34f46c9b0ada7027fd53e553e7d033255d25fc..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/add_security_group.png and /dev/null differ diff --git a/doc/v2/images/bi_lstm.jpg b/doc/v2/images/bi_lstm.jpg deleted file mode 100644 index adec1606d64d6e35ffe7e62abfa9a09309b05c84..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/bi_lstm.jpg and /dev/null differ diff --git a/doc/v2/images/checkpointing.png b/doc/v2/images/checkpointing.png deleted file mode 100644 index c221e8474f90f37e31416cbb19c9452207a0d14c..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/checkpointing.png and /dev/null differ diff --git a/doc/v2/images/create_efs.png b/doc/v2/images/create_efs.png deleted file mode 100644 index e5f1526033d1daf401700989af1d25919bcb7675..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/create_efs.png and /dev/null differ diff --git a/doc/v2/images/csr.png b/doc/v2/images/csr.png deleted file mode 100644 index 3dc10b8de4f6d3f517624956b1694b689405a031..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/csr.png and /dev/null differ diff --git a/doc/v2/images/data_dispatch.png b/doc/v2/images/data_dispatch.png deleted file mode 100644 index 5bdcc24d6a6d193cb014f8c38b362451fded5e54..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/data_dispatch.png and /dev/null differ diff --git a/doc/v2/images/dataset.graffle b/doc/v2/images/dataset.graffle deleted file mode 100644 index c10a423ed16a23229a9ee33d11bfc82bb59646c8..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/dataset.graffle and /dev/null differ diff --git a/doc/v2/images/dataset.png b/doc/v2/images/dataset.png deleted file mode 100644 index 2fb7f1cce3b6dd21489392557826e95a9f207c34..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/dataset.png and /dev/null differ diff --git a/doc/v2/images/doc_en.png b/doc/v2/images/doc_en.png deleted file mode 100644 index ed6b9178fba91a3bdf45ae797a9924f84146fbc8..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/doc_en.png and /dev/null differ diff --git a/doc/v2/images/efs_mount.png b/doc/v2/images/efs_mount.png deleted file mode 100644 index 0f9e3cab98445707e5e9baa18ddabe15cdf04576..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/efs_mount.png and /dev/null differ diff --git a/doc/v2/images/encoder-decoder-attention-model.png b/doc/v2/images/encoder-decoder-attention-model.png deleted file mode 100644 index 79f911d4ba12ac0c0d1a936c9df639c302786914..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/encoder-decoder-attention-model.png and /dev/null differ diff --git a/doc/v2/images/engine.png b/doc/v2/images/engine.png deleted file mode 100644 index 1f5f65c2cc765a514a3ba9e7b7f468e1dc4b0c3b..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/engine.png and /dev/null differ diff --git a/doc/v2/images/file_storage.graffle b/doc/v2/images/file_storage.graffle deleted file mode 100644 index 50a17e70fa255495337c529a3bf12a5c0024a5be..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/file_storage.graffle and /dev/null differ diff --git a/doc/v2/images/file_storage.png b/doc/v2/images/file_storage.png deleted file mode 100644 index fccb4e3e7e738224c7f1584326bd5f351ce799aa..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/file_storage.png and /dev/null differ diff --git a/doc/v2/images/glossary_rnn.dot b/doc/v2/images/glossary_rnn.dot deleted file mode 100644 index 2cd0fb1820c44b0e8e0b869f9d39fcad27efa758..0000000000000000000000000000000000000000 --- a/doc/v2/images/glossary_rnn.dot +++ /dev/null @@ -1,42 +0,0 @@ -digraph G{ - subgraph cluster_timestep0 { - label="recurrent timestep i-1" - bgcolor=lightgray - node [style=filled,color=white] - fc0_0 [label="fc 0"] - fc0_1 [label="fc 1"] - fc0_2 [label="fc 2"] - - fc0_0 -> fc0_1 - fc0_1 -> fc0_2 - } - - subgraph cluster_timestep1 { - label="recurrent timestep i" - node [style=filled]; - fc1_0 [label="fc 0"] - fc1_1 [label="fc 1"] - fc1_2 [label="fc 2"] - color=blue - - fc1_0 -> fc1_1 - fc1_1 -> fc1_2 - } - - subgraph cluster_timestep2 { - label="recurrent timestep i+1" - bgcolor=lightgray - node [style=filled,color=white] - fc2_0 [label="fc 0"] - fc2_1 [label="fc 1"] - fc2_2 [label="fc 2"] - - fc2_0 -> fc2_1 - fc2_1 -> fc2_2 - } - - - fc0_1 -> fc1_1 [style="dotted" constraint=false] - fc1_1 -> fc2_1 [style="dotted" constraint=false] - -} \ No newline at end of file diff --git a/doc/v2/images/glossary_rnn_with_memory.dot b/doc/v2/images/glossary_rnn_with_memory.dot deleted file mode 100644 index 0f101ec2d8f15aec76c57f328046b6b55cf0c7eb..0000000000000000000000000000000000000000 --- a/doc/v2/images/glossary_rnn_with_memory.dot +++ /dev/null @@ -1,48 +0,0 @@ -digraph G{ - subgraph cluster_timestep0 { - label="recurrent timestep i-1" - bgcolor=lightgray - node [style=filled,color=white] - fc0_0 [label="fc 0"] - fc0_1 [label="fc 1"] - fc0_2 [label="fc 2"] - m0 [label="memory"] - fc0_0 -> fc0_1 - fc0_1 -> fc0_2 - fc0_1 -> m0 - m0 -> fc0_1 - } - - subgraph cluster_timestep1 { - label="recurrent timestep i" - node [style=filled]; - fc1_0 [label="fc 0"] - fc1_1 [label="fc 1"] - fc1_2 [label="fc 2"] - m1 [label="memory"] - color=blue - fc1_0 -> fc1_1 - fc1_1 -> fc1_2 - fc1_1 -> m1 - m1 -> fc1_1 - } - - subgraph cluster_timestep2 { - label="recurrent timestep i+1" - bgcolor=lightgray - node [style=filled,color=white] - fc2_0 [label="fc 0"] - fc2_1 [label="fc 1"] - fc2_2 [label="fc 2"] - m2 [label="memory"] - fc2_0 -> fc2_1 - fc2_1 -> fc2_2 - fc2_1 -> m2 - m2 -> fc2_1 - } - - - m0 -> m1 [style="dotted" constraint=false] - m1 -> m2 [style="dotted" constraint=false] - -} \ No newline at end of file diff --git a/doc/v2/images/gradients.png b/doc/v2/images/gradients.png deleted file mode 100644 index f031bcf8e4cec14e63075b8b9d2c7bbd9f1b1a3c..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/gradients.png and /dev/null differ diff --git a/doc/v2/images/init_lock.graffle b/doc/v2/images/init_lock.graffle deleted file mode 100644 index fa9149f21b1311eed48ef72ec55e556559d0fc94..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/init_lock.graffle and /dev/null differ diff --git a/doc/v2/images/init_lock.png b/doc/v2/images/init_lock.png deleted file mode 100644 index 92404ee6d6c0f9a7727952bae3c869ba338ecd7f..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/init_lock.png and /dev/null differ diff --git a/doc/v2/images/k8s-paddle-arch.png b/doc/v2/images/k8s-paddle-arch.png deleted file mode 100644 index b3800c4fe81302d35e49f7dbacb9221c4dfa5cde..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/k8s-paddle-arch.png and /dev/null differ diff --git a/doc/v2/images/layers.png b/doc/v2/images/layers.png deleted file mode 100644 index 306f79b7a844610915eb8944128f57d2b7a3065a..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/layers.png and /dev/null differ diff --git a/doc/v2/images/managed_policy.png b/doc/v2/images/managed_policy.png deleted file mode 100644 index c7ecda555b81d7750e9292a9ab72d2f517f76a2a..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/managed_policy.png and /dev/null differ diff --git a/doc/v2/images/matrix.png b/doc/v2/images/matrix.png deleted file mode 100644 index c33ce9cf0335e47cc8c1253304d0fe179186e6f2..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/matrix.png and /dev/null differ diff --git a/doc/v2/images/nvvp1.png b/doc/v2/images/nvvp1.png deleted file mode 100644 index 1af23ac3c52929b2b0645d2f9fa4d4c6db1f6e77..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/nvvp1.png and /dev/null differ diff --git a/doc/v2/images/nvvp2.png b/doc/v2/images/nvvp2.png deleted file mode 100644 index 177c9db708da6863d1075f3e615f5962dbe18b29..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/nvvp2.png and /dev/null differ diff --git a/doc/v2/images/nvvp3.png b/doc/v2/images/nvvp3.png deleted file mode 100644 index d8f393667d6569b6f1e61ffccac43fae5888b6db..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/nvvp3.png and /dev/null differ diff --git a/doc/v2/images/nvvp4.png b/doc/v2/images/nvvp4.png deleted file mode 100644 index 51f2f3e183295de6cf8ddaf2b3b8a0862aa35f01..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/nvvp4.png and /dev/null differ diff --git a/doc/v2/images/overview.png b/doc/v2/images/overview.png deleted file mode 100644 index 8fb7bbb9dd654bf363d701d0c8cd4a557043d188..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/overview.png and /dev/null differ diff --git a/doc/v2/images/paddle-cloud-in-data-center.png b/doc/v2/images/paddle-cloud-in-data-center.png deleted file mode 100644 index da5d1a77562480ad1d886f5f21dbd84001d3d508..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/paddle-cloud-in-data-center.png and /dev/null differ diff --git a/doc/v2/images/paddle-etcd.graffle b/doc/v2/images/paddle-etcd.graffle deleted file mode 100644 index f973dc9b9dbf72e9bc31e2d32822916cd281f8d9..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/paddle-etcd.graffle and /dev/null differ diff --git a/doc/v2/images/paddle-etcd.png b/doc/v2/images/paddle-etcd.png deleted file mode 100644 index 57981ceb4b94f0f7d6dfa63f3d28c0402bf9cc31..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/paddle-etcd.png and /dev/null differ diff --git a/doc/v2/images/paddle-model-sharding.graffle b/doc/v2/images/paddle-model-sharding.graffle deleted file mode 100644 index fba30f0ca2b47f0d202a432821d95e55aac37ec8..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/paddle-model-sharding.graffle and /dev/null differ diff --git a/doc/v2/images/paddle-model-sharding.png b/doc/v2/images/paddle-model-sharding.png deleted file mode 100644 index 8c3f6724ef46c6527e63a4cd8cb0b50fe0167124..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/paddle-model-sharding.png and /dev/null differ diff --git a/doc/v2/images/paddle-ps-0.png b/doc/v2/images/paddle-ps-0.png deleted file mode 100644 index 47ef32806f182cab003da77f1556823b3f6d1721..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/paddle-ps-0.png and /dev/null differ diff --git a/doc/v2/images/paddle-ps-1.png b/doc/v2/images/paddle-ps-1.png deleted file mode 100644 index f3125db73096c52bac6e7c60e1675552857c0774..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/paddle-ps-1.png and /dev/null differ diff --git a/doc/v2/images/paddle-ps.graffle b/doc/v2/images/paddle-ps.graffle deleted file mode 100644 index 0e536ffdd91cd696008b4c01bad3cb53edebdc16..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/paddle-ps.graffle and /dev/null differ diff --git a/doc/v2/images/paddle-task-queues.graffle b/doc/v2/images/paddle-task-queues.graffle deleted file mode 100644 index 4263ed8bfd2ef0e55058828bf23f2fac3595e5fd..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/paddle-task-queues.graffle and /dev/null differ diff --git a/doc/v2/images/paddle-task-queues.png b/doc/v2/images/paddle-task-queues.png deleted file mode 100644 index 5f980266795776752cebd0c346b85c4a75a47780..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/paddle-task-queues.png and /dev/null differ diff --git a/doc/v2/images/paddle-task-states.graffle b/doc/v2/images/paddle-task-states.graffle deleted file mode 100644 index cf1a0b9246d9386a949d2dbb8c32fe84f72eea83..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/paddle-task-states.graffle and /dev/null differ diff --git a/doc/v2/images/paddle-task-states.png b/doc/v2/images/paddle-task-states.png deleted file mode 100644 index 4ae43cb66c071aee9eb90d875e2373b29af9c3e0..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/paddle-task-states.png and /dev/null differ diff --git a/doc/v2/images/ps_cn.png b/doc/v2/images/ps_cn.png deleted file mode 100644 index f9525739cc8bc6506adde642aafa0a85ae3ebebc..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/ps_cn.png and /dev/null differ diff --git a/doc/v2/images/ps_en.png b/doc/v2/images/ps_en.png deleted file mode 100644 index 6537d3d56589ca9f19a77a50a970e4b5275e6ce0..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/ps_en.png and /dev/null differ diff --git a/doc/v2/images/pserver_and_trainer.png b/doc/v2/images/pserver_and_trainer.png deleted file mode 100644 index f41fe48920590333ad332bb51eb18e03dc251541..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/pserver_and_trainer.png and /dev/null differ diff --git a/doc/v2/images/pserver_init.graffle b/doc/v2/images/pserver_init.graffle deleted file mode 100644 index 5f3f1f52be8aa7f9049a8fcd6b7c93c8560c1676..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/pserver_init.graffle and /dev/null differ diff --git a/doc/v2/images/pserver_init.png b/doc/v2/images/pserver_init.png deleted file mode 100644 index dfe491ff98dd7db1c336093c80964a260df2cd90..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/pserver_init.png and /dev/null differ diff --git a/doc/v2/images/route53_create_recordset.png b/doc/v2/images/route53_create_recordset.png deleted file mode 100644 index 34e476c7beac30fcdde13fccc4cc8d08b4be3d35..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/route53_create_recordset.png and /dev/null differ diff --git a/doc/v2/images/route53_create_zone.png b/doc/v2/images/route53_create_zone.png deleted file mode 100644 index 25b7ddb831c5cba97f4b2edddd27da3234d621af..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/route53_create_zone.png and /dev/null differ diff --git a/doc/v2/images/sequence_data.png b/doc/v2/images/sequence_data.png deleted file mode 100644 index 6e47a46b8955dfe977e85898fe3c9f33ed28de7e..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/sequence_data.png and /dev/null differ diff --git a/doc/v2/images/simple_full_hierarchical_recurrent.dot b/doc/v2/images/simple_full_hierarchical_recurrent.dot deleted file mode 100644 index ff278a0323bb2c3ef07bf6f016a3a8df05783581..0000000000000000000000000000000000000000 --- a/doc/v2/images/simple_full_hierarchical_recurrent.dot +++ /dev/null @@ -1,30 +0,0 @@ -digraph G { - rankdir=LR; - - subgraph cluster_t0 { - a [label="4"] - b [label="5"] - c [label="2"] - } - - subgraph cluster_t1 { - d [label="0"] - e [label="9"] - } - - subgraph cluster_t2 { - f [label="8"] - g [label="1"] - h [label="4"] - } - - a -> b; - b -> c; - c -> d [constraint=false]; - - d -> e; - e -> f [constraint=false]; - - f -> g; - g -> h; -} \ No newline at end of file diff --git a/doc/v2/images/simple_full_recurrent.dot b/doc/v2/images/simple_full_recurrent.dot deleted file mode 100644 index cee281fbac993afbd0cc3416570f95965cdf0a59..0000000000000000000000000000000000000000 --- a/doc/v2/images/simple_full_recurrent.dot +++ /dev/null @@ -1,19 +0,0 @@ -digraph G { - rankdir=LR; - a [label="4"] - b [label="5"] - c [label="2"] - d [label="0"] - e [label="9"] - f [label="8"] - g [label="1"] - h [label="4"] - - a -> b; - b -> c; - c -> d; - d -> e; - e -> f; - f -> g; - g -> h; -} \ No newline at end of file diff --git a/doc/v2/images/submit-job.graffle b/doc/v2/images/submit-job.graffle deleted file mode 100644 index 677cdfb6d9a32168bf71729eb841fa1ca0dd31d6..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/submit-job.graffle and /dev/null differ diff --git a/doc/v2/images/submit-job.png b/doc/v2/images/submit-job.png deleted file mode 100644 index 3046a460a7ba708079e88a560debaa215a694680..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/submit-job.png and /dev/null differ diff --git a/doc/v2/images/trainer.graffle b/doc/v2/images/trainer.graffle deleted file mode 100644 index 43415ed8cf61a5acfa34f8e56b9577f338dbf254..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/trainer.graffle and /dev/null differ diff --git a/doc/v2/images/trainer.png b/doc/v2/images/trainer.png deleted file mode 100644 index 6537d3d56589ca9f19a77a50a970e4b5275e6ce0..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/trainer.png and /dev/null differ diff --git a/doc/v2/images/trainer_cn.png b/doc/v2/images/trainer_cn.png deleted file mode 100644 index f9525739cc8bc6506adde642aafa0a85ae3ebebc..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/trainer_cn.png and /dev/null differ diff --git a/doc/v2/images/worker_security_group.png b/doc/v2/images/worker_security_group.png deleted file mode 100644 index 57eb0265a34ad4223b69600d2a3dd355482e0bf5..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/worker_security_group.png and /dev/null differ diff --git a/doc/v2/images/workflow_of_CAPI.png b/doc/v2/images/workflow_of_CAPI.png deleted file mode 100644 index a4399ade048b3fe10d2d9c714bc34333ca068edb..0000000000000000000000000000000000000000 Binary files a/doc/v2/images/workflow_of_CAPI.png and /dev/null differ diff --git a/doc/v2/index_cn.rst b/doc/v2/index_cn.rst deleted file mode 100644 index 0f645db6fc5d0f84bbe0cbb335677752e3a355ea..0000000000000000000000000000000000000000 --- a/doc/v2/index_cn.rst +++ /dev/null @@ -1,11 +0,0 @@ -PaddlePaddle 文档 -====================== - -.. toctree:: - :maxdepth: 1 - - getstarted/index_cn.rst - build_and_install/index_cn.rst - howto/index_cn.rst - dev/index_cn.rst - faq/index_cn.rst diff --git a/doc/v2/index_en.rst b/doc/v2/index_en.rst deleted file mode 100644 index 909f035cca3db2a02fd38462acc451375eceff40..0000000000000000000000000000000000000000 --- a/doc/v2/index_en.rst +++ /dev/null @@ -1,11 +0,0 @@ -PaddlePaddle Documentation -========================== - -.. toctree:: - :maxdepth: 1 - - getstarted/index_en.rst - build_and_install/index_en.rst - howto/index_en.rst - dev/index_en.rst - faq/index_en.rst diff --git a/go/.gitignore b/go/.gitignore deleted file mode 100644 index 398d70ca375ffceccdbfc82a4851a6830ca31264..0000000000000000000000000000000000000000 --- a/go/.gitignore +++ /dev/null @@ -1,3 +0,0 @@ -vendor/ -.glide/ -proto/*.go diff --git a/go/CMakeLists.txt b/go/CMakeLists.txt deleted file mode 100644 index f3a9296c2c66cd96419cae37c3ac2c93c2b033f5..0000000000000000000000000000000000000000 --- a/go/CMakeLists.txt +++ /dev/null @@ -1,23 +0,0 @@ -# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# - -add_subdirectory(pserver/client/c) -add_subdirectory(cmd/pserver) -add_subdirectory(cmd/master) -add_subdirectory(master/c) -add_subdirectory(master) -add_subdirectory(pserver) -add_subdirectory(pserver/client) -add_subdirectory(utils/networkhelper) diff --git a/go/cmd/master/CMakeLists.txt b/go/cmd/master/CMakeLists.txt deleted file mode 100644 index fc99d8d3bd1ec1941b7a068cf8417f0663dea8c0..0000000000000000000000000000000000000000 --- a/go/cmd/master/CMakeLists.txt +++ /dev/null @@ -1,15 +0,0 @@ -# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -go_binary(master SRC master.go) diff --git a/go/cmd/master/master.go b/go/cmd/master/master.go deleted file mode 100644 index 537df59c860a3cb77ecd8287cd352397d7f7a4e4..0000000000000000000000000000000000000000 --- a/go/cmd/master/master.go +++ /dev/null @@ -1,120 +0,0 @@ -// Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at - -// http://www.apache.org/licenses/LICENSE-2.0 - -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -package main - -import ( - "fmt" - "net" - "net/http" - "net/rpc" - "os" - "os/signal" - "strconv" - "strings" - "time" - - log "github.com/inconshreveable/log15" - "github.com/namsral/flag" - - "github.com/PaddlePaddle/Paddle/go/master" - "github.com/PaddlePaddle/Paddle/go/utils/networkhelper" -) - -func main() { - port := flag.Int("port", 8080, "port of the master server.") - ttlSec := flag.Int("ttl", 60, "etcd lease TTL in seconds.") - endpoints := flag.String("endpoints", "http://127.0.0.1:2379", "comma separated etcd endpoints. If empty, fault tolerance will not be enabled.") - taskTimeoutDur := flag.Duration("task-timout-dur", 20*time.Minute, "task timout duration.") - taskTimeoutMax := flag.Int("task-timeout-max", 3, "max timtout count for each task before it being declared failed task.") - chunkPerTask := flag.Int("chunk-per-task", 10, "chunk per task.") - logLevel := flag.String("log-level", "info", - "log level, possible values: debug, info, warn, error, crit") - flag.Parse() - - lvl, err := log.LvlFromString(*logLevel) - if err != nil { - panic(err) - } - - log.Root().SetHandler( - log.LvlFilterHandler(lvl, log.CallerStackHandler("%+v", log.StderrHandler)), - ) - - if *endpoints == "" { - log.Warn("-endpoints not set, fault tolerance not be enabled.") - } - - var store master.Store - if *endpoints != "" { - eps := strings.Split(*endpoints, ",") - ip, err := networkhelper.GetExternalIP() - if err != nil { - log.Crit("get external ip error", log.Ctx{"error": err}) - panic(err) - } - - addr := fmt.Sprintf("%s:%d", ip, *port) - store, err = master.NewEtcdClient(eps, addr, master.DefaultLockPath, master.DefaultAddrPath, master.DefaultStatePath, *ttlSec) - if err != nil { - log.Crit("error creating etcd client.", log.Ctx{"error": err}) - panic(err) - } - } else { - store = &master.InMemStore{} - } - - shutdown := func() { - log.Info("shutting down gracefully") - err := store.Shutdown() - if err != nil { - log.Error("shutdown error", log.Ctx{"error": err}) - } - } - - // Guaranteed to run even panic happens. - defer shutdown() - - c := make(chan os.Signal, 1) - signal.Notify(c, os.Interrupt) - - s, err := master.NewService(store, *chunkPerTask, *taskTimeoutDur, *taskTimeoutMax) - if err != nil { - log.Crit("error creating new service.", log.Ctx{"error": err}) - panic(err) - } - - err = rpc.Register(s) - if err != nil { - log.Crit("error registering to etcd.", log.Ctx{"error": err}) - panic(err) - } - - rpc.HandleHTTP() - l, err := net.Listen("tcp", ":"+strconv.Itoa(*port)) - if err != nil { - log.Crit("error listing to port", log.Ctx{"error": err, "port": *port}) - panic(err) - } - - go func() { - err = http.Serve(l, nil) - if err != nil { - log.Crit("error serving HTTP", log.Ctx{"error": err}) - panic(err) - } - }() - - <-c -} diff --git a/go/cmd/pserver/.gitignore b/go/cmd/pserver/.gitignore deleted file mode 100644 index fffd9adc4fde9681ad2a58fcf594d20bdd86ab45..0000000000000000000000000000000000000000 --- a/go/cmd/pserver/.gitignore +++ /dev/null @@ -1 +0,0 @@ -pserver diff --git a/go/cmd/pserver/CMakeLists.txt b/go/cmd/pserver/CMakeLists.txt deleted file mode 100644 index 20d033c938648d1b1e5c5ed1b8a738a543c325cf..0000000000000000000000000000000000000000 --- a/go/cmd/pserver/CMakeLists.txt +++ /dev/null @@ -1,15 +0,0 @@ -# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -go_binary(pserver SRCS pserver.go DEPS paddle_go_optimizer) diff --git a/go/cmd/pserver/pserver.go b/go/cmd/pserver/pserver.go deleted file mode 100644 index 271274cafc5c94a2c89ac211dba7a3a2bd232026..0000000000000000000000000000000000000000 --- a/go/cmd/pserver/pserver.go +++ /dev/null @@ -1,108 +0,0 @@ -// Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at - -// http://www.apache.org/licenses/LICENSE-2.0 - -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -package main - -import ( - "net" - "net/http" - "net/rpc" - "os" - "os/signal" - "strconv" - "time" - - "github.com/namsral/flag" - "github.com/topicai/candy" - - "github.com/PaddlePaddle/Paddle/go/pserver" - log "github.com/inconshreveable/log15" -) - -func main() { - port := flag.Int("port", 8001, "port of the pserver") - index := flag.Int("index", -1, "index of the pserver, set to -1 if use etcd for auto pserver index registry") - etcdEndpoint := flag.String("etcd-endpoint", "http://127.0.0.1:2379", - "comma separated endpoint string for pserver to connect to etcd") - dialTimeout := flag.Duration("dial-timeout", 5*time.Second, "dial timeout") - etcdTTL := flag.Int("etcd-ttl", 5, "etcd time to live in seconds") - numPservers := flag.Int("num-pservers", 1, "total pserver count in a training job") - checkpointPath := flag.String("checkpoint-path", "/checkpoints/", "save checkpoint path") - checkpointInterval := flag.Duration("checkpoint-interval", 600*time.Second, "save checkpoint per interval seconds") - logLevel := flag.String("log-level", "info", - "log level, possible values: debug, info, warn, error, crit") - flag.Parse() - - lvl, err := log.LvlFromString(*logLevel) - if err != nil { - panic(err) - } - - log.Root().SetHandler( - log.LvlFilterHandler(lvl, log.CallerStackHandler("%+v", log.StderrHandler)), - ) - - var idx int - - var cp pserver.Checkpoint - var e *pserver.EtcdClient - if *index >= 0 { - idx = *index - } else { - e = pserver.NewEtcdClient(*etcdEndpoint, *numPservers, *dialTimeout, *etcdTTL) - idx, err = e.Register(*port) - candy.Must(err) - - cp, err = pserver.LoadCheckpoint(e, idx) - if err != nil { - if err == pserver.ErrCheckpointNotFound { - log.Info("load checkpoint error", "error", err) - } else { - panic(err) - } - } - } - - shutdown := func() { - log.Info("shutting down gracefully") - sErr := e.Shutdown() - if sErr != nil { - log.Error("error shutting down", log.Ctx{"error": sErr}) - } - } - - // Guaranteed to run even panic happens. - defer shutdown() - - c := make(chan os.Signal, 1) - signal.Notify(c, os.Interrupt) - - s, err := pserver.NewService(idx, *checkpointInterval, *checkpointPath, e, cp) - candy.Must(err) - - err = rpc.Register(s) - candy.Must(err) - - rpc.HandleHTTP() - l, err := net.Listen("tcp", ":"+strconv.Itoa(*port)) - candy.Must(err) - - go func() { - log.Info("serving pserver", log.Ctx{"port": *port}) - err = http.Serve(l, nil) - candy.Must(err) - }() - - <-c -} diff --git a/go/connection/conn.go b/go/connection/conn.go deleted file mode 100644 index b8353e8e18ed7b40bab057d6226637df1e6e569a..0000000000000000000000000000000000000000 --- a/go/connection/conn.go +++ /dev/null @@ -1,120 +0,0 @@ -// Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at - -// http://www.apache.org/licenses/LICENSE-2.0 - -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -package connection - -import ( - "errors" - "net/rpc" - "sync" - - log "github.com/sirupsen/logrus" -) - -// TODO(helin): add TCP re-connect logic - -// Conn is a connection to a parameter server -type Conn struct { - mu sync.Mutex - client *rpc.Client - waitConn chan struct{} -} - -// New creates a new connection. -func New() *Conn { - c := &Conn{} - return c -} - -// Close closes the connection. -func (c *Conn) Close() error { - c.mu.Lock() - defer c.mu.Unlock() - - if c.client == nil { - return nil - } - - return c.client.Close() -} - -// Connect connects the connection to a address. -func (c *Conn) Connect(addr string) error { - c.mu.Lock() - if c.client != nil { - err := c.client.Close() - if err != nil { - c.mu.Unlock() - return err - } - - c.client = nil - } - c.mu.Unlock() - - client, err := rpc.DialHTTP("tcp", addr) - if err != nil { - return err - } - - c.mu.Lock() - defer c.mu.Unlock() - - if c.client == nil { - c.client = client - if c.waitConn != nil { - close(c.waitConn) - c.waitConn = nil - } - } else { - err := client.Close() - if err != nil { - log.Errorln(err) - } - - return errors.New("client already set from a concurrent goroutine") - } - - return nil -} - -// TODO(helin): refactor Call to be able to perform given retry -// policy. - -// Call make a RPC call. -// -// Call will be blocked until the connection to remote RPC service -// being established. -func (c *Conn) Call(serviceMethod string, args interface{}, reply interface{}) error { - c.mu.Lock() - client := c.client - var waitCh chan struct{} - if client == nil { - if c.waitConn != nil { - waitCh = c.waitConn - } else { - waitCh = make(chan struct{}) - c.waitConn = waitCh - } - } - c.mu.Unlock() - - if waitCh != nil { - // wait until new connection being established - <-waitCh - return c.Call(serviceMethod, args, reply) - } - - return client.Call(serviceMethod, args, reply) -} diff --git a/go/master/CMakeLists.txt b/go/master/CMakeLists.txt deleted file mode 100644 index b5101c3479d708418dd662b84e09ad74af86adbe..0000000000000000000000000000000000000000 --- a/go/master/CMakeLists.txt +++ /dev/null @@ -1,17 +0,0 @@ -# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# -if(WITH_TESTING) - go_test(master_test) -endif() diff --git a/go/master/c/CMakeLists.txt b/go/master/c/CMakeLists.txt deleted file mode 100644 index 58b44e6445b63e12eb7d9bfdee93239cf1fab899..0000000000000000000000000000000000000000 --- a/go/master/c/CMakeLists.txt +++ /dev/null @@ -1,15 +0,0 @@ -# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# -go_library(paddle_master SHARED DEPS paddle_go_optimizer) diff --git a/go/master/c/client.go b/go/master/c/client.go deleted file mode 100644 index 42c176d00bd56f989b05e1d128b5ce030d220c77..0000000000000000000000000000000000000000 --- a/go/master/c/client.go +++ /dev/null @@ -1,196 +0,0 @@ -// Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at - -// http://www.apache.org/licenses/LICENSE-2.0 - -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -package main - -/* -#include -#include -#include -#define PADDLE_MASTER_OK 0 -#define PADDLE_MASTER_ERROR -1 - -#define PADDLE_SAVE_MODEL_OK 1 -#define PADDLE_SAVE_MODEL_SKIP 0 - -typedef int paddle_master_client; -*/ -import "C" - -import ( - "strings" - "sync" - "time" - "unsafe" - - "github.com/PaddlePaddle/Paddle/go/master" - log "github.com/inconshreveable/log15" -) - -var mu sync.Mutex -var handleMap = make(map[C.paddle_master_client]*master.Client) -var curHandle C.paddle_master_client - -func init() { - log.Root().SetHandler( - log.LvlFilterHandler(log.LvlWarn, log.CallerStackHandler("%+v", log.StderrHandler)), - ) -} - -func add(c *master.Client) C.paddle_master_client { - mu.Lock() - defer mu.Unlock() - client := curHandle - curHandle++ - handleMap[client] = c - return client -} - -func get(client C.paddle_master_client) *master.Client { - mu.Lock() - defer mu.Unlock() - return handleMap[client] -} - -func remove(client C.paddle_master_client) *master.Client { - mu.Lock() - defer mu.Unlock() - h := handleMap[client] - delete(handleMap, client) - return h -} - -//export paddle_new_etcd_master_client -// -// bufSize is the record buffer size. -func paddle_new_etcd_master_client(etcdEndpoints *C.char, timeout int, bufSize int) C.paddle_master_client { - p := C.GoString(etcdEndpoints) - endpoints := strings.Split(p, ",") - c, err := master.NewClient( - master.WithEtcd(endpoints, time.Duration(timeout)*time.Second), - master.WithBuffer(bufSize), - ) - if err != nil { - panic(err) - } - - return add(c) -} - -//export paddle_new_master_client -// -// bufSize is the record buffer size. -func paddle_new_master_client(addr *C.char, bufSize int) C.paddle_master_client { - a := C.GoString(addr) - c, err := master.NewClient(master.WithAddr(a), master.WithBuffer(bufSize)) - if err != nil { - panic(err) - } - - return add(c) -} - -//export paddle_release_master_client -func paddle_release_master_client(client C.paddle_master_client) { - remove(client) -} - -//export paddle_start_get_records -func paddle_start_get_records(client C.paddle_master_client, pass C.int) { - c := get(client) - c.StartGetRecords(int(pass)) -} - -//export paddle_set_dataset -func paddle_set_dataset(client C.paddle_master_client, path **C.char, size C.int) C.int { - c := get(client) - var paths []string - for i := 0; i < int(size); i++ { - ptr := (**C.char)(unsafe.Pointer(uintptr(unsafe.Pointer(path)) + uintptr(i)*unsafe.Sizeof(*path))) - str := C.GoString(*ptr) - paths = append(paths, str) - } - err := c.SetDataset(paths) - if err != nil { - log.Error("error set dataset", - log.Ctx{"error": err, "paths": paths}) - return C.PADDLE_MASTER_ERROR - } - - return C.PADDLE_MASTER_OK -} - -// paddle_next_record gets the nexts training record. -// -// returns number of bytes of the records if success, -1 if failed, -2 if pass end. -// -//export paddle_next_record -func paddle_next_record(client C.paddle_master_client, record **C.uchar) C.int { - c := get(client) - r, err := c.NextRecord() - if err != nil { - // NOTE: use errors to indicate pass ends - if err.Error() == master.ErrAllTaskFailed.Error() || - err.Error() == master.ErrNoMoreAvailable.Error() || - err.Error() == master.ErrPassBefore.Error() { - return -2 - } - *record = (*C.uchar)(nil) - return -1 - } - - if len(r) == 0 { - // Empty record - *record = (*C.uchar)(nil) - return 0 - } - - size := C.size_t(len(r)) - *record = (*C.uchar)(C.malloc(size)) - C.memcpy(unsafe.Pointer(*record), unsafe.Pointer(&r[0]), size) - return C.int(size) -} - -// paddle_request_save_model requests the master server to approve the -// caller to save the model. -// -// returns 1 if the save the model request is approved, 0 if the -// request is rejected because other trainer is saving the model, -1 -// if error happened. -// -//export paddle_request_save_model -func paddle_request_save_model(client C.paddle_master_client, trainerID string, blockMS int) C.int { - c := get(client) - need, err := c.RequestSaveModel(trainerID, time.Duration(blockMS)*time.Millisecond) - if err != nil { - log.Error("error request save model", log.Ctx{"error": err}) - return C.PADDLE_MASTER_ERROR - } - - if need { - return C.PADDLE_SAVE_MODEL_OK - } - - return C.PADDLE_SAVE_MODEL_SKIP -} - -//export mem_free -func mem_free(p unsafe.Pointer) { - // "free" may be a better name for this function, but doing so - // will cause calling any function of this library from Python - // ctypes hanging. - C.free(p) -} - -func main() {} diff --git a/go/master/client.go b/go/master/client.go deleted file mode 100644 index e43903dd14e74047119d9dcea2adc431357305ee..0000000000000000000000000000000000000000 --- a/go/master/client.go +++ /dev/null @@ -1,255 +0,0 @@ -// Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at - -// http://www.apache.org/licenses/LICENSE-2.0 - -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -package master - -import ( - "os" - "time" - - "github.com/PaddlePaddle/Paddle/go/connection" - "github.com/PaddlePaddle/recordio" - "github.com/coreos/etcd/clientv3" - log "github.com/inconshreveable/log15" -) - -// Client is the client of the master server. -type Client struct { - conn *connection.Conn - ch chan record - bufSize int -} - -type record struct { - r []byte - err error -} - -// WithBuffer sets the client to buffer the training record. -// -// bufSize is the record buffer size. NextRecord will read from this -// buffer. -func WithBuffer(bufSize int) func(*Client) error { - return func(c *Client) error { - if bufSize <= 0 { - return nil - } - c.bufSize = bufSize - return nil - } -} - -// WithAddr sets the client to use fixed master address. -func WithAddr(addr string) func(c *Client) error { - return func(c *Client) error { - ch := make(chan string, 1) - ch <- addr - go c.monitorMaster(ch) - return nil - } -} - -// WithEtcd sets the client to use etcd for master discovery. -func WithEtcd(endpoints []string, timeout time.Duration) func(*Client) error { - return func(c *Client) error { - var cli *clientv3.Client - f := func() error { - var err error - cli, err = clientv3.New(clientv3.Config{ - Endpoints: endpoints, - DialTimeout: timeout, - }) - return err - } - for { - err := f() - if err != nil { - log.Warn("create etcd client error", log.Ctx{"error": err}) - } else { - break - } - time.Sleep(time.Second) - } - - ch := make(chan string, 1) - a, err := GetKey(cli, DefaultAddrPath, timeout) - if err != nil { - return err - } - - if a != "" { - // Master is registered, send to the master address - // channel. - ch <- a - } - - go watchKey(cli, DefaultAddrPath, ch) - go c.monitorMaster(ch) - return nil - } -} - -// NewClient creates a new Client. -func NewClient(opts ...func(*Client) error) (*Client, error) { - c := &Client{} - c.conn = connection.New() - - for _, opt := range opts { - err := opt(c) - if err != nil { - return nil, err - } - } - c.ch = make(chan record, c.bufSize) - return c, nil -} - -// StartGetRecords must be called at beginning of each pass -func (c *Client) StartGetRecords(passID int) { - go c.getRecords(passID) -} - -func (c *Client) getRecords(passID int) { - i := 0 - for { - t, err := c.getTask(passID) - if err != nil { - if err.Error() == ErrPassBefore.Error() || - err.Error() == ErrNoMoreAvailable.Error() || - err.Error() == ErrAllTaskFailed.Error() { - c.ch <- record{nil, err} - break - } - - if i%60 == 0 { - log.Debug("getTask of passID error.", - log.Ctx{"error": err, "passID": passID}) - i = 0 - } - - // if err.Error() == ErrPassAfter.Error() - // wait util last pass finishes - // if other error such as network error - // wait to reconnect or task time out - time.Sleep(time.Second * 3) - i += 3 - continue - } - - for _, chunk := range t.Chunks { - f, e := os.Open(chunk.Path) - if e != nil { - log.Error("error open chunk", log.Ctx{"error": e}) - continue - } - - s := recordio.NewRangeScanner(f, &chunk.Index, -1, -1) - for s.Scan() { - c.ch <- record{s.Record(), nil} - } - - if s.Err() != nil { - c.ch <- record{nil, s.Err()} - log.Error( - "error scan chunk", - log.Ctx{"error": err, "path": chunk.Path}, - ) - } - - err = f.Close() - if err != nil { - log.Error("error close record file", log.Ctx{"error": err}) - } - } - - // We treat a task as finished whenever the last data - // instance of the task is read. This is not exactly - // correct, but a reasonable approximation. - err = c.taskFinished(t.Meta.ID) - if err != nil { - log.Error("task finish callback error.", log.Ctx{"error": err}) - } - } -} - -func (c *Client) monitorMaster(addrCh <-chan string) { - lastMaster := "" - for curMaster := range addrCh { - // connect to the new address once address changed. - if curMaster != lastMaster { - if curMaster == "" { - err := c.conn.Close() - if err != nil { - log.Error("close old master addr error", log.Ctx{"error": err}) - } - } else { - err := c.conn.Connect(curMaster) - if err != nil { - log.Error("connect to new master addr error", log.Ctx{"error": err}) - - // connect to addr failed, set - // to last known addr in order - // to retry next time. - curMaster = lastMaster - } - } - } - lastMaster = curMaster - } -} - -// SetDataset sets dataset to dispatch for the master server. -// -// SetDataset can be call multiple times at one pass. But only the first call -// will be honored. -// -// After all tasks are done, another call of SetDataset will start another pass. -func (c *Client) SetDataset(globPaths []string) error { - err := c.conn.Call("Service.SetDataset", globPaths, nil) - return err -} - -// getTask gets a new task from the master server. -func (c *Client) getTask(passID int) (Task, error) { - var t Task - err := c.conn.Call("Service.GetTask", passID, &t) - return t, err -} - -// TaskFinished tells the master server a task is finished. -func (c *Client) taskFinished(taskID int) error { - return c.conn.Call("Service.TaskFinished", taskID, nil) -} - -// TaskFailed tell the master server as task is failed. -func (c *Client) taskFailed(meta TaskMeta) error { - return c.conn.Call("Service.TaskFailed", meta, nil) -} - -// NextRecord returns next record in the dataset. -// -// NextRecord will block until the next record is available. It is -// thread-safe. -func (c *Client) NextRecord() ([]byte, error) { - r := <-c.ch - return r.r, r.err -} - -// RequestSaveModel requests the master server to approve the caller -// to save the model. -func (c *Client) RequestSaveModel(trainerID string, blockDur time.Duration) (bool, error) { - var need bool - err := c.conn.Call("Service.RequestSaveModel", SaveModelRequest{TrainerID: trainerID, BlockDur: blockDur}, &need) - return need, err -} diff --git a/go/master/client_internal_test.go b/go/master/client_internal_test.go deleted file mode 100644 index 37028a9e1f884f6660bf1c5630980dccae2beb01..0000000000000000000000000000000000000000 --- a/go/master/client_internal_test.go +++ /dev/null @@ -1,152 +0,0 @@ -// Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at - -// http://www.apache.org/licenses/LICENSE-2.0 - -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -package master - -import ( - "fmt" - "net" - "net/http" - "net/rpc" - "os" - "strconv" - "strings" - "testing" - "time" - - "github.com/PaddlePaddle/Paddle/go/connection" - "github.com/PaddlePaddle/recordio" -) - -const ( - totalTask = 20 - chunkPerTask = 10 -) - -func TestGetFinishTask(t *testing.T) { - const path = "/tmp/master_client_test_0" - - l, err := net.Listen("tcp", ":0") - if err != nil { - panic(err) - } - - ss := strings.Split(l.Addr().String(), ":") - p, err := strconv.Atoi(ss[len(ss)-1]) - if err != nil { - panic(err) - } - go func(l net.Listener) { - s, sErr := NewService(&InMemStore{}, chunkPerTask, time.Second, 1) - if sErr != nil { - panic(sErr) - } - - server := rpc.NewServer() - sErr = server.Register(s) - if sErr != nil { - panic(sErr) - } - - mux := http.NewServeMux() - mux.Handle(rpc.DefaultRPCPath, server) - sErr = http.Serve(l, mux) - if sErr != nil { - panic(sErr) - } - }(l) - - f, err := os.Create(path) - if err != nil { - panic(err) - } - - for i := 0; i < totalTask*chunkPerTask; i++ { - w := recordio.NewWriter(f, -1, -1) - _, err = w.Write(nil) - if err != nil { - panic(err) - } - - // call Close to force RecordIO writing a chunk. - err = w.Close() - if err != nil { - panic(err) - } - } - err = f.Close() - if err != nil { - panic(err) - } - - // Manually intialize client to avoid calling c.getRecords() - c := &Client{} - c.conn = connection.New() - addr := fmt.Sprintf(":%d", p) - ch := make(chan string, 1) - ch <- addr - go c.monitorMaster(ch) - - err = c.SetDataset([]string{path}) - if err != nil { - panic(err) - } - - checkOnePass := func(i int) { - var tasks []Task - for idx := 0; idx < totalTask; idx++ { - task, cErr := c.getTask(i) - if cErr != nil && cErr.Error() != ErrNoMoreAvailable.Error() && cErr.Error() != ErrPassAfter.Error() { - t.Fatalf("error: %v, pass: %d\n", cErr, i) - } - tasks = append(tasks, task) - } - - // getting task before task finishes should return error - _, cErr := c.getTask(i) - if cErr == nil { - t.Fatalf("Should get error, pass: %d\n", i) - } - - cErr = c.taskFinished(tasks[0].Meta.ID) - if cErr != nil { - t.Fatalf("Error: %v, pass: %d\n", cErr, i) - } - // call taskFailed once won't put the task to failed queue, just ensure - // the call - cErr = c.taskFailed(tasks[0].Meta) - if cErr != nil { - t.Fatalf("Error: %v, pass: %d\n", cErr, i) - } - - tasks = tasks[1:] - _, cErr = c.getTask(i) - if cErr != nil && cErr.Error() != ErrNoMoreAvailable.Error() && cErr.Error() != ErrPassAfter.Error() { - t.Fatalf("Should be ErrNoMoreAvailable or ErrPassAfter: %s", cErr) - } - - for _, task := range tasks { - cErr = c.taskFinished(task.Meta.ID) - if cErr != nil { - t.Fatal(cErr) - } - } - } - - for i := 0; i < 10; i++ { - // init pass data - c.StartGetRecords(i) - checkOnePass(i) - } -} diff --git a/go/master/client_test.go b/go/master/client_test.go deleted file mode 100644 index 01ecad2deada7978e6fe030a6f2d25e533749568..0000000000000000000000000000000000000000 --- a/go/master/client_test.go +++ /dev/null @@ -1,150 +0,0 @@ -// Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at - -// http://www.apache.org/licenses/LICENSE-2.0 - -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -package master_test - -import ( - "fmt" - "net" - "net/http" - "net/rpc" - "os" - "runtime" - "strconv" - "strings" - "sync" - "testing" - "time" - - "github.com/PaddlePaddle/Paddle/go/master" - "github.com/PaddlePaddle/recordio" -) - -// tool function for testing output goroutine ids -func goid() int { - var buf [64]byte - n := runtime.Stack(buf[:], false) - idField := strings.Fields(strings.TrimPrefix(string(buf[:n]), "goroutine "))[0] - id, err := strconv.Atoi(idField) - if err != nil { - panic(fmt.Sprintf("cannot get goroutine id: %v", err)) - } - return id -} - -func TestNextRecord(t *testing.T) { - const ( - path = "/tmp/master_client_TestFull" - total = 50 - ) - l, err := net.Listen("tcp", ":0") - if err != nil { - panic(err) - } - - ss := strings.Split(l.Addr().String(), ":") - p, err := strconv.Atoi(ss[len(ss)-1]) - if err != nil { - panic(err) - } - go func(l net.Listener) { - s, err := master.NewService(&master.InMemStore{}, 1, time.Second*60, 1) - if err != nil { - panic(err) - } - - server := rpc.NewServer() - err = server.Register(s) - if err != nil { - panic(err) - } - - mux := http.NewServeMux() - mux.Handle(rpc.DefaultRPCPath, server) - err = http.Serve(l, mux) - if err != nil { - panic(err) - } - }(l) - - f, err := os.Create(path) - if err != nil { - panic(err) - } - - w := recordio.NewWriter(f, 1, -1) - for i := 0; i < total; i++ { - _, err = w.Write([]byte{byte(i)}) - if err != nil { - panic(err) - } - } - - err = w.Close() - if err != nil { - panic(err) - } - - err = f.Close() - if err != nil { - panic(err) - } - - // start several client to test task fetching - var wg sync.WaitGroup - for i := 0; i < 4; i++ { - wg.Add(1) - // test for multiple concurrent clients - go func() { - defer wg.Done() - // each go-routine needs a single client connection instance - c, e := master.NewClient(master.WithAddr(fmt.Sprintf(":%d", p)), master.WithBuffer(1)) - if e != nil { - t.Fatal(e) - } - e = c.SetDataset([]string{path}) - if e != nil { - panic(e) - } - - // test for n passes - for pass := 0; pass < 10; pass++ { - c.StartGetRecords(pass) - - received := make(map[byte]bool) - taskid := 0 - for { - r, e := c.NextRecord() - if e != nil { - // ErrorPassAfter will wait, else break for next pass - if e.Error() == master.ErrPassBefore.Error() || - e.Error() == master.ErrNoMoreAvailable.Error() { - break - } - t.Fatal(pass, taskid, "Read error:", e) - } - if len(r) != 1 { - t.Fatal(pass, taskid, "Length should be 1.", r) - } - if received[r[0]] { - t.Fatal(pass, taskid, "Received duplicate.", received, r) - } - taskid++ - received[r[0]] = true - } - } - }() - } - wg.Wait() -} diff --git a/go/master/etcd_client.go b/go/master/etcd_client.go deleted file mode 100644 index 36fe61127443dc8f6386295acb1a711d6a93b11c..0000000000000000000000000000000000000000 --- a/go/master/etcd_client.go +++ /dev/null @@ -1,201 +0,0 @@ -// Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at - -// http://www.apache.org/licenses/LICENSE-2.0 - -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -package master - -import ( - "context" - "time" - - "github.com/coreos/etcd/clientv3" - "github.com/coreos/etcd/clientv3/concurrency" - log "github.com/inconshreveable/log15" -) - -const ( - // DefaultLockPath is the default etcd master lock path. - DefaultLockPath = "/master/lock" - // DefaultStatePath is the default etcd key for master state. - DefaultStatePath = "/master/state" - // DefaultAddrPath is the default etcd key for master address. - DefaultAddrPath = "/master/addr" -) - -// EtcdClient is the etcd client that the master uses for fault -// tolerance and service registry. -type EtcdClient struct { - lockPath string - statePath string - client *clientv3.Client - lock *concurrency.Mutex - sess *concurrency.Session -} - -// NewEtcdClient creates a new EtcdClient. -func NewEtcdClient(endpoints []string, addr string, lockPath, addrPath, statePath string, ttlSec int) (*EtcdClient, error) { - log.Debug("Connecting to etcd", log.Ctx{"endpoint": endpoints}) - cli, err := clientv3.New(clientv3.Config{ - Endpoints: endpoints, - DialTimeout: dialTimeout, - }) - if err != nil { - return nil, err - } - - sess, err := concurrency.NewSession(cli, concurrency.WithTTL(ttlSec)) - if err != nil { - return nil, err - } - - lock := concurrency.NewMutex(sess, lockPath) - // It's fine for the lock to get stuck, in this case we have - // multiple master servers running (only configured to have - // one master running, but split-brain problem may cause - // multiple master servers running), and the cluster management - // software will kill one of them. - log.Info("Trying to acquire lock.", log.Ctx{"path": lockPath}) - err = lock.Lock(context.TODO()) - if err != nil { - return nil, err - } - log.Info("Successfully acquired lock at %s.", log.Ctx{"path": lockPath}) - - put := clientv3.OpPut(addrPath, addr) - resp, err := cli.Txn(context.Background()).If(lock.IsOwner()).Then(put).Commit() - if err != nil { - return nil, err - } - - if !resp.Succeeded { - log.Crit("No longer owns the master lock. Exiting.") - panic("No longer owns the master lock. Exiting.") - } - - e := &EtcdClient{ - lockPath: lockPath, - statePath: statePath, - client: cli, - lock: lock, - sess: sess, - } - - return e, nil -} - -// Save saves the state into the etcd. -func (e *EtcdClient) Save(state []byte) error { - ctx := context.TODO() - put := clientv3.OpPut(e.statePath, string(state)) - resp, err := e.client.Txn(ctx).If(e.lock.IsOwner()).Then(put).Commit() - if err != nil { - return err - } - - if !resp.Succeeded { - log.Error("No longer owns the lock, trying to lock again") - ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) - err := e.lock.Lock(ctx) - cancel() - if err != nil { - // We lost the master lock and can not acquire - // it back, it means some other master is - // already started. We don't want cluster - // management system to kill the master server - // who is holding the lock and running - // correctly. So the most feasible solution is - // to kill current master server. The current - // state is not saved, but the trainer's RPC - // call will fail, so the trainer will retry. - log.Crit("Could not acquire the lock at %s: %v. Exiting.", log.Ctx{"path": e.lockPath, "error": err}) - panic("Could not acquire the lock at %s: %v. Exiting.") - } - log.Info("Successfully acquired lock at %s.", e.lockPath) - return e.Save(state) - } - - return nil -} - -// Load loads the state from etcd. -func (e *EtcdClient) Load() ([]byte, error) { - ctx := context.TODO() - get := clientv3.OpGet(e.statePath) - - resp, err := e.client.Txn(ctx).If(e.lock.IsOwner()).Then(get).Commit() - if err != nil { - return nil, err - } - - if !resp.Succeeded { - log.Error("No longer owns the lock, trying to lock and load again.") - err = e.lock.Lock(context.Background()) - if err != nil { - return nil, err - } - - return e.Load() - } - - kvs := resp.Responses[0].GetResponseRange().Kvs - if len(kvs) == 0 { - // No state exists - return nil, nil - } - - state := kvs[0].Value - return state, nil -} - -// Shutdown shuts down the etcd client gracefully. -func (e *EtcdClient) Shutdown() error { - err := e.sess.Close() - newErr := e.client.Close() - if newErr != nil { - if err == nil { - err = newErr - } else { - log.Error("shutdown error", log.Ctx{"error": newErr}) - } - } - - return err -} - -// GetKey gets the value by the specify key. -func GetKey(c *clientv3.Client, key string, timeout time.Duration) (string, error) { - ctx, cancel := context.WithTimeout(context.Background(), timeout) - resp, err := c.Get(ctx, key) - cancel() - if err != nil { - return "", err - } - kvs := resp.Kvs - if len(kvs) == 0 { - return "", nil - } - v := kvs[0].Value - return string(v), nil -} - -// watchKey watches the specify key and send to valChan if there is some event. -func watchKey(c *clientv3.Client, key string, valChan chan<- string) { - rch := c.Watch(context.Background(), key) - for wresp := range rch { - for _, ev := range wresp.Events { - // if received event is DELETE, the value will be an empty string - log.Info("received event.", log.Ctx{"type": ev.Type, "key": ev.Kv.Key, "value": ev.Kv.Value}) - valChan <- string(ev.Kv.Value) - } - } -} diff --git a/go/master/inmem_store.go b/go/master/inmem_store.go deleted file mode 100644 index 33b4714317ff3f1ebbf312ac3a231cd9383bf224..0000000000000000000000000000000000000000 --- a/go/master/inmem_store.go +++ /dev/null @@ -1,47 +0,0 @@ -// Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at - -// http://www.apache.org/licenses/LICENSE-2.0 - -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -package master - -import "sync" - -// InMemStore is an in memory implementation of Store interface. -// -// It does not tolerate the fault that causes the program to crash. -type InMemStore struct { - mu sync.Mutex - buf []byte -} - -// Save saves the state into the in-memory store. -func (m *InMemStore) Save(state []byte) error { - m.mu.Lock() - defer m.mu.Unlock() - - m.buf = state - return nil -} - -// Load loads the state from the in-memory store. -func (m *InMemStore) Load() ([]byte, error) { - m.mu.Lock() - defer m.mu.Unlock() - - return m.buf, nil -} - -// Shutdown shuts down the in mem store. -func (m *InMemStore) Shutdown() error { - return nil -} diff --git a/go/master/service.go b/go/master/service.go deleted file mode 100644 index 39f746e528e0c91ecf0f3ccacb01520bab81e0a4..0000000000000000000000000000000000000000 --- a/go/master/service.go +++ /dev/null @@ -1,510 +0,0 @@ -// Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at - -// http://www.apache.org/licenses/LICENSE-2.0 - -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -package master - -import ( - "bytes" - "compress/gzip" - "encoding/gob" - "errors" - "math/rand" - "os" - "path/filepath" - "sync" - "time" - - log "github.com/inconshreveable/log15" - - "github.com/PaddlePaddle/recordio" -) - -const ( - dialTimeout = 5 * time.Second -) - -// ErrAllTaskFailed occur when tasks are in done or failed state. -var ErrAllTaskFailed = errors.New("all task finished") - -// ErrNoMoreAvailable occur when no task in todo and yet not all done or fail. -var ErrNoMoreAvailable = errors.New("no more available task") - -// ErrPassBefore client side pass number does not match with master counter. -var ErrPassBefore = errors.New("pass number smaller than master") - -// ErrPassAfter client side pass number does not match with master counter. -var ErrPassAfter = errors.New("pass number larger than master") - -// Store is the interface for save and load the master state. -type Store interface { - Save([]byte) error - Load() ([]byte, error) - Shutdown() error -} - -// Chunk is a chunk of data consisted of several data instances. -type Chunk struct { - Path string - Index recordio.Index // chunk index -} - -// TaskMeta is a struct which stores task's meta info. -type TaskMeta struct { - ID int - Epoch int -} - -// Task is the basic unit of data instances assigned to trainers. -type Task struct { - Meta TaskMeta - Chunks []Chunk -} - -type taskEntry struct { - Task Task - // A task fails if it's timeout or trainer reports it exits unnormally. - NumFailure int -} - -type masterState struct { - Todo []taskEntry - Pending map[int]taskEntry // map from task ID to task entry - Done []taskEntry - Failed []taskEntry - CurPass int -} - -// Service is the master server service. -type Service struct { - chunksPerTask int - timeoutDur time.Duration - failureMax int - store Store - - ready chan struct{} - initDone bool - - mu sync.Mutex - // State to be persisted to snapshot. - state masterState - // The trainer that is currently saving model. This state is - // transient, does not need to be persisted to snapshot. - savingTrainer string -} - -func partition(chunks []Chunk, chunksPerTask int) []taskEntry { - // generate uniq id across job using nanosecond + randint + counter - // FIXME(typhoonzero): this is a workaround, use uuid - randStart := rand.Int() - counter := 0 - timestamp := time.Now().Nanosecond() - id := timestamp + randStart + counter - if chunksPerTask <= 0 { - chunksPerTask = 1 - } - - var result []taskEntry - var cur taskEntry - for i, c := range chunks { - if i%chunksPerTask == 0 && len(cur.Task.Chunks) > 0 { - cur.Task.Meta.ID = id - counter++ - id = timestamp + randStart + counter - result = append(result, cur) - cur.Task.Chunks = nil - } - - cur.Task.Chunks = append(cur.Task.Chunks, c) - } - - if len(cur.Task.Chunks) > 0 { - cur.Task.Meta.ID = id - result = append(result, cur) - } - - return result -} - -// NewService creates a new service. -func NewService(store Store, chunksPerTask int, timeoutDur time.Duration, failureMax int) (*Service, error) { - s := &Service{} - s.chunksPerTask = chunksPerTask - s.timeoutDur = timeoutDur - s.failureMax = failureMax - s.state = masterState{} - s.state.Pending = make(map[int]taskEntry) - s.ready = make(chan struct{}) - s.store = store - recovered, err := s.recover() - if err != nil { - return nil, err - } - - if recovered { - // Recovered. Now the state is already initialized, - // and the master is ready. - s.initDone = true - close(s.ready) - log.Info("Master recovered from saved state.") - } - - return s, nil -} - -// recover recovers service state from etcd. -func (s *Service) recover() (bool, error) { - state, err := s.store.Load() - if err != nil { - return false, err - } - - if state == nil { - log.Info("No state exists, not recovered.") - return false, nil - } - - log.Info("Loaded snapshot.", log.Ctx{"size": len(state)}) - gr, err := gzip.NewReader(bytes.NewReader(state)) - if err != nil { - return false, err - } - - dec := gob.NewDecoder(gr) - var tqs masterState - err = dec.Decode(&tqs) - if err != nil { - return false, err - } - - err = gr.Close() - if err != nil { - // Only close failed, recover actually succeed, so - // just log error. - log.Error("error close recover file.", log.Ctx{"error": err}) - } - - s.state = tqs - log.Info("Master recovered from snapshot, scheduling pending task timeout check.", s.logCtx()) - for _, t := range s.state.Pending { - time.AfterFunc(s.timeoutDur, s.checkTimeoutFunc(t.Task.Meta.ID, t.Task.Meta.Epoch)) - } - - return true, nil -} - -// snapshot *must* be called with s.mu being held. -func (s *Service) snapshot() error { - // TODO(helin): etcd request has a size limit, so the snapshot - // size is limited by the max request size. We should either - // divide the snapshot into smaller chunks and save under - // different keys, or configure the request size to be big - // enough: - // https://github.com/coreos/etcd/blob/2f84f3d8d8ed8f9537ab6ffa44a3a1c7eddfa9b1/embed/config.go#L44 - var buf bytes.Buffer - gw := gzip.NewWriter(&buf) - enc := gob.NewEncoder(gw) - err := enc.Encode(s.state) - if err != nil { - return err - } - err = gw.Close() - if err != nil { - return err - } - - state := buf.Bytes() - log.Info("Saving snapshot.", log.Ctx{"size bytes": len(state)}) - return s.store.Save(state) -} - -func readChunks(globPaths []string) ([]Chunk, error) { - var chunks []Chunk - var paths []string - - for _, s := range globPaths { - match, err := filepath.Glob(s) - if err != nil { - return nil, err - } - paths = append(paths, match...) - } - - if len(paths) == 0 { - return nil, errors.New("no valid dataset specified") - } - - for _, path := range paths { - f, err := os.Open(path) - if err != nil { - return nil, err - } - - index, err := recordio.LoadIndex(f) - if err != nil { - return nil, err - } - err = f.Close() - if err != nil { - return nil, err - } - - count := index.NumChunks() - log.Info("reading chunks.", log.Ctx{"path": path, "num chunks": count}) - for i := 0; i < count; i++ { - chunk := Chunk{ - Path: path, - Index: *index.ChunkIndex(i), - } - chunks = append(chunks, chunk) - } - } - - return chunks, nil -} - -// SetDataset sets dataset to dispatch for the master server. -// -// SetDataset can be call multiple times. But only the first call will -// be honored. -func (s *Service) SetDataset(globPaths []string, _ *int) error { - if len(globPaths) == 0 { - return errors.New("no dataset specified") - } - - s.mu.Lock() - defer s.mu.Unlock() - if s.initDone { - // Already initialized. All trainer will call - // SetDataset, but we only handle the first one. Treat - // other calls as successful but do nothing. - return nil - } - - chunks, err := readChunks(globPaths) - if err != nil { - return err - } - - s.state.Todo = partition(chunks, s.chunksPerTask) - - err = s.snapshot() - if err != nil { - log.Error("snapshot error", log.Ctx{"error": err}) - return err - } - close(s.ready) - s.initDone = true - return nil -} - -// processFailedTask retry s.failureMax times for failed task. -// return true if all task are done or failed. -func (s *Service) processFailedTask(t taskEntry, epoch int) { - if t.Task.Meta.Epoch != epoch { - // new epoch, task launched after the - // schedule of this timeout check or failed status report. - return - } - - defer func() { - err := s.snapshot() - if err != nil { - log.Error("snapshot error", log.Ctx{"error": err}) - } - }() - - delete(s.state.Pending, t.Task.Meta.ID) - - t.NumFailure++ - if t.NumFailure > s.failureMax { - log.Warn("Task failed to many times, discard.", log.Ctx{"task": t.Task, "num failed": t.NumFailure}) - s.state.Failed = append(s.state.Failed, t) - return - } - - log.Warn("Task failed, re-dispatch.", log.Ctx{"task": t.Task, "num failed": t.NumFailure}) - s.state.Todo = append(s.state.Todo, t) - return -} - -func (s *Service) checkTimeoutFunc(taskID int, epoch int) func() { - return func() { - s.mu.Lock() - defer s.mu.Unlock() - - t, ok := s.state.Pending[taskID] - if !ok { - return - } - - s.processFailedTask(t, epoch) - } -} - -// must be called with lock held. -func (s *Service) logCtx() log.Ctx { - return log.Ctx{ - "todoLen": len(s.state.Todo), - "pendingLen": len(s.state.Pending), - "doneLen": len(s.state.Done), - "failedLen": len(s.state.Failed), - "curPass": s.state.CurPass, - } -} - -// GetTask gets a new task from the service. -// passID is the client side pass count -func (s *Service) GetTask(passID int, task *Task) error { - select { - case <-s.ready: - } - - s.mu.Lock() - defer s.mu.Unlock() - if passID < s.state.CurPass { - return ErrPassBefore - } - if passID > s.state.CurPass { - // Client may get run to pass after master when one client faster than the - // other - return ErrPassAfter - } - - if len(s.state.Todo) == 0 { - if len(s.state.Done) == 0 && len(s.state.Pending) == 0 { - log.Warn("All tasks failed, may start next pass", s.logCtx()) - return ErrAllTaskFailed - } - log.Warn("No more available task.", s.logCtx()) - return ErrNoMoreAvailable - } - - t := s.state.Todo[0] - t.Task.Meta.Epoch++ - s.state.Todo = s.state.Todo[1:] - s.state.Pending[t.Task.Meta.ID] = t - err := s.snapshot() - if err != nil { - return err - } - - *task = t.Task - ctx := s.logCtx() - ctx["task meta"] = t.Task.Meta - log.Info("Task dispatched.", ctx) - time.AfterFunc(s.timeoutDur, s.checkTimeoutFunc(t.Task.Meta.ID, t.Task.Meta.Epoch)) - return nil -} - -// TaskFinished tell the service that a task is finished. -func (s *Service) TaskFinished(taskID int, dummy *int) error { - select { - case <-s.ready: - } - - s.mu.Lock() - defer s.mu.Unlock() - - t, ok := s.state.Pending[taskID] - if !ok { - ctx := s.logCtx() - ctx["task id"] = taskID - log.Warn("Pending task not found.", ctx) - return nil - } - - // task finished, reset timeout - t.NumFailure = 0 - s.state.Done = append(s.state.Done, t) - delete(s.state.Pending, taskID) - - ctx := s.logCtx() - ctx["task id"] = taskID - log.Info("Task finished.", ctx) - if len(s.state.Todo) == 0 && len(s.state.Pending) == 0 { - // increase master side pass count if all tasks finished - s.state.CurPass++ - s.state.Todo = append(s.state.Done, s.state.Failed...) - s.state.Done = []taskEntry{} - // TODO(typhoonzero): deal with failed tasks - s.state.Failed = []taskEntry{} - ctx := s.logCtx() - ctx["new pass"] = s.state.CurPass - log.Warn("all task finished, add new pass data.", ctx) - } - - err := s.snapshot() - if err != nil { - log.Error("snapshot error", log.Ctx{"error": err}) - } - return err -} - -// TaskFailed tells the service that a task is failed. -func (s *Service) TaskFailed(meta TaskMeta, dummy *int) error { - select { - case <-s.ready: - } - - s.mu.Lock() - defer s.mu.Unlock() - - t, ok := s.state.Pending[meta.ID] - if !ok { - log.Warn("TaskFailed:Pending task not found.", log.Ctx{"task": t.Task.Meta}) - return nil - } - - s.processFailedTask(t, meta.Epoch) - return nil -} - -// SaveModelRequest is the request for saving model -type SaveModelRequest struct { - TrainerID string - BlockDur time.Duration -} - -// RequestSaveModel requests the master server to approve the caller -// to save the model. -func (s *Service) RequestSaveModel(req SaveModelRequest, need *bool) error { - s.mu.Lock() - defer s.mu.Unlock() - - if req.TrainerID == "" { - return errors.New("trainer id is empty") - } - - if s.savingTrainer == "" { - *need = true - } else { - if req.TrainerID == s.savingTrainer { - // save trainer asked to save model again - *need = true - } else { - *need = false - } - } - - if *need { - s.savingTrainer = req.TrainerID - time.AfterFunc(req.BlockDur, func() { - s.mu.Lock() - s.savingTrainer = "" - s.mu.Unlock() - }) - } - - return nil -} diff --git a/go/master/service_internal_test.go b/go/master/service_internal_test.go deleted file mode 100644 index dd22f3d548b99ac11250735c74bca3dfca46cf86..0000000000000000000000000000000000000000 --- a/go/master/service_internal_test.go +++ /dev/null @@ -1,52 +0,0 @@ -// Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at - -// http://www.apache.org/licenses/LICENSE-2.0 - -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -package master - -import "testing" - -func TestPartitionCount(t *testing.T) { - cs := make([]Chunk, 100) - ts := partition(cs, 5) - if len(ts) != 20 { - t.Error(len(ts)) - } - - cs = make([]Chunk, 101) - ts = partition(cs, 5) - if len(ts) != 21 { - t.Error(len(ts)) - } - - ts = partition(cs, 1) - if len(ts) != 101 { - t.Error(len(ts)) - } - - ts = partition(cs, 0) - if len(ts) != 101 { - t.Error(len(ts)) - } -} - -func TestPartionIndex(t *testing.T) { - cs := make([]Chunk, 100) - ts := partition(cs, 20) - for i := range ts { - // test auto increament ids - if i > 0 && ts[i].Task.Meta.ID != ts[i-1].Task.Meta.ID+1 { - t.Error(ts[i], i) - } - } -} diff --git a/go/master/service_test.go b/go/master/service_test.go deleted file mode 100644 index 2d00c22d6feb7177da5c19c557fd16d7925ef6d1..0000000000000000000000000000000000000000 --- a/go/master/service_test.go +++ /dev/null @@ -1,72 +0,0 @@ -package master_test - -import ( - "io/ioutil" - "net/url" - "os" - "strings" - "testing" - "time" - - "github.com/PaddlePaddle/Paddle/go/master" - "github.com/coreos/etcd/clientv3" - "github.com/coreos/etcd/embed" - "github.com/stretchr/testify/assert" -) - -func TestNewServiceWithEtcd(t *testing.T) { - // setup an embed etcd server - etcdDir, err := ioutil.TempDir("", "") - if err != nil { - t.Fatal(err) - } - cfg := embed.NewConfig() - lpurl, _ := url.Parse("http://localhost:0") - lcurl, _ := url.Parse("http://localhost:0") - cfg.LPUrls = []url.URL{*lpurl} - cfg.LCUrls = []url.URL{*lcurl} - cfg.Dir = etcdDir - e, err := embed.StartEtcd(cfg) - if err != nil { - t.Fatal(err) - } - defer func() { - e.Close() - if err := os.RemoveAll(etcdDir); err != nil { - t.Fatal(err) - } - }() - - <-e.Server.ReadyNotify() - - port := strings.Split(e.Clients[0].Addr().String(), ":")[1] - endpoint := "127.0.0.1:" + port - - ep := []string{endpoint} - masterAddr := "127.0.0.1:3306" - store, err := master.NewEtcdClient(ep, masterAddr, master.DefaultLockPath, master.DefaultAddrPath, master.DefaultStatePath, 30) - if err != nil { - t.Fatal(err) - } - - _, err = master.NewService(store, 10, 10, 3) - if err != nil { - t.Fatal(err) - } - cli, err := clientv3.New(clientv3.Config{ - Endpoints: ep, - DialTimeout: 3 * time.Second, - }) - if err != nil { - t.Fatal(err) - } - v, err := master.GetKey(cli, master.DefaultAddrPath, 3*time.Second) - if err != nil { - t.Fatal(err) - } - if err := cli.Close(); err != nil { - t.Fatal(err) - } - // test master process registry itself into etcd server. - assert.Equal(t, masterAddr, v, "master process should registry itself into etcd server.") -} diff --git a/go/proto/.gitignore b/go/proto/.gitignore deleted file mode 100644 index 5e7d2734cfc60289debf74293817c0a8f572ff32..0000000000000000000000000000000000000000 --- a/go/proto/.gitignore +++ /dev/null @@ -1,4 +0,0 @@ -# Ignore everything in this directory -* -# Except this file -!.gitignore diff --git a/go/pserver/CMakeLists.txt b/go/pserver/CMakeLists.txt deleted file mode 100644 index 32f3b2baba37238f0ca75e9177f9afa3dcfd4156..0000000000000000000000000000000000000000 --- a/go/pserver/CMakeLists.txt +++ /dev/null @@ -1,17 +0,0 @@ -# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# -if(WITH_TESTING) - go_test(pserver_test DEPS paddle_go_optimizer gen_proto_go) -endif() diff --git a/go/pserver/client/CMakeLists.txt b/go/pserver/client/CMakeLists.txt deleted file mode 100644 index 1d6f45a6642fa8819050f8a21c212369b52d1112..0000000000000000000000000000000000000000 --- a/go/pserver/client/CMakeLists.txt +++ /dev/null @@ -1,17 +0,0 @@ -# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# -if(WITH_TESTING) - go_test(pserver_client_test DEPS paddle_go_optimizer) -endif() diff --git a/go/pserver/client/c/.gitignore b/go/pserver/client/c/.gitignore deleted file mode 100644 index 4bf05c85386dfcef83453a663dffc5d62efcbcc0..0000000000000000000000000000000000000000 --- a/go/pserver/client/c/.gitignore +++ /dev/null @@ -1 +0,0 @@ -libpaddle_go_optimizer.a diff --git a/go/pserver/client/c/CMakeLists.txt b/go/pserver/client/c/CMakeLists.txt deleted file mode 100644 index 78776219dee06da09e8b6956cd7bc132fb28552b..0000000000000000000000000000000000000000 --- a/go/pserver/client/c/CMakeLists.txt +++ /dev/null @@ -1,30 +0,0 @@ -# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# -cc_library(paddle_go_optimizer DEPS paddle_optimizer paddle_proto glog gflags protobuf) -target_link_libraries(paddle_go_optimizer stdc++ m) - -# Copy library to the required place. -# See: go/pserver/optimizer.go: -# // #cgo LDFLAGS: ${SRCDIR}/client/c/libpaddle_go_optimizer.a -lstdc++ -lm -add_custom_command(TARGET paddle_go_optimizer POST_BUILD - COMMAND cp "${CMAKE_CURRENT_BINARY_DIR}/libpaddle_go_optimizer.a" "${CMAKE_CURRENT_SOURCE_DIR}" - ) - -go_library(paddle_pserver_cclient STATIC DEPS paddle_go_optimizer) -if(WITH_TESTING) - # FIXME: this test requires pserver which is not managed by the test - # we need some kind of e2e testing machanism. - # add_subdirectory(test) -endif() diff --git a/go/pserver/client/c/cclient.go b/go/pserver/client/c/cclient.go deleted file mode 100644 index cddc28e46f48799f8643732283c94216e1f5cfb1..0000000000000000000000000000000000000000 --- a/go/pserver/client/c/cclient.go +++ /dev/null @@ -1,300 +0,0 @@ -// Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at - -// http://www.apache.org/licenses/LICENSE-2.0 - -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -package main - -/* -#include -typedef enum { - PADDLE_ELEMENT_TYPE_INT32 = 0, - PADDLE_ELEMENT_TYPE_UINT32 = 1, - PADDLE_ELEMENT_TYPE_INT64 = 2, - PADDLE_ELEMENT_TYPE_UINT64 = 3, - PADDLE_ELEMENT_TYPE_FLOAT32 = 4, - PADDLE_ELEMENT_TYPE_FLOAT64 = 5, -} paddle_element_type; - -typedef struct { - char* name; - paddle_element_type element_type; - unsigned char* content; - int content_len; -} paddle_parameter, paddle_gradient; - -typedef int paddle_pserver_client; -#define PSERVER_ERROR -1 -#define PSERVER_OK 0 -*/ -import "C" - -import ( - "strings" - "sync" - "unsafe" - - "github.com/PaddlePaddle/Paddle/go/pserver" - "github.com/PaddlePaddle/Paddle/go/pserver/client" - log "github.com/inconshreveable/log15" -) - -func init() { - log.Root().SetHandler( - log.LvlFilterHandler(log.LvlWarn, log.CallerStackHandler("%+v", log.StderrHandler)), - ) -} - -var mu sync.Mutex -var handleMap = make(map[C.paddle_pserver_client]*client.Client) -var curHandle C.paddle_pserver_client - -func add(c *client.Client) C.paddle_pserver_client { - mu.Lock() - defer mu.Unlock() - cli := curHandle - curHandle++ - handleMap[cli] = c - return cli -} - -func get(client C.paddle_pserver_client) *client.Client { - mu.Lock() - defer mu.Unlock() - return handleMap[client] -} - -func remove(client C.paddle_pserver_client) *client.Client { - mu.Lock() - defer mu.Unlock() - h := handleMap[client] - delete(handleMap, client) - return h -} - -func cArrayToSlice(p unsafe.Pointer, len int) []byte { - if p == nil { - return nil - } - - // create a Go clice backed by a C array, reference: - // https://github.com/golang/go/wiki/cgo#turning-c-arrays-into-go-slices - // - // Go garbage collector will not interact with this data, need - // to be freed properly. - return (*[1 << 30]byte)(p)[:len:len] -} - -type selector bool - -func (s selector) Select() (bool, error) { - return bool(s), nil -} - -func (s selector) Done() error { - return nil -} - -type lister []client.Server - -func (l lister) List() []client.Server { - return l -} - -//export paddle_new_pserver_client -func paddle_new_pserver_client(addrs *C.char, selected int) C.paddle_pserver_client { - a := C.GoString(addrs) - as := strings.Split(a, ",") - servers := make([]client.Server, len(as)) - for i := range as { - servers[i].Index = i - servers[i].Addr = as[i] - } - c := client.NewClient(lister(servers), len(as), selector(selected != 0)) - return add(c) -} - -//export paddle_new_etcd_pserver_client -func paddle_new_etcd_pserver_client(etcdEndpoints *C.char) C.paddle_pserver_client { - addr := C.GoString(etcdEndpoints) - etcdClient := client.NewEtcd(addr) - c := client.NewClient(etcdClient, etcdClient.Desired(), etcdClient) - return add(c) -} - -//export paddle_pserver_client_release -func paddle_pserver_client_release(client C.paddle_pserver_client) { - remove(client) -} - -// paddle_begin_init_params tells trainer if it needs to init the -// parameters. -// -// returns 1 if the trainer needs to init the parameters. 0 if the -// trainer does not need to init the parameters. -// -//export paddle_begin_init_params -func paddle_begin_init_params(client C.paddle_pserver_client) C.int { - c := get(client) - selected, err := c.BeginInitParams() - if err != nil { - panic(err) - } - - if selected { - return 1 - } - return 0 -} - -//export paddle_init_param -func paddle_init_param(client C.paddle_pserver_client, param C.paddle_parameter, paramConfig unsafe.Pointer, configLen C.int) C.int { - et := pserver.ElementType(param.element_type) - name := C.GoString(param.name) - content := cArrayToSlice(unsafe.Pointer(param.content), int(param.content_len)) - pc := pserver.ParameterWithConfig{ - Param: pserver.Parameter{Name: name, ElementType: et, Content: content}, - Config: cArrayToSlice(paramConfig, int(configLen)), - } - c := get(client) - err := c.InitParam(pc) - - if err != nil { - if err.Error() == pserver.AlreadyInitialized { - log.Warn( - "parameter already initialized, treat paddle_init_param as successful.", - log.Ctx{"parameter": name}, - ) - return C.PSERVER_OK - } - log.Error("error init param", log.Ctx{"error": err}) - return C.PSERVER_ERROR - } - - return C.PSERVER_OK -} - -//export paddle_finish_init_params -func paddle_finish_init_params(client C.paddle_pserver_client) C.int { - c := get(client) - err := c.FinishInitParams() - if err != nil { - if err.Error() == pserver.AlreadyInitialized { - log.Warn("parameters already initialized, treat paddle_finish_init_params as successful.") - return C.PSERVER_OK - } - - log.Error("error finish init params", log.Ctx{"error": err}) - return C.PSERVER_ERROR - } - - return C.PSERVER_OK -} - -//export paddle_send_grads -func paddle_send_grads(client C.paddle_pserver_client, grads **C.paddle_gradient, total C.int) C.int { - var gs []pserver.Gradient - for i := 0; i < int(total); i++ { - grad := *(**C.paddle_gradient)(unsafe.Pointer((uintptr(unsafe.Pointer(grads)) + uintptr(i)*unsafe.Sizeof(*grads)))) - et := pserver.ElementType(grad.element_type) - name := C.GoString(grad.name) - content := cArrayToSlice(unsafe.Pointer(grad.content), int(grad.content_len)) - gs = append(gs, pserver.Gradient{Name: name, ElementType: et, Content: content}) - } - - c := get(client) - err := c.SendGrads(gs) - if err != nil { - log.Error("error send grads", log.Ctx{"error": err}) - return C.PSERVER_ERROR - } - - return C.PSERVER_OK -} - -//export paddle_get_params -func paddle_get_params(client C.paddle_pserver_client, dst **C.paddle_parameter, total C.int) C.int { - var ns []string - for i := 0; i < int(total); i++ { - param := *(**C.paddle_parameter)(unsafe.Pointer((uintptr(unsafe.Pointer(dst)) + uintptr(i)*unsafe.Sizeof(*dst)))) - ns = append(ns, C.GoString(param.name)) - } - c := get(client) - ps, err := c.GetParams(ns) - if err != nil { - log.Error("error get params", log.Ctx{"error": err}) - return C.PSERVER_ERROR - } - - if len(ps) != len(ns) { - pn := make([]string, len(ps)) - for i, p := range ps { - pn[i] = p.Name - } - log.Error( - "pserver returned wrong number of parameters.", - log.Ctx{ - "Requested": strings.Join(pn, ", "), - "Returned": strings.Join(ns, ", "), - }, - ) - return C.PSERVER_ERROR - } - - for i := range ps { - if ns[i] != ps[i].Name { - pn := make([]string, len(ps)) - for i, p := range ps { - pn[i] = p.Name - } - log.Error( - "pserver returned wrong parameters, or not in requested order.", - log.Ctx{ - "Requested": strings.Join(pn, ", "), - "Returned": strings.Join(ns, ", "), - }, - ) - return C.PSERVER_ERROR - } - } - - for i := 0; i < int(total); i++ { - p := ps[i] - param := *(**C.paddle_parameter)(unsafe.Pointer((uintptr(unsafe.Pointer(dst)) + uintptr(i)*unsafe.Sizeof(*dst)))) - - if unsafe.Pointer(param) == nil { - log.Error("must pre-allocate parameter.") - return C.PSERVER_ERROR - } - - if unsafe.Pointer(param.content) != nil { - if int(param.content_len) != len(p.Content) { - log.Error( - "the pre-allocated content len does not match parameter content len.", - log.Ctx{ - "Pre-allocated len": param.content_len, - "Returned len": len(p.Content), - }, - ) - return C.PSERVER_ERROR - } - } - - C.memcpy(unsafe.Pointer(param.content), unsafe.Pointer(&p.Content[0]), C.size_t(len(p.Content))) - param.content_len = C.int(len(p.Content)) - param.element_type = C.paddle_element_type(p.ElementType) - } - - return C.PSERVER_OK -} - -func main() {} // Required but ignored diff --git a/go/pserver/client/c/test/CMakeLists.txt b/go/pserver/client/c/test/CMakeLists.txt deleted file mode 100644 index 4500b1f288372ed0e2d9d383234df97ae976c60b..0000000000000000000000000000000000000000 --- a/go/pserver/client/c/test/CMakeLists.txt +++ /dev/null @@ -1,15 +0,0 @@ -# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# -cc_test(test_cclient SRCS test_cclient.c DEPS paddle_pserver_cclient paddle_go_optimizer) diff --git a/go/pserver/client/c/test/test_cclient.c b/go/pserver/client/c/test/test_cclient.c deleted file mode 100644 index 0116e42a0a67f757a786aa6dc9f8097af656d8b2..0000000000000000000000000000000000000000 --- a/go/pserver/client/c/test/test_cclient.c +++ /dev/null @@ -1,115 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include -#include - -#include "libpaddle_pserver_cclient.h" - -// TODO(helin): Fix: gtest using cmake is not working, using this -// hacky way for now. -#define fail() \ - fprintf(stderr, "info: %s:%d: ", __FILE__, __LINE__); \ - exit(-1); - -void sendGrads(paddle_pserver_client c) { - unsigned char grad_a[2000] = {2}; - unsigned char grad_b[3000] = {3}; - paddle_gradient grad1 = { - "param_a", PADDLE_ELEMENT_TYPE_FLOAT32, grad_a, 2000}; - paddle_gradient grad2 = { - "param_b", PADDLE_ELEMENT_TYPE_FLOAT32, grad_b, 3000}; - paddle_gradient *grads[2] = {&grad1, &grad2}; - if (paddle_send_grads(c, grads, 2)) { - fail(); - } -} - -void getParams(paddle_pserver_client c) { - paddle_parameter param_a; - paddle_parameter param_b; - char name_a[] = "param_a"; - char name_b[] = "param_b"; - // Must pre-allocate the prameter content before calling paddle_get_params. - unsigned char content_a[2000] = {}; - unsigned char content_b[3000] = {}; - param_a.element_type = PADDLE_ELEMENT_TYPE_FLOAT32; - param_a.name = name_a; - param_a.content = content_a; - param_a.content_len = 2000; - param_b.element_type = PADDLE_ELEMENT_TYPE_FLOAT32; - param_b.name = name_b; - param_b.content = content_b; - param_b.content_len = 3000; - - paddle_parameter *params[2] = {¶m_a, ¶m_b}; - if (paddle_get_params(c, params, 2)) { - fail(); - } -} - -int main() { - char addr[] = "localhost:3000"; - paddle_pserver_client c = paddle_new_pserver_client(addr, 1); - char *config_proto; - size_t config_proto_len = 0; - ssize_t nread; - FILE *fp = fopen("testdata/optimizer.pb", "r"); - if (!fp) { - fail(); - } - while ((nread = getline(&config_proto, &config_proto_len, fp)) != -1) { - printf("%s", config_proto); - } - fclose(fp); -retry: - if (paddle_begin_init_params(c)) { - paddle_parameter param; - char name_a[] = "param_a"; - char name_b[] = "param_b"; - unsigned char content_a[2000] = {1}; - unsigned char content_b[3000] = {0}; - param.element_type = PADDLE_ELEMENT_TYPE_FLOAT32; - param.name = name_a; - param.content = content_a; - param.content_len = 2000; - int error = - paddle_init_param(c, param, (void *)config_proto, config_proto_len); - if (error != 0) { - goto retry; - } - - param.element_type = PADDLE_ELEMENT_TYPE_FLOAT32; - param.name = name_b; - param.content = content_b; - param.content_len = 3000; - error = paddle_init_param(c, param, (void *)config_proto, config_proto_len); - if (error != 0) { - goto retry; - } - - error = paddle_finish_init_params(c); - if (error != 0) { - goto retry; - } - } - - int i; - for (i = 0; i < 100; i++) { - sendGrads(c); - getParams(c); - } - - return 0; -} diff --git a/go/pserver/client/c/test/test_mnist.py b/go/pserver/client/c/test/test_mnist.py deleted file mode 100644 index 97f63aeb6d4cdfc639b0d778d4df817525c51430..0000000000000000000000000000000000000000 --- a/go/pserver/client/c/test/test_mnist.py +++ /dev/null @@ -1,145 +0,0 @@ -# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import paddle.v2 as paddle -import gzip - - -def softmax_regression(img): - predict = paddle.layer.fc(input=img, - size=10, - act=paddle.activation.Softmax()) - return predict - - -def multilayer_perceptron(img): - # The first fully-connected layer - hidden1 = paddle.layer.fc(input=img, size=128, act=paddle.activation.Relu()) - # The second fully-connected layer and the according activation function - hidden2 = paddle.layer.fc(input=hidden1, - size=64, - act=paddle.activation.Relu()) - # The thrid fully-connected layer, note that the hidden size should be 10, - # which is the number of unique digits - predict = paddle.layer.fc(input=hidden2, - size=10, - act=paddle.activation.Softmax()) - return predict - - -def convolutional_neural_network(img): - # first conv layer - conv_pool_1 = paddle.networks.simple_img_conv_pool( - input=img, - filter_size=5, - num_filters=20, - num_channel=1, - pool_size=2, - pool_stride=2, - act=paddle.activation.Tanh()) - # second conv layer - conv_pool_2 = paddle.networks.simple_img_conv_pool( - input=conv_pool_1, - filter_size=5, - num_filters=50, - num_channel=20, - pool_size=2, - pool_stride=2, - act=paddle.activation.Tanh()) - # The first fully-connected layer - fc1 = paddle.layer.fc(input=conv_pool_2, - size=128, - act=paddle.activation.Tanh()) - # The softmax layer, note that the hidden size should be 10, - # which is the number of unique digits - predict = paddle.layer.fc(input=fc1, - size=10, - act=paddle.activation.Softmax()) - return predict - - -def main(): - paddle.init(use_gpu=False, trainer_count=1) - - # define network topology - images = paddle.layer.data( - name='pixel', type=paddle.data_type.dense_vector(784)) - label = paddle.layer.data( - name='label', type=paddle.data_type.integer_value(10)) - - # Here we can build the prediction network in different ways. Please - # choose one by uncomment corresponding line. - predict = softmax_regression(images) - #predict = multilayer_perceptron(images) - #predict = convolutional_neural_network(images) - - cost = paddle.layer.classification_cost(input=predict, label=label) - parameters = paddle.parameters.create(cost) - - optimizer = paddle.optimizer.Momentum( - learning_rate=0.1 / 128.0, - momentum=0.9, - regularization=paddle.optimizer.L2Regularization(rate=0.0005 * 128)) - - trainer = paddle.trainer.SGD(cost=cost, - parameters=parameters, - update_equation=optimizer, - is_local=False, - pserver_spec="localhost:3000") - - lists = [] - - def event_handler(event): - if isinstance(event, paddle.event.EndIteration): - if event.batch_id % 1000 == 0: - print "Pass %d, Batch %d, Cost %f, %s" % ( - event.pass_id, event.batch_id, event.cost, event.metrics) - - elif isinstance(event, paddle.event.EndPass): - result = trainer.test(reader=paddle.batch( - paddle.dataset.mnist.test(), batch_size=128)) - print "Test with Pass %d, Cost %f, %s\n" % ( - event.pass_id, result.cost, result.metrics) - lists.append((event.pass_id, result.cost, - result.metrics['classification_error_evaluator'])) - - trainer.train( - reader=paddle.batch( - paddle.reader.shuffle( - paddle.dataset.mnist.train(), buf_size=8192), - batch_size=128), - event_handler=event_handler, - num_passes=100) - - # find the best pass - best = sorted(lists, key=lambda list: float(list[1]))[0] - print 'Best pass is %s, testing Avgcost is %s' % (best[0], best[1]) - print 'The classification accuracy is %.2f%%' % (100 - float(best[2]) * 100) - - test_creator = paddle.dataset.mnist.test() - test_data = [] - for item in test_creator(): - test_data.append((item[0], )) - if len(test_data) == 100: - break - - # output is a softmax layer. It returns probabilities. - # Shape should be (100, 10) - probs = paddle.infer( - output_layer=predict, parameters=parameters, input=test_data) - print probs.shape - - -if __name__ == '__main__': - main() diff --git a/go/pserver/client/c/test/test_train.py b/go/pserver/client/c/test/test_train.py deleted file mode 100644 index 2db5a0bf6a520b8fa29d13ea854b638ebcbbb7d9..0000000000000000000000000000000000000000 --- a/go/pserver/client/c/test/test_train.py +++ /dev/null @@ -1,89 +0,0 @@ -# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import paddle.v2 as paddle -import paddle.v2.dataset.uci_housing as uci_housing -import paddle.v2.master as master -import os -import cPickle as pickle -from paddle.v2.reader.creator import cloud_reader - -etcd_ip = os.getenv("MASTER_IP", "127.0.0.1") -etcd_endpoints = "http://" + etcd_ip + ":2379" -print "etcd endpoints: ", etcd_endpoints - - -def main(): - # init - paddle.init(use_gpu=False, trainer_count=1) - - # network config - x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13)) - y_predict = paddle.layer.fc(input=x, - param_attr=paddle.attr.Param(name='w'), - size=1, - act=paddle.activation.Linear(), - bias_attr=paddle.attr.Param(name='b')) - y = paddle.layer.data(name='y', type=paddle.data_type.dense_vector(1)) - cost = paddle.layer.mse_cost(input=y_predict, label=y) - - # create parameters - parameters = paddle.parameters.create(cost) - - # create optimizer of new remote updater to pserver - optimizer = paddle.optimizer.Momentum(momentum=0, learning_rate=1e-3) - - trainer = paddle.trainer.SGD(cost=cost, - parameters=parameters, - update_equation=optimizer, - is_local=False, - pserver_spec=etcd_endpoints, - use_etcd=True) - - # event_handler to print training and testing info - def event_handler(event): - if isinstance(event, paddle.event.EndIteration): - # FIXME: for cloud data reader, pass number is managed by master - # should print the server side pass number - if event.batch_id % 100 == 0: - print "Pass %d, Batch %d, Cost %f" % ( - event.pass_id, event.batch_id, event.cost) - - if isinstance(event, paddle.event.EndPass): - if (event.pass_id + 1) % 10 == 0: - result = trainer.test( - reader=paddle.batch( - uci_housing.test(), batch_size=2), - feeding={'x': 0, - 'y': 1}) - print "Test %d, %.2f" % (event.pass_id, result.cost) - - # training - # NOTE: use uci_housing.train() as reader for non-paddlecloud training - trainer.train( - reader=paddle.batch( - paddle.reader.shuffle( - cloud_reader( - ["/pfs/dlnel/public/dataset/uci_housing/uci_housing*"], - etcd_endpoints), - buf_size=500), - batch_size=2), - feeding={'x': 0, - 'y': 1}, - event_handler=event_handler, - num_passes=30) - - -if __name__ == '__main__': - main() diff --git a/go/pserver/client/c/test/testdata/optimizer.pb b/go/pserver/client/c/test/testdata/optimizer.pb deleted file mode 100644 index 27dd3bc5f19e2964b4b674cff8860233cbdb445a..0000000000000000000000000000000000000000 Binary files a/go/pserver/client/c/test/testdata/optimizer.pb and /dev/null differ diff --git a/go/pserver/client/client.go b/go/pserver/client/client.go deleted file mode 100644 index 2a8f66a07c79906288c4179db5cab703cc2b8b61..0000000000000000000000000000000000000000 --- a/go/pserver/client/client.go +++ /dev/null @@ -1,237 +0,0 @@ -// Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at - -// http://www.apache.org/licenses/LICENSE-2.0 - -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -package client - -import ( - "errors" - "hash/fnv" - "sort" - "time" - - "github.com/PaddlePaddle/Paddle/go/connection" - "github.com/PaddlePaddle/Paddle/go/pserver" - log "github.com/inconshreveable/log15" -) - -// TODO(helin): add RPC call retry logic - -// Selector selects if the client should initialize parameters and -// reports the initialization process done. -type Selector interface { - // Select selects if the client should initialize parameter servers. - Select() (bool, error) - // Done indicates the initialization process is done. - Done() error -} - -// Server is the identification of a parameter Server. -type Server struct { - Index int - Addr string -} - -// Lister lists currently available parameter servers. -type Lister interface { - List() []Server -} - -// Client is the client to parameter servers. -type Client struct { - sel Selector - pservers []*connection.Conn -} - -// NewClient creates a new client. -func NewClient(l Lister, pserverNum int, sel Selector) *Client { - c := &Client{sel: sel} - c.pservers = make([]*connection.Conn, pserverNum) - for i := 0; i < pserverNum; i++ { - c.pservers[i] = connection.New() - } - go c.monitorPservers(l, pserverNum) - return c -} - -// monitorPservers monitors pserver addresses, and updates connection -// when the address changes. -func (c *Client) monitorPservers(l Lister, pserverNum int) { - lastServers := make([]Server, pserverNum) - ticker := time.NewTicker(10 * time.Second) - monitor := func() { - curServers := make([]Server, pserverNum) - list := l.List() - for _, l := range list { - curServers[l.Index] = l - } - - for i := range lastServers { - if lastServers[i].Addr == curServers[i].Addr { - continue - } - - if curServers[i].Addr == "" { - err := c.pservers[i].Close() - if err != nil { - log.Error("error closing connection to pserver", log.Ctx{"error": err}) - } - - continue - } - - err := c.pservers[i].Connect(curServers[i].Addr) - if err != nil { - log.Error("error connecting to pserver", log.Ctx{"error": err}) - - // connect to addr failed, set - // to last known addr in order - // to retry next time. - curServers[i].Addr = lastServers[i].Addr - } - - } - - lastServers = curServers - } - - monitor() - for range ticker.C { - monitor() - } -} - -// BeginInitParams begins to initialize parameters on parameter -// servers. -// -// BeginInitParams will be called from multiple trainers, only one -// trainer will be selected to initialize the parameters on parameter -// servers. Other trainers will be blocked until the initialization is -// done, and they need to get the initialized parameters from -// parameter servers using GetParams. -func (c *Client) BeginInitParams() (bool, error) { - return c.sel.Select() -} - -// InitParam initializes the parameter on parameter servers. -func (c *Client) InitParam(paramWithConfigs pserver.ParameterWithConfig) error { - return c.pservers[c.partition(paramWithConfigs.Param.Name)].Call("Service.InitParam", paramWithConfigs, nil) -} - -// FinishInitParams tells parameter servers client has sent all -// parameters to parameter servers as initialization. -func (c *Client) FinishInitParams() error { - for _, p := range c.pservers { - err := p.Call("Service.FinishInitParams", 0, nil) - if err != nil { - return err - } - } - return c.sel.Done() -} - -// SendGrads sends gradients to parameter servers for updating -// parameters. -func (c *Client) SendGrads(grads []pserver.Gradient) error { - if len(grads) == 0 { - return errors.New("no gradient received") - } - errCh := make(chan error, len(grads)) - for _, g := range grads { - go func(g pserver.Gradient) { - err := c.pservers[c.partition(g.Name)].Call("Service.SendGrad", g, nil) - errCh <- err - }(g) - } - - recv := 0 - for err := range errCh { - if err != nil { - return err - } - - recv++ - if recv == len(grads) { - break - } - } - return nil -} - -type result struct { - idx int - param pserver.Parameter - err error -} - -type results []result - -func (r results) Len() int { - return len(r) -} - -func (r results) Less(i int, j int) bool { - return r[i].idx < r[j].idx -} - -func (r results) Swap(i int, j int) { - r[i], r[j] = r[j], r[i] -} - -// GetParams gets parameters from parameter servers. -func (c *Client) GetParams(names []string) ([]pserver.Parameter, error) { - rCh := make(chan result, len(names)) - - for idx, name := range names { - go func(name string, idx int) { - var parameter pserver.Parameter - err := c.pservers[c.partition(name)].Call("Service.GetParam", name, ¶meter) - rCh <- result{idx: idx, param: parameter, err: err} - }(name, idx) - } - - var rs results - recv := 0 - for r := range rCh { - if r.err != nil { - return nil, r.err - } - rs = append(rs, r) - - recv++ - if recv == len(names) { - break - } - } - sort.Sort(rs) - - ps := make([]pserver.Parameter, len(rs)) - for i := range rs { - ps[i] = rs[i].param - } - - return ps, nil -} - -func strHash(s string) uint32 { - h := fnv.New32a() - _, _ = h.Write([]byte(s)) - return h.Sum32() -} - -// TODO(helin): now partition only select which parameter server to -// send the entire parameter. We need to partition a parameter into -// small blocks and send to different parameter servers. -func (c *Client) partition(key string) int { - return int(strHash(key) % uint32(len(c.pservers))) -} diff --git a/go/pserver/client/client_test.go b/go/pserver/client/client_test.go deleted file mode 100644 index 3a067ff5188fad8f6a13de88a2802e3e5866e59c..0000000000000000000000000000000000000000 --- a/go/pserver/client/client_test.go +++ /dev/null @@ -1,268 +0,0 @@ -// Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at - -// http://www.apache.org/licenses/LICENSE-2.0 - -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -package client_test - -import ( - "context" - "io/ioutil" - "math/rand" - "net" - "net/http" - "net/rpc" - "strconv" - "strings" - "sync" - "testing" - "time" - - "github.com/PaddlePaddle/Paddle/go/pserver" - "github.com/PaddlePaddle/Paddle/go/pserver/client" - "github.com/coreos/etcd/clientv3" - log "github.com/inconshreveable/log15" -) - -const ( - numPserver = 10 - etcdEndpoints = "127.0.0.1:2379" - timeout = 2 * time.Second -) - -var pserverClientPorts [numPserver]int - -// this function init pserver client and return their ports in an array. -func initClient() [numPserver]int { - var ports [numPserver]int - for i := 0; i < numPserver; i++ { - l, err := net.Listen("tcp", ":0") - if err != nil { - panic(err) - } - - ss := strings.Split(l.Addr().String(), ":") - p, err := strconv.Atoi(ss[len(ss)-1]) - if err != nil { - panic(err) - } - ports[i] = p - - go func(l net.Listener) { - var cp pserver.Checkpoint - s, err := pserver.NewService(0, time.Hour, "", nil, cp) - if err != nil { - panic(err) - } - server := rpc.NewServer() - err = server.Register(s) - if err != nil { - panic(err) - } - - mux := http.NewServeMux() - mux.Handle(rpc.DefaultRPCPath, server) - err = http.Serve(l, mux) - if err != nil { - panic(err) - } - }(l) - } - return ports -} - -func initNativeClient() { - pserverClientPorts = initClient() -} - -func initEtcdClient() { - client, err := clientv3.New(clientv3.Config{ - Endpoints: []string{etcdEndpoints}, - DialTimeout: time.Second * time.Duration(1), - }) - if err != nil { - log.Error("error init etcd client", log.Ctx{"error": err}) - } - ctx, cancel := context.WithTimeout(context.Background(), timeout) - _, err = client.Delete(ctx, pserver.PsDesired) - if err != nil { - panic(err) - } - - _, err = client.Delete(ctx, pserver.PsPath) - if err != nil { - panic(err) - } - - _, err = client.Put(ctx, pserver.PsDesired, strconv.Itoa(numPserver)) - if err != nil { - panic(err) - } - - ports := initClient() - for i := 0; i < numPserver; i++ { - _, err = client.Put(ctx, pserver.PsPath+strconv.Itoa(i), ":"+strconv.Itoa(ports[i])) - if err != nil { - panic(err) - } - } - cancel() - err = client.Close() - if err != nil { - panic(err) - } -} - -type selector bool - -func (s selector) Select() (bool, error) { - return bool(s), nil -} - -func (s selector) Done() error { - return nil -} - -type lister []client.Server - -func (l lister) List() []client.Server { - return l -} - -func testClient(t *testing.T, c *client.Client) { - selected, err := c.BeginInitParams() - if err != nil { - t.Fatal(err) - } - - if !selected { - t.Fatal("should be selected.") - } - - const numParameter = 1000 - config, err := ioutil.ReadFile("./c/test/testdata/optimizer.pb") - if err != nil { - t.Fatalf("read optimizer proto failed") - } - - var wg sync.WaitGroup - for i := 0; i < numParameter; i++ { - wg.Add(1) - go func(i int) { - var p pserver.Parameter - p.Name = "p_" + strconv.Itoa(i) - p.ElementType = pserver.Float32 - p.Content = make([]byte, (i+1)*100) - err := c.InitParam(pserver.ParameterWithConfig{Param: p, Config: config}) - if err != nil { - t.Fatal(err) - } - wg.Done() - }(i) - } - wg.Wait() - - err = c.FinishInitParams() - if err != nil { - t.Fatal(err) - } - - var grads []pserver.Gradient - for i := 0; i < numParameter; i++ { - var g pserver.Gradient - g.Name = "p_" + strconv.Itoa(i) - g.ElementType = pserver.Float32 - g.Content = make([]byte, (i+1)*100) - grads = append(grads, g) - } - - const paramPerGroup = 10 - const numGroups = numParameter / paramPerGroup - - // shuffle send grads order - for i := range grads { - j := rand.Intn(i + 1) - grads[i], grads[j] = grads[j], grads[i] - } - - for i := 0; i < numGroups; i++ { - var gs []pserver.Gradient - if i == numGroups-1 { - gs = grads[i*paramPerGroup:] - } else { - gs = grads[i*paramPerGroup : (i+1)*paramPerGroup] - } - - wg.Add(1) - go func(gs []pserver.Gradient) { - err := c.SendGrads(gs) - if err != nil { - t.Fatal(err) - } - wg.Done() - }(gs) - } - - names := make([]string, numParameter) - for i := 0; i < numParameter; i++ { - names[i] = "p_" + strconv.Itoa(i) - } - - for i := 0; i < numGroups; i++ { - var ns []string - if i == numGroups-1 { - ns = names[i*paramPerGroup:] - } else { - ns = names[i*paramPerGroup : (i+1)*paramPerGroup] - } - - wg.Add(1) - go func(ns []string) { - params, err := c.GetParams(ns) - if err != nil { - t.Fatal(err) - } - - if len(ns) != len(params) { - t.Fatalf("parameter size not match, need: %d, have: %d", len(names), len(params)) - } - - for i := range params { - if ns[i] != params[i].Name { - t.Fatalf("order of returned parameter does not required: parameter name: %s, required name: %s", ns[i], params[i].Name) - } - } - wg.Done() - }(ns) - } - - wg.Wait() -} - -func TestNativeClient(t *testing.T) { - initNativeClient() - servers := make([]client.Server, numPserver) - for i := 0; i < numPserver; i++ { - servers[i] = client.Server{Index: i, Addr: ":" + strconv.Itoa(pserverClientPorts[i])} - } - c1 := client.NewClient(lister(servers), len(servers), selector(true)) - testClient(t, c1) -} - -// EtcdClient is a disabled test, since we have not embedded etcd into -// our test. -func EtcdClient(t *testing.T) { - initEtcdClient() - etcdClient := client.NewEtcd(etcdEndpoints) - c2 := client.NewClient(etcdClient, etcdClient.Desired(), selector(true)) - testClient(t, c2) -} diff --git a/go/pserver/client/etcd_client.go b/go/pserver/client/etcd_client.go deleted file mode 100644 index 3fb835a6e165b7493df1a1fbb7440ee27109bbad..0000000000000000000000000000000000000000 --- a/go/pserver/client/etcd_client.go +++ /dev/null @@ -1,266 +0,0 @@ -// Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at - -// http://www.apache.org/licenses/LICENSE-2.0 - -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -package client - -import ( - "context" - "errors" - "fmt" - "strconv" - "strings" - "time" - - "github.com/PaddlePaddle/Paddle/go/pserver" - "github.com/coreos/etcd/clientv3" - "github.com/coreos/etcd/clientv3/concurrency" - log "github.com/inconshreveable/log15" -) - -const ( - defaultEtcdTimeout time.Duration = 5 * time.Second - - initLockPath = "/init_ps/lock" - initDonePath = "/init_ps/done" - initDoneVal = "1" -) - -// Etcd is used by pserver client that is a part of trainer process. -// TODO: -// 1. add watcher to watch the change state of pservers. -type Etcd struct { - client *clientv3.Client - timeout time.Duration - endpoints []string - lock *concurrency.Mutex -} - -// Desired read ps desired number from etcd. -func (e *Etcd) Desired() int { - var psDesired int - for { - ctx, cancel := context.WithTimeout(context.Background(), e.timeout) - resp, err := e.client.Get(ctx, pserver.PsDesired) - cancel() - if err != nil { - log.Error( - "Get ps dresire number failed! reconnecting...", - log.Ctx{"error": err}, - ) - time.Sleep(e.timeout) - continue - } - - kvs := resp.Kvs - if len(kvs) == 0 { - log.Info("Waiting for ps desired registered ...") - time.Sleep(e.timeout) - continue - } - - psDesired, err = strconv.Atoi(string(resp.Kvs[0].Value)) - if err != nil { - log.Error("atoi failed", log.Ctx{"error": err}) - time.Sleep(e.timeout) - continue - } - - log.Debug("Got psDesired", log.Ctx{"psDesired": psDesired}) - break - } - return psDesired -} - -// List return the pserver list read from etcd. -func (e *Etcd) List() []Server { - psDesired := e.Desired() - - servers := make([]Server, psDesired) - for { - for i := 0; i < psDesired; i++ { - ctx, cancel := context.WithTimeout(context.Background(), e.timeout) - psKey := pserver.PsPath + strconv.Itoa(i) - log.Debug("looking for pserver", log.Ctx{"ps key": psKey}) - resp, err := e.client.Get(ctx, psKey) - cancel() - if err != nil { - log.Info( - "Get psKey error", - log.Ctx{"ps key": psKey, "error": err}, - ) - time.Sleep(e.timeout) - continue - } - kvs := resp.Kvs - if len(kvs) == 0 { - log.Info("Waiting for ps addr registered ...") - time.Sleep(e.timeout) - continue - } - - psAddr := string(resp.Kvs[0].Value) - // TODO(Longfei) check the ps address - if psAddr == "" { - log.Info( - "Value under psKey is empty", - log.Ctx{"psKey": psKey}, - ) - time.Sleep(e.timeout) - continue - } - log.Debug( - "got psAddr given psKey", - log.Ctx{"psAddr": psAddr, "psKey": psKey}, - ) - servers[i].Index = i - servers[i].Addr = psAddr - } - break - } - return servers -} - -// NewEtcd create a etcd client to return the state of pserver on etcd. -func NewEtcd(endpoints string) *Etcd { - ep := strings.Split(endpoints, ",") - var cli *clientv3.Client - var err error - for { - cli, err = clientv3.New(clientv3.Config{ - Endpoints: ep, - DialTimeout: defaultEtcdTimeout, - }) - if err != nil { - log.Error("Init etcd connection failed", log.Ctx{"error": err}) - time.Sleep(defaultEtcdTimeout) - continue - } - break - } - log.Info("Connected to etcd endpoint", log.Ctx{"endpoint": endpoints}) - client := &Etcd{ - client: cli, - timeout: defaultEtcdTimeout, - endpoints: ep, - } - return client -} - -// Select indicates if the current trainer is selected to initialize -// the pserver parameters. -func (e *Etcd) Select() (bool, error) { - sess, err := concurrency.NewSession(e.client, concurrency.WithTTL(5)) - if err != nil { - return false, err - } - - lock := concurrency.NewMutex(sess, initLockPath) - log.Info("Trying to acquire lock", log.Ctx{"lock path": initLockPath}) - // Do not use timeout context here, since we don't know how - // long does it take for other trainers to initialize the - // parameters. - err = lock.Lock(context.Background()) - if err != nil { - return false, err - } - log.Info("Successfully acquired lock", log.Ctx{"lock path": initLockPath}) - - get := clientv3.OpGet(initDonePath) - ctx, cancel := context.WithTimeout(context.Background(), e.timeout) - tresp, err := e.client.Txn(ctx).If(lock.IsOwner()).Then(get).Commit() - cancel() - if err != nil { - return false, err - } - - if !tresp.Succeeded { - return false, errors.New("no longer the owner of the lock") - } - - resp := tresp.Responses[0].GetResponseRange() - - if len(resp.Kvs) == 0 { - // Key value not set, select current trainer. - e.lock = lock - log.Info("Trainer selected.") - return true, nil - } - - if string(resp.Kvs[0].Value) == initDoneVal { - log.Info("Initialization is already done.") - ctx, cancel = context.WithTimeout(context.Background(), e.timeout) - err = lock.Unlock(ctx) - cancel() - if err != nil { - log.Error("error unlocking", log.Ctx{"error": err}) - } - return false, nil - } - - return false, fmt.Errorf("key %s have unexpected value: %v", initDonePath, resp.Kvs[0].Value) -} - -// Done indicates the parameter initialization process is done. -func (e *Etcd) Done() error { - if e.lock == nil { - return errors.New("lock is nil, Done called unexpectedly") - } - - put := clientv3.OpPut(initDonePath, initDoneVal) - ctx, cancel := context.WithTimeout(context.Background(), e.timeout) - tresp, err := e.client.Txn(ctx).If(e.lock.IsOwner()).Then(put).Commit() - cancel() - if err != nil { - return err - } - - if !tresp.Succeeded { - return errors.New("no longer the owner of the lock") - } - - ctx, cancel = context.WithTimeout(context.Background(), e.timeout) - err = e.lock.Unlock(ctx) - cancel() - if err != nil { - log.Error("error unlocking", log.Ctx{"error": err}) - } else { - e.lock = nil - } - - return nil -} - -// Close closes the etcd client. -func (e *Etcd) Close() error { - var err error - if e.lock != nil { - ctx, cancel := context.WithTimeout(context.Background(), e.timeout) - err = e.lock.Unlock(ctx) - cancel() - if err == nil { - e.lock = nil - } - } - - cErr := e.client.Close() - if cErr != nil { - if err != nil { - log.Error("error closing etcd client", log.Ctx{"error": cErr}) - return err - } - return cErr - } - - return err -} diff --git a/go/pserver/client/etcd_client_test.go b/go/pserver/client/etcd_client_test.go deleted file mode 100644 index 08742433e7a266fbd39e34f4b92ac4cc4caeb0fb..0000000000000000000000000000000000000000 --- a/go/pserver/client/etcd_client_test.go +++ /dev/null @@ -1,106 +0,0 @@ -package client_test - -import ( - "io/ioutil" - "net/url" - "os" - "strings" - "sync" - "testing" - - "github.com/PaddlePaddle/Paddle/go/pserver/client" - "github.com/coreos/etcd/embed" -) - -func TestSelector(t *testing.T) { - etcdDir, err := ioutil.TempDir("", "") - if err != nil { - t.Fatal(err) - } - cfg := embed.NewConfig() - lpurl, _ := url.Parse("http://localhost:0") - lcurl, _ := url.Parse("http://localhost:0") - cfg.LPUrls = []url.URL{*lpurl} - cfg.LCUrls = []url.URL{*lcurl} - cfg.Dir = etcdDir - e, err := embed.StartEtcd(cfg) - if err != nil { - t.Fatal(err) - } - - defer func() { - e.Close() - if err := os.RemoveAll(etcdDir); err != nil { - t.Fatal(err) - } - }() - - <-e.Server.ReadyNotify() - - port := strings.Split(e.Clients[0].Addr().String(), ":")[1] - endpoint := "127.0.0.1:" + port - - var mu sync.Mutex - selectedCount := 0 - var wg sync.WaitGroup - selectAndDone := func(c *client.Etcd) { - defer wg.Done() - - selected, err := c.Select() - if err != nil { - panic(err) - } - - if selected { - mu.Lock() - selectedCount++ - mu.Unlock() - err = c.Done() - if err != nil { - t.Fatal(err) - } - } - } - - c0 := client.NewEtcd(endpoint) - c1 := client.NewEtcd(endpoint) - c2 := client.NewEtcd(endpoint) - c3 := client.NewEtcd(endpoint) - wg.Add(3) - go selectAndDone(c0) - go selectAndDone(c1) - go selectAndDone(c2) - wg.Wait() - - // simulate trainer crashed and restarted after the - // initialization process. - wg.Add(1) - go selectAndDone(c3) - wg.Wait() - - mu.Lock() - if selectedCount != 1 { - t.Fatal("selected count wrong:", selectedCount) - } - mu.Unlock() - - err = c0.Close() - if err != nil { - t.Fatal(err) - } - - err = c1.Close() - if err != nil { - t.Fatal(err) - } - - err = c2.Close() - if err != nil { - t.Fatal(err) - } - - err = c3.Close() - if err != nil { - t.Fatal(err) - } -} diff --git a/go/pserver/etcd_client.go b/go/pserver/etcd_client.go deleted file mode 100644 index 719013b1bb4e80ff1f3040394803706d97514516..0000000000000000000000000000000000000000 --- a/go/pserver/etcd_client.go +++ /dev/null @@ -1,253 +0,0 @@ -// Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at - -// http://www.apache.org/licenses/LICENSE-2.0 - -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -package pserver - -import ( - "context" - "errors" - "strconv" - "strings" - "time" - - "github.com/PaddlePaddle/Paddle/go/utils/networkhelper" - "github.com/coreos/etcd/clientv3" - "github.com/coreos/etcd/clientv3/concurrency" - log "github.com/inconshreveable/log15" -) - -const ( - // PsDesired is etcd path for store desired pserver count - PsDesired = "/ps_desired" - // PsPath is the base dir for pserver to store their addr - PsPath = "/ps/" - // PsCheckpoint is the etcd path for store checkpoints information - PsCheckpoint = "/checkpoints/" - - retryTimeout = 5 * time.Second -) - -// EtcdClient is the etcd client that the pserver uses for fault -// tolerance, service registry and coordination. -type EtcdClient struct { - numPservers int - endpoints string - client *clientv3.Client - sess *concurrency.Session - dialTimeout time.Duration - ttlSec int - // FIXME: ensure GetExternalIP gets the correct ip for trainers to connect. - externalIP string - // desired number of pservers in the job. - // assume desired will not change during one training job. - desired int -} - -// NewEtcdClient creates an EtcdClient -func NewEtcdClient(endpoints string, numPservers int, dialtimeout time.Duration, ttlSec int) *EtcdClient { - return &EtcdClient{ - dialTimeout: dialtimeout, - ttlSec: ttlSec, - numPservers: numPservers, - endpoints: endpoints, - } -} - -// Register registers the pserver on etcd -// -// Register returns the index of the current pserver. -func (e *EtcdClient) Register(port int) (int, error) { - var err error - e.externalIP, err = networkhelper.GetExternalIP() - if err != nil { - return 0, err - } - - // initialize connection to etcd. - ep := strings.Split(e.endpoints, ",") - for { - cli, err := clientv3.New(clientv3.Config{ - Endpoints: ep, - DialTimeout: e.dialTimeout, - }) - if err != nil { - log.Error("connect to etcd error", log.Ctx{"error": err}) - time.Sleep(retryTimeout) - continue - } - e.client = cli - sess, err := concurrency.NewSession(cli, concurrency.WithTTL(e.ttlSec)) - if err != nil { - log.Error("create etcd session error", log.Ctx{"error": err}) - time.Sleep(retryTimeout) - continue - } - e.sess = sess - log.Debug("connected to etcd", log.Ctx{"endpoint": e.endpoints}) - break - } - // init /ps_desired using transaction, for multiple pservers may want to write - // it at the same time. - for { - ctx, cancel := context.WithTimeout(context.Background(), time.Second) - _, err := e.initDesiredPservers(ctx, e.numPservers) - cancel() - if err != nil { - log.Warn("pserver init error", log.Ctx{"error": err, "num pservers": e.numPservers}) - time.Sleep(retryTimeout) - continue - } - break - } - // TODO: when implementing extending or reducing pservers, /ps_desired is - // changed, then we need to watch /ps_desired node for events. For now, just - // write once when init and read from it. - // wait and set s.desired init value - for { - ctx, cancel := context.WithTimeout(context.Background(), time.Second) - resp, err := e.client.Get(ctx, PsDesired) - cancel() - if err != nil { - log.Error("get etcd key error", log.Ctx{"key": PsDesired, "error": err}) - time.Sleep(retryTimeout) - continue - } - if len(resp.Kvs) != 0 { - e.desired, err = strconv.Atoi(string(resp.Kvs[0].Value)) - if err != nil { - log.Error( - "psDesired atoi error", - log.Ctx{"error": err, "value": string(resp.Kvs[0].Value)}, - ) - time.Sleep(retryTimeout) - // NOTE: wait util ps_desired value change - continue - } - break - } - } - - var pserverIdx int - // try register pserver node on etcd - for { - ctx, cancel := context.WithTimeout(context.Background(), time.Second) - var err error - pserverIdx, err = e.registerPserverEtcd(ctx, port) - cancel() - if err != nil { - log.Warn("register pserver on etcd error", log.Ctx{"error": err}) - time.Sleep(retryTimeout) - continue - } - break - } - - return pserverIdx, nil -} - -func (e *EtcdClient) initDesiredPservers(ctx context.Context, numPservers int) (*clientv3.TxnResponse, error) { - return concurrency.NewSTM(e.client, func(c concurrency.STM) error { - dsStr := c.Get(PsDesired) - if dsStr == "" { - c.Put(PsDesired, strconv.Itoa(numPservers), clientv3.WithLease(e.sess.Lease())) - } - return nil - }, concurrency.WithAbortContext(ctx), concurrency.WithIsolation(concurrency.RepeatableReads)) -} - -// registerPserverEtcd registers pserver node on etcd using transaction. -func (e *EtcdClient) registerPserverEtcd(ctx context.Context, port int) (int, error) { - var idx int - _, err := concurrency.NewSTM(e.client, func(c concurrency.STM) error { - registered := false - for i := 0; i < e.desired; i++ { - psKey := PsPath + strconv.Itoa(i) - ps := c.Get(psKey) - log.Debug( - "register pserver got value", - log.Ctx{"value": ps, "key": psKey}, - ) - - if ps == "" { - // find the first id and write info - pserverAddr := e.externalIP + ":" + strconv.Itoa(port) - c.Put(psKey, pserverAddr, clientv3.WithLease(e.sess.Lease())) - log.Debug("register finished", log.Ctx{"key": psKey, "value": pserverAddr}) - idx = i - registered = true - break - } - } - if registered { - return nil - } - return errors.New("not registered, may due to already have enough pservers") - }, concurrency.WithAbortContext(ctx), concurrency.WithIsolation(concurrency.RepeatableReads)) - - if err != nil { - return 0, err - } - - return idx, nil -} - -// GetKey gets the value by the specified key -func (e *EtcdClient) GetKey(key string, timeout time.Duration) ([]byte, error) { - ctx, cancel := context.WithTimeout(context.Background(), timeout) - resp, err := e.client.Get(ctx, key) - cancel() - if err != nil { - return []byte{}, err - } - - kvs := resp.Kvs - if len(kvs) == 0 { - return []byte{}, nil - } - v := kvs[0].Value - return v, nil -} - -// PutKey put into etcd with value by key specified -func (e *EtcdClient) PutKey(key string, value []byte, timeout time.Duration, withLease bool) error { - ctx, cancel := context.WithTimeout(context.Background(), timeout) - var err error - if withLease { - _, err = e.client.Put(ctx, key, string(value), clientv3.WithLease(e.sess.Lease())) - } else { - _, err = e.client.Put(ctx, key, string(value)) - } - cancel() - return err -} - -// Shutdown shuts down the etcd client gracefully. -func (e *EtcdClient) Shutdown() error { - var err error - if e.sess != nil { - err = e.sess.Close() - } - - if e.client != nil { - newErr := e.client.Close() - if newErr != nil { - if err != nil { - log.Error("shutdown error", log.Ctx{"error": newErr}) - } else { - err = newErr - } - } - } - return err -} diff --git a/go/pserver/optimizer.go b/go/pserver/optimizer.go deleted file mode 100644 index eba0c47e195a80fc298f0fdd78c8d6345e963be8..0000000000000000000000000000000000000000 --- a/go/pserver/optimizer.go +++ /dev/null @@ -1,132 +0,0 @@ -// Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at - -// http://www.apache.org/licenses/LICENSE-2.0 - -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -package pserver - -// #cgo CFLAGS: -I ../../ -// #cgo LDFLAGS: ${SRCDIR}/client/c/libpaddle_go_optimizer.a -lstdc++ -lm -// #include "paddle/legacy/optimizer/optimizer.h" -// #include -// #include -import "C" - -import ( - "fmt" - "unsafe" - - log "github.com/inconshreveable/log15" -) - -type optimizer struct { - opt *C.struct_paddle_optimizer - elementType ElementType - contentLen int - config []byte -} - -func cArrayToSlice(p unsafe.Pointer, len int) []byte { - if p == nil { - return nil - } - - // create a Go clice backed by a C array, reference: - // https://github.com/golang/go/wiki/cgo#turning-c-arrays-into-go-slices - // - // Go garbage collector will not interact with this data, need - // to be freed properly. - return (*[1 << 30]byte)(p)[:len:len] -} - -func newOptimizer(paramWithConfigs ParameterWithConfig, State []byte) *optimizer { - o := &optimizer{} - o.elementType = paramWithConfigs.Param.ElementType - o.contentLen = len(paramWithConfigs.Param.Content) - p := paramWithConfigs.Param - c := paramWithConfigs.Config - s := State - paramBufferSize := C.size_t(len(p.Content)) - log.Info("New Optimizer Created with config", log.Ctx{ - "ElementType": p.ElementType, - "ParamSize": paramBufferSize, - "ConfigSize": len(c), - "StateSize": len(s), - }) - var cbuffer unsafe.Pointer - cbuffer = C.malloc(paramBufferSize) - - C.memcpy(cbuffer, unsafe.Pointer(&p.Content[0]), paramBufferSize) - var cstate unsafe.Pointer - if len(s) != 0 { - cstate = unsafe.Pointer(&s[0]) - } - - var cptr (*C.uchar) - if len(c) > 0 { - cptr = (*C.uchar)(&c[0]) - } else { - log.Error("empty config", "param name", paramWithConfigs.Param.Name) - } - o.config = c - o.opt = C.paddle_create_optimizer( - cptr, - C.int(len(c)), - C.paddle_element_type(p.ElementType), - cbuffer, - C.int(paramBufferSize), - (*C.char)(cstate), - C.int(len(s)), - ) - return o -} - -func (o *optimizer) GetWeights() []byte { - var buffer unsafe.Pointer - // we do not own the buffer, no need to free later. - bufferLen := C.paddle_optimizer_get_weights(o.opt, &buffer) - return cArrayToSlice(buffer, int(bufferLen)*C.sizeof_float) -} - -func (o *optimizer) GetStates() []byte { - var cbuffer *C.char - // we owns the state buffer, need to free later. - cbufferLen := C.paddle_optimizer_get_state(o.opt, &cbuffer) - buf := cArrayToSlice(unsafe.Pointer(cbuffer), int(cbufferLen)) - cpy := make([]byte, len(buf)) - copy(cpy, buf) - C.free(unsafe.Pointer(cbuffer)) - return cpy -} - -func (o *optimizer) UpdateParameter(g Gradient) error { - if o.elementType != g.ElementType { - return fmt.Errorf("Name: %s, parameter and gradient element type not match, parameter: %v, gradient: %v", g.Name, o.elementType, g.ElementType) - } - - if o.contentLen != len(g.Content) { - return fmt.Errorf("Name: %s, parameter and gradient does not have same content len, parameter: %d, gradient: %d", g.Name, o.contentLen, len(g.Content)) - } - - r := C.paddle_update_parameter(o.opt, C.paddle_element_type(g.ElementType), unsafe.Pointer(&g.Content[0]), C.int(len(g.Content))) - if r != 0 { - return fmt.Errorf("optimizer update returned error code: %d", r) - } - return nil -} - -func (o *optimizer) Cleanup() { - if unsafe.Pointer(o.opt) != nil { - C.paddle_release_optimizer(o.opt) - o.opt = (*C.struct_paddle_optimizer)(nil) - } -} diff --git a/go/pserver/optimizer_test.go b/go/pserver/optimizer_test.go deleted file mode 100644 index 3b923879d5ec77675d707ccd40bf44a5148105fb..0000000000000000000000000000000000000000 --- a/go/pserver/optimizer_test.go +++ /dev/null @@ -1,78 +0,0 @@ -// Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at - -// http://www.apache.org/licenses/LICENSE-2.0 - -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -package pserver - -import ( - "encoding/binary" - "io/ioutil" - "math" - "testing" - - "github.com/stretchr/testify/assert" -) - -func TestOptimizerCreateRelease(t *testing.T) { - p := Parameter{ - Name: "a", - ElementType: Int32, - } - p.Content = []byte{1, 3} - config, err := ioutil.ReadFile("./client/c/test/testdata/optimizer.pb") - if err != nil { - t.Fatalf("read optimizer proto failed") - } - param := ParameterWithConfig{ - Param: p, - Config: config, - } - o := newOptimizer(param, nil) - o.Cleanup() -} - -func float32Bytes(float float32) []byte { - bits := math.Float32bits(float) - bytes := make([]byte, 4) - binary.LittleEndian.PutUint32(bytes, bits) - return bytes -} - -func TestOptimizerState(t *testing.T) { - p := Parameter{ - Name: "a", - ElementType: Int32, - } - weights := float32Bytes(100) - p.Content = weights - config, err := ioutil.ReadFile("./client/c/test/testdata/optimizer.pb") - if err != nil { - t.Fatalf("read optimizer proto failed") - } - param := ParameterWithConfig{ - Param: p, - Config: config, - } - o := newOptimizer(param, nil) - s := o.GetStates() - - // clear param content and check if the state is restored. - param.Param.Content = float32Bytes(300) - o1 := newOptimizer(param, s) - s1 := o1.GetStates() - assert.Equal(t, s, s1) - assert.Equal(t, weights, o.GetWeights()) - assert.Equal(t, weights, o1.GetWeights()) - o.Cleanup() - o1.Cleanup() -} diff --git a/go/pserver/service.go b/go/pserver/service.go deleted file mode 100644 index d6ead774af522ad78e9fe717f0d27bdf24d86246..0000000000000000000000000000000000000000 --- a/go/pserver/service.go +++ /dev/null @@ -1,450 +0,0 @@ -// Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at - -// http://www.apache.org/licenses/LICENSE-2.0 - -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -package pserver - -import ( - "bufio" - "bytes" - "encoding/binary" - "encoding/gob" - "encoding/json" - "errors" - "fmt" - "hash/crc32" - "io/ioutil" - "os" - "path" - "strconv" - "strings" - "sync" - "time" - - "github.com/golang/protobuf/proto" - uuid "github.com/satori/go.uuid" - - pb "github.com/PaddlePaddle/Paddle/go/proto" - - log "github.com/inconshreveable/log15" -) - -// ElementType is the type of elements of a Parameter. -type ElementType int - -// ErrCheckpointNotFound indicates that the pserver checkpoint could -// not be found. -var ErrCheckpointNotFound = errors.New("checkpoint not found in etcd") - -// RPC error message. -const ( - AlreadyInitialized = "pserver already initialized" - Uninitialized = "pserver not fully initialized" - WrongChecksum = "checkpoint file checksum validation failed" -) - -// Supported element types. -const ( - Int32 ElementType = iota - UInt32 - Int64 - UInt64 - Float32 - Float64 -) - -// Parameter is a piece of data to sync with the parameter server. -type Parameter struct { - Name string - ElementType ElementType - Content []byte -} - -func float32ToString(b []byte) string { - f := make([]float32, len(b)/4) - buf := bytes.NewReader(b) - err := binary.Read(buf, binary.LittleEndian, &f) - if err != nil { - return "" - } - return fmt.Sprintf("%v", f) -} - -func float32ByteToString(c []byte) string { - var a []byte - var b []byte - if len(c) <= 80 { - a = c - } else { - a = c[0:40] - b = c[len(c)-40:] - } - - var s string - s = float32ToString(a) - - if b == nil { - return s - } - - s = strings.Replace(s, "]", "", -1) + "..." + strings.Replace(float32ToString(b), "[", "", -1) - return s -} - -func (p Parameter) String() string { - if p.ElementType != Float32 { - return fmt.Sprintf("name:%v ElementType:%v", - p.Name, p.ElementType) - } - - return float32ByteToString(p.Content) -} - -// ParameterWithConfig contains the parameter and the configuration. -type ParameterWithConfig struct { - Param Parameter - Config []byte // parameter configuration in Proto Buffer format -} - -// checkpointMeta saves checkpoint metadata -type checkpointMeta struct { - UUID string `json:"uuid"` - Path string `json:"path"` - CRC32 uint32 `json:"crc32"` - Timestamp int64 `json:"timestamp"` -} - -// Checkpoint is the pserver shard persist in file. -type Checkpoint []parameterCheckpoint - -// Gradient is the gradient of the parameter. -type Gradient Parameter - -// Service is the RPC service for pserver. -type Service struct { - initialized chan struct{} - idx int - checkpointInterval time.Duration - checkpointPath string - client KVStore - - mu sync.Mutex - optMap map[string]*optimizer -} - -// parameterCheckpoint saves parameter checkpoint. -type parameterCheckpoint struct { - ParameterWithConfig - State []byte -} - -type KVStore interface { - GetKey(key string, timeout time.Duration) ([]byte, error) - PutKey(key string, value []byte, timeout time.Duration, withLease bool) error -} - -func loadMeta(e KVStore, idx int) (meta checkpointMeta, err error) { - v, err := e.GetKey(PsCheckpoint+strconv.Itoa(idx), 3*time.Second) - if err != nil { - return - } - - if len(v) == 0 { - err = ErrCheckpointNotFound - return - } - - if err = json.Unmarshal(v, &meta); err != nil { - return - } - - return -} - -// LoadCheckpoint loads checkpoint from file. -func LoadCheckpoint(e KVStore, idx int) (Checkpoint, error) { - log.Info("Loading checkpoint", "pserver index", idx) - defer traceTime(time.Now(), "load checkpoint") - - cpMeta, err := loadMeta(e, idx) - if err != nil { - return nil, err - } - - content, err := ioutil.ReadFile(cpMeta.Path) - if err != nil { - return nil, err - } - - crc32 := crc32.ChecksumIEEE(content) - if crc32 != cpMeta.CRC32 { - return nil, errors.New(WrongChecksum) - } - - dec := gob.NewDecoder(bytes.NewReader(content)) - var cp Checkpoint - if err = dec.Decode(&cp); err != nil { - return nil, err - } - - return cp, nil -} - -// NewService creates a new service, will bypass etcd registration if no -// endpoints specified. It will recovery from checkpoint file if a exists a specified checkpoint. -func NewService(idx int, interval time.Duration, path string, client KVStore, cp Checkpoint) (*Service, error) { - s := &Service{ - idx: idx, - checkpointInterval: interval, - checkpointPath: path, - client: client, - } - s.optMap = make(map[string]*optimizer) - s.initialized = make(chan struct{}) - - if cp != nil { - for _, item := range cp { - p := ParameterWithConfig{ - Param: item.Param, - Config: item.Config, - } - s.optMap[p.Param.Name] = newOptimizer(p, item.State) - } - close(s.initialized) - } - return s, nil -} - -// InitParam initializes a parameter. -func (s *Service) InitParam(paramWithConfigs ParameterWithConfig, _ *int) error { - select { - case <-s.initialized: - log.Warn("init param called but parameters already initialized.") - return errors.New(AlreadyInitialized) - default: - } - - c := &pb.OptimizerConfig{} - proto.Unmarshal(paramWithConfigs.Config, c) - log.Debug(fmt.Sprintf("OptimizerConfig:%v", c)) - - s.mu.Lock() - defer s.mu.Unlock() - - // TODO(helin): check if paramWithConfigs.Param.Content is - // properly memory aligned, if not, make copy to a memory - // aligned region. - s.optMap[paramWithConfigs.Param.Name] = newOptimizer(paramWithConfigs, nil) - log.Info( - "init parameter", - "name", paramWithConfigs.Param.Name, - "config len", len(paramWithConfigs.Config), - "param len", len(paramWithConfigs.Param.Content), - "type", paramWithConfigs.Param.ElementType, - ) - return nil -} - -// FinishInitParams tells the parameter server that the parameter -// initialization has finished. -func (s *Service) FinishInitParams(_ int, _ *int) error { - select { - case <-s.initialized: - log.Warn("finished init param called but parameters already initialized.") - return errors.New(AlreadyInitialized) - default: - } - - close(s.initialized) - go func() { - t := time.Tick(s.checkpointInterval) - for range t { - err := s.checkpoint() - if err != nil { - log.Error("checkpoint error", log.Ctx{"error": err}) - } - } - }() - - log.Info("init parameter finished.") - return nil -} - -// SendGrad sends gradient to parameter servers for parameter -// optimization. -func (s *Service) SendGrad(g Gradient, _ *int) error { - select { - case <-s.initialized: - default: - log.Warn("received gradient before initialization.", - "name", g.Name, "size", len(g.Content), "type", g.ElementType) - return errors.New(Uninitialized) - } - - s.mu.Lock() - defer s.mu.Unlock() - - o, ok := s.optMap[g.Name] - if !ok { - log.Warn("received gradient but can't find name.", - "name", g.Name, "size", len(g.Content), "type", g.ElementType) - return fmt.Errorf("parameter: %s does not exist", g.Name) - } - - log.Debug(Parameter(g).String()) - log.Info("received gradient from trainer, updating gradient.", - "name", g.Name, "size", len(g.Content), "type", g.ElementType) - return o.UpdateParameter(g) -} - -// GetParam gets parameters from the parameter server. -func (s *Service) GetParam(name string, parameter *Parameter) error { - <-s.initialized - s.mu.Lock() - defer s.mu.Unlock() - - opt, ok := s.optMap[name] - if !ok { - log.Warn("trainer wants to get a parameter that does not exist.", "name", name) - return fmt.Errorf("parameter: %s does not exist", name) - } - - // The parameter content (a byte slice) may change - // during RPC serialization due to write from other - // goroutine, we allow it since mini-batch based deep - // learning optimization methods are stochastic in - // nature. This race condition is allowed deliberately - // to save the program from making a copy of the - // parameter content. - parameter.Name = name - parameter.ElementType = opt.elementType - parameter.Content = opt.GetWeights() - log.Debug(parameter.String()) - log.Info("sending parameter to the trainer", "name", parameter.Name, "size", len(parameter.Content), "type", parameter.ElementType) - return nil -} - -func traceTime(start time.Time, name string) { - elapsed := time.Since(start) - log.Info("time elapsed", log.Ctx{"name": name, "elapsed": elapsed}) -} - -// checkpoint saves checkpoint to disk. -// -// checkpoint should be only called after the parameters are -// initialized. -func (s *Service) checkpoint() (err error) { - log.Info("Begin save checkpoint.") - defer traceTime(time.Now(), "save checkpoint") - - s.mu.Lock() - cp := make([]parameterCheckpoint, len(s.optMap)) - index := 0 - // TODO(helin): write checkpoint incrementally to reduce memory - // footprint during checkpoint. - for name, opt := range s.optMap { - var pc parameterCheckpoint - pc.Param.Name = name - pc.Param.ElementType = opt.elementType - pc.Param.Content = opt.GetWeights() - pc.Config = opt.config - pc.State = opt.GetStates() - cp[index] = pc - index++ - } - s.mu.Unlock() - - var buf bytes.Buffer - encoder := gob.NewEncoder(&buf) - err = encoder.Encode(cp) - if err != nil { - return - } - - if _, err = os.Stat(s.checkpointPath); os.IsNotExist(err) { - err = os.MkdirAll(s.checkpointPath, os.ModePerm) - if err != nil { - return - } - } - - id := uuid.NewV4().String() - p := path.Join(s.checkpointPath, id) - f, err := os.Create(p) - if err != nil { - return - } - - defer func() { - closeErr := f.Close() - if closeErr != nil { - if err != nil { - log.Error("error close checkpoint file", log.Ctx{"error": closeErr}) - } else { - // Set closeErr as return value. - err = closeErr - } - } - }() - - writer := bufio.NewWriter(f) - _, err = writer.Write(buf.Bytes()) - if err != nil { - return - } - - err = writer.Flush() - if err != nil { - return - } - - oldMeta, err := loadMeta(s.client, s.idx) - if err == ErrCheckpointNotFound { - log.Info("old meta not found, skip removing old meta") - err = nil - } else if err == nil { - log.Info("removing old meta") - if oldMeta.Path != "" { - rmErr := os.Remove(oldMeta.Path) - if rmErr != nil { - // log error, but still treat checkpoint as - // successful. - log.Error("remove old meta file error", log.Ctx{"error": rmErr}) - } - } - } - - if err != nil { - return - } - - crc32 := crc32.ChecksumIEEE(buf.Bytes()) - cpMeta := checkpointMeta{ - UUID: id, - Timestamp: time.Now().UnixNano(), - CRC32: crc32, - Path: p, - } - - json, err := json.Marshal(cpMeta) - if err != nil { - return - } - - err = s.client.PutKey(PsCheckpoint+strconv.Itoa(s.idx), json, 3*time.Second, false) - if err != nil { - return - } - - return -} diff --git a/go/pserver/service_internal_test.go b/go/pserver/service_internal_test.go deleted file mode 100644 index 36eca5112b3117cf295288de0de957c4af040f03..0000000000000000000000000000000000000000 --- a/go/pserver/service_internal_test.go +++ /dev/null @@ -1,86 +0,0 @@ -package pserver - -import ( - "bytes" - "encoding/binary" - "fmt" - "testing" - "time" - - "github.com/stretchr/testify/assert" -) - -const testDir = "./test_data" - -type myKV struct { - m map[string][]byte -} - -func (m *myKV) GetKey(key string, timeout time.Duration) ([]byte, error) { - if m.m == nil { - m.m = make(map[string][]byte) - } - return m.m[key], nil -} - -func (m *myKV) PutKey(key string, value []byte, timeout time.Duration, withLease bool) error { - if m.m == nil { - m.m = make(map[string][]byte) - } - m.m[key] = value - return nil -} - -func TestCheckpoint(t *testing.T) { - kv := &myKV{} - s, err := NewService(0, time.Hour, testDir, kv, nil) - assert.Nil(t, err) - err = s.checkpoint() - assert.Nil(t, err) - _, err = LoadCheckpoint(kv, 0) - assert.Nil(t, err) -} - -func float32ToByte(f float32) []byte { - var buf bytes.Buffer - err := binary.Write(&buf, binary.LittleEndian, f) - if err != nil { - fmt.Println("binary.Write failed:", err) - } - return buf.Bytes() -} - -func TestCheckpointWithData(t *testing.T) { - kv := &myKV{} - s, err := NewService(0, time.Hour, testDir, kv, nil) - assert.Nil(t, err) - - var content []byte - for i := 0; i < 50000; i++ { - content = append(content, float32ToByte(float32(i))...) - } - - p1 := Parameter{Name: "p1", ElementType: 1, Content: content} - err = s.InitParam(ParameterWithConfig{Param: p1}, nil) - assert.Nil(t, err) - - err = s.FinishInitParams(0, nil) - assert.Nil(t, err) - - var p2 Parameter - err = s.GetParam(p1.Name, &p2) - assert.Nil(t, err) - assert.Equal(t, p1, p2) - - err = s.checkpoint() - assert.Nil(t, err) - cp, err := LoadCheckpoint(kv, 0) - assert.Nil(t, err) - s1, err := NewService(0, time.Hour, testDir, kv, cp) - assert.Nil(t, err) - - var p3 Parameter - err = s1.GetParam(p1.Name, &p3) - assert.Nil(t, err) - assert.Equal(t, p1, p3) -} diff --git a/go/pserver/service_test.go b/go/pserver/service_test.go deleted file mode 100644 index 6949348e933e74d53a99f3b6c8fb928b9b5140f5..0000000000000000000000000000000000000000 --- a/go/pserver/service_test.go +++ /dev/null @@ -1,211 +0,0 @@ -// Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at - -// http://www.apache.org/licenses/LICENSE-2.0 - -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -package pserver_test - -import ( - "fmt" - "io/ioutil" - "reflect" - "sync" - "testing" - "time" - - "github.com/PaddlePaddle/Paddle/go/pserver" -) - -const ( - OptimizerConfig = "./client/c/test/testdata/optimizer.pb" -) - -func TestServiceFull(t *testing.T) { - var cp pserver.Checkpoint - s, err := pserver.NewService(0, time.Hour, "", nil, cp) - if err != nil { - t.Error(err) - } - var p pserver.Parameter - p.Name = "param_a" - p.Content = []byte{1, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0} - p.ElementType = pserver.Int32 - config, err := ioutil.ReadFile(OptimizerConfig) - if err != nil { - t.Fatalf("read optimizer proto failed") - } - - err = s.InitParam(pserver.ParameterWithConfig{Param: p, Config: config}, nil) - if err != nil { - t.Fatal(err) - } - - var p1 pserver.Parameter - p1.Name = "param_b" - p1.Content = []byte{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} - p1.ElementType = pserver.Float32 - err = s.InitParam(pserver.ParameterWithConfig{Param: p1, Config: config}, nil) - if err != nil { - t.Fatal(err) - } - - err = s.FinishInitParams(0, nil) - if err != nil { - t.Fatal(err) - } - - var param pserver.Parameter - err = s.GetParam("param_b", ¶m) - if err != nil { - t.Fatal(err) - } - - if !reflect.DeepEqual(param, p1) { - t.Fatal("not equal:", param, p1) - } - - g1, g2 := pserver.Gradient(p1), pserver.Gradient(p) - - err = s.SendGrad(g1, nil) - if err != nil { - t.Fatal(err) - } - err = s.SendGrad(g2, nil) - - if err != nil { - t.Fatal(err) - } - - var param1 pserver.Parameter - err = s.GetParam("param_a", ¶m1) - if err != nil { - t.Fatal(err) - } - - // don't compare content, since it's already changed by - // gradient update. - param1.Content = nil - p.Content = nil - - if !reflect.DeepEqual(param1, p) { - t.Fatal("not equal:", param1, p) - } -} - -func TestMultipleInit(t *testing.T) { - var cp pserver.Checkpoint - s, err := pserver.NewService(0, time.Hour, "", nil, cp) - if err != nil { - t.Fatal(err) - } - err = s.FinishInitParams(0, nil) - if err != nil { - t.Fatal(err) - } - - err = s.FinishInitParams(0, nil) - if err.Error() != pserver.AlreadyInitialized { - t.Fatal(err) - } -} - -func TestUninitialized(t *testing.T) { - var cp pserver.Checkpoint - s, err := pserver.NewService(0, time.Hour, "", nil, cp) - err = s.SendGrad(pserver.Gradient{}, nil) - if err.Error() != pserver.Uninitialized { - t.Fatal(err) - } -} - -func TestBlockUntilInitialized(t *testing.T) { - var cp pserver.Checkpoint - s, err := pserver.NewService(0, time.Hour, "", nil, cp) - if err != nil { - t.Error(err) - } - ch := make(chan struct{}, 2) - errCh := make(chan error, 2) - var wg sync.WaitGroup - wg.Add(1) - go func() { - var param pserver.Parameter - err := s.GetParam("param_a", ¶m) - if err != nil { - errCh <- err - } - wg.Done() - ch <- struct{}{} - }() - - time.Sleep(50 * time.Millisecond) - - select { - case <-ch: - // some function returned before initialization is completed. - t.FailNow() - case <-errCh: - t.FailNow() - default: - } - - var p pserver.Parameter - p.Name = "param_a" - p.Content = []byte{1, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0} - p.ElementType = pserver.Int32 - config, err := ioutil.ReadFile(OptimizerConfig) - if err != nil { - t.Fatalf("read optimizer proto failed") - } - err = s.InitParam(pserver.ParameterWithConfig{Param: p, Config: config}, nil) - - if err != nil { - t.Fatal(err) - } - - err = s.FinishInitParams(0, nil) - if err != nil { - t.Fatal(err) - } - - wg.Wait() -} - -func TestGradientString(t *testing.T) { - g := pserver.Parameter{} - g.ElementType = pserver.Float32 - g.Content = []byte{0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40, 0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40} - if g.String() != "[3.3702806e+12 2.142699 3.3702806e+12 2.142699]" { - t.Fatal("get float data error!") - } - - g.Content = []byte{0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40, - 0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40, - 0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40, - 0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40, - 0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40, - 0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40, - 0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40, - 0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40, - 0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40, - 0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40, - 0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40, - 0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40, - 0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40, - 0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40, - 0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40, - 0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40} - if g.String() != "[3.3702806e+12 2.142699 3.3702806e+12 2.142699 3.3702806e+12 2.142699 3.3702806e+12 2.142699 3.3702806e+12 2.142699...3.3702806e+12 2.142699 3.3702806e+12 2.142699 3.3702806e+12 2.142699 3.3702806e+12 2.142699 3.3702806e+12 2.142699]" { - t.Fatal("get float data error!", g.String()) - } - fmt.Println(g) -} diff --git a/go/utils/networkhelper/CMakeLists.txt b/go/utils/networkhelper/CMakeLists.txt deleted file mode 100644 index 3100f2b5a527720b5e8edfb4219b42a8a874f67a..0000000000000000000000000000000000000000 --- a/go/utils/networkhelper/CMakeLists.txt +++ /dev/null @@ -1,17 +0,0 @@ -# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# -if(WITH_TESTING) - go_test(network_helper_test) -endif() diff --git a/go/utils/networkhelper/helper.go b/go/utils/networkhelper/helper.go deleted file mode 100644 index d205b6c50202148c6634bb378a03adcca7b074a0..0000000000000000000000000000000000000000 --- a/go/utils/networkhelper/helper.go +++ /dev/null @@ -1,59 +0,0 @@ -// Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at - -// http://www.apache.org/licenses/LICENSE-2.0 - -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -package networkhelper - -import ( - "errors" - "net" -) - -// GetExternalIP returns the ip address of local network interface, not the -// loopback device. -func GetExternalIP() (string, error) { - ifaces, err := net.Interfaces() - if err != nil { - return "", err - } - for _, iface := range ifaces { - if iface.Flags&net.FlagUp == 0 { - continue // interface down - } - if iface.Flags&net.FlagLoopback != 0 { - continue // loopback interface - } - addrs, err := iface.Addrs() - if err != nil { - return "", err - } - for _, addr := range addrs { - var ip net.IP - switch v := addr.(type) { - case *net.IPNet: - ip = v.IP - case *net.IPAddr: - ip = v.IP - } - if ip == nil || ip.IsLoopback() { - continue - } - ip = ip.To4() - if ip == nil { - continue // not an ipv4 address - } - return ip.String(), nil - } - } - return "", errors.New("are you connected to the network?") -} diff --git a/go/utils/networkhelper/helper_test.go b/go/utils/networkhelper/helper_test.go deleted file mode 100644 index 60b520fae15484e024cccddf169c2c8072c2e990..0000000000000000000000000000000000000000 --- a/go/utils/networkhelper/helper_test.go +++ /dev/null @@ -1,24 +0,0 @@ -// Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at - -// http://www.apache.org/licenses/LICENSE-2.0 - -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -package networkhelper - -import "testing" - -func TestGetIP(t *testing.T) { - _, err := GetExternalIP() - if err != nil { - t.Errorf("GetExternalIP returns error : %v\n", err) - } -} diff --git a/paddle/CMakeLists.txt b/paddle/CMakeLists.txt index 6653244507742b33d9524a7a0e4a5b2b575d358a..c0c04d475959de2bfd6505b6ed30d5c18cbd99da 100644 --- a/paddle/CMakeLists.txt +++ b/paddle/CMakeLists.txt @@ -1,29 +1,4 @@ -if(NOT WITH_FLUID_ONLY) - add_subdirectory(legacy/cuda) - add_subdirectory(legacy/function) - add_subdirectory(legacy/utils) - add_subdirectory(legacy/math) - add_subdirectory(legacy/gserver) - add_subdirectory(legacy/parameter) - - if(MOBILE_INFERENCE) - add_subdirectory(legacy/capi) - else() - add_subdirectory(legacy/pserver) - add_subdirectory(legacy/trainer) - add_subdirectory(scripts) - - if(WITH_C_API) - add_subdirectory(legacy/capi) - endif() - - if(WITH_SWIG_PY) - add_subdirectory(legacy/api) - endif() - endif() -endif() - +add_subdirectory(scripts) add_subdirectory(testing) -if(NOT MOBILE_INFERENCE AND NOT RPI AND NOT WITH_C_API) - add_subdirectory(fluid) -endif() +set(PYTHON_TESTS_DIR ${PADDLE_BINARY_DIR}/python/paddle/fluid/tests CACHE INTERNAL "python tests directory") +add_subdirectory(fluid) diff --git a/paddle/contrib/float16/README.md b/paddle/contrib/float16/README.md index 58b4a50666bfb622af8acbce29355f2a4a870a82..a1f8cb42451dd5e84c97d6830216d284cc8bd819 100644 --- a/paddle/contrib/float16/README.md +++ b/paddle/contrib/float16/README.md @@ -5,13 +5,13 @@ Kexin Zhao ## Introduction Deep learning is usually a two-stage work: training and inference. The training stage estimates model parameters (weights) from data. The inference stage loads the weights and uses them to interpret inputs. Typically, weights are 32-bit float values (float32). Some new devices, including NVIDIA Volta GPUs, support higher speed computation using 16-bit float values (float16). -This article explains our efforts with PaddlePaddle to train using float32 and to inference using float16. We describe a [*transpiler*](https://github.com/PaddlePaddle/Paddle/blob/a4d3de0071e1f3912230c3ab3f9ac74cf06b093a/doc/fluid/design/motivation/fluid_compiler.md), which converts a PaddlePaddle Fluid model, which, to be precise, should be called a [Fluid *program*](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/concepts/program.md), into the inference program, and converts the weights from float32 into float16. +This article explains our efforts with PaddlePaddle to train using float32 and to inference using float16. We describe a [*transpiler*](https://github.com/PaddlePaddle/Paddle/blob/a4d3de0071e1f3912230c3ab3f9ac74cf06b093a/doc/fluid/design/motivation/fluid_compiler.md), which converts a PaddlePaddle Fluid model, which, to be precise, should be called a [Fluid *program*](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/concepts/program.md), into the inference program, and converts the weights from float32 into float16. ## What is float16? float16 (or FP16) is a half-precision floating-point format that uses 16 bits in memory to represent a value. The advantage over 32-bit single-precision floating-point format (commonly known as float or float32 data type) is that it requires half the storage and bandwidth at the expense of precision and range. Fortunately, DNN inference has a high tolerance for the loss of precision and range when using float16 to represent the weights, and the inference accuracy will only be minimally affected in most cases, which gives us the opportunity to use float16 data type to speed up the inference. -Interested readers can refer to our [design doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/data_type/float16.md) and [code](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/platform/float16.h) for more details on how we implement the float16 data type. +Interested readers can refer to our [design doc](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/data_type/float16.md) and [code](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/platform/float16.h) for more details on how we implement the float16 data type. ## Why float16? The trend in today's deep learning community is to use bigger and deeper model, which translates to larger memory footprint, higher computation demands, and as a result higher energy consumption on computing devices. The advantages of float16 over float32 are correspondingly three-fold: @@ -24,12 +24,12 @@ The trend in today's deep learning community is to use bigger and deeper model, ## Fluid implementation of float16 inference ### Overview -Fluid use [Program](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/modules/python_api.md#program) instead of computation graph to describe a neural network model and the optimization procedure. Fluid program is a python wrapper around a protobuf message called [ProgramDesc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/concepts/program.md). Similar to programming languages, the basic structure of a Fluid program is some nested [blocks](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/modules/python_api.md#block), where each block consists of some [variable](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/modules/python_api.md#variable) definitions and a sequence of [operators](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/modules/python_api.md#operator). An [executor](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/concepts/executor.md) will run a given program by sequentially executing the operators in the entrance block. +Fluid use [Program](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/modules/python_api.md#program) instead of computation graph to describe a neural network model and the optimization procedure. Fluid program is a python wrapper around a protobuf message called [ProgramDesc](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/concepts/program.md). Similar to programming languages, the basic structure of a Fluid program is some nested [blocks](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/modules/python_api.md#block), where each block consists of some [variable](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/modules/python_api.md#variable) definitions and a sequence of [operators](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/modules/python_api.md#operator). An [executor](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/concepts/executor.md) will run a given program by sequentially executing the operators in the entrance block. ### Basic requirement When an executor runs an operator, it uses a kernel to perform computations on tensors contained in the input variables, and then writes the results to the tensors in the output variables. Each operator has multiple kernels for different combinations of data types, devices, and library types, respectively. The operator will select the appropriate kernel to run based on, among other things, the data type of the input tensors. By default, every Fluid operator has a kernel for float data type that takes float inputs and generates float outputs. -If we provide float input to the first operator in a program, then each operator will use float kernel to compute float output and send it as input to the next operator to trigger its float kernel. This chain effect will make the program run in float mode and gives us a final output of float data type. +If we provide float input to the first operator in a program, then each operator will use float kernel to compute float output and send it as input to the next operator to trigger its float kernel. This chain effect will make the program run in float mode and gives us a final output of float data type. The same principle applies if we want a program to run in float16 mode. We provide input variable of the float16 data type to the first operator, and every subsequent operator will invoke the float16 kernel until we get the final output in float16. So the preliminary requirements for float16 inference are to add float16 kernels to operators that are needed in a specific kind of neural networks. Our current focus is on Convolutional Neural Networks (CNN) and hence we have added float16 kernels to the following operators: convolution, pooling, GEMM, elementwise addition, batch norm, dropout, various activations including relu and tanh, and softmax. @@ -75,7 +75,7 @@ In this scenario, we already have a float32 inference program and some associate We can then run various inference experiments in float16 mode and save the float16 program and weights on disk for future deployment. To enhance the code usability, we maintain a consistent API so that user can use the same float32 input data to run inference program in either float32 and float16 mode and obtain output data both of float32 data type. Consequently, we need to add cast operators in the float16 inference program for conversions between the float16 tensor and float32 tensor. -The float16 transpiler is implemented to fulfill the requirements mentioned above. The details of the float16 transpiler can be found [here](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/data_type/float16.md#float16-inference). +The float16 transpiler is implemented to fulfill the requirements mentioned above. The details of the float16 transpiler can be found [here](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/data_type/float16.md#float16-inference). ### Experiment results Simply running the following commands to reproduce the experiment results presented in this section: @@ -113,7 +113,7 @@ We repeat the test ten times and get the following results: | #10 | 62.53% | 62.48% | | average| 62.63% | 62.62% | -We can see that the accuracy of float16 inference is very close to that of float32 inference in every experiment (within 0.05% difference) and is overall 0.01% better than its float32 counterpart averaged over ten tests. +We can see that the accuracy of float16 inference is very close to that of float32 inference in every experiment (within 0.05% difference) and is overall 0.01% better than its float32 counterpart averaged over ten tests. #### Performance benchmark Currently, Fluid only supports float16 inference on NVIDIA GPUs. There is no motivation to support float16 inference on non-ARM CPUs where float16 is not natively supported, and float16 calculation will only be slower than its float32 counterpart. @@ -132,7 +132,7 @@ Average inference time for one mini-batch on Vgg16 model tested on ImageNet data |float16| 3.32 | 4.11 | 5.88 | 9.41 | 16.54 | 30.47 | 60.23 | |Speedup| 4.22 | 2.36  | 3.91 | 3.00 | 3.26  | 2.77 | 2.97 | -We can see that float16 inference provides **2x ~ 4x** speedup on different batch sizes. +We can see that float16 inference provides **2x ~ 4x** speedup on different batch sizes. Convolution operation is ususally the computational bottleneck of CNN, so we also check the average time spent on the Fluid convolution operators for one mini-batch as follows: @@ -162,7 +162,7 @@ We find that the speedup provided by float16 inference starts relatively small a We also did the same benchmark on a single NVIDIA GeForce GTX 1080 Ti GPU that does not support Tensor Core. The results show that for Vgg16, float16 inference provides consistent small speedup (around 1.15x) for all mini-batch sizes, while for Resnet50, float16 inference is slower than its float32 counterpart in small batch sizes (mb = 1 and 2) and then delivers around 1.15x speedup for all larger batch sizes. By comparing the benchmarks on 1080 Ti and V100, we find that Tensor Core, which is specialized for float16 computations, is a critical component of high performance float16 inference. -Please refer to [here](https://github.com/PaddlePaddle/Paddle/blob/develop/contrib/float16/float16_benchmark.md) for complete benchmark results. +Please refer to [here](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/contrib/float16/float16_benchmark.md) for complete benchmark results. ### Summary 1. Fluid is now able to run inference in float16 mode via a float16 transpiler. We currently support CNN programs, including Vgg and Resnet, to run in float16 inference mode. diff --git a/paddle/contrib/float16/float16_transpiler.py b/paddle/contrib/float16/float16_transpiler.py index 8d95dc0591e1d6bd815cc697528191c2ee8c1cfe..500f64bed9898fa874cbad2ea69aae05df58023e 100644 --- a/paddle/contrib/float16/float16_transpiler.py +++ b/paddle/contrib/float16/float16_transpiler.py @@ -60,7 +60,7 @@ class Float16Transpiler: raise TypeError("place should be as CPUPlace/CUDAPlace type") if scope is None: scope = global_scope() - if not isinstance(scope, core.Scope): + if not isinstance(scope, core._Scope): raise TypeError("scope should be as Scope type or None") self.scope = scope diff --git a/paddle/contrib/float16/run_float16_demo.sh b/paddle/contrib/float16/run_float16_demo.sh index 031225a85dabb26e5d9ea06f58909c049e7f0c08..34cb7a12db171915f2bc7df8787dd62cd381de68 100755 --- a/paddle/contrib/float16/run_float16_demo.sh +++ b/paddle/contrib/float16/run_float16_demo.sh @@ -14,9 +14,7 @@ cmake .. -DWITH_AVX=OFF \ -DWITH_MKL=OFF \ -DWITH_GPU=ON \ -DWITH_TESTING=ON \ - -DWITH_TIMER=ON \ -DWITH_PROFILER=ON \ - -DWITH_FLUID_ONLY=ON make -j `nproc` pip install -U "$WHEEL_PATH/$(ls $WHEEL_PATH)" diff --git a/paddle/fluid/API.spec b/paddle/fluid/API.spec index 850ccbfb397cd9722d02ed8c4923d85dae3d943b..951514710b29d1d6fb3165acd92b65d997f43772 100644 --- a/paddle/fluid/API.spec +++ b/paddle/fluid/API.spec @@ -1,395 +1,562 @@ -paddle.fluid.Program.__init__ ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None) -paddle.fluid.Program.block ArgSpec(args=['self', 'index'], varargs=None, keywords=None, defaults=None) -paddle.fluid.Program.clone ArgSpec(args=['self', 'for_test'], varargs=None, keywords=None, defaults=(False,)) -paddle.fluid.Program.current_block ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None) -paddle.fluid.Program.global_block ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None) -paddle.fluid.Program.list_vars ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None) -paddle.fluid.Program.parse_from_string ArgSpec(args=['binary_str'], varargs=None, keywords=None, defaults=None) -paddle.fluid.Program.to_string ArgSpec(args=['self', 'throw_on_error', 'with_details'], varargs=None, keywords=None, defaults=(False,)) -paddle.fluid.default_startup_program ArgSpec(args=[], varargs=None, keywords=None, defaults=None) -paddle.fluid.default_main_program ArgSpec(args=[], varargs=None, keywords=None, defaults=None) -paddle.fluid.program_guard ArgSpec(args=[], varargs='args', keywords='kwds', defaults=None) -paddle.fluid.name_scope ArgSpec(args=[], varargs='args', keywords='kwds', defaults=None) -paddle.fluid.Executor.__init__ ArgSpec(args=['self', 'place'], varargs=None, keywords=None, defaults=None) -paddle.fluid.Executor.close ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None) -paddle.fluid.Executor.run ArgSpec(args=['self', 'program', 'feed', 'fetch_list', 'feed_var_name', 'fetch_var_name', 'scope', 'return_numpy', 'use_program_cache'], varargs=None, keywords=None, defaults=(None, None, None, 'feed', 'fetch', None, True, False)) -paddle.fluid.global_scope ArgSpec(args=[], varargs=None, keywords=None, defaults=None) -paddle.fluid.scope_guard ArgSpec(args=[], varargs='args', keywords='kwds', defaults=None) -paddle.fluid.DistributeTranspiler.__init__ ArgSpec(args=['self', 'config'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.DistributeTranspiler.get_pserver_program ArgSpec(args=['self', 'endpoint'], varargs=None, keywords=None, defaults=None) -paddle.fluid.DistributeTranspiler.get_pserver_programs ArgSpec(args=['self', 'endpoint'], varargs=None, keywords=None, defaults=None) -paddle.fluid.DistributeTranspiler.get_startup_program ArgSpec(args=['self', 'endpoint', 'pserver_program', 'startup_program'], varargs=None, keywords=None, defaults=(None, None)) -paddle.fluid.DistributeTranspiler.get_trainer_program ArgSpec(args=['self', 'wait_port'], varargs=None, keywords=None, defaults=(True,)) -paddle.fluid.DistributeTranspiler.transpile ArgSpec(args=['self', 'trainer_id', 'program', 'pservers', 'trainers', 'sync_mode', 'startup_program', 'current_endpoint'], varargs=None, keywords=None, defaults=(None, '127.0.0.1:6174', 1, True, None, '127.0.0.1:6174')) -paddle.fluid.memory_optimize ArgSpec(args=['input_program', 'skip_opt_set', 'print_log', 'level', 'skip_grads'], varargs=None, keywords=None, defaults=(None, False, 0, False)) -paddle.fluid.release_memory ArgSpec(args=['input_program', 'skip_opt_set'], varargs=None, keywords=None, defaults=(None,)) +paddle.fluid.Program.__init__ (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.Program.block (ArgSpec(args=['self', 'index'], varargs=None, keywords=None, defaults=None), ('document', 'af5346376065ff4cf6832a8ac0ae0945')) +paddle.fluid.Program.clone (ArgSpec(args=['self', 'for_test'], varargs=None, keywords=None, defaults=(False,)), ('document', 'ebb7765b2962bd2be041d19720e49d0f')) +paddle.fluid.Program.current_block (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '5e162d3bf8dd625703463d9e4be36adb')) +paddle.fluid.Program.global_block (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', 'cfb7e05a002b2e64650778cabde7301c')) +paddle.fluid.Program.list_vars (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '1c8647b14fe57c7824b1c9562394dd3c')) +paddle.fluid.Program.parse_from_string (ArgSpec(args=['binary_str'], varargs=None, keywords=None, defaults=None), ('document', 'b6a7ffb239a30bf2ce58cfaca8d8b8d5')) +paddle.fluid.Program.to_string (ArgSpec(args=['self', 'throw_on_error', 'with_details'], varargs=None, keywords=None, defaults=(False,)), ('document', 'faec17e5a04af28e3776160e34504d15')) +paddle.fluid.default_startup_program (ArgSpec(args=[], varargs=None, keywords=None, defaults=None), ('document', '99e5d53d92d82797093332719c9e3ccd')) +paddle.fluid.default_main_program (ArgSpec(args=[], varargs=None, keywords=None, defaults=None), ('document', '5430f54ab4895f9f47db6bebbaf71659')) +paddle.fluid.program_guard (ArgSpec(args=['main_program', 'startup_program'], varargs=None, keywords=None, defaults=(None,)), ('document', 'b54f403e57825a1592aece03afe3afb6')) +paddle.fluid.name_scope (ArgSpec(args=['prefix'], varargs=None, keywords=None, defaults=(None,)), ('document', '0ef753f5cec69fef9ae6ad8b867b33a2')) +paddle.fluid.cuda_places (ArgSpec(args=['device_ids'], varargs=None, keywords=None, defaults=(None,)), ('document', '7d9a51fc9cf3c5245b5227080a8064c3')) +paddle.fluid.cpu_places (ArgSpec(args=['device_count'], varargs=None, keywords=None, defaults=(None,)), ('document', '4c0cd83f0b401fc2ff84c70974e5d210')) +paddle.fluid.cuda_pinned_places (ArgSpec(args=['device_count'], varargs=None, keywords=None, defaults=(None,)), ('document', 'd0c3ebd813c39958c92b78e3eef7e912')) +paddle.fluid.in_dygraph_mode (ArgSpec(args=[], varargs=None, keywords=None, defaults=None), ('document', 'f06314a1cb30c96b5808dde2219c2dae')) +paddle.fluid.Executor.__init__ (ArgSpec(args=['self', 'place'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.Executor.close (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', 'f5369953dd0c443961cf79f7a00e1a03')) +paddle.fluid.Executor.infer_from_dataset (ArgSpec(args=['self', 'program', 'dataset', 'scope', 'thread', 'debug', 'fetch_list', 'fetch_info', 'print_period'], varargs=None, keywords=None, defaults=(None, None, None, 0, False, None, None, 100)), ('document', '9c7decb955b9c4f718114179c8985581')) +paddle.fluid.Executor.run (ArgSpec(args=['self', 'program', 'feed', 'fetch_list', 'feed_var_name', 'fetch_var_name', 'scope', 'return_numpy', 'use_program_cache'], varargs=None, keywords=None, defaults=(None, None, None, 'feed', 'fetch', None, True, False)), ('document', 'f482e93b38b4018796969a2e1dde479d')) +paddle.fluid.Executor.train_from_dataset (ArgSpec(args=['self', 'program', 'dataset', 'scope', 'thread', 'debug', 'fetch_list', 'fetch_info', 'print_period'], varargs=None, keywords=None, defaults=(None, None, None, 0, False, None, None, 100)), ('document', 'd521011d79e71080fe9b5bb179b43518')) +paddle.fluid.global_scope (ArgSpec(args=[], varargs=None, keywords=None, defaults=None), ('document', 'e148d3ab1ed8edf3e928212a375959c0')) +paddle.fluid.scope_guard (ArgSpec(args=['scope'], varargs=None, keywords=None, defaults=None), ('document', 'b94d1f6bcc29c4fb58fc0058561250c2')) +paddle.fluid.DistributeTranspiler.__init__ (ArgSpec(args=['self', 'config'], varargs=None, keywords=None, defaults=(None,)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.DistributeTranspiler.get_pserver_program (ArgSpec(args=['self', 'endpoint'], varargs=None, keywords=None, defaults=None), ('document', '292ab72977afbe58e6a3bde175452680')) +paddle.fluid.DistributeTranspiler.get_pserver_programs (ArgSpec(args=['self', 'endpoint'], varargs=None, keywords=None, defaults=None), ('document', '78f4949aedf317666a89ca74b3748ba8')) +paddle.fluid.DistributeTranspiler.get_startup_program (ArgSpec(args=['self', 'endpoint', 'pserver_program', 'startup_program'], varargs=None, keywords=None, defaults=(None, None)), ('document', 'd796fc0c8d51503b556fcf6dc15c4f0c')) +paddle.fluid.DistributeTranspiler.get_trainer_program (ArgSpec(args=['self', 'wait_port'], varargs=None, keywords=None, defaults=(True,)), ('document', '736330e31a7a54abccc0c7fd9119d9ff')) +paddle.fluid.DistributeTranspiler.transpile (ArgSpec(args=['self', 'trainer_id', 'program', 'pservers', 'trainers', 'sync_mode', 'startup_program', 'current_endpoint'], varargs=None, keywords=None, defaults=(None, '127.0.0.1:6174', 1, True, None, '127.0.0.1:6174')), ('document', '06ce55338dfe96311ad1078235ab3bf4')) +paddle.fluid.memory_optimize (ArgSpec(args=['input_program', 'skip_opt_set', 'print_log', 'level', 'skip_grads'], varargs=None, keywords=None, defaults=(None, False, 0, False)), ('document', 'eda17d0f1639bc6ca215cecf87f588a4')) +paddle.fluid.release_memory (ArgSpec(args=['input_program', 'skip_opt_set'], varargs=None, keywords=None, defaults=(None,)), ('document', 'ac4114d3df16264f1946deb3a8434a6f')) paddle.fluid.DistributeTranspilerConfig.__init__ -paddle.fluid.ParallelExecutor.__init__ ArgSpec(args=['self', 'use_cuda', 'loss_name', 'main_program', 'share_vars_from', 'exec_strategy', 'build_strategy', 'num_trainers', 'trainer_id', 'scope'], varargs=None, keywords=None, defaults=(None, None, None, None, None, 1, 0, None)) -paddle.fluid.ParallelExecutor.run ArgSpec(args=['self', 'fetch_list', 'feed', 'feed_dict', 'return_numpy'], varargs=None, keywords=None, defaults=(None, None, True)) -paddle.fluid.ExecutionStrategy.__init__ __init__(self: paddle.fluid.core.ExecutionStrategy) -> None -paddle.fluid.BuildStrategy.GradientScaleStrategy.__init__ __init__(self: paddle.fluid.core.GradientScaleStrategy, arg0: int) -> None -paddle.fluid.BuildStrategy.ReduceStrategy.__init__ __init__(self: paddle.fluid.core.ReduceStrategy, arg0: int) -> None -paddle.fluid.BuildStrategy.__init__ __init__(self: paddle.fluid.core.BuildStrategy) -> None -paddle.fluid.create_lod_tensor ArgSpec(args=['data', 'recursive_seq_lens', 'place'], varargs=None, keywords=None, defaults=None) -paddle.fluid.create_random_int_lodtensor ArgSpec(args=['recursive_seq_lens', 'base_shape', 'place', 'low', 'high'], varargs=None, keywords=None, defaults=None) -paddle.fluid.io.save_vars ArgSpec(args=['executor', 'dirname', 'main_program', 'vars', 'predicate', 'filename'], varargs=None, keywords=None, defaults=(None, None, None, None)) -paddle.fluid.io.save_params ArgSpec(args=['executor', 'dirname', 'main_program', 'filename'], varargs=None, keywords=None, defaults=(None, None)) -paddle.fluid.io.save_persistables ArgSpec(args=['executor', 'dirname', 'main_program', 'filename'], varargs=None, keywords=None, defaults=(None, None)) -paddle.fluid.io.load_vars ArgSpec(args=['executor', 'dirname', 'main_program', 'vars', 'predicate', 'filename'], varargs=None, keywords=None, defaults=(None, None, None, None)) -paddle.fluid.io.load_params ArgSpec(args=['executor', 'dirname', 'main_program', 'filename'], varargs=None, keywords=None, defaults=(None, None)) -paddle.fluid.io.load_persistables ArgSpec(args=['executor', 'dirname', 'main_program', 'filename'], varargs=None, keywords=None, defaults=(None, None)) -paddle.fluid.io.save_inference_model ArgSpec(args=['dirname', 'feeded_var_names', 'target_vars', 'executor', 'main_program', 'model_filename', 'params_filename', 'export_for_deployment'], varargs=None, keywords=None, defaults=(None, None, None, True)) -paddle.fluid.io.load_inference_model ArgSpec(args=['dirname', 'executor', 'model_filename', 'params_filename', 'pserver_endpoints'], varargs=None, keywords=None, defaults=(None, None, None)) -paddle.fluid.initializer.ConstantInitializer.__init__ ArgSpec(args=['self', 'value', 'force_cpu'], varargs=None, keywords=None, defaults=(0.0, False)) -paddle.fluid.initializer.UniformInitializer.__init__ ArgSpec(args=['self', 'low', 'high', 'seed'], varargs=None, keywords=None, defaults=(-1.0, 1.0, 0)) -paddle.fluid.initializer.NormalInitializer.__init__ ArgSpec(args=['self', 'loc', 'scale', 'seed'], varargs=None, keywords=None, defaults=(0.0, 1.0, 0)) -paddle.fluid.initializer.TruncatedNormalInitializer.__init__ ArgSpec(args=['self', 'loc', 'scale', 'seed'], varargs=None, keywords=None, defaults=(0.0, 1.0, 0)) -paddle.fluid.initializer.XavierInitializer.__init__ ArgSpec(args=['self', 'uniform', 'fan_in', 'fan_out', 'seed'], varargs=None, keywords=None, defaults=(True, None, None, 0)) -paddle.fluid.initializer.BilinearInitializer.__init__ ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None) -paddle.fluid.initializer.MSRAInitializer.__init__ ArgSpec(args=['self', 'uniform', 'fan_in', 'seed'], varargs=None, keywords=None, defaults=(True, None, 0)) -paddle.fluid.initializer.force_init_on_cpu ArgSpec(args=[], varargs=None, keywords=None, defaults=None) -paddle.fluid.initializer.init_on_cpu ArgSpec(args=[], varargs='args', keywords='kwds', defaults=None) -paddle.fluid.layers.fc ArgSpec(args=['input', 'size', 'num_flatten_dims', 'param_attr', 'bias_attr', 'act', 'is_test', 'name'], varargs=None, keywords=None, defaults=(1, None, None, None, False, None)) -paddle.fluid.layers.embedding ArgSpec(args=['input', 'size', 'is_sparse', 'is_distributed', 'padding_idx', 'param_attr', 'dtype'], varargs=None, keywords=None, defaults=(False, False, None, None, 'float32')) -paddle.fluid.layers.dynamic_lstm ArgSpec(args=['input', 'size', 'h_0', 'c_0', 'param_attr', 'bias_attr', 'use_peepholes', 'is_reverse', 'gate_activation', 'cell_activation', 'candidate_activation', 'dtype', 'name'], varargs=None, keywords=None, defaults=(None, None, None, None, True, False, 'sigmoid', 'tanh', 'tanh', 'float32', None)) -paddle.fluid.layers.dynamic_lstmp ArgSpec(args=['input', 'size', 'proj_size', 'param_attr', 'bias_attr', 'use_peepholes', 'is_reverse', 'gate_activation', 'cell_activation', 'candidate_activation', 'proj_activation', 'dtype', 'name'], varargs=None, keywords=None, defaults=(None, None, True, False, 'sigmoid', 'tanh', 'tanh', 'tanh', 'float32', None)) -paddle.fluid.layers.dynamic_gru ArgSpec(args=['input', 'size', 'param_attr', 'bias_attr', 'is_reverse', 'gate_activation', 'candidate_activation', 'h_0'], varargs=None, keywords=None, defaults=(None, None, False, 'sigmoid', 'tanh', None)) -paddle.fluid.layers.gru_unit ArgSpec(args=['input', 'hidden', 'size', 'param_attr', 'bias_attr', 'activation', 'gate_activation'], varargs=None, keywords=None, defaults=(None, None, 'tanh', 'sigmoid')) -paddle.fluid.layers.linear_chain_crf ArgSpec(args=['input', 'label', 'param_attr'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.crf_decoding ArgSpec(args=['input', 'param_attr', 'label'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.cos_sim ArgSpec(args=['X', 'Y'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.cross_entropy ArgSpec(args=['input', 'label', 'soft_label', 'ignore_index'], varargs=None, keywords=None, defaults=(False, -100)) -paddle.fluid.layers.square_error_cost ArgSpec(args=['input', 'label'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.chunk_eval ArgSpec(args=['input', 'label', 'chunk_scheme', 'num_chunk_types', 'excluded_chunk_types'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.sequence_conv ArgSpec(args=['input', 'num_filters', 'filter_size', 'filter_stride', 'padding', 'bias_attr', 'param_attr', 'act', 'name'], varargs=None, keywords=None, defaults=(3, 1, None, None, None, None, None)) -paddle.fluid.layers.conv2d ArgSpec(args=['input', 'num_filters', 'filter_size', 'stride', 'padding', 'dilation', 'groups', 'param_attr', 'bias_attr', 'use_cudnn', 'act', 'name'], varargs=None, keywords=None, defaults=(1, 0, 1, None, None, None, True, None, None)) -paddle.fluid.layers.conv3d ArgSpec(args=['input', 'num_filters', 'filter_size', 'stride', 'padding', 'dilation', 'groups', 'param_attr', 'bias_attr', 'use_cudnn', 'act', 'name'], varargs=None, keywords=None, defaults=(1, 0, 1, None, None, None, True, None, None)) -paddle.fluid.layers.sequence_pool ArgSpec(args=['input', 'pool_type'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.sequence_softmax ArgSpec(args=['input', 'use_cudnn', 'name'], varargs=None, keywords=None, defaults=(False, None)) -paddle.fluid.layers.softmax ArgSpec(args=['input', 'use_cudnn', 'name'], varargs=None, keywords=None, defaults=(True, None)) -paddle.fluid.layers.pool2d ArgSpec(args=['input', 'pool_size', 'pool_type', 'pool_stride', 'pool_padding', 'global_pooling', 'use_cudnn', 'ceil_mode', 'name'], varargs=None, keywords=None, defaults=(-1, 'max', 1, 0, False, True, False, None)) -paddle.fluid.layers.pool3d ArgSpec(args=['input', 'pool_size', 'pool_type', 'pool_stride', 'pool_padding', 'global_pooling', 'use_cudnn', 'ceil_mode', 'name'], varargs=None, keywords=None, defaults=(-1, 'max', 1, 0, False, True, False, None)) -paddle.fluid.layers.batch_norm ArgSpec(args=['input', 'act', 'is_test', 'momentum', 'epsilon', 'param_attr', 'bias_attr', 'data_layout', 'in_place', 'name', 'moving_mean_name', 'moving_variance_name', 'do_model_average_for_mean_and_var', 'fuse_with_relu'], varargs=None, keywords=None, defaults=(None, False, 0.9, 1e-05, None, None, 'NCHW', False, None, None, None, False, False)) -paddle.fluid.layers.beam_search_decode ArgSpec(args=['ids', 'scores', 'beam_size', 'end_id', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.conv2d_transpose ArgSpec(args=['input', 'num_filters', 'output_size', 'filter_size', 'padding', 'stride', 'dilation', 'groups', 'param_attr', 'bias_attr', 'use_cudnn', 'act', 'name'], varargs=None, keywords=None, defaults=(None, None, 0, 1, 1, None, None, None, True, None, None)) -paddle.fluid.layers.conv3d_transpose ArgSpec(args=['input', 'num_filters', 'output_size', 'filter_size', 'padding', 'stride', 'dilation', 'groups', 'param_attr', 'bias_attr', 'use_cudnn', 'act', 'name'], varargs=None, keywords=None, defaults=(None, None, 0, 1, 1, None, None, None, True, None, None)) -paddle.fluid.layers.sequence_expand ArgSpec(args=['x', 'y', 'ref_level', 'name'], varargs=None, keywords=None, defaults=(-1, None)) -paddle.fluid.layers.sequence_expand_as ArgSpec(args=['x', 'y', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.sequence_pad ArgSpec(args=['x', 'pad_value', 'maxlen', 'name'], varargs=None, keywords=None, defaults=(None, None)) -paddle.fluid.layers.sequence_unpad ArgSpec(args=['x', 'length', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.lstm_unit ArgSpec(args=['x_t', 'hidden_t_prev', 'cell_t_prev', 'forget_bias', 'param_attr', 'bias_attr', 'name'], varargs=None, keywords=None, defaults=(0.0, None, None, None)) -paddle.fluid.layers.reduce_sum ArgSpec(args=['input', 'dim', 'keep_dim', 'name'], varargs=None, keywords=None, defaults=(None, False, None)) -paddle.fluid.layers.reduce_mean ArgSpec(args=['input', 'dim', 'keep_dim', 'name'], varargs=None, keywords=None, defaults=(None, False, None)) -paddle.fluid.layers.reduce_max ArgSpec(args=['input', 'dim', 'keep_dim', 'name'], varargs=None, keywords=None, defaults=(None, False, None)) -paddle.fluid.layers.reduce_min ArgSpec(args=['input', 'dim', 'keep_dim', 'name'], varargs=None, keywords=None, defaults=(None, False, None)) -paddle.fluid.layers.reduce_prod ArgSpec(args=['input', 'dim', 'keep_dim', 'name'], varargs=None, keywords=None, defaults=(None, False, None)) -paddle.fluid.layers.sequence_first_step ArgSpec(args=['input'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.sequence_last_step ArgSpec(args=['input'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.sequence_slice ArgSpec(args=['input', 'offset', 'length', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.dropout ArgSpec(args=['x', 'dropout_prob', 'is_test', 'seed', 'name'], varargs=None, keywords=None, defaults=(False, None, None)) -paddle.fluid.layers.split ArgSpec(args=['input', 'num_or_sections', 'dim', 'name'], varargs=None, keywords=None, defaults=(-1, None)) -paddle.fluid.layers.ctc_greedy_decoder ArgSpec(args=['input', 'blank', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.edit_distance ArgSpec(args=['input', 'label', 'normalized', 'ignored_tokens'], varargs=None, keywords=None, defaults=(True, None)) -paddle.fluid.layers.l2_normalize ArgSpec(args=['x', 'axis', 'epsilon', 'name'], varargs=None, keywords=None, defaults=(1e-12, None)) -paddle.fluid.layers.matmul ArgSpec(args=['x', 'y', 'transpose_x', 'transpose_y', 'alpha', 'name'], varargs=None, keywords=None, defaults=(False, False, 1.0, None)) -paddle.fluid.layers.topk ArgSpec(args=['input', 'k', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.warpctc ArgSpec(args=['input', 'label', 'blank', 'norm_by_times'], varargs=None, keywords=None, defaults=(0, False)) -paddle.fluid.layers.sequence_reshape ArgSpec(args=['input', 'new_dim'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.transpose ArgSpec(args=['x', 'perm', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.im2sequence ArgSpec(args=['input', 'filter_size', 'stride', 'padding', 'input_image_size', 'out_stride', 'name'], varargs=None, keywords=None, defaults=(1, 1, 0, None, 1, None)) -paddle.fluid.layers.nce ArgSpec(args=['input', 'label', 'num_total_classes', 'sample_weight', 'param_attr', 'bias_attr', 'num_neg_samples', 'name'], varargs=None, keywords=None, defaults=(None, None, None, None, None)) -paddle.fluid.layers.hsigmoid ArgSpec(args=['input', 'label', 'num_classes', 'param_attr', 'bias_attr', 'name'], varargs=None, keywords=None, defaults=(None, None, None)) -paddle.fluid.layers.beam_search ArgSpec(args=['pre_ids', 'pre_scores', 'ids', 'scores', 'beam_size', 'end_id', 'level', 'name'], varargs=None, keywords=None, defaults=(0, None)) -paddle.fluid.layers.row_conv ArgSpec(args=['input', 'future_context_size', 'param_attr', 'act'], varargs=None, keywords=None, defaults=(None, None)) -paddle.fluid.layers.multiplex ArgSpec(args=['inputs', 'index'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.layer_norm ArgSpec(args=['input', 'scale', 'shift', 'begin_norm_axis', 'epsilon', 'param_attr', 'bias_attr', 'act', 'name'], varargs=None, keywords=None, defaults=(True, True, 1, 1e-05, None, None, None, None)) -paddle.fluid.layers.softmax_with_cross_entropy ArgSpec(args=['logits', 'label', 'soft_label', 'ignore_index'], varargs=None, keywords=None, defaults=(False, -100)) -paddle.fluid.layers.smooth_l1 ArgSpec(args=['x', 'y', 'inside_weight', 'outside_weight', 'sigma'], varargs=None, keywords=None, defaults=(None, None, None)) -paddle.fluid.layers.one_hot ArgSpec(args=['input', 'depth'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.autoincreased_step_counter ArgSpec(args=['counter_name', 'begin', 'step'], varargs=None, keywords=None, defaults=(None, 1, 1)) -paddle.fluid.layers.reshape ArgSpec(args=['x', 'shape', 'actual_shape', 'act', 'inplace', 'name'], varargs=None, keywords=None, defaults=(None, None, True, None)) -paddle.fluid.layers.squeeze ArgSpec(args=['input', 'axes', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.unsqueeze ArgSpec(args=['input', 'axes', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.lod_reset ArgSpec(args=['x', 'y', 'target_lod'], varargs=None, keywords=None, defaults=(None, None)) -paddle.fluid.layers.lrn ArgSpec(args=['input', 'n', 'k', 'alpha', 'beta', 'name'], varargs=None, keywords=None, defaults=(5, 1.0, 0.0001, 0.75, None)) -paddle.fluid.layers.pad ArgSpec(args=['x', 'paddings', 'pad_value', 'name'], varargs=None, keywords=None, defaults=(0.0, None)) -paddle.fluid.layers.pad_constant_like ArgSpec(args=['x', 'y', 'pad_value', 'name'], varargs=None, keywords=None, defaults=(0.0, None)) -paddle.fluid.layers.label_smooth ArgSpec(args=['label', 'prior_dist', 'epsilon', 'dtype', 'name'], varargs=None, keywords=None, defaults=(None, 0.1, 'float32', None)) -paddle.fluid.layers.roi_pool ArgSpec(args=['input', 'rois', 'pooled_height', 'pooled_width', 'spatial_scale'], varargs=None, keywords=None, defaults=(1, 1, 1.0)) -paddle.fluid.layers.dice_loss ArgSpec(args=['input', 'label', 'epsilon'], varargs=None, keywords=None, defaults=(1e-05,)) -paddle.fluid.layers.image_resize ArgSpec(args=['input', 'out_shape', 'scale', 'name', 'resample'], varargs=None, keywords=None, defaults=(None, None, None, 'BILINEAR')) -paddle.fluid.layers.image_resize_short ArgSpec(args=['input', 'out_short_len', 'resample'], varargs=None, keywords=None, defaults=('BILINEAR',)) -paddle.fluid.layers.resize_bilinear ArgSpec(args=['input', 'out_shape', 'scale', 'name'], varargs=None, keywords=None, defaults=(None, None, None)) -paddle.fluid.layers.gather ArgSpec(args=['input', 'index'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.scatter ArgSpec(args=['input', 'index', 'updates', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.sequence_scatter ArgSpec(args=['input', 'index', 'updates', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.random_crop ArgSpec(args=['x', 'shape', 'seed'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.mean_iou ArgSpec(args=['input', 'label', 'num_classes'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.relu ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.log ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.crop ArgSpec(args=['x', 'shape', 'offsets', 'name'], varargs=None, keywords=None, defaults=(None, None, None)) -paddle.fluid.layers.rank_loss ArgSpec(args=['label', 'left', 'right', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.margin_rank_loss ArgSpec(args=['label', 'left', 'right', 'margin', 'name'], varargs=None, keywords=None, defaults=(0.1, None)) -paddle.fluid.layers.elu ArgSpec(args=['x', 'alpha', 'name'], varargs=None, keywords=None, defaults=(1.0, None)) -paddle.fluid.layers.relu6 ArgSpec(args=['x', 'threshold', 'name'], varargs=None, keywords=None, defaults=(6.0, None)) -paddle.fluid.layers.pow ArgSpec(args=['x', 'factor', 'name'], varargs=None, keywords=None, defaults=(1.0, None)) -paddle.fluid.layers.stanh ArgSpec(args=['x', 'scale_a', 'scale_b', 'name'], varargs=None, keywords=None, defaults=(0.6666666666666666, 1.7159, None)) -paddle.fluid.layers.hard_sigmoid ArgSpec(args=['x', 'slope', 'offset', 'name'], varargs=None, keywords=None, defaults=(0.2, 0.5, None)) -paddle.fluid.layers.swish ArgSpec(args=['x', 'beta', 'name'], varargs=None, keywords=None, defaults=(1.0, None)) -paddle.fluid.layers.prelu ArgSpec(args=['x', 'mode', 'param_attr', 'name'], varargs=None, keywords=None, defaults=(None, None)) -paddle.fluid.layers.brelu ArgSpec(args=['x', 't_min', 't_max', 'name'], varargs=None, keywords=None, defaults=(0.0, 24.0, None)) -paddle.fluid.layers.leaky_relu ArgSpec(args=['x', 'alpha', 'name'], varargs=None, keywords=None, defaults=(0.02, None)) -paddle.fluid.layers.soft_relu ArgSpec(args=['x', 'threshold', 'name'], varargs=None, keywords=None, defaults=(40.0, None)) -paddle.fluid.layers.flatten ArgSpec(args=['x', 'axis', 'name'], varargs=None, keywords=None, defaults=(1, None)) -paddle.fluid.layers.sequence_mask ArgSpec(args=['x', 'maxlen', 'dtype', 'name'], varargs=None, keywords=None, defaults=(None, 'int64', None)) -paddle.fluid.layers.stack ArgSpec(args=['x', 'axis'], varargs=None, keywords=None, defaults=(0,)) -paddle.fluid.layers.pad2d ArgSpec(args=['input', 'paddings', 'mode', 'pad_value', 'data_format', 'name'], varargs=None, keywords=None, defaults=([0, 0, 0, 0], 'constant', 0.0, 'NCHW', None)) -paddle.fluid.layers.unstack ArgSpec(args=['x', 'axis', 'num'], varargs=None, keywords=None, defaults=(0, None)) -paddle.fluid.layers.sequence_enumerate ArgSpec(args=['input', 'win_size', 'pad_value', 'name'], varargs=None, keywords=None, defaults=(0, None)) -paddle.fluid.layers.expand ArgSpec(args=['x', 'expand_times', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.sequence_concat ArgSpec(args=['input', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.scale ArgSpec(args=['x', 'scale', 'bias', 'bias_after_scale', 'act', 'name'], varargs=None, keywords=None, defaults=(1.0, 0.0, True, None, None)) -paddle.fluid.layers.elementwise_add ArgSpec(args=['x', 'y', 'axis', 'act', 'name'], varargs=None, keywords=None, defaults=(-1, None, None)) -paddle.fluid.layers.elementwise_div ArgSpec(args=['x', 'y', 'axis', 'act', 'name'], varargs=None, keywords=None, defaults=(-1, None, None)) -paddle.fluid.layers.elementwise_sub ArgSpec(args=['x', 'y', 'axis', 'act', 'name'], varargs=None, keywords=None, defaults=(-1, None, None)) -paddle.fluid.layers.elementwise_mul ArgSpec(args=['x', 'y', 'axis', 'act', 'name'], varargs=None, keywords=None, defaults=(-1, None, None)) -paddle.fluid.layers.elementwise_max ArgSpec(args=['x', 'y', 'axis', 'act', 'name'], varargs=None, keywords=None, defaults=(-1, None, None)) -paddle.fluid.layers.elementwise_min ArgSpec(args=['x', 'y', 'axis', 'act', 'name'], varargs=None, keywords=None, defaults=(-1, None, None)) -paddle.fluid.layers.elementwise_pow ArgSpec(args=['x', 'y', 'axis', 'act', 'name'], varargs=None, keywords=None, defaults=(-1, None, None)) -paddle.fluid.layers.uniform_random_batch_size_like ArgSpec(args=['input', 'shape', 'dtype', 'input_dim_idx', 'output_dim_idx', 'min', 'max', 'seed'], varargs=None, keywords=None, defaults=('float32', 0, 0, -1.0, 1.0, 0)) -paddle.fluid.layers.gaussian_random ArgSpec(args=['shape', 'mean', 'std', 'seed', 'dtype'], varargs=None, keywords=None, defaults=(0.0, 1.0, 0, 'float32')) -paddle.fluid.layers.sampling_id ArgSpec(args=['x', 'min', 'max', 'seed', 'dtype'], varargs=None, keywords=None, defaults=(0.0, 1.0, 0, 'float32')) -paddle.fluid.layers.gaussian_random_batch_size_like ArgSpec(args=['input', 'shape', 'input_dim_idx', 'output_dim_idx', 'mean', 'std', 'seed', 'dtype'], varargs=None, keywords=None, defaults=(0, 0, 0.0, 1.0, 0, 'float32')) -paddle.fluid.layers.sum ArgSpec(args=['x'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.slice ArgSpec(args=['input', 'axes', 'starts', 'ends'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.shape ArgSpec(args=['input'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.logical_and ArgSpec(args=['x', 'y', 'out', 'name'], varargs=None, keywords=None, defaults=(None, None)) -paddle.fluid.layers.logical_or ArgSpec(args=['x', 'y', 'out', 'name'], varargs=None, keywords=None, defaults=(None, None)) -paddle.fluid.layers.logical_xor ArgSpec(args=['x', 'y', 'out', 'name'], varargs=None, keywords=None, defaults=(None, None)) -paddle.fluid.layers.logical_not ArgSpec(args=['x', 'out', 'name'], varargs=None, keywords=None, defaults=(None, None)) -paddle.fluid.layers.clip ArgSpec(args=['x', 'min', 'max', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.clip_by_norm ArgSpec(args=['x', 'max_norm', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.mean ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.mul ArgSpec(args=['x', 'y', 'x_num_col_dims', 'y_num_col_dims', 'name'], varargs=None, keywords=None, defaults=(1, 1, None)) -paddle.fluid.layers.sigmoid_cross_entropy_with_logits ArgSpec(args=['x', 'label', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.maxout ArgSpec(args=['x', 'groups', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.affine_channel ArgSpec(args=['x', 'scale', 'bias', 'data_layout', 'name'], varargs=None, keywords=None, defaults=(None, None, 'NCHW', None)) -paddle.fluid.layers.data ArgSpec(args=['name', 'shape', 'append_batch_size', 'dtype', 'lod_level', 'type', 'stop_gradient'], varargs=None, keywords=None, defaults=(True, 'float32', 0, VarType.LOD_TENSOR, True)) -paddle.fluid.layers.open_files ArgSpec(args=['filenames', 'shapes', 'lod_levels', 'dtypes', 'thread_num', 'buffer_size', 'pass_num', 'is_test'], varargs=None, keywords=None, defaults=(None, None, 1, None)) -paddle.fluid.layers.read_file ArgSpec(args=['reader'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.shuffle ArgSpec(args=['reader', 'buffer_size'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.batch ArgSpec(args=['reader', 'batch_size'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.double_buffer ArgSpec(args=['reader', 'place', 'name'], varargs=None, keywords=None, defaults=(None, None)) -paddle.fluid.layers.random_data_generator ArgSpec(args=['low', 'high', 'shapes', 'lod_levels', 'for_parallel'], varargs=None, keywords=None, defaults=(True,)) -paddle.fluid.layers.py_reader ArgSpec(args=['capacity', 'shapes', 'dtypes', 'lod_levels', 'name', 'use_double_buffer'], varargs=None, keywords=None, defaults=(None, None, True)) -paddle.fluid.layers.Preprocessor.__init__ ArgSpec(args=['self', 'reader', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.Preprocessor.block ArgSpec(args=[], varargs='args', keywords='kwds', defaults=None) -paddle.fluid.layers.Preprocessor.inputs ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.Preprocessor.outputs ArgSpec(args=['self'], varargs='outs', keywords=None, defaults=None) -paddle.fluid.layers.load ArgSpec(args=['out', 'file_path', 'load_as_fp16'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.create_tensor ArgSpec(args=['dtype', 'name', 'persistable'], varargs=None, keywords=None, defaults=(None, False)) -paddle.fluid.layers.create_parameter ArgSpec(args=['shape', 'dtype', 'name', 'attr', 'is_bias', 'default_initializer'], varargs=None, keywords=None, defaults=(None, None, False, None)) -paddle.fluid.layers.create_global_var ArgSpec(args=['shape', 'value', 'dtype', 'persistable', 'force_cpu', 'name'], varargs=None, keywords=None, defaults=(False, False, None)) -paddle.fluid.layers.cast ArgSpec(args=['x', 'dtype'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.concat ArgSpec(args=['input', 'axis', 'name'], varargs=None, keywords=None, defaults=(0, None)) -paddle.fluid.layers.sums ArgSpec(args=['input', 'out'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.assign ArgSpec(args=['input', 'output'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.fill_constant_batch_size_like ArgSpec(args=['input', 'shape', 'dtype', 'value', 'input_dim_idx', 'output_dim_idx'], varargs=None, keywords=None, defaults=(0, 0)) -paddle.fluid.layers.fill_constant ArgSpec(args=['shape', 'dtype', 'value', 'force_cpu', 'out'], varargs=None, keywords=None, defaults=(False, None)) -paddle.fluid.layers.argmin ArgSpec(args=['x', 'axis'], varargs=None, keywords=None, defaults=(0,)) -paddle.fluid.layers.argmax ArgSpec(args=['x', 'axis'], varargs=None, keywords=None, defaults=(0,)) -paddle.fluid.layers.argsort ArgSpec(args=['input', 'axis', 'name'], varargs=None, keywords=None, defaults=(-1, None)) -paddle.fluid.layers.ones ArgSpec(args=['shape', 'dtype', 'force_cpu'], varargs=None, keywords=None, defaults=(False,)) -paddle.fluid.layers.zeros ArgSpec(args=['shape', 'dtype', 'force_cpu'], varargs=None, keywords=None, defaults=(False,)) -paddle.fluid.layers.reverse ArgSpec(args=['x', 'axis'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.has_inf ArgSpec(args=['x'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.has_nan ArgSpec(args=['x'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.isfinite ArgSpec(args=['x'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.While.__init__ ArgSpec(args=['self', 'cond', 'is_test', 'name'], varargs=None, keywords=None, defaults=(False, None)) -paddle.fluid.layers.While.block ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.Switch.__init__ ArgSpec(args=['self', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.Switch.case ArgSpec(args=['self', 'condition'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.Switch.default ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.increment ArgSpec(args=['x', 'value', 'in_place'], varargs=None, keywords=None, defaults=(1.0, True)) -paddle.fluid.layers.array_write ArgSpec(args=['x', 'i', 'array'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.create_array ArgSpec(args=['dtype'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.less_than ArgSpec(args=['x', 'y', 'force_cpu', 'cond'], varargs=None, keywords='ignored', defaults=(None, None)) -paddle.fluid.layers.equal ArgSpec(args=['x', 'y', 'cond'], varargs=None, keywords='ignored', defaults=(None,)) -paddle.fluid.layers.array_read ArgSpec(args=['array', 'i'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.array_length ArgSpec(args=['array'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.IfElse.__init__ ArgSpec(args=['self', 'cond', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.IfElse.false_block ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.IfElse.input ArgSpec(args=['self', 'x'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.IfElse.output ArgSpec(args=['self'], varargs='outs', keywords=None, defaults=None) -paddle.fluid.layers.IfElse.true_block ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.DynamicRNN.__init__ ArgSpec(args=['self', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.DynamicRNN.block ArgSpec(args=[], varargs='args', keywords='kwds', defaults=None) -paddle.fluid.layers.DynamicRNN.memory ArgSpec(args=['self', 'init', 'shape', 'value', 'need_reorder', 'dtype'], varargs=None, keywords=None, defaults=(None, None, 0.0, False, 'float32')) -paddle.fluid.layers.DynamicRNN.output ArgSpec(args=['self'], varargs='outputs', keywords=None, defaults=None) -paddle.fluid.layers.DynamicRNN.static_input ArgSpec(args=['self', 'x'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.DynamicRNN.step_input ArgSpec(args=['self', 'x'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.DynamicRNN.update_memory ArgSpec(args=['self', 'ex_mem', 'new_mem'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.StaticRNN.__init__ ArgSpec(args=['self', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.StaticRNN.memory ArgSpec(args=['self', 'init', 'shape', 'batch_ref', 'init_value', 'init_batch_dim_idx', 'ref_batch_dim_idx'], varargs=None, keywords=None, defaults=(None, None, None, 0.0, 0, 1)) -paddle.fluid.layers.StaticRNN.output ArgSpec(args=['self'], varargs='outputs', keywords=None, defaults=None) -paddle.fluid.layers.StaticRNN.step ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.StaticRNN.step_input ArgSpec(args=['self', 'x'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.StaticRNN.step_output ArgSpec(args=['self', 'o'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.StaticRNN.update_memory ArgSpec(args=['self', 'mem', 'var'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.reorder_lod_tensor_by_rank ArgSpec(args=['x', 'rank_table'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.Print ArgSpec(args=['input', 'first_n', 'message', 'summarize', 'print_tensor_name', 'print_tensor_type', 'print_tensor_shape', 'print_tensor_lod', 'print_phase'], varargs=None, keywords=None, defaults=(-1, None, -1, True, True, True, True, 'both')) -paddle.fluid.layers.is_empty ArgSpec(args=['x', 'cond'], varargs=None, keywords='ignored', defaults=(None,)) -paddle.fluid.layers.sigmoid ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.logsigmoid ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.exp ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.tanh ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.tanh_shrink ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.softshrink ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.sqrt ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.abs ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.ceil ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.floor ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.cos ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.sin ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.round ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.reciprocal ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.square ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.softplus ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.softsign ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.uniform_random ArgSpec(args=['shape', 'dtype', 'min', 'max', 'seed'], varargs=None, keywords=None, defaults=(None, None, None, None)) -paddle.fluid.layers.hard_shrink ArgSpec(args=['x', 'threshold'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.cumsum ArgSpec(args=['x', 'axis', 'exclusive', 'reverse'], varargs=None, keywords=None, defaults=(None, None, None)) -paddle.fluid.layers.thresholded_relu ArgSpec(args=['x', 'threshold'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.prior_box ArgSpec(args=['input', 'image', 'min_sizes', 'max_sizes', 'aspect_ratios', 'variance', 'flip', 'clip', 'steps', 'offset', 'name', 'min_max_aspect_ratios_order'], varargs=None, keywords=None, defaults=(None, [1.0], [0.1, 0.1, 0.2, 0.2], False, False, [0.0, 0.0], 0.5, None, False)) -paddle.fluid.layers.multi_box_head ArgSpec(args=['inputs', 'image', 'base_size', 'num_classes', 'aspect_ratios', 'min_ratio', 'max_ratio', 'min_sizes', 'max_sizes', 'steps', 'step_w', 'step_h', 'offset', 'variance', 'flip', 'clip', 'kernel_size', 'pad', 'stride', 'name', 'min_max_aspect_ratios_order'], varargs=None, keywords=None, defaults=(None, None, None, None, None, None, None, 0.5, [0.1, 0.1, 0.2, 0.2], True, False, 1, 0, 1, None, False)) -paddle.fluid.layers.bipartite_match ArgSpec(args=['dist_matrix', 'match_type', 'dist_threshold', 'name'], varargs=None, keywords=None, defaults=(None, None, None)) -paddle.fluid.layers.target_assign ArgSpec(args=['input', 'matched_indices', 'negative_indices', 'mismatch_value', 'name'], varargs=None, keywords=None, defaults=(None, None, None)) -paddle.fluid.layers.detection_output ArgSpec(args=['loc', 'scores', 'prior_box', 'prior_box_var', 'background_label', 'nms_threshold', 'nms_top_k', 'keep_top_k', 'score_threshold', 'nms_eta'], varargs=None, keywords=None, defaults=(0, 0.3, 400, 200, 0.01, 1.0)) -paddle.fluid.layers.ssd_loss ArgSpec(args=['location', 'confidence', 'gt_box', 'gt_label', 'prior_box', 'prior_box_var', 'background_label', 'overlap_threshold', 'neg_pos_ratio', 'neg_overlap', 'loc_loss_weight', 'conf_loss_weight', 'match_type', 'mining_type', 'normalize', 'sample_size'], varargs=None, keywords=None, defaults=(None, 0, 0.5, 3.0, 0.5, 1.0, 1.0, 'per_prediction', 'max_negative', True, None)) -paddle.fluid.layers.detection_map ArgSpec(args=['detect_res', 'label', 'class_num', 'background_label', 'overlap_threshold', 'evaluate_difficult', 'has_state', 'input_states', 'out_states', 'ap_version'], varargs=None, keywords=None, defaults=(0, 0.3, True, None, None, None, 'integral')) -paddle.fluid.layers.rpn_target_assign ArgSpec(args=['bbox_pred', 'cls_logits', 'anchor_box', 'anchor_var', 'gt_boxes', 'is_crowd', 'im_info', 'rpn_batch_size_per_im', 'rpn_straddle_thresh', 'rpn_fg_fraction', 'rpn_positive_overlap', 'rpn_negative_overlap', 'use_random'], varargs=None, keywords=None, defaults=(256, 0.0, 0.5, 0.7, 0.3, True)) -paddle.fluid.layers.anchor_generator ArgSpec(args=['input', 'anchor_sizes', 'aspect_ratios', 'variance', 'stride', 'offset', 'name'], varargs=None, keywords=None, defaults=(None, None, [0.1, 0.1, 0.2, 0.2], None, 0.5, None)) -paddle.fluid.layers.roi_perspective_transform ArgSpec(args=['input', 'rois', 'transformed_height', 'transformed_width', 'spatial_scale'], varargs=None, keywords=None, defaults=(1.0,)) -paddle.fluid.layers.generate_proposal_labels ArgSpec(args=['rpn_rois', 'gt_classes', 'is_crowd', 'gt_boxes', 'im_info', 'batch_size_per_im', 'fg_fraction', 'fg_thresh', 'bg_thresh_hi', 'bg_thresh_lo', 'bbox_reg_weights', 'class_nums', 'use_random'], varargs=None, keywords=None, defaults=(256, 0.25, 0.25, 0.5, 0.0, [0.1, 0.1, 0.2, 0.2], None, True)) -paddle.fluid.layers.generate_proposals ArgSpec(args=['scores', 'bbox_deltas', 'im_info', 'anchors', 'variances', 'pre_nms_top_n', 'post_nms_top_n', 'nms_thresh', 'min_size', 'eta', 'name'], varargs=None, keywords=None, defaults=(6000, 1000, 0.5, 0.1, 1.0, None)) -paddle.fluid.layers.iou_similarity ArgSpec(args=['x', 'y', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.box_coder ArgSpec(args=['prior_box', 'prior_box_var', 'target_box', 'code_type', 'box_normalized', 'name'], varargs=None, keywords=None, defaults=('encode_center_size', True, None)) -paddle.fluid.layers.polygon_box_transform ArgSpec(args=['input', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.accuracy ArgSpec(args=['input', 'label', 'k', 'correct', 'total'], varargs=None, keywords=None, defaults=(1, None, None)) -paddle.fluid.layers.auc ArgSpec(args=['input', 'label', 'curve', 'num_thresholds', 'topk', 'slide_steps'], varargs=None, keywords=None, defaults=('ROC', 4095, 1, 1)) -paddle.fluid.layers.exponential_decay ArgSpec(args=['learning_rate', 'decay_steps', 'decay_rate', 'staircase'], varargs=None, keywords=None, defaults=(False,)) -paddle.fluid.layers.natural_exp_decay ArgSpec(args=['learning_rate', 'decay_steps', 'decay_rate', 'staircase'], varargs=None, keywords=None, defaults=(False,)) -paddle.fluid.layers.inverse_time_decay ArgSpec(args=['learning_rate', 'decay_steps', 'decay_rate', 'staircase'], varargs=None, keywords=None, defaults=(False,)) -paddle.fluid.layers.polynomial_decay ArgSpec(args=['learning_rate', 'decay_steps', 'end_learning_rate', 'power', 'cycle'], varargs=None, keywords=None, defaults=(0.0001, 1.0, False)) -paddle.fluid.layers.piecewise_decay ArgSpec(args=['boundaries', 'values'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.noam_decay ArgSpec(args=['d_model', 'warmup_steps'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.append_LARS ArgSpec(args=['params_grads', 'learning_rate', 'weight_decay'], varargs=None, keywords=None, defaults=None) -paddle.fluid.contrib.InitState.__init__ ArgSpec(args=['self', 'init', 'shape', 'value', 'init_boot', 'need_reorder', 'dtype'], varargs=None, keywords=None, defaults=(None, None, 0.0, None, False, 'float32')) -paddle.fluid.contrib.StateCell.__init__ ArgSpec(args=['self', 'inputs', 'states', 'out_state', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.contrib.StateCell.compute_state ArgSpec(args=['self', 'inputs'], varargs=None, keywords=None, defaults=None) -paddle.fluid.contrib.StateCell.get_input ArgSpec(args=['self', 'input_name'], varargs=None, keywords=None, defaults=None) -paddle.fluid.contrib.StateCell.get_state ArgSpec(args=['self', 'state_name'], varargs=None, keywords=None, defaults=None) -paddle.fluid.contrib.StateCell.out_state ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None) -paddle.fluid.contrib.StateCell.set_state ArgSpec(args=['self', 'state_name', 'state_value'], varargs=None, keywords=None, defaults=None) -paddle.fluid.contrib.StateCell.state_updater ArgSpec(args=['self', 'updater'], varargs=None, keywords=None, defaults=None) -paddle.fluid.contrib.StateCell.update_states ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None) -paddle.fluid.contrib.TrainingDecoder.__init__ ArgSpec(args=['self', 'state_cell', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.contrib.TrainingDecoder.block ArgSpec(args=[], varargs='args', keywords='kwds', defaults=None) -paddle.fluid.contrib.TrainingDecoder.output ArgSpec(args=['self'], varargs='outputs', keywords=None, defaults=None) -paddle.fluid.contrib.TrainingDecoder.static_input ArgSpec(args=['self', 'x'], varargs=None, keywords=None, defaults=None) -paddle.fluid.contrib.TrainingDecoder.step_input ArgSpec(args=['self', 'x'], varargs=None, keywords=None, defaults=None) -paddle.fluid.contrib.BeamSearchDecoder.__init__ ArgSpec(args=['self', 'state_cell', 'init_ids', 'init_scores', 'target_dict_dim', 'word_dim', 'input_var_dict', 'topk_size', 'sparse_emb', 'max_len', 'beam_size', 'end_id', 'name'], varargs=None, keywords=None, defaults=({}, 50, True, 100, 1, 1, None)) -paddle.fluid.contrib.BeamSearchDecoder.block ArgSpec(args=[], varargs='args', keywords='kwds', defaults=None) -paddle.fluid.contrib.BeamSearchDecoder.decode ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None) -paddle.fluid.contrib.BeamSearchDecoder.early_stop ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None) -paddle.fluid.contrib.BeamSearchDecoder.read_array ArgSpec(args=['self', 'init', 'is_ids', 'is_scores'], varargs=None, keywords=None, defaults=(False, False)) -paddle.fluid.contrib.BeamSearchDecoder.update_array ArgSpec(args=['self', 'array', 'value'], varargs=None, keywords=None, defaults=None) -paddle.fluid.contrib.memory_usage ArgSpec(args=['program', 'batch_size'], varargs=None, keywords=None, defaults=None) -paddle.fluid.contrib.op_freq_statistic ArgSpec(args=['program'], varargs=None, keywords=None, defaults=None) -paddle.fluid.contrib.QuantizeTranspiler.__init__ ArgSpec(args=['self', 'weight_bits', 'activation_bits', 'activation_quantize_type', 'weight_quantize_type', 'window_size'], varargs=None, keywords=None, defaults=(8, 8, 'abs_max', 'abs_max', 10000)) -paddle.fluid.contrib.QuantizeTranspiler.convert_to_int8 ArgSpec(args=['self', 'program', 'place', 'scope'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.contrib.QuantizeTranspiler.freeze_program ArgSpec(args=['self', 'program', 'place', 'fuse_bn', 'scope'], varargs=None, keywords=None, defaults=(False, None)) -paddle.fluid.contrib.QuantizeTranspiler.training_transpile ArgSpec(args=['self', 'program', 'startup_program'], varargs=None, keywords=None, defaults=(None, None)) -paddle.fluid.transpiler.DistributeTranspiler.__init__ ArgSpec(args=['self', 'config'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.transpiler.DistributeTranspiler.get_pserver_program ArgSpec(args=['self', 'endpoint'], varargs=None, keywords=None, defaults=None) -paddle.fluid.transpiler.DistributeTranspiler.get_pserver_programs ArgSpec(args=['self', 'endpoint'], varargs=None, keywords=None, defaults=None) -paddle.fluid.transpiler.DistributeTranspiler.get_startup_program ArgSpec(args=['self', 'endpoint', 'pserver_program', 'startup_program'], varargs=None, keywords=None, defaults=(None, None)) -paddle.fluid.transpiler.DistributeTranspiler.get_trainer_program ArgSpec(args=['self', 'wait_port'], varargs=None, keywords=None, defaults=(True,)) -paddle.fluid.transpiler.DistributeTranspiler.transpile ArgSpec(args=['self', 'trainer_id', 'program', 'pservers', 'trainers', 'sync_mode', 'startup_program', 'current_endpoint'], varargs=None, keywords=None, defaults=(None, '127.0.0.1:6174', 1, True, None, '127.0.0.1:6174')) -paddle.fluid.transpiler.memory_optimize ArgSpec(args=['input_program', 'skip_opt_set', 'print_log', 'level', 'skip_grads'], varargs=None, keywords=None, defaults=(None, False, 0, False)) -paddle.fluid.transpiler.release_memory ArgSpec(args=['input_program', 'skip_opt_set'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.transpiler.HashName.__init__ ArgSpec(args=['self', 'pserver_endpoints'], varargs=None, keywords=None, defaults=None) -paddle.fluid.transpiler.HashName.dispatch ArgSpec(args=['self', 'varlist'], varargs=None, keywords=None, defaults=None) -paddle.fluid.transpiler.HashName.reset ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None) -paddle.fluid.transpiler.RoundRobin.__init__ ArgSpec(args=['self', 'pserver_endpoints'], varargs=None, keywords=None, defaults=None) -paddle.fluid.transpiler.RoundRobin.dispatch ArgSpec(args=['self', 'varlist'], varargs=None, keywords=None, defaults=None) -paddle.fluid.transpiler.RoundRobin.reset ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None) +paddle.fluid.ParallelExecutor.__init__ (ArgSpec(args=['self', 'use_cuda', 'loss_name', 'main_program', 'share_vars_from', 'exec_strategy', 'build_strategy', 'num_trainers', 'trainer_id', 'scope'], varargs=None, keywords=None, defaults=(None, None, None, None, None, 1, 0, None)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.ParallelExecutor.run (ArgSpec(args=['self', 'fetch_list', 'feed', 'feed_dict', 'return_numpy'], varargs=None, keywords=None, defaults=(None, None, True)), ('document', '2cb4bd74481861345c70228a0f57620c')) +paddle.fluid.create_lod_tensor (ArgSpec(args=['data', 'recursive_seq_lens', 'place'], varargs=None, keywords=None, defaults=None), ('document', '8e7bb21e83ff4604f5b379672e285b94')) +paddle.fluid.create_random_int_lodtensor (ArgSpec(args=['recursive_seq_lens', 'base_shape', 'place', 'low', 'high'], varargs=None, keywords=None, defaults=None), ('document', '368f638b99f1dfe59e9b02aa6f077752')) +paddle.fluid.DataFeedDesc.__init__ (ArgSpec(args=['self', 'proto_file'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.DataFeedDesc.desc (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '4294493e31c4bc9fc4bd48753044235f')) +paddle.fluid.DataFeedDesc.set_batch_size (ArgSpec(args=['self', 'batch_size'], varargs=None, keywords=None, defaults=None), ('document', '8d9f44601e0a99dd431f14fd9250cd21')) +paddle.fluid.DataFeedDesc.set_dense_slots (ArgSpec(args=['self', 'dense_slots_name'], varargs=None, keywords=None, defaults=None), ('document', 'eb894b464bbcd1b4bc8038398954f766')) +paddle.fluid.DataFeedDesc.set_use_slots (ArgSpec(args=['self', 'use_slots_name'], varargs=None, keywords=None, defaults=None), ('document', '415c56600ce4e198c071cad01409a690')) +paddle.fluid.AsyncExecutor.__init__ (ArgSpec(args=['self', 'place', 'run_mode'], varargs=None, keywords=None, defaults=(None, '')), ('document', '4e85874dddcd06c38f5717992d741589')) +paddle.fluid.AsyncExecutor.config_distributed_nodes (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '762980fe0181eb41e3d1081b26ed76b1')) +paddle.fluid.AsyncExecutor.download_data (ArgSpec(args=['self', 'afs_path', 'local_path', 'fs_default_name', 'ugi', 'file_cnt', 'hadoop_home', 'process_num'], varargs=None, keywords=None, defaults=('$HADOOP_HOME', 12)), ('document', '39e3ccddf8ea8db75ea85287c9147c3b')) +paddle.fluid.AsyncExecutor.get_instance (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', 'f8688f76a2db1243c7097a60c507b182')) +paddle.fluid.AsyncExecutor.init_model (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '504f39be2007404a17e5cabea1256c7d')) +paddle.fluid.AsyncExecutor.init_server (ArgSpec(args=['self', 'dist_desc'], varargs=None, keywords=None, defaults=None), ('document', '384fa5fbb99912db1baf7ef7784bd312')) +paddle.fluid.AsyncExecutor.init_worker (ArgSpec(args=['self', 'dist_desc', 'startup_program'], varargs=None, keywords=None, defaults=None), ('document', 'f0a36d7c8561039f60a6f6555c7fee0b')) +paddle.fluid.AsyncExecutor.run (ArgSpec(args=['self', 'program', 'data_feed', 'filelist', 'thread_num', 'fetch', 'mode', 'debug'], varargs=None, keywords=None, defaults=('', False)), ('document', '848fc53484e8326f6325feea87fe955c')) +paddle.fluid.AsyncExecutor.save_model (ArgSpec(args=['self', 'save_path'], varargs=None, keywords=None, defaults=None), ('document', '145b5c0da01bfff397142e51361f4b75')) +paddle.fluid.AsyncExecutor.stop (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '5f23d043607bb5d55e466ec3f578e093')) +paddle.fluid.CompiledProgram.__init__ (ArgSpec(args=['self', 'program_or_graph'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.CompiledProgram.with_data_parallel (ArgSpec(args=['self', 'loss_name', 'build_strategy', 'exec_strategy', 'share_vars_from', 'places'], varargs=None, keywords=None, defaults=(None, None, None, None, None)), ('document', 'a8c7793803cf976680d9478e378fa356')) +paddle.fluid.CompiledProgram.with_inference_optimize (ArgSpec(args=['self', 'config'], varargs=None, keywords=None, defaults=None), ('document', '9e5b009d850191a010e859189c127fd8')) +paddle.fluid.ExecutionStrategy.__init__ __init__(self: paddle.fluid.core.ParallelExecutor.ExecutionStrategy) -> None +paddle.fluid.BuildStrategy.GradientScaleStrategy.__init__ __init__(self: paddle.fluid.core.ParallelExecutor.BuildStrategy.GradientScaleStrategy, arg0: int) -> None +paddle.fluid.BuildStrategy.ReduceStrategy.__init__ __init__(self: paddle.fluid.core.ParallelExecutor.BuildStrategy.ReduceStrategy, arg0: int) -> None +paddle.fluid.BuildStrategy.__init__ __init__(self: paddle.fluid.core.ParallelExecutor.BuildStrategy) -> None +paddle.fluid.io.save_vars (ArgSpec(args=['executor', 'dirname', 'main_program', 'vars', 'predicate', 'filename'], varargs=None, keywords=None, defaults=(None, None, None, None)), ('document', 'b55d6193a1d4198d45b013fc5779e1f2')) +paddle.fluid.io.save_params (ArgSpec(args=['executor', 'dirname', 'main_program', 'filename'], varargs=None, keywords=None, defaults=(None, None)), ('document', '3a7a99abac3e1bf898871fe609354218')) +paddle.fluid.io.save_persistables (ArgSpec(args=['executor', 'dirname', 'main_program', 'filename'], varargs=None, keywords=None, defaults=(None, None)), ('document', '9141bb5f32caf7975eb3fd88c8a1b2da')) +paddle.fluid.io.load_vars (ArgSpec(args=['executor', 'dirname', 'main_program', 'vars', 'predicate', 'filename'], varargs=None, keywords=None, defaults=(None, None, None, None)), ('document', '0a5308f496632ab1ec3ba1f1377e6f95')) +paddle.fluid.io.load_params (ArgSpec(args=['executor', 'dirname', 'main_program', 'filename'], varargs=None, keywords=None, defaults=(None, None)), ('document', '41779819cef32f2246e83aebc5a002e2')) +paddle.fluid.io.load_persistables (ArgSpec(args=['executor', 'dirname', 'main_program', 'filename'], varargs=None, keywords=None, defaults=(None, None)), ('document', '28df5bfe26ca7a077f91156abb0fe6d2')) +paddle.fluid.io.save_inference_model (ArgSpec(args=['dirname', 'feeded_var_names', 'target_vars', 'executor', 'main_program', 'model_filename', 'params_filename', 'export_for_deployment'], varargs=None, keywords=None, defaults=(None, None, None, True)), ('document', '70f4f53f13572436ac72d1c8b5efeb9d')) +paddle.fluid.io.load_inference_model (ArgSpec(args=['dirname', 'executor', 'model_filename', 'params_filename', 'pserver_endpoints'], varargs=None, keywords=None, defaults=(None, None, None)), ('document', '7a5255386075dac3c75b7058254fcdcb')) +paddle.fluid.io.PyReader.__init__ (ArgSpec(args=['self', 'feed_list', 'capacity', 'use_double_buffer', 'iterable'], varargs=None, keywords=None, defaults=(True, False)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.io.PyReader.decorate_batch_generator (ArgSpec(args=['self', 'reader', 'places'], varargs=None, keywords=None, defaults=(None,)), ('document', 'a3fefec8bacd6ce83f49906a9d05e779')) +paddle.fluid.io.PyReader.decorate_sample_generator (ArgSpec(args=['self', 'sample_generator', 'batch_size', 'drop_last', 'places'], varargs=None, keywords=None, defaults=(True, None)), ('document', '7abd9cf7d695bab5bb6cf7ded5903cb2')) +paddle.fluid.io.PyReader.decorate_sample_list_generator (ArgSpec(args=['self', 'reader', 'places'], varargs=None, keywords=None, defaults=(None,)), ('document', 'faef298f73e91aedcfaf5d184f3109b7')) +paddle.fluid.io.PyReader.reset (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', 'ff1cc1e2beb8824d453656c72c28ddfb')) +paddle.fluid.io.PyReader.start (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', 'b7ea0a548991924e4cfe61a577b8e56d')) +paddle.fluid.initializer.ConstantInitializer.__init__ (ArgSpec(args=['self', 'value', 'force_cpu'], varargs=None, keywords=None, defaults=(0.0, False)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.initializer.UniformInitializer.__init__ (ArgSpec(args=['self', 'low', 'high', 'seed'], varargs=None, keywords=None, defaults=(-1.0, 1.0, 0)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.initializer.NormalInitializer.__init__ (ArgSpec(args=['self', 'loc', 'scale', 'seed'], varargs=None, keywords=None, defaults=(0.0, 1.0, 0)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.initializer.TruncatedNormalInitializer.__init__ (ArgSpec(args=['self', 'loc', 'scale', 'seed'], varargs=None, keywords=None, defaults=(0.0, 1.0, 0)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.initializer.XavierInitializer.__init__ (ArgSpec(args=['self', 'uniform', 'fan_in', 'fan_out', 'seed'], varargs=None, keywords=None, defaults=(True, None, None, 0)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.initializer.BilinearInitializer.__init__ (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', 'd389912dc079cbef432335a00017cec0')) +paddle.fluid.initializer.MSRAInitializer.__init__ (ArgSpec(args=['self', 'uniform', 'fan_in', 'seed'], varargs=None, keywords=None, defaults=(True, None, 0)), ('document', '53c757bed9345f2ad3361902531e7cf5')) +paddle.fluid.initializer.force_init_on_cpu (ArgSpec(args=[], varargs=None, keywords=None, defaults=None), ('document', '6d0f3e22c90d9d500d36ff57daf056ee')) +paddle.fluid.initializer.init_on_cpu (ArgSpec(args=[], varargs=None, keywords=None, defaults=None), ('document', 'a6d7011ca3d8c0d454dac3a56eae0c29')) +paddle.fluid.initializer.NumpyArrayInitializer.__init__ (ArgSpec(args=['self', 'value'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.layers.fc (ArgSpec(args=['input', 'size', 'num_flatten_dims', 'param_attr', 'bias_attr', 'act', 'is_test', 'name'], varargs=None, keywords=None, defaults=(1, None, None, None, False, None)), ('document', '424e898365195e3ccbc2e7dc8b63605e')) +paddle.fluid.layers.embedding (ArgSpec(args=['input', 'size', 'is_sparse', 'is_distributed', 'padding_idx', 'param_attr', 'dtype'], varargs=None, keywords=None, defaults=(False, False, None, None, 'float32')), ('document', '89c2c55a0b0656b106064048e068e77a')) +paddle.fluid.layers.dynamic_lstm (ArgSpec(args=['input', 'size', 'h_0', 'c_0', 'param_attr', 'bias_attr', 'use_peepholes', 'is_reverse', 'gate_activation', 'cell_activation', 'candidate_activation', 'dtype', 'name'], varargs=None, keywords=None, defaults=(None, None, None, None, True, False, 'sigmoid', 'tanh', 'tanh', 'float32', None)), ('document', 'dfbb624f85015df29e994ca6999e8ff6')) +paddle.fluid.layers.dynamic_lstmp (ArgSpec(args=['input', 'size', 'proj_size', 'param_attr', 'bias_attr', 'use_peepholes', 'is_reverse', 'gate_activation', 'cell_activation', 'candidate_activation', 'proj_activation', 'dtype', 'name', 'h_0', 'c_0', 'cell_clip', 'proj_clip'], varargs=None, keywords=None, defaults=(None, None, True, False, 'sigmoid', 'tanh', 'tanh', 'tanh', 'float32', None, None, None, None, None)), ('document', 'b4b608b986eb9617aa0525e1be21d32d')) +paddle.fluid.layers.dynamic_gru (ArgSpec(args=['input', 'size', 'param_attr', 'bias_attr', 'is_reverse', 'gate_activation', 'candidate_activation', 'h_0', 'origin_mode'], varargs=None, keywords=None, defaults=(None, None, False, 'sigmoid', 'tanh', None, False)), ('document', '4ec4845fd7d991bcac822f8b0dfc101f')) +paddle.fluid.layers.gru_unit (ArgSpec(args=['input', 'hidden', 'size', 'param_attr', 'bias_attr', 'activation', 'gate_activation', 'origin_mode'], varargs=None, keywords=None, defaults=(None, None, 'tanh', 'sigmoid', False)), ('document', 'e0e2439f7af069b57badca18a6ba60b8')) +paddle.fluid.layers.linear_chain_crf (ArgSpec(args=['input', 'label', 'param_attr'], varargs=None, keywords=None, defaults=(None,)), ('document', '7c49ef4bbf0adfd4b9a1d98e2e5f3fea')) +paddle.fluid.layers.crf_decoding (ArgSpec(args=['input', 'param_attr', 'label'], varargs=None, keywords=None, defaults=(None,)), ('document', '7642373ab65d3fc3b96d16d10fef1538')) +paddle.fluid.layers.cos_sim (ArgSpec(args=['X', 'Y'], varargs=None, keywords=None, defaults=None), ('document', 'd740824aa7316b807c4b4a3c6c8c0bbe')) +paddle.fluid.layers.cross_entropy (ArgSpec(args=['input', 'label', 'soft_label', 'ignore_index'], varargs=None, keywords=None, defaults=(False, -100)), ('document', '025b364dafb4b7975c801eb33e7831a1')) +paddle.fluid.layers.bpr_loss (ArgSpec(args=['input', 'label', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '30add751a0f99347a6257634c03ff254')) +paddle.fluid.layers.square_error_cost (ArgSpec(args=['input', 'label'], varargs=None, keywords=None, defaults=None), ('document', '44b6eef4a0f2bc15f7d9745782406736')) +paddle.fluid.layers.chunk_eval (ArgSpec(args=['input', 'label', 'chunk_scheme', 'num_chunk_types', 'excluded_chunk_types'], varargs=None, keywords=None, defaults=(None,)), ('document', 'ee152a7ba3036e7b9ede9184545179b4')) +paddle.fluid.layers.sequence_conv (ArgSpec(args=['input', 'num_filters', 'filter_size', 'filter_stride', 'padding', 'bias_attr', 'param_attr', 'act', 'name'], varargs=None, keywords=None, defaults=(3, 1, None, None, None, None, None)), ('document', 'b6543768e1afaa2ecb869709d6e9c7e2')) +paddle.fluid.layers.conv2d (ArgSpec(args=['input', 'num_filters', 'filter_size', 'stride', 'padding', 'dilation', 'groups', 'param_attr', 'bias_attr', 'use_cudnn', 'act', 'name'], varargs=None, keywords=None, defaults=(1, 0, 1, None, None, None, True, None, None)), ('document', '8ca6121acd6d23cd8806a93f493c2e17')) +paddle.fluid.layers.conv3d (ArgSpec(args=['input', 'num_filters', 'filter_size', 'stride', 'padding', 'dilation', 'groups', 'param_attr', 'bias_attr', 'use_cudnn', 'act', 'name'], varargs=None, keywords=None, defaults=(1, 0, 1, None, None, None, True, None, None)), ('document', '37042620f9bd3a2da6e5d3138b2f724b')) +paddle.fluid.layers.sequence_pool (ArgSpec(args=['input', 'pool_type', 'is_test'], varargs=None, keywords=None, defaults=(False,)), ('document', 'a194fb80614023f543df3949fbd0d0b8')) +paddle.fluid.layers.sequence_softmax (ArgSpec(args=['input', 'use_cudnn', 'name'], varargs=None, keywords=None, defaults=(False, None)), ('document', '19ef6f9cdd27feac8a1ae060f19c10b4')) +paddle.fluid.layers.softmax (ArgSpec(args=['input', 'use_cudnn', 'name', 'axis'], varargs=None, keywords=None, defaults=(False, None, -1)), ('document', '59b1c6bf2f0fa9dc649c85fef3a3b2ea')) +paddle.fluid.layers.pool2d (ArgSpec(args=['input', 'pool_size', 'pool_type', 'pool_stride', 'pool_padding', 'global_pooling', 'use_cudnn', 'ceil_mode', 'name', 'exclusive'], varargs=None, keywords=None, defaults=(-1, 'max', 1, 0, False, True, False, None, True)), ('document', 'bbd84e855e660cd1084bb71a2fd0cdaa')) +paddle.fluid.layers.pool3d (ArgSpec(args=['input', 'pool_size', 'pool_type', 'pool_stride', 'pool_padding', 'global_pooling', 'use_cudnn', 'ceil_mode', 'name', 'exclusive'], varargs=None, keywords=None, defaults=(-1, 'max', 1, 0, False, True, False, None, True)), ('document', '043de7333b79ee0ac55053c14ed81625')) +paddle.fluid.layers.adaptive_pool2d (ArgSpec(args=['input', 'pool_size', 'pool_type', 'require_index', 'name'], varargs=None, keywords=None, defaults=('max', False, None)), ('document', '859b887174d06f361658f69cb7c06d95')) +paddle.fluid.layers.adaptive_pool3d (ArgSpec(args=['input', 'pool_size', 'pool_type', 'require_index', 'name'], varargs=None, keywords=None, defaults=('max', False, None)), ('document', '120f4323a3d7ed9c0916f15a59f0e497')) +paddle.fluid.layers.batch_norm (ArgSpec(args=['input', 'act', 'is_test', 'momentum', 'epsilon', 'param_attr', 'bias_attr', 'data_layout', 'in_place', 'name', 'moving_mean_name', 'moving_variance_name', 'do_model_average_for_mean_and_var', 'fuse_with_relu', 'use_global_stats'], varargs=None, keywords=None, defaults=(None, False, 0.9, 1e-05, None, None, 'NCHW', False, None, None, None, False, False, False)), ('document', '320c6973b02ea179fa89fecc80796464')) +paddle.fluid.layers.data_norm (ArgSpec(args=['input', 'act', 'epsilon', 'param_attr', 'data_layout', 'in_place', 'name', 'moving_mean_name', 'moving_variance_name', 'do_model_average_for_mean_and_var'], varargs=None, keywords=None, defaults=(None, 1e-05, None, 'NCHW', False, None, None, None, False)), ('document', 'e45e09e65a2658e07cad987222f0d9ab')) +paddle.fluid.layers.beam_search_decode (ArgSpec(args=['ids', 'scores', 'beam_size', 'end_id', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', 'b0b8d53821716cd50c42e09b593f3feb')) +paddle.fluid.layers.conv2d_transpose (ArgSpec(args=['input', 'num_filters', 'output_size', 'filter_size', 'padding', 'stride', 'dilation', 'groups', 'param_attr', 'bias_attr', 'use_cudnn', 'act', 'name'], varargs=None, keywords=None, defaults=(None, None, 0, 1, 1, None, None, None, True, None, None)), ('document', '03993955ab1e6d3044c44e6f17fc85e9')) +paddle.fluid.layers.conv3d_transpose (ArgSpec(args=['input', 'num_filters', 'output_size', 'filter_size', 'padding', 'stride', 'dilation', 'groups', 'param_attr', 'bias_attr', 'use_cudnn', 'act', 'name'], varargs=None, keywords=None, defaults=(None, None, 0, 1, 1, None, None, None, True, None, None)), ('document', 'ec113c6a3686ac94f8fccd1a7953d445')) +paddle.fluid.layers.sequence_expand (ArgSpec(args=['x', 'y', 'ref_level', 'name'], varargs=None, keywords=None, defaults=(-1, None)), ('document', '79c375214fa427faac504043d162dae9')) +paddle.fluid.layers.sequence_expand_as (ArgSpec(args=['x', 'y', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '9d2611f84ab364c5da545e6a82f1770a')) +paddle.fluid.layers.sequence_pad (ArgSpec(args=['x', 'pad_value', 'maxlen', 'name'], varargs=None, keywords=None, defaults=(None, None)), ('document', '6a1adf3067b20f6e4bcb354d71c19184')) +paddle.fluid.layers.sequence_unpad (ArgSpec(args=['x', 'length', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', 'd12803c903c99aa36ec03aaac5f0cc5b')) +paddle.fluid.layers.lstm_unit (ArgSpec(args=['x_t', 'hidden_t_prev', 'cell_t_prev', 'forget_bias', 'param_attr', 'bias_attr', 'name'], varargs=None, keywords=None, defaults=(0.0, None, None, None)), ('document', '027723966f3ef0d7bc598f22287a96cc')) +paddle.fluid.layers.reduce_sum (ArgSpec(args=['input', 'dim', 'keep_dim', 'name'], varargs=None, keywords=None, defaults=(None, False, None)), ('document', 'b69998ce3ff4980fb21da0df05565f1b')) +paddle.fluid.layers.reduce_mean (ArgSpec(args=['input', 'dim', 'keep_dim', 'name'], varargs=None, keywords=None, defaults=(None, False, None)), ('document', 'd4d80dd98a1a5839f41eeb3a0f85f370')) +paddle.fluid.layers.reduce_max (ArgSpec(args=['input', 'dim', 'keep_dim', 'name'], varargs=None, keywords=None, defaults=(None, False, None)), ('document', '66a622db727551761ce4eb73eaa7f6a4')) +paddle.fluid.layers.reduce_min (ArgSpec(args=['input', 'dim', 'keep_dim', 'name'], varargs=None, keywords=None, defaults=(None, False, None)), ('document', 'd50ac552b5d131468ed466d08bb2d38c')) +paddle.fluid.layers.reduce_prod (ArgSpec(args=['input', 'dim', 'keep_dim', 'name'], varargs=None, keywords=None, defaults=(None, False, None)), ('document', 'fcd8301a0ce15f219c7a4bcd0c1e8eca')) +paddle.fluid.layers.sequence_first_step (ArgSpec(args=['input'], varargs=None, keywords=None, defaults=None), ('document', '2b290d3d77882bfe9bb8d331cac8cdd3')) +paddle.fluid.layers.sequence_last_step (ArgSpec(args=['input'], varargs=None, keywords=None, defaults=None), ('document', 'c16a892f44f7fe71bfa5afc32d3f34ce')) +paddle.fluid.layers.sequence_slice (ArgSpec(args=['input', 'offset', 'length', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', 'fdcea0e8b5bc7d8d4b1b072c521014e6')) +paddle.fluid.layers.dropout (ArgSpec(args=['x', 'dropout_prob', 'is_test', 'seed', 'name', 'dropout_implementation'], varargs=None, keywords=None, defaults=(False, None, None, 'downgrade_in_infer')), ('document', 'f1dd22f7351f7f9853212958e0d8aa7a')) +paddle.fluid.layers.split (ArgSpec(args=['input', 'num_or_sections', 'dim', 'name'], varargs=None, keywords=None, defaults=(-1, None)), ('document', '652625345c2acb900029c78cc75f8aa6')) +paddle.fluid.layers.ctc_greedy_decoder (ArgSpec(args=['input', 'blank', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', 'ebbf2adbd79683dc93db03454dfa18c2')) +paddle.fluid.layers.edit_distance (ArgSpec(args=['input', 'label', 'normalized', 'ignored_tokens'], varargs=None, keywords=None, defaults=(True, None)), ('document', '97f0262f97602644c83142789d784571')) +paddle.fluid.layers.l2_normalize (ArgSpec(args=['x', 'axis', 'epsilon', 'name'], varargs=None, keywords=None, defaults=(1e-12, None)), ('document', '35c6a241bcc1a1fc89508860d82ad62b')) +paddle.fluid.layers.matmul (ArgSpec(args=['x', 'y', 'transpose_x', 'transpose_y', 'alpha', 'name'], varargs=None, keywords=None, defaults=(False, False, 1.0, None)), ('document', 'b4cbe1ac451005df6dad12e9ffdccca9')) +paddle.fluid.layers.topk (ArgSpec(args=['input', 'k', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', 'd3570c02f71bcd78e60b3f31dc8f5b32')) +paddle.fluid.layers.warpctc (ArgSpec(args=['input', 'label', 'blank', 'norm_by_times', 'use_cudnn'], varargs=None, keywords=None, defaults=(0, False, False)), ('document', 'aaba49c038ba927f0a8e45c0c9a686ab')) +paddle.fluid.layers.sequence_reshape (ArgSpec(args=['input', 'new_dim'], varargs=None, keywords=None, defaults=None), ('document', 'a10ab9bf88d4a7e328882d411abb6fd1')) +paddle.fluid.layers.transpose (ArgSpec(args=['x', 'perm', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', 'a1feac48b843d679db82312dc85885f4')) +paddle.fluid.layers.im2sequence (ArgSpec(args=['input', 'filter_size', 'stride', 'padding', 'input_image_size', 'out_stride', 'name'], varargs=None, keywords=None, defaults=(1, 1, 0, None, 1, None)), ('document', '3ce01160ede80b1c26f776f8fef9340f')) +paddle.fluid.layers.nce (ArgSpec(args=['input', 'label', 'num_total_classes', 'sample_weight', 'param_attr', 'bias_attr', 'num_neg_samples', 'name', 'sampler', 'custom_dist', 'seed', 'is_sparse'], varargs=None, keywords=None, defaults=(None, None, None, None, None, 'uniform', None, 0, False)), ('document', 'fddad4896dee5193e1cdf70882c2a347')) +paddle.fluid.layers.sampled_softmax_with_cross_entropy (ArgSpec(args=['logits', 'label', 'num_samples', 'num_true', 'remove_accidental_hits', 'use_customized_samples', 'customized_samples', 'customized_probabilities', 'seed'], varargs=None, keywords=None, defaults=(1, True, False, None, None, 0)), ('document', '5db30b8a74e8c93687943a3e8d221da0')) +paddle.fluid.layers.hsigmoid (ArgSpec(args=['input', 'label', 'num_classes', 'param_attr', 'bias_attr', 'name', 'path_table', 'path_code', 'is_custom', 'is_sparse'], varargs=None, keywords=None, defaults=(None, None, None, None, None, False, False)), ('document', '80641ee6810b1cdc3fd6e14fc89ecc9d')) +paddle.fluid.layers.beam_search (ArgSpec(args=['pre_ids', 'pre_scores', 'ids', 'scores', 'beam_size', 'end_id', 'level', 'is_accumulated', 'name', 'return_parent_idx'], varargs=None, keywords=None, defaults=(0, True, None, False)), ('document', 'b350b9a30a18e7efd7e1bb740eef6996')) +paddle.fluid.layers.row_conv (ArgSpec(args=['input', 'future_context_size', 'param_attr', 'act'], varargs=None, keywords=None, defaults=(None, None)), ('document', '17485788fffe4e2d36dc58c2ac8d174e')) +paddle.fluid.layers.multiplex (ArgSpec(args=['inputs', 'index'], varargs=None, keywords=None, defaults=None), ('document', '2c4d1ae83da6ed35e3b36ba1b3b51d23')) +paddle.fluid.layers.layer_norm (ArgSpec(args=['input', 'scale', 'shift', 'begin_norm_axis', 'epsilon', 'param_attr', 'bias_attr', 'act', 'name'], varargs=None, keywords=None, defaults=(True, True, 1, 1e-05, None, None, None, None)), ('document', 'de6a906950bae9f3c245cb744d22b94e')) +paddle.fluid.layers.group_norm (ArgSpec(args=['input', 'groups', 'epsilon', 'param_attr', 'bias_attr', 'act', 'data_layout', 'name'], varargs=None, keywords=None, defaults=(1e-05, None, None, None, 'NCHW', None)), ('document', '419c3a24a83cc89219a029cf4092788b')) +paddle.fluid.layers.spectral_norm (ArgSpec(args=['weight', 'dim', 'power_iters', 'eps', 'name'], varargs=None, keywords=None, defaults=(0, 1, 1e-12, None)), ('document', '3f536aafba30d793287b52d231baff1b')) +paddle.fluid.layers.softmax_with_cross_entropy (ArgSpec(args=['logits', 'label', 'soft_label', 'ignore_index', 'numeric_stable_mode', 'return_softmax'], varargs=None, keywords=None, defaults=(False, -100, True, False)), ('document', 'bce1b75e3d95b75cacd1099655cbb3c3')) +paddle.fluid.layers.smooth_l1 (ArgSpec(args=['x', 'y', 'inside_weight', 'outside_weight', 'sigma'], varargs=None, keywords=None, defaults=(None, None, None)), ('document', 'c6b175d253c55baf4b9c0eca9b1dda88')) +paddle.fluid.layers.one_hot (ArgSpec(args=['input', 'depth'], varargs=None, keywords=None, defaults=None), ('document', '6148b6a555cbfb62fdcd030d8982c18c')) +paddle.fluid.layers.autoincreased_step_counter (ArgSpec(args=['counter_name', 'begin', 'step'], varargs=None, keywords=None, defaults=(None, 1, 1)), ('document', '3f6c828594720c9b2da89c464be94478')) +paddle.fluid.layers.reshape (ArgSpec(args=['x', 'shape', 'actual_shape', 'act', 'inplace', 'name'], varargs=None, keywords=None, defaults=(None, None, False, None)), ('document', '323c019f257e55ddea4a824a362de62f')) +paddle.fluid.layers.squeeze (ArgSpec(args=['input', 'axes', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '3229d06517f794e86ca3da14c38b1465')) +paddle.fluid.layers.unsqueeze (ArgSpec(args=['input', 'axes', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', 'bbd62da391b1df984a1909d069a759b2')) +paddle.fluid.layers.lod_reset (ArgSpec(args=['x', 'y', 'target_lod'], varargs=None, keywords=None, defaults=(None, None)), ('document', 'f122194c562bd674f6ecdccf33785f99')) +paddle.fluid.layers.lrn (ArgSpec(args=['input', 'n', 'k', 'alpha', 'beta', 'name'], varargs=None, keywords=None, defaults=(5, 1.0, 0.0001, 0.75, None)), ('document', '0795e9940e42dcd62953514ff7e09f77')) +paddle.fluid.layers.pad (ArgSpec(args=['x', 'paddings', 'pad_value', 'name'], varargs=None, keywords=None, defaults=(0.0, None)), ('document', '2f28153bdd2d5ea6f7bad5867bd03eeb')) +paddle.fluid.layers.pad_constant_like (ArgSpec(args=['x', 'y', 'pad_value', 'name'], varargs=None, keywords=None, defaults=(0.0, None)), ('document', 'd2e1f45fef51b2c214e3f2aa8976c46c')) +paddle.fluid.layers.label_smooth (ArgSpec(args=['label', 'prior_dist', 'epsilon', 'dtype', 'name'], varargs=None, keywords=None, defaults=(None, 0.1, 'float32', None)), ('document', '70c113658102a11cc5d8e3d45145737a')) +paddle.fluid.layers.roi_pool (ArgSpec(args=['input', 'rois', 'pooled_height', 'pooled_width', 'spatial_scale'], varargs=None, keywords=None, defaults=(1, 1, 1.0)), ('document', 'c317aa595deb31649083c8faa91cdb97')) +paddle.fluid.layers.roi_align (ArgSpec(args=['input', 'rois', 'pooled_height', 'pooled_width', 'spatial_scale', 'sampling_ratio', 'name'], varargs=None, keywords=None, defaults=(1, 1, 1.0, -1, None)), ('document', '12c5bbb8b38c42e623fbc47611d766e1')) +paddle.fluid.layers.dice_loss (ArgSpec(args=['input', 'label', 'epsilon'], varargs=None, keywords=None, defaults=(1e-05,)), ('document', '1ba0508d573f65feecf3564dce22aa1d')) +paddle.fluid.layers.image_resize (ArgSpec(args=['input', 'out_shape', 'scale', 'name', 'resample', 'actual_shape', 'align_corners', 'align_mode'], varargs=None, keywords=None, defaults=(None, None, None, 'BILINEAR', None, True, 1)), ('document', 'd1b08c11bb9277386fcf6ae70b6622d1')) +paddle.fluid.layers.image_resize_short (ArgSpec(args=['input', 'out_short_len', 'resample'], varargs=None, keywords=None, defaults=('BILINEAR',)), ('document', '06211aefc50c5a3e940d7204d859cdf7')) +paddle.fluid.layers.resize_bilinear (ArgSpec(args=['input', 'out_shape', 'scale', 'name', 'actual_shape', 'align_corners', 'align_mode'], varargs=None, keywords=None, defaults=(None, None, None, None, True, 1)), ('document', 'c45591fbc4f64a178fbca219e1546a58')) +paddle.fluid.layers.resize_nearest (ArgSpec(args=['input', 'out_shape', 'scale', 'name', 'actual_shape', 'align_corners'], varargs=None, keywords=None, defaults=(None, None, None, None, True)), ('document', 'ae6d73cdc7f3a138d8a338ecdb33c1ae')) +paddle.fluid.layers.gather (ArgSpec(args=['input', 'index'], varargs=None, keywords=None, defaults=None), ('document', '98f1c86716b9b7f4dda83f20e2adeee2')) +paddle.fluid.layers.scatter (ArgSpec(args=['input', 'index', 'updates', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '65f8e9d8ddfd0b412f940579c4faa342')) +paddle.fluid.layers.sequence_scatter (ArgSpec(args=['input', 'index', 'updates', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '15b522457dfef103f0c20ca9d397678b')) +paddle.fluid.layers.random_crop (ArgSpec(args=['x', 'shape', 'seed'], varargs=None, keywords=None, defaults=(None,)), ('document', 'c9ab9e460ef0a1823249935a30e82c66')) +paddle.fluid.layers.mean_iou (ArgSpec(args=['input', 'label', 'num_classes'], varargs=None, keywords=None, defaults=None), ('document', '35cbbdfa585d027bb490707c95a176b9')) +paddle.fluid.layers.relu (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '866ffa1cc93f29e23662b526a7596537')) +paddle.fluid.layers.selu (ArgSpec(args=['x', 'scale', 'alpha', 'name'], varargs=None, keywords=None, defaults=(None, None, None)), ('document', '9044c7fe667b76cb2d9264f2db11f417')) +paddle.fluid.layers.log (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '98247c59d1c9b40af6730001b2aea73d')) +paddle.fluid.layers.crop (ArgSpec(args=['x', 'shape', 'offsets', 'name'], varargs=None, keywords=None, defaults=(None, None, None)), ('document', '883104791204d3127e24234bb630b2e7')) +paddle.fluid.layers.rank_loss (ArgSpec(args=['label', 'left', 'right', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', 'c542e39ac6add24a6bef6e79bf5617e2')) +paddle.fluid.layers.margin_rank_loss (ArgSpec(args=['label', 'left', 'right', 'margin', 'name'], varargs=None, keywords=None, defaults=(0.1, None)), ('document', '6d19dcc19917080b7ff3e03bde451bc8')) +paddle.fluid.layers.elu (ArgSpec(args=['x', 'alpha', 'name'], varargs=None, keywords=None, defaults=(1.0, None)), ('document', '463258ee9f8b60760eb1e26357cc9bfa')) +paddle.fluid.layers.relu6 (ArgSpec(args=['x', 'threshold', 'name'], varargs=None, keywords=None, defaults=(6.0, None)), ('document', '6f367339caf6c7124bc262fe1475df70')) +paddle.fluid.layers.pow (ArgSpec(args=['x', 'factor', 'name'], varargs=None, keywords=None, defaults=(1.0, None)), ('document', 'a5117c1eb84aca2ac0b0abab337a4799')) +paddle.fluid.layers.stanh (ArgSpec(args=['x', 'scale_a', 'scale_b', 'name'], varargs=None, keywords=None, defaults=(0.6666666666666666, 1.7159, None)), ('document', '959936a477efc6c1447a9c8bf8ce94bb')) +paddle.fluid.layers.hard_sigmoid (ArgSpec(args=['x', 'slope', 'offset', 'name'], varargs=None, keywords=None, defaults=(0.2, 0.5, None)), ('document', 'c82059b6fea1aa730f9aac911807b756')) +paddle.fluid.layers.swish (ArgSpec(args=['x', 'beta', 'name'], varargs=None, keywords=None, defaults=(1.0, None)), ('document', 'ef745e55a48763ee7b46b21a81dc7e84')) +paddle.fluid.layers.prelu (ArgSpec(args=['x', 'mode', 'param_attr', 'name'], varargs=None, keywords=None, defaults=(None, None)), ('document', 'f6acef7ff7d887e49ff499fbb1dad4a9')) +paddle.fluid.layers.brelu (ArgSpec(args=['x', 't_min', 't_max', 'name'], varargs=None, keywords=None, defaults=(0.0, 24.0, None)), ('document', '3db337c195e156e6ef2b8b4a57113600')) +paddle.fluid.layers.leaky_relu (ArgSpec(args=['x', 'alpha', 'name'], varargs=None, keywords=None, defaults=(0.02, None)), ('document', 'f878486c82b576938151daad0de995a0')) +paddle.fluid.layers.soft_relu (ArgSpec(args=['x', 'threshold', 'name'], varargs=None, keywords=None, defaults=(40.0, None)), ('document', '869adce548c342d6cc1bd88a948d83c9')) +paddle.fluid.layers.flatten (ArgSpec(args=['x', 'axis', 'name'], varargs=None, keywords=None, defaults=(1, None)), ('document', 'cb295c13cb957db85cd9609269d7784d')) +paddle.fluid.layers.sequence_mask (ArgSpec(args=['x', 'maxlen', 'dtype', 'name'], varargs=None, keywords=None, defaults=(None, 'int64', None)), ('document', 'f0dd6eddd3bff015a3c05269d82fcbd8')) +paddle.fluid.layers.stack (ArgSpec(args=['x', 'axis'], varargs=None, keywords=None, defaults=(0,)), ('document', '367cfbb642839beacb5d117e2d2b4041')) +paddle.fluid.layers.pad2d (ArgSpec(args=['input', 'paddings', 'mode', 'pad_value', 'data_format', 'name'], varargs=None, keywords=None, defaults=([0, 0, 0, 0], 'constant', 0.0, 'NCHW', None)), ('document', '7f4d46320cc077ca2e8db600c35f4030')) +paddle.fluid.layers.unstack (ArgSpec(args=['x', 'axis', 'num'], varargs=None, keywords=None, defaults=(0, None)), ('document', '98eb9d633116efcfc6f90c114bd44fd6')) +paddle.fluid.layers.sequence_enumerate (ArgSpec(args=['input', 'win_size', 'pad_value', 'name'], varargs=None, keywords=None, defaults=(0, None)), ('document', 'f6028537085dc296103bbbd85fa7763d')) +paddle.fluid.layers.expand (ArgSpec(args=['x', 'expand_times', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '117d3607d1ffa0571835bbaebc7857ff')) +paddle.fluid.layers.sequence_concat (ArgSpec(args=['input', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '3a1d155dd1bf6e72a0a3e3e1519591d1')) +paddle.fluid.layers.scale (ArgSpec(args=['x', 'scale', 'bias', 'bias_after_scale', 'act', 'name'], varargs=None, keywords=None, defaults=(1.0, 0.0, True, None, None)), ('document', '30190413b2fa442e7466d6cf2ce5ea07')) +paddle.fluid.layers.elementwise_add (ArgSpec(args=['x', 'y', 'axis', 'act', 'name'], varargs=None, keywords=None, defaults=(-1, None, None)), ('document', '6bfbe72cbadc95ac7ab88c05ed5bf9f0')) +paddle.fluid.layers.elementwise_div (ArgSpec(args=['x', 'y', 'axis', 'act', 'name'], varargs=None, keywords=None, defaults=(-1, None, None)), ('document', 'cc6e6cc1cb942a152dde3ef08d5f165c')) +paddle.fluid.layers.elementwise_sub (ArgSpec(args=['x', 'y', 'axis', 'act', 'name'], varargs=None, keywords=None, defaults=(-1, None, None)), ('document', 'a12abdab09c3e57af5a6e1e9f138684a')) +paddle.fluid.layers.elementwise_mul (ArgSpec(args=['x', 'y', 'axis', 'act', 'name'], varargs=None, keywords=None, defaults=(-1, None, None)), ('document', '422c77dbfcff355a57b5fdd4ec876daa')) +paddle.fluid.layers.elementwise_max (ArgSpec(args=['x', 'y', 'axis', 'act', 'name'], varargs=None, keywords=None, defaults=(-1, None, None)), ('document', 'f0bb0b2c454541cfafa761021a5cc776')) +paddle.fluid.layers.elementwise_min (ArgSpec(args=['x', 'y', 'axis', 'act', 'name'], varargs=None, keywords=None, defaults=(-1, None, None)), ('document', '8a9cdefefbccbf9f6b0991c0946a21e9')) +paddle.fluid.layers.elementwise_pow (ArgSpec(args=['x', 'y', 'axis', 'act', 'name'], varargs=None, keywords=None, defaults=(-1, None, None)), ('document', '1aea4e197c552a284f83888a3c67a32e')) +paddle.fluid.layers.uniform_random_batch_size_like (ArgSpec(args=['input', 'shape', 'dtype', 'input_dim_idx', 'output_dim_idx', 'min', 'max', 'seed'], varargs=None, keywords=None, defaults=('float32', 0, 0, -1.0, 1.0, 0)), ('document', '129e0a3257f1d532a948eedf9d5bf671')) +paddle.fluid.layers.gaussian_random (ArgSpec(args=['shape', 'mean', 'std', 'seed', 'dtype'], varargs=None, keywords=None, defaults=(0.0, 1.0, 0, 'float32')), ('document', '389dafe36e099841b6a7fb18d11f1b4c')) +paddle.fluid.layers.sampling_id (ArgSpec(args=['x', 'min', 'max', 'seed', 'dtype'], varargs=None, keywords=None, defaults=(0.0, 1.0, 0, 'float32')), ('document', '840fdac643d1341c1cae218d4511dbb9')) +paddle.fluid.layers.gaussian_random_batch_size_like (ArgSpec(args=['input', 'shape', 'input_dim_idx', 'output_dim_idx', 'mean', 'std', 'seed', 'dtype'], varargs=None, keywords=None, defaults=(0, 0, 0.0, 1.0, 0, 'float32')), ('document', '840026b4766613c5705e06563cd103b6')) +paddle.fluid.layers.sum (ArgSpec(args=['x'], varargs=None, keywords=None, defaults=None), ('document', 'a418e3ccb5e2ac21bd60f5cc221d5860')) +paddle.fluid.layers.slice (ArgSpec(args=['input', 'axes', 'starts', 'ends'], varargs=None, keywords=None, defaults=None), ('document', '01dbb91e7c74cb11336cd531013de51a')) +paddle.fluid.layers.shape (ArgSpec(args=['input'], varargs=None, keywords=None, defaults=None), ('document', '17db0f814eb7bb5a3fac1ca6e60e16d8')) +paddle.fluid.layers.logical_and (ArgSpec(args=['x', 'y', 'out', 'name'], varargs=None, keywords=None, defaults=(None, None)), ('document', 'cdcf20c494c92060d10feb9374532f42')) +paddle.fluid.layers.logical_or (ArgSpec(args=['x', 'y', 'out', 'name'], varargs=None, keywords=None, defaults=(None, None)), ('document', '0eae3f726a4afe590757552fa3ced012')) +paddle.fluid.layers.logical_xor (ArgSpec(args=['x', 'y', 'out', 'name'], varargs=None, keywords=None, defaults=(None, None)), ('document', 'b0daaa3fa4a0aa62f9b58c43d959eb25')) +paddle.fluid.layers.logical_not (ArgSpec(args=['x', 'out', 'name'], varargs=None, keywords=None, defaults=(None, None)), ('document', 'cd1c8cf31e040427d4e05711044caeb6')) +paddle.fluid.layers.clip (ArgSpec(args=['x', 'min', 'max', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', 'b020b7aab59719be98a4ae229a76deba')) +paddle.fluid.layers.clip_by_norm (ArgSpec(args=['x', 'max_norm', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', 'a1ea0bc5a926f427458c4254ca022749')) +paddle.fluid.layers.mean (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', 'd638d915195ce86a8d7963b81110d4c8')) +paddle.fluid.layers.mul (ArgSpec(args=['x', 'y', 'x_num_col_dims', 'y_num_col_dims', 'name'], varargs=None, keywords=None, defaults=(1, 1, None)), ('document', 'ccd37fa6b53f074adbfb732d738c4c2d')) +paddle.fluid.layers.sigmoid_cross_entropy_with_logits (ArgSpec(args=['x', 'label', 'ignore_index', 'name', 'normalize'], varargs=None, keywords=None, defaults=(-100, None, False)), ('document', '180c284317ea45ef89a460d8d79c0b72')) +paddle.fluid.layers.maxout (ArgSpec(args=['x', 'groups', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '891870d069a6aea746d34cc53b61690c')) +paddle.fluid.layers.space_to_depth (ArgSpec(args=['x', 'blocksize', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', 'a9221eaef53884a00654e028551b78e2')) +paddle.fluid.layers.affine_grid (ArgSpec(args=['theta', 'out_shape', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '51def402b8910e163cbace9d0c0526ed')) +paddle.fluid.layers.sequence_reverse (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '77a6d80aa5551ca70324fc975c44507f')) +paddle.fluid.layers.affine_channel (ArgSpec(args=['x', 'scale', 'bias', 'data_layout', 'name', 'act'], varargs=None, keywords=None, defaults=(None, None, 'NCHW', None, None)), ('document', 'ab84fdc6dc60f3ad9aa397e6007e3bf9')) +paddle.fluid.layers.similarity_focus (ArgSpec(args=['input', 'axis', 'indexes', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '70e3b5182a18b40b47ecabd7c8490a35')) +paddle.fluid.layers.hash (ArgSpec(args=['input', 'hash_size', 'num_hash', 'name'], varargs=None, keywords=None, defaults=(1, None)), ('document', '9bb77f8dc002dd2ce75d4769eaaf5007')) +paddle.fluid.layers.grid_sampler (ArgSpec(args=['x', 'grid', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', 'd256cba1c41a5ed92ce3f31e24a2ca6d')) +paddle.fluid.layers.log_loss (ArgSpec(args=['input', 'label', 'epsilon', 'name'], varargs=None, keywords=None, defaults=(0.0001, None)), ('document', '4b5a2341023afe63157a066c14254f98')) +paddle.fluid.layers.add_position_encoding (ArgSpec(args=['input', 'alpha', 'beta', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '4b9c2e8af5817937d831820874b5aa77')) +paddle.fluid.layers.bilinear_tensor_product (ArgSpec(args=['x', 'y', 'size', 'act', 'name', 'param_attr', 'bias_attr'], varargs=None, keywords=None, defaults=(None, None, None, None)), ('document', 'aa7540a0fa73ff69a02e11b4091aab75')) +paddle.fluid.layers.merge_selected_rows (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', 'dc63315b84f591ac79ecca0c3632027a')) +paddle.fluid.layers.get_tensor_from_selected_rows (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '7ffc849e71f31dfe29030ff94e662de6')) +paddle.fluid.layers.lstm (ArgSpec(args=['input', 'init_h', 'init_c', 'max_len', 'hidden_size', 'num_layers', 'dropout_prob', 'is_bidirec', 'is_test', 'name', 'default_initializer', 'seed'], varargs=None, keywords=None, defaults=(0.0, False, False, None, None, -1)), ('document', 'd5e6c494ac35100e2ed4d4bd9a1ed932')) +paddle.fluid.layers.shuffle_channel (ArgSpec(args=['x', 'group', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '2fa6782d43d02ae64482d21235a82949')) +paddle.fluid.layers.temporal_shift (ArgSpec(args=['x', 'seg_num', 'shift_ratio', 'name'], varargs=None, keywords=None, defaults=(0.25, None)), ('document', 'fe4481fb31363b09cfdd228fc6776ddf')) +paddle.fluid.layers.py_func (ArgSpec(args=['func', 'x', 'out', 'backward_func', 'skip_vars_in_backward_input'], varargs=None, keywords=None, defaults=(None, None)), ('document', '8404e472ac12b4a30a505d3d3a3e5fdb')) +paddle.fluid.layers.psroi_pool (ArgSpec(args=['input', 'rois', 'output_channels', 'spatial_scale', 'pooled_height', 'pooled_width', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '1546136806fef5c08f6918544bd9151d')) +paddle.fluid.layers.teacher_student_sigmoid_loss (ArgSpec(args=['input', 'label', 'soft_max_up_bound', 'soft_max_lower_bound'], varargs=None, keywords=None, defaults=(15.0, -15.0)), ('document', '2f6ff96864054a31aa4bb659c6722c99')) +paddle.fluid.layers.huber_loss (ArgSpec(args=['input', 'label', 'delta'], varargs=None, keywords=None, defaults=None), ('document', '431a4301c35032166ec029f7432c80a7')) +paddle.fluid.layers.kldiv_loss (ArgSpec(args=['x', 'target', 'reduction', 'name'], varargs=None, keywords=None, defaults=('mean', None)), ('document', '776d536cac47c89073abc7ee524d5aec')) +paddle.fluid.layers.tree_conv (ArgSpec(args=['nodes_vector', 'edge_set', 'output_size', 'num_filters', 'max_depth', 'act', 'param_attr', 'bias_attr', 'name'], varargs=None, keywords=None, defaults=(1, 2, 'tanh', None, None, None)), ('document', '34ea12ac9f10a65dccbc50100d12e607')) +paddle.fluid.layers.npair_loss (ArgSpec(args=['anchor', 'positive', 'labels', 'l2_reg'], varargs=None, keywords=None, defaults=(0.002,)), ('document', '46994d10276dd4cb803b4062b5d14329')) +paddle.fluid.layers.pixel_shuffle (ArgSpec(args=['x', 'upscale_factor'], varargs=None, keywords=None, defaults=None), ('document', '731b21c62a4add60a33bd76d802ffc5c')) +paddle.fluid.layers.fsp_matrix (ArgSpec(args=['x', 'y'], varargs=None, keywords=None, defaults=None), ('document', 'b76ccca3735bea4a58a0dbf0d77c5393')) +paddle.fluid.layers.data (ArgSpec(args=['name', 'shape', 'append_batch_size', 'dtype', 'lod_level', 'type', 'stop_gradient'], varargs=None, keywords=None, defaults=(True, 'float32', 0, VarType.LOD_TENSOR, True)), ('document', '33bbd42027d872b3818b3d64ec52e139')) +paddle.fluid.layers.open_files (ArgSpec(args=['filenames', 'shapes', 'lod_levels', 'dtypes', 'thread_num', 'buffer_size', 'pass_num', 'is_test'], varargs=None, keywords=None, defaults=(None, None, 1, None)), ('document', 'b1ae2e1cc0750e58726374061ea90ecc')) +paddle.fluid.layers.read_file (ArgSpec(args=['reader'], varargs=None, keywords=None, defaults=None), ('document', 'b0a1c2fc51c27a106da28f3308c41f5e')) +paddle.fluid.layers.shuffle (ArgSpec(args=['reader', 'buffer_size'], varargs=None, keywords=None, defaults=None), ('document', 'f967a73426db26f970bc70bfb03cffca')) +paddle.fluid.layers.batch (ArgSpec(args=['reader', 'batch_size'], varargs=None, keywords=None, defaults=None), ('document', 'f563d376d35e1a4c4db100fd11b381a0')) +paddle.fluid.layers.double_buffer (ArgSpec(args=['reader', 'place', 'name'], varargs=None, keywords=None, defaults=(None, None)), ('document', '07e5b796674796eb1ef3fee9c10d24e3')) +paddle.fluid.layers.random_data_generator (ArgSpec(args=['low', 'high', 'shapes', 'lod_levels', 'for_parallel'], varargs=None, keywords=None, defaults=(True,)), ('document', '9b7f0f86ec24bbc97643cadcb6499cff')) +paddle.fluid.layers.py_reader (ArgSpec(args=['capacity', 'shapes', 'dtypes', 'lod_levels', 'name', 'use_double_buffer'], varargs=None, keywords=None, defaults=(None, None, True)), ('document', '4357643685cfd65454ba5a15f0151709')) +paddle.fluid.layers.create_py_reader_by_data (ArgSpec(args=['capacity', 'feed_list', 'name', 'use_double_buffer'], varargs=None, keywords=None, defaults=(None, True)), ('document', '350f74d93fab9adb2ac4950f1c26416b')) +paddle.fluid.layers.Preprocessor.__init__ (ArgSpec(args=['self', 'reader', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.layers.Preprocessor.block (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.layers.Preprocessor.inputs (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.layers.Preprocessor.outputs (ArgSpec(args=['self'], varargs='outs', keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.layers.load (ArgSpec(args=['out', 'file_path', 'load_as_fp16'], varargs=None, keywords=None, defaults=(None,)), ('document', '9d1a4bc97bbce9fa1d4f7a4200a771ff')) +paddle.fluid.layers.create_tensor (ArgSpec(args=['dtype', 'name', 'persistable'], varargs=None, keywords=None, defaults=(None, False)), ('document', 'c0c3d0194f83fff8ea99ce0820657dae')) +paddle.fluid.layers.create_parameter (ArgSpec(args=['shape', 'dtype', 'name', 'attr', 'is_bias', 'default_initializer'], varargs=None, keywords=None, defaults=(None, None, False, None)), ('document', 'd62b866c899bc1fedb5385f95b88e1f8')) +paddle.fluid.layers.create_global_var (ArgSpec(args=['shape', 'value', 'dtype', 'persistable', 'force_cpu', 'name'], varargs=None, keywords=None, defaults=(False, False, None)), ('document', 'ab914fac893607e29ac6e52bbdbea1a4')) +paddle.fluid.layers.cast (ArgSpec(args=['x', 'dtype'], varargs=None, keywords=None, defaults=None), ('document', '992eb42590fc1c380841a6db72ce78b3')) +paddle.fluid.layers.tensor_array_to_tensor (ArgSpec(args=['input', 'axis', 'name'], varargs=None, keywords=None, defaults=(1, None)), ('document', 'b12717d3d4567e6119589f7f655b0cbb')) +paddle.fluid.layers.concat (ArgSpec(args=['input', 'axis', 'name'], varargs=None, keywords=None, defaults=(0, None)), ('document', 'b19b79be4f05e85d1d6cec642c9fb535')) +paddle.fluid.layers.sums (ArgSpec(args=['input', 'out'], varargs=None, keywords=None, defaults=(None,)), ('document', '42912092418620b4be07f36af31e7816')) +paddle.fluid.layers.assign (ArgSpec(args=['input', 'output'], varargs=None, keywords=None, defaults=(None,)), ('document', 'b690184f3537df5501e4d9d8f31152a5')) +paddle.fluid.layers.fill_constant_batch_size_like (ArgSpec(args=['input', 'shape', 'dtype', 'value', 'input_dim_idx', 'output_dim_idx'], varargs=None, keywords=None, defaults=(0, 0)), ('document', 'd4059a2f5763036b07018d76429f9acb')) +paddle.fluid.layers.fill_constant (ArgSpec(args=['shape', 'dtype', 'value', 'force_cpu', 'out'], varargs=None, keywords=None, defaults=(False, None)), ('document', '1d8b14729639fa38509c79b9784740fa')) +paddle.fluid.layers.argmin (ArgSpec(args=['x', 'axis'], varargs=None, keywords=None, defaults=(0,)), ('document', '2778a1d34be49263a51211885599ea37')) +paddle.fluid.layers.argmax (ArgSpec(args=['x', 'axis'], varargs=None, keywords=None, defaults=(0,)), ('document', '04114996cfb98994ba222804a1a6109f')) +paddle.fluid.layers.argsort (ArgSpec(args=['input', 'axis', 'name'], varargs=None, keywords=None, defaults=(-1, None)), ('document', '68ec45c6fb6b93e47de9c9a0945fb98e')) +paddle.fluid.layers.ones (ArgSpec(args=['shape', 'dtype', 'force_cpu'], varargs=None, keywords=None, defaults=(False,)), ('document', 'b402489c62e668df42e7daceb63c142b')) +paddle.fluid.layers.zeros (ArgSpec(args=['shape', 'dtype', 'force_cpu'], varargs=None, keywords=None, defaults=(False,)), ('document', 'c155e2efc56ffa5ed4658cca0272e491')) +paddle.fluid.layers.reverse (ArgSpec(args=['x', 'axis'], varargs=None, keywords=None, defaults=None), ('document', '8ee7cb6ca639e7460e825f953b65d94d')) +paddle.fluid.layers.has_inf (ArgSpec(args=['x'], varargs=None, keywords=None, defaults=None), ('document', '8f8c0306117ea441f20dcbbdba1f0ecc')) +paddle.fluid.layers.has_nan (ArgSpec(args=['x'], varargs=None, keywords=None, defaults=None), ('document', '2e53e83127dbfd86e7098bdfe9a549e8')) +paddle.fluid.layers.isfinite (ArgSpec(args=['x'], varargs=None, keywords=None, defaults=None), ('document', '0a437011c3906079fd8947ed3e52d292')) +paddle.fluid.layers.range (ArgSpec(args=['start', 'end', 'step', 'dtype'], varargs=None, keywords=None, defaults=None), ('document', '2ec937ede953ded2fdff2675883900bb')) +paddle.fluid.layers.While.__init__ (ArgSpec(args=['self', 'cond', 'is_test', 'name'], varargs=None, keywords=None, defaults=(False, None)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.layers.While.block (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.layers.Switch.__init__ (ArgSpec(args=['self', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.layers.Switch.case (ArgSpec(args=['self', 'condition'], varargs=None, keywords=None, defaults=None), ('document', 'f7c7160014c1b46cfeda9dd5808d1789')) +paddle.fluid.layers.Switch.default (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '50853ae884df03d9c36703bb46d9ef07')) +paddle.fluid.layers.increment (ArgSpec(args=['x', 'value', 'in_place'], varargs=None, keywords=None, defaults=(1.0, True)), ('document', '73bb96ec4783ec1a11e760e8851b0e77')) +paddle.fluid.layers.array_write (ArgSpec(args=['x', 'i', 'array'], varargs=None, keywords=None, defaults=(None,)), ('document', '40b6d15f4c86b2b09df340d7778ad713')) +paddle.fluid.layers.create_array (ArgSpec(args=['dtype'], varargs=None, keywords=None, defaults=None), ('document', '2d4f20087080ba5105b55205ad5c5b6a')) +paddle.fluid.layers.less_than (ArgSpec(args=['x', 'y', 'force_cpu', 'cond'], varargs=None, keywords=None, defaults=(None, None)), ('document', '067bbc799c66289ca8b8924c26b6673f')) +paddle.fluid.layers.equal (ArgSpec(args=['x', 'y', 'cond'], varargs=None, keywords=None, defaults=(None,)), ('document', '80c29b1dc64718f0116de90d1ac88a77')) +paddle.fluid.layers.array_read (ArgSpec(args=['array', 'i'], varargs=None, keywords=None, defaults=None), ('document', 'dd68bead34dfbaf6b0a163fc1cc3c385')) +paddle.fluid.layers.array_length (ArgSpec(args=['array'], varargs=None, keywords=None, defaults=None), ('document', 'ffb8b9578ec66db565b223d313aa82a2')) +paddle.fluid.layers.IfElse.__init__ (ArgSpec(args=['self', 'cond', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.layers.IfElse.false_block (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.layers.IfElse.input (ArgSpec(args=['self', 'x'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.layers.IfElse.output (ArgSpec(args=['self'], varargs='outs', keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.layers.IfElse.true_block (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.layers.DynamicRNN.__init__ (ArgSpec(args=['self', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.layers.DynamicRNN.block (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6d3e0a5d9aa519a9773a36e1620ea9b7')) +paddle.fluid.layers.DynamicRNN.memory (ArgSpec(args=['self', 'init', 'shape', 'value', 'need_reorder', 'dtype'], varargs=None, keywords=None, defaults=(None, None, 0.0, False, 'float32')), ('document', 'b9174d4e91505b0c8ecc193eb51e248d')) +paddle.fluid.layers.DynamicRNN.output (ArgSpec(args=['self'], varargs='outputs', keywords=None, defaults=None), ('document', 'b439a176a3328de8a75bdc5c08eece4a')) +paddle.fluid.layers.DynamicRNN.static_input (ArgSpec(args=['self', 'x'], varargs=None, keywords=None, defaults=None), ('document', 'f29ad2478b6b2ad4f413d2936a331ea0')) +paddle.fluid.layers.DynamicRNN.step_input (ArgSpec(args=['self', 'x', 'level'], varargs=None, keywords=None, defaults=(0,)), ('document', '7568c5ac7622a10288d3307a94134655')) +paddle.fluid.layers.DynamicRNN.update_memory (ArgSpec(args=['self', 'ex_mem', 'new_mem'], varargs=None, keywords=None, defaults=None), ('document', '5d83987da13b98363d6a807a52d8024f')) +paddle.fluid.layers.StaticRNN.__init__ (ArgSpec(args=['self', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.layers.StaticRNN.memory (ArgSpec(args=['self', 'init', 'shape', 'batch_ref', 'init_value', 'init_batch_dim_idx', 'ref_batch_dim_idx'], varargs=None, keywords=None, defaults=(None, None, None, 0.0, 0, 1)), ('document', 'c24e368e23afac1ed91a78a639d7a9c7')) +paddle.fluid.layers.StaticRNN.output (ArgSpec(args=['self'], varargs='outputs', keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.layers.StaticRNN.step (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.layers.StaticRNN.step_input (ArgSpec(args=['self', 'x'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.layers.StaticRNN.step_output (ArgSpec(args=['self', 'o'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.layers.StaticRNN.update_memory (ArgSpec(args=['self', 'mem', 'var'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.layers.reorder_lod_tensor_by_rank (ArgSpec(args=['x', 'rank_table'], varargs=None, keywords=None, defaults=None), ('document', '3545f529ef04e8f6ecb76b47fa3df01a')) +paddle.fluid.layers.Print (ArgSpec(args=['input', 'first_n', 'message', 'summarize', 'print_tensor_name', 'print_tensor_type', 'print_tensor_shape', 'print_tensor_lod', 'print_phase'], varargs=None, keywords=None, defaults=(-1, None, -1, True, True, True, True, 'both')), ('document', '5fef91b0e21c93610785f2b1f7161732')) +paddle.fluid.layers.is_empty (ArgSpec(args=['x', 'cond'], varargs=None, keywords=None, defaults=(None,)), ('document', 'bbe578dbb49ad13e15b014e98c22b519')) +paddle.fluid.layers.sigmoid (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '29a25ba78de79152076cacfc5443137d')) +paddle.fluid.layers.logsigmoid (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '81ccb7acafd06c7728e11581f5d342e3')) +paddle.fluid.layers.exp (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', 'e6b3e769413d96aab4176f96db25984b')) +paddle.fluid.layers.tanh (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', 'e9d586a0b5bd05f67ee78048f9d503b6')) +paddle.fluid.layers.atan (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '3a46e0b5f9ce82348406478e610f14c9')) +paddle.fluid.layers.tanh_shrink (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '1e521554b9fdda9061ec6d306f0709b7')) +paddle.fluid.layers.softshrink (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '9eef31597bbafa2bd49691e072296e13')) +paddle.fluid.layers.sqrt (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', 'e9e27491c39ac74d0b1ffe506aec0ebb')) +paddle.fluid.layers.abs (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '64650ac42cf82e9920cb0b172b1d29fd')) +paddle.fluid.layers.ceil (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', 'c75d67dc5fe28f68e4cfffead4f698ad')) +paddle.fluid.layers.floor (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '647b16c5da5ef909649ae02abb434973')) +paddle.fluid.layers.cos (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '485f2686bcc2fe37a4bd893769c8a3e2')) +paddle.fluid.layers.acos (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '920a47734482276c069ba24c61c26b25')) +paddle.fluid.layers.asin (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', 'cf4ee2c9b9d7293556f8c5173dfb5d2c')) +paddle.fluid.layers.sin (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '01f1766aa76eff1df30147505b59f7c4')) +paddle.fluid.layers.round (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', 'b47f5da13913d3e56bdb1e612a73f3f2')) +paddle.fluid.layers.reciprocal (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', 'cc6ac2f14f03c52aaa83a59bf83b8d26')) +paddle.fluid.layers.square (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '48dfb45d773dbc30126c3a7f777de5ee')) +paddle.fluid.layers.softplus (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '459c5781e9d1dd88283b7c5769d7872a')) +paddle.fluid.layers.softsign (ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '80846bcd4bd457207457a6d5411f4148')) +paddle.fluid.layers.uniform_random (ArgSpec(args=['shape', 'dtype', 'min', 'max', 'seed'], varargs=None, keywords=None, defaults=('float32', -1.0, 1.0, 0)), ('document', '308b619af849caa82bbc31e897f5e641')) +paddle.fluid.layers.hard_shrink (ArgSpec(args=['x', 'threshold'], varargs=None, keywords=None, defaults=(None,)), ('document', 'c142f5884f3255e0d6075c286bbd531e')) +paddle.fluid.layers.cumsum (ArgSpec(args=['x', 'axis', 'exclusive', 'reverse'], varargs=None, keywords=None, defaults=(None, None, None)), ('document', '944d7c03057f5fc88bc78acd4d82f926')) +paddle.fluid.layers.thresholded_relu (ArgSpec(args=['x', 'threshold'], varargs=None, keywords=None, defaults=(None,)), ('document', '90566ea449ea4c681435546e2f70610a')) +paddle.fluid.layers.prior_box (ArgSpec(args=['input', 'image', 'min_sizes', 'max_sizes', 'aspect_ratios', 'variance', 'flip', 'clip', 'steps', 'offset', 'name', 'min_max_aspect_ratios_order'], varargs=None, keywords=None, defaults=(None, [1.0], [0.1, 0.1, 0.2, 0.2], False, False, [0.0, 0.0], 0.5, None, False)), ('document', '14cac0ee643fa6e026ad82aeeee75bd8')) +paddle.fluid.layers.density_prior_box (ArgSpec(args=['input', 'image', 'densities', 'fixed_sizes', 'fixed_ratios', 'variance', 'clip', 'steps', 'offset', 'flatten_to_2d', 'name'], varargs=None, keywords=None, defaults=(None, None, None, [0.1, 0.1, 0.2, 0.2], False, [0.0, 0.0], 0.5, False, None)), ('document', 'a0d762bb08de9ce93bc780aa57cd5cd9')) +paddle.fluid.layers.multi_box_head (ArgSpec(args=['inputs', 'image', 'base_size', 'num_classes', 'aspect_ratios', 'min_ratio', 'max_ratio', 'min_sizes', 'max_sizes', 'steps', 'step_w', 'step_h', 'offset', 'variance', 'flip', 'clip', 'kernel_size', 'pad', 'stride', 'name', 'min_max_aspect_ratios_order'], varargs=None, keywords=None, defaults=(None, None, None, None, None, None, None, 0.5, [0.1, 0.1, 0.2, 0.2], True, False, 1, 0, 1, None, False)), ('document', 'a6ab47a2fe681e52fabb7057ddf0efdd')) +paddle.fluid.layers.bipartite_match (ArgSpec(args=['dist_matrix', 'match_type', 'dist_threshold', 'name'], varargs=None, keywords=None, defaults=(None, None, None)), ('document', '3ddb9b966f193900193a95a3df77c3c1')) +paddle.fluid.layers.target_assign (ArgSpec(args=['input', 'matched_indices', 'negative_indices', 'mismatch_value', 'name'], varargs=None, keywords=None, defaults=(None, None, None)), ('document', 'c0b334f917828f95056f6ebe10907b1c')) +paddle.fluid.layers.detection_output (ArgSpec(args=['loc', 'scores', 'prior_box', 'prior_box_var', 'background_label', 'nms_threshold', 'nms_top_k', 'keep_top_k', 'score_threshold', 'nms_eta'], varargs=None, keywords=None, defaults=(0, 0.3, 400, 200, 0.01, 1.0)), ('document', 'c33093a82a46e3091e789e5572588db1')) +paddle.fluid.layers.ssd_loss (ArgSpec(args=['location', 'confidence', 'gt_box', 'gt_label', 'prior_box', 'prior_box_var', 'background_label', 'overlap_threshold', 'neg_pos_ratio', 'neg_overlap', 'loc_loss_weight', 'conf_loss_weight', 'match_type', 'mining_type', 'normalize', 'sample_size'], varargs=None, keywords=None, defaults=(None, 0, 0.5, 3.0, 0.5, 1.0, 1.0, 'per_prediction', 'max_negative', True, None)), ('document', '6d5028fd09d01ab82d296adc0ea95aee')) +paddle.fluid.layers.detection_map (ArgSpec(args=['detect_res', 'label', 'class_num', 'background_label', 'overlap_threshold', 'evaluate_difficult', 'has_state', 'input_states', 'out_states', 'ap_version'], varargs=None, keywords=None, defaults=(0, 0.3, True, None, None, None, 'integral')), ('document', '1467d91b50c22cd52103b4aa1ee9d0a1')) +paddle.fluid.layers.rpn_target_assign (ArgSpec(args=['bbox_pred', 'cls_logits', 'anchor_box', 'anchor_var', 'gt_boxes', 'is_crowd', 'im_info', 'rpn_batch_size_per_im', 'rpn_straddle_thresh', 'rpn_fg_fraction', 'rpn_positive_overlap', 'rpn_negative_overlap', 'use_random'], varargs=None, keywords=None, defaults=(256, 0.0, 0.5, 0.7, 0.3, True)), ('document', '1dddef3eb4b3cbd4df8e03ac480dbf97')) +paddle.fluid.layers.anchor_generator (ArgSpec(args=['input', 'anchor_sizes', 'aspect_ratios', 'variance', 'stride', 'offset', 'name'], varargs=None, keywords=None, defaults=(None, None, [0.1, 0.1, 0.2, 0.2], None, 0.5, None)), ('document', '23337cc57bbf5be73884b6bd0f849603')) +paddle.fluid.layers.roi_perspective_transform (ArgSpec(args=['input', 'rois', 'transformed_height', 'transformed_width', 'spatial_scale'], varargs=None, keywords=None, defaults=(1.0,)), ('document', '5761f9ed83654314416e24372b33bb84')) +paddle.fluid.layers.generate_proposal_labels (ArgSpec(args=['rpn_rois', 'gt_classes', 'is_crowd', 'gt_boxes', 'im_info', 'batch_size_per_im', 'fg_fraction', 'fg_thresh', 'bg_thresh_hi', 'bg_thresh_lo', 'bbox_reg_weights', 'class_nums', 'use_random'], varargs=None, keywords=None, defaults=(256, 0.25, 0.25, 0.5, 0.0, [0.1, 0.1, 0.2, 0.2], None, True)), ('document', '87863717edeb7fe87a1268976cbc015d')) +paddle.fluid.layers.generate_proposals (ArgSpec(args=['scores', 'bbox_deltas', 'im_info', 'anchors', 'variances', 'pre_nms_top_n', 'post_nms_top_n', 'nms_thresh', 'min_size', 'eta', 'name'], varargs=None, keywords=None, defaults=(6000, 1000, 0.5, 0.1, 1.0, None)), ('document', '57ab49f3f324f310b7eed322e7c1057a')) +paddle.fluid.layers.generate_mask_labels (ArgSpec(args=['im_info', 'gt_classes', 'is_crowd', 'gt_segms', 'rois', 'labels_int32', 'num_classes', 'resolution'], varargs=None, keywords=None, defaults=None), ('document', 'f73706a65468e9ca3e0bee4a31521b0a')) +paddle.fluid.layers.iou_similarity (ArgSpec(args=['x', 'y', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '587845f60c5d97ffdf2dfd21da52eca1')) +paddle.fluid.layers.box_coder (ArgSpec(args=['prior_box', 'prior_box_var', 'target_box', 'code_type', 'box_normalized', 'name', 'axis'], varargs=None, keywords=None, defaults=('encode_center_size', True, None, 0)), ('document', '032d0f4b7d8f6235ee5d91e473344f0e')) +paddle.fluid.layers.polygon_box_transform (ArgSpec(args=['input', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '0e5ac2507723a0b5adec473f9556799b')) +paddle.fluid.layers.yolov3_loss (ArgSpec(args=['x', 'gtbox', 'gtlabel', 'anchors', 'anchor_mask', 'class_num', 'ignore_thresh', 'downsample_ratio', 'gtscore', 'use_label_smooth', 'name'], varargs=None, keywords=None, defaults=(None, True, None)), ('document', '57fa96922e42db8f064c3fb77f2255e8')) +paddle.fluid.layers.yolo_box (ArgSpec(args=['x', 'img_size', 'anchors', 'class_num', 'conf_thresh', 'downsample_ratio', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '5566169a5ab993d177792c023c7fb340')) +paddle.fluid.layers.box_clip (ArgSpec(args=['input', 'im_info', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '397e9e02b451d99c56e20f268fa03f2e')) +paddle.fluid.layers.multiclass_nms (ArgSpec(args=['bboxes', 'scores', 'score_threshold', 'nms_top_k', 'keep_top_k', 'nms_threshold', 'normalized', 'nms_eta', 'background_label', 'name'], varargs=None, keywords=None, defaults=(0.3, True, 1.0, 0, None)), ('document', 'ca7d1107b6c5d2d6d8221039a220fde0')) +paddle.fluid.layers.distribute_fpn_proposals (ArgSpec(args=['fpn_rois', 'min_level', 'max_level', 'refer_level', 'refer_scale', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '7bb011ec26bace2bc23235aa4a17647d')) +paddle.fluid.layers.box_decoder_and_assign (ArgSpec(args=['prior_box', 'prior_box_var', 'target_box', 'box_score', 'box_clip', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '005a5ae47d6c8fff721931d69d072b9f')) +paddle.fluid.layers.accuracy (ArgSpec(args=['input', 'label', 'k', 'correct', 'total'], varargs=None, keywords=None, defaults=(1, None, None)), ('document', '9808534c12c5e739a10f73ebb0b4eafd')) +paddle.fluid.layers.auc (ArgSpec(args=['input', 'label', 'curve', 'num_thresholds', 'topk', 'slide_steps'], varargs=None, keywords=None, defaults=('ROC', 4095, 1, 1)), ('document', 'e0e95334fce92d16c2d9db6e7caffc47')) +paddle.fluid.layers.exponential_decay (ArgSpec(args=['learning_rate', 'decay_steps', 'decay_rate', 'staircase'], varargs=None, keywords=None, defaults=(False,)), ('document', '98a5050bee8522fcea81aa795adaba51')) +paddle.fluid.layers.natural_exp_decay (ArgSpec(args=['learning_rate', 'decay_steps', 'decay_rate', 'staircase'], varargs=None, keywords=None, defaults=(False,)), ('document', '676a7bc2a218691db50bca233903d21e')) +paddle.fluid.layers.inverse_time_decay (ArgSpec(args=['learning_rate', 'decay_steps', 'decay_rate', 'staircase'], varargs=None, keywords=None, defaults=(False,)), ('document', 'd07e767d59c4a5e6c930f3e6756d3f82')) +paddle.fluid.layers.polynomial_decay (ArgSpec(args=['learning_rate', 'decay_steps', 'end_learning_rate', 'power', 'cycle'], varargs=None, keywords=None, defaults=(0.0001, 1.0, False)), ('document', '882634f420f626642f0874481263da40')) +paddle.fluid.layers.piecewise_decay (ArgSpec(args=['boundaries', 'values'], varargs=None, keywords=None, defaults=None), ('document', 'c717d9d1d78a53c809d01b8bc56f3cae')) +paddle.fluid.layers.noam_decay (ArgSpec(args=['d_model', 'warmup_steps'], varargs=None, keywords=None, defaults=None), ('document', 'd9a95746353fd574be36dc28d8726c28')) +paddle.fluid.layers.cosine_decay (ArgSpec(args=['learning_rate', 'step_each_epoch', 'epochs'], varargs=None, keywords=None, defaults=None), ('document', 'f8b2727bccf0f368c997d7cf05847e49')) +paddle.fluid.layers.linear_lr_warmup (ArgSpec(args=['learning_rate', 'warmup_steps', 'start_lr', 'end_lr'], varargs=None, keywords=None, defaults=None), ('document', '2ef3f5ca5cd71ea4217c418e5a7a0565')) +paddle.fluid.contrib.InitState.__init__ (ArgSpec(args=['self', 'init', 'shape', 'value', 'init_boot', 'need_reorder', 'dtype'], varargs=None, keywords=None, defaults=(None, None, 0.0, None, False, 'float32')), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.contrib.StateCell.__init__ (ArgSpec(args=['self', 'inputs', 'states', 'out_state', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.contrib.StateCell.compute_state (ArgSpec(args=['self', 'inputs'], varargs=None, keywords=None, defaults=None), ('document', '92973b3f222081a1d17069c683cf4a99')) +paddle.fluid.contrib.StateCell.get_input (ArgSpec(args=['self', 'input_name'], varargs=None, keywords=None, defaults=None), ('document', '6f24a007cfa184e32f01a960703bfd70')) +paddle.fluid.contrib.StateCell.get_state (ArgSpec(args=['self', 'state_name'], varargs=None, keywords=None, defaults=None), ('document', '630a4945cfe659ea4f307598fbbce5d2')) +paddle.fluid.contrib.StateCell.out_state (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '7ad681dff0393ddf13a724194e720f28')) +paddle.fluid.contrib.StateCell.set_state (ArgSpec(args=['self', 'state_name', 'state_value'], varargs=None, keywords=None, defaults=None), ('document', 'd4e0e08cd5d9d9a571cbc52d114f5ae9')) +paddle.fluid.contrib.StateCell.state_updater (ArgSpec(args=['self', 'updater'], varargs=None, keywords=None, defaults=None), ('document', 'd5afe1b7665d94fb023b15cf913ca510')) +paddle.fluid.contrib.StateCell.update_states (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', 'fe0b0f1338723516a35a30247899c81b')) +paddle.fluid.contrib.TrainingDecoder.__init__ (ArgSpec(args=['self', 'state_cell', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.contrib.TrainingDecoder.block (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '98d88fa1c989748410a12517c6a585bf')) +paddle.fluid.contrib.TrainingDecoder.output (ArgSpec(args=['self'], varargs='outputs', keywords=None, defaults=None), ('document', 'f0a457dee586559036202087ce2eff69')) +paddle.fluid.contrib.TrainingDecoder.static_input (ArgSpec(args=['self', 'x'], varargs=None, keywords=None, defaults=None), ('document', 'a024c72664fe815068423ba630b7658a')) +paddle.fluid.contrib.TrainingDecoder.step_input (ArgSpec(args=['self', 'x'], varargs=None, keywords=None, defaults=None), ('document', '4659db7a888a2495e71c1838a0483909')) +paddle.fluid.contrib.BeamSearchDecoder.__init__ (ArgSpec(args=['self', 'state_cell', 'init_ids', 'init_scores', 'target_dict_dim', 'word_dim', 'input_var_dict', 'topk_size', 'sparse_emb', 'max_len', 'beam_size', 'end_id', 'name'], varargs=None, keywords=None, defaults=({}, 50, True, 100, 1, 1, None)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.contrib.BeamSearchDecoder.block (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '98d88fa1c989748410a12517c6a585bf')) +paddle.fluid.contrib.BeamSearchDecoder.decode (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '1e47c60f080c1343ebb6ceaef89656b2')) +paddle.fluid.contrib.BeamSearchDecoder.early_stop (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '3a84a7454ed6707f79b9e954d92a7575')) +paddle.fluid.contrib.BeamSearchDecoder.read_array (ArgSpec(args=['self', 'init', 'is_ids', 'is_scores'], varargs=None, keywords=None, defaults=(False, False)), ('document', 'aa89eb8fd5e4cabaf5cc1bcae14665a4')) +paddle.fluid.contrib.BeamSearchDecoder.update_array (ArgSpec(args=['self', 'array', 'value'], varargs=None, keywords=None, defaults=None), ('document', '5754e9b3212b7c09497151516a0de5a7')) +paddle.fluid.contrib.memory_usage (ArgSpec(args=['program', 'batch_size'], varargs=None, keywords=None, defaults=None), ('document', '8fcb2f93bb743693baa8d4860a5ccc47')) +paddle.fluid.contrib.op_freq_statistic (ArgSpec(args=['program'], varargs=None, keywords=None, defaults=None), ('document', '4d43687113c4bf5b29d15aee2f4e4afa')) +paddle.fluid.contrib.QuantizeTranspiler.__init__ (ArgSpec(args=['self', 'weight_bits', 'activation_bits', 'activation_quantize_type', 'weight_quantize_type', 'window_size', 'moving_rate'], varargs=None, keywords=None, defaults=(8, 8, 'abs_max', 'abs_max', 10000, 0.9)), ('document', '14b39f1fcd5667ff556b1aad94357d1d')) +paddle.fluid.contrib.QuantizeTranspiler.convert_to_int8 (ArgSpec(args=['self', 'program', 'place', 'scope'], varargs=None, keywords=None, defaults=(None,)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.contrib.QuantizeTranspiler.freeze_program (ArgSpec(args=['self', 'program', 'place', 'fuse_bn', 'scope'], varargs=None, keywords=None, defaults=(False, None)), ('document', '909675a1ab055c69b436a7893fcae4fd')) +paddle.fluid.contrib.QuantizeTranspiler.training_transpile (ArgSpec(args=['self', 'program', 'startup_program'], varargs=None, keywords=None, defaults=(None, None)), ('document', '6dd9909f10b283ba2892a99058a72884')) +paddle.fluid.contrib.Calibrator.__init__ (ArgSpec(args=['self'], varargs='args', keywords='kwargs', defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.contrib.Calibrator.sample_data (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '3b8c85ca1e2cf753cc8c90a6c6992958')) +paddle.fluid.contrib.Calibrator.save_int8_model (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.contrib.reader.ctr_reader.ctr_reader (ArgSpec(args=['feed_dict', 'file_type', 'file_format', 'dense_slot_index', 'sparse_slot_index', 'capacity', 'thread_num', 'batch_size', 'file_list', 'slots', 'name'], varargs=None, keywords=None, defaults=(None,)), ('document', 'b2ebf3de2a6ef1af2c3b88d2db7591ab')) +paddle.fluid.contrib.Compressor.__init__ (ArgSpec(args=['self', 'place', 'scope', 'train_program', 'train_reader', 'train_feed_list', 'train_fetch_list', 'eval_program', 'eval_reader', 'eval_feed_list', 'eval_fetch_list', 'teacher_programs', 'checkpoint_path', 'train_optimizer', 'distiller_optimizer'], varargs=None, keywords=None, defaults=(None, None, None, None, None, None, None, [], './checkpoints', None, None)), ('document', '31ae143830c9bf6b43547dd546c5ba80')) +paddle.fluid.contrib.Compressor.config (ArgSpec(args=['self', 'config_file'], varargs=None, keywords=None, defaults=None), ('document', '780d9c007276ccbb95b292400d7807b0')) +paddle.fluid.contrib.Compressor.run (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', 'c6e43d6a078d307672283c1f36e04fe9')) +paddle.fluid.contrib.load_persistables_for_increment (ArgSpec(args=['dirname', 'executor', 'program', 'lookup_table_var', 'lookup_table_var_path'], varargs=None, keywords=None, defaults=None), ('document', '2ab36d4f7a564f5f65e455807ad06c67')) +paddle.fluid.contrib.load_persistables_for_inference (ArgSpec(args=['dirname', 'executor', 'program', 'lookup_table_var_name'], varargs=None, keywords=None, defaults=None), ('document', '59066bac9db0ac6ce414d05780b7333f')) +paddle.fluid.contrib.convert_dist_to_sparse_program (ArgSpec(args=['program'], varargs=None, keywords=None, defaults=None), ('document', '74c39c595dc70d6be2f16d8e462d282b')) +paddle.fluid.contrib.HDFSClient.__init__ (ArgSpec(args=['self', 'hadoop_home', 'configs'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.contrib.HDFSClient.delete (ArgSpec(args=['self', 'hdfs_path'], varargs=None, keywords=None, defaults=None), ('document', 'c3721aa2d4d9ef5a857dd47b2681c03e')) +paddle.fluid.contrib.HDFSClient.download (ArgSpec(args=['self', 'hdfs_path', 'local_path', 'overwrite', 'unzip'], varargs=None, keywords=None, defaults=(False, False)), ('document', 'ca55bde92184d3fd0f9f5c963b25e634')) +paddle.fluid.contrib.HDFSClient.is_dir (ArgSpec(args=['self', 'hdfs_path'], varargs=None, keywords=None, defaults=(None,)), ('document', '45bde1bae02605a205c8245b58b9156d')) +paddle.fluid.contrib.HDFSClient.is_exist (ArgSpec(args=['self', 'hdfs_path'], varargs=None, keywords=None, defaults=(None,)), ('document', 'be9c94bccff7ba0c1d95883ac62b5864')) +paddle.fluid.contrib.HDFSClient.ls (ArgSpec(args=['self', 'hdfs_path'], varargs=None, keywords=None, defaults=None), ('document', '808acac504870c7e46594b95674f8a86')) +paddle.fluid.contrib.HDFSClient.lsr (ArgSpec(args=['self', 'hdfs_path', 'only_file', 'sort'], varargs=None, keywords=None, defaults=(True, True)), ('document', 'fae835aa3354eb6a0434c0f9ba3c2747')) +paddle.fluid.contrib.HDFSClient.make_local_dirs (ArgSpec(args=['local_path'], varargs=None, keywords=None, defaults=None), ('document', 'e76b89c8e7f019b5da576c0026fcf689')) +paddle.fluid.contrib.HDFSClient.makedirs (ArgSpec(args=['self', 'hdfs_path'], varargs=None, keywords=None, defaults=None), ('document', '44d9972aae390aedf40aaea731a37e4b')) +paddle.fluid.contrib.HDFSClient.rename (ArgSpec(args=['self', 'hdfs_src_path', 'hdfs_dst_path', 'overwrite'], varargs=None, keywords=None, defaults=(False,)), ('document', '0eb133644d9a9f4da45bb39261ff0955')) +paddle.fluid.contrib.HDFSClient.upload (ArgSpec(args=['self', 'hdfs_path', 'local_path', 'overwrite', 'retry_times'], varargs=None, keywords=None, defaults=(False, 5)), ('document', '7d053b4bfd6dcfdd2c9dda0e0dbd9665')) +paddle.fluid.contrib.multi_download (ArgSpec(args=['client', 'hdfs_path', 'local_path', 'trainer_id', 'trainers', 'multi_processes'], varargs=None, keywords=None, defaults=(5,)), ('document', '100927be598ed8f9eaa1f3ef1b23568a')) +paddle.fluid.contrib.multi_upload (ArgSpec(args=['client', 'hdfs_path', 'local_path', 'multi_processes', 'overwrite', 'sync'], varargs=None, keywords=None, defaults=(5, False, True)), ('document', '183f34c83d30dbe16e09e8716c41958a')) +paddle.fluid.contrib.extend_with_decoupled_weight_decay (ArgSpec(args=['base_optimizer'], varargs=None, keywords=None, defaults=None), ('document', 'a1095dfd4ec725747f662d69cd7659d4')) +paddle.fluid.transpiler.DistributeTranspiler.__init__ (ArgSpec(args=['self', 'config'], varargs=None, keywords=None, defaults=(None,)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.transpiler.DistributeTranspiler.get_pserver_program (ArgSpec(args=['self', 'endpoint'], varargs=None, keywords=None, defaults=None), ('document', '292ab72977afbe58e6a3bde175452680')) +paddle.fluid.transpiler.DistributeTranspiler.get_pserver_programs (ArgSpec(args=['self', 'endpoint'], varargs=None, keywords=None, defaults=None), ('document', '78f4949aedf317666a89ca74b3748ba8')) +paddle.fluid.transpiler.DistributeTranspiler.get_startup_program (ArgSpec(args=['self', 'endpoint', 'pserver_program', 'startup_program'], varargs=None, keywords=None, defaults=(None, None)), ('document', 'd796fc0c8d51503b556fcf6dc15c4f0c')) +paddle.fluid.transpiler.DistributeTranspiler.get_trainer_program (ArgSpec(args=['self', 'wait_port'], varargs=None, keywords=None, defaults=(True,)), ('document', '736330e31a7a54abccc0c7fd9119d9ff')) +paddle.fluid.transpiler.DistributeTranspiler.transpile (ArgSpec(args=['self', 'trainer_id', 'program', 'pservers', 'trainers', 'sync_mode', 'startup_program', 'current_endpoint'], varargs=None, keywords=None, defaults=(None, '127.0.0.1:6174', 1, True, None, '127.0.0.1:6174')), ('document', '06ce55338dfe96311ad1078235ab3bf4')) +paddle.fluid.transpiler.memory_optimize (ArgSpec(args=['input_program', 'skip_opt_set', 'print_log', 'level', 'skip_grads'], varargs=None, keywords=None, defaults=(None, False, 0, False)), ('document', 'eda17d0f1639bc6ca215cecf87f588a4')) +paddle.fluid.transpiler.release_memory (ArgSpec(args=['input_program', 'skip_opt_set'], varargs=None, keywords=None, defaults=(None,)), ('document', 'ac4114d3df16264f1946deb3a8434a6f')) +paddle.fluid.transpiler.HashName.__init__ (ArgSpec(args=['self', 'pserver_endpoints'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.transpiler.HashName.dispatch (ArgSpec(args=['self', 'varlist'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.transpiler.HashName.reset (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.transpiler.RoundRobin.__init__ (ArgSpec(args=['self', 'pserver_endpoints'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.transpiler.RoundRobin.dispatch (ArgSpec(args=['self', 'varlist'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.transpiler.RoundRobin.reset (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) paddle.fluid.transpiler.DistributeTranspilerConfig.__init__ -paddle.fluid.nets.simple_img_conv_pool ArgSpec(args=['input', 'num_filters', 'filter_size', 'pool_size', 'pool_stride', 'pool_padding', 'pool_type', 'global_pooling', 'conv_stride', 'conv_padding', 'conv_dilation', 'conv_groups', 'param_attr', 'bias_attr', 'act', 'use_cudnn'], varargs=None, keywords=None, defaults=(0, 'max', False, 1, 0, 1, 1, None, None, None, True)) -paddle.fluid.nets.sequence_conv_pool ArgSpec(args=['input', 'num_filters', 'filter_size', 'param_attr', 'act', 'pool_type'], varargs=None, keywords=None, defaults=(None, 'sigmoid', 'max')) -paddle.fluid.nets.glu ArgSpec(args=['input', 'dim'], varargs=None, keywords=None, defaults=(-1,)) -paddle.fluid.nets.scaled_dot_product_attention ArgSpec(args=['queries', 'keys', 'values', 'num_heads', 'dropout_rate'], varargs=None, keywords=None, defaults=(1, 0.0)) -paddle.fluid.nets.img_conv_group ArgSpec(args=['input', 'conv_num_filter', 'pool_size', 'conv_padding', 'conv_filter_size', 'conv_act', 'param_attr', 'conv_with_batchnorm', 'conv_batchnorm_drop_rate', 'pool_stride', 'pool_type', 'use_cudnn'], varargs=None, keywords=None, defaults=(1, 3, None, None, False, 0.0, 1, 'max', True)) -paddle.fluid.optimizer.SGDOptimizer.__init__ ArgSpec(args=['self', 'learning_rate', 'regularization', 'name'], varargs=None, keywords=None, defaults=(None, None)) -paddle.fluid.optimizer.SGDOptimizer.minimize ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)) -paddle.fluid.optimizer.MomentumOptimizer.__init__ ArgSpec(args=['self', 'learning_rate', 'momentum', 'use_nesterov', 'regularization', 'name'], varargs=None, keywords=None, defaults=(False, None, None)) -paddle.fluid.optimizer.MomentumOptimizer.minimize ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)) -paddle.fluid.optimizer.AdagradOptimizer.__init__ ArgSpec(args=['self', 'learning_rate', 'epsilon', 'regularization', 'name'], varargs=None, keywords=None, defaults=(1e-06, None, None)) -paddle.fluid.optimizer.AdagradOptimizer.minimize ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)) -paddle.fluid.optimizer.AdamOptimizer.__init__ ArgSpec(args=['self', 'learning_rate', 'beta1', 'beta2', 'epsilon', 'regularization', 'name'], varargs=None, keywords=None, defaults=(0.001, 0.9, 0.999, 1e-08, None, None)) -paddle.fluid.optimizer.AdamOptimizer.minimize ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)) -paddle.fluid.optimizer.AdamaxOptimizer.__init__ ArgSpec(args=['self', 'learning_rate', 'beta1', 'beta2', 'epsilon', 'regularization', 'name'], varargs=None, keywords=None, defaults=(0.001, 0.9, 0.999, 1e-08, None, None)) -paddle.fluid.optimizer.AdamaxOptimizer.minimize ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)) -paddle.fluid.optimizer.DecayedAdagradOptimizer.__init__ ArgSpec(args=['self', 'learning_rate', 'decay', 'epsilon', 'regularization', 'name'], varargs=None, keywords=None, defaults=(0.95, 1e-06, None, None)) -paddle.fluid.optimizer.DecayedAdagradOptimizer.minimize ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)) -paddle.fluid.optimizer.FtrlOptimizer.__init__ ArgSpec(args=['self', 'learning_rate', 'l1', 'l2', 'lr_power', 'regularization', 'name'], varargs=None, keywords=None, defaults=(0.0, 0.0, -0.5, None, None)) -paddle.fluid.optimizer.FtrlOptimizer.minimize ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)) -paddle.fluid.optimizer.RMSPropOptimizer.__init__ ArgSpec(args=['self', 'learning_rate', 'rho', 'epsilon', 'momentum', 'centered', 'regularization', 'name'], varargs=None, keywords=None, defaults=(0.95, 1e-06, 0.0, False, None, None)) -paddle.fluid.optimizer.RMSPropOptimizer.minimize ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)) -paddle.fluid.optimizer.AdadeltaOptimizer.__init__ ArgSpec(args=['self', 'learning_rate', 'epsilon', 'rho', 'regularization', 'name'], varargs=None, keywords=None, defaults=(1e-06, 0.95, None, None)) -paddle.fluid.optimizer.AdadeltaOptimizer.minimize ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)) -paddle.fluid.optimizer.ModelAverage.__init__ ArgSpec(args=['self', 'average_window_rate', 'min_average_window', 'max_average_window', 'regularization', 'name'], varargs=None, keywords=None, defaults=(10000, 10000, None, None)) -paddle.fluid.optimizer.ModelAverage.apply ArgSpec(args=[], varargs='args', keywords='kwds', defaults=None) -paddle.fluid.optimizer.ModelAverage.minimize ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)) -paddle.fluid.optimizer.ModelAverage.restore ArgSpec(args=['self', 'executor'], varargs=None, keywords=None, defaults=None) -paddle.fluid.backward.append_backward ArgSpec(args=['loss', 'parameter_list', 'no_grad_set', 'callbacks'], varargs=None, keywords=None, defaults=(None, None, None)) -paddle.fluid.regularizer.L1DecayRegularizer.__init__ ArgSpec(args=['self', 'regularization_coeff'], varargs=None, keywords=None, defaults=(0.0,)) -paddle.fluid.regularizer.L2DecayRegularizer.__init__ ArgSpec(args=['self', 'regularization_coeff'], varargs=None, keywords=None, defaults=(0.0,)) +paddle.fluid.nets.simple_img_conv_pool (ArgSpec(args=['input', 'num_filters', 'filter_size', 'pool_size', 'pool_stride', 'pool_padding', 'pool_type', 'global_pooling', 'conv_stride', 'conv_padding', 'conv_dilation', 'conv_groups', 'param_attr', 'bias_attr', 'act', 'use_cudnn'], varargs=None, keywords=None, defaults=(0, 'max', False, 1, 0, 1, 1, None, None, None, True)), ('document', 'e0f67f35abf27f666f81003113b90244')) +paddle.fluid.nets.sequence_conv_pool (ArgSpec(args=['input', 'num_filters', 'filter_size', 'param_attr', 'act', 'pool_type', 'bias_attr'], varargs=None, keywords=None, defaults=(None, 'sigmoid', 'max', None)), ('document', '48c434dd7bb827f69d90e5135d77470f')) +paddle.fluid.nets.glu (ArgSpec(args=['input', 'dim'], varargs=None, keywords=None, defaults=(-1,)), ('document', '08c1c57e1db6b20bf87b264cb7cf3ca8')) +paddle.fluid.nets.scaled_dot_product_attention (ArgSpec(args=['queries', 'keys', 'values', 'num_heads', 'dropout_rate'], varargs=None, keywords=None, defaults=(1, 0.0)), ('document', '921714c9bfb351b41403418265393203')) +paddle.fluid.nets.img_conv_group (ArgSpec(args=['input', 'conv_num_filter', 'pool_size', 'conv_padding', 'conv_filter_size', 'conv_act', 'param_attr', 'conv_with_batchnorm', 'conv_batchnorm_drop_rate', 'pool_stride', 'pool_type', 'use_cudnn'], varargs=None, keywords=None, defaults=(1, 3, None, None, False, 0.0, 1, 'max', True)), ('document', '3802be78fbfb206dae64a2d9f8480970')) +paddle.fluid.optimizer.SGDOptimizer.__init__ (ArgSpec(args=['self', 'learning_rate', 'regularization', 'name'], varargs=None, keywords=None, defaults=(None, None)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.SGDOptimizer.apply_gradients (ArgSpec(args=['self', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', 'bfe7305918552aaecfdaa22411dbe871')) +paddle.fluid.optimizer.SGDOptimizer.apply_optimize (ArgSpec(args=['self', 'loss', 'startup_program', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', '5c46d1926a40f1f873ffe9f37ac89dae')) +paddle.fluid.optimizer.SGDOptimizer.backward (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set', 'callbacks'], varargs=None, keywords=None, defaults=(None, None, None, None)), ('document', 'ba3a113d0229ff7bc9d39bda0a6d947f')) +paddle.fluid.optimizer.SGDOptimizer.get_opti_var_name_list (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.SGDOptimizer.minimize (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)), ('document', '35fd5d3330c97903528c7e0dacc7f6ea')) +paddle.fluid.optimizer.MomentumOptimizer.__init__ (ArgSpec(args=['self', 'learning_rate', 'momentum', 'use_nesterov', 'regularization', 'name'], varargs=None, keywords=None, defaults=(False, None, None)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.MomentumOptimizer.apply_gradients (ArgSpec(args=['self', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', 'bfe7305918552aaecfdaa22411dbe871')) +paddle.fluid.optimizer.MomentumOptimizer.apply_optimize (ArgSpec(args=['self', 'loss', 'startup_program', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', '5c46d1926a40f1f873ffe9f37ac89dae')) +paddle.fluid.optimizer.MomentumOptimizer.backward (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set', 'callbacks'], varargs=None, keywords=None, defaults=(None, None, None, None)), ('document', 'ba3a113d0229ff7bc9d39bda0a6d947f')) +paddle.fluid.optimizer.MomentumOptimizer.get_opti_var_name_list (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.MomentumOptimizer.minimize (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)), ('document', '35fd5d3330c97903528c7e0dacc7f6ea')) +paddle.fluid.optimizer.AdagradOptimizer.__init__ (ArgSpec(args=['self', 'learning_rate', 'epsilon', 'regularization', 'name', 'initial_accumulator_value'], varargs=None, keywords=None, defaults=(1e-06, None, None, 0.0)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.AdagradOptimizer.apply_gradients (ArgSpec(args=['self', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', 'bfe7305918552aaecfdaa22411dbe871')) +paddle.fluid.optimizer.AdagradOptimizer.apply_optimize (ArgSpec(args=['self', 'loss', 'startup_program', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', '5c46d1926a40f1f873ffe9f37ac89dae')) +paddle.fluid.optimizer.AdagradOptimizer.backward (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set', 'callbacks'], varargs=None, keywords=None, defaults=(None, None, None, None)), ('document', 'ba3a113d0229ff7bc9d39bda0a6d947f')) +paddle.fluid.optimizer.AdagradOptimizer.get_opti_var_name_list (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.AdagradOptimizer.minimize (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)), ('document', '35fd5d3330c97903528c7e0dacc7f6ea')) +paddle.fluid.optimizer.AdamOptimizer.__init__ (ArgSpec(args=['self', 'learning_rate', 'beta1', 'beta2', 'epsilon', 'regularization', 'name', 'lazy_mode'], varargs=None, keywords=None, defaults=(0.001, 0.9, 0.999, 1e-08, None, None, False)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.AdamOptimizer.apply_gradients (ArgSpec(args=['self', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', 'bfe7305918552aaecfdaa22411dbe871')) +paddle.fluid.optimizer.AdamOptimizer.apply_optimize (ArgSpec(args=['self', 'loss', 'startup_program', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', '5c46d1926a40f1f873ffe9f37ac89dae')) +paddle.fluid.optimizer.AdamOptimizer.backward (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set', 'callbacks'], varargs=None, keywords=None, defaults=(None, None, None, None)), ('document', 'ba3a113d0229ff7bc9d39bda0a6d947f')) +paddle.fluid.optimizer.AdamOptimizer.get_opti_var_name_list (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.AdamOptimizer.minimize (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)), ('document', '35fd5d3330c97903528c7e0dacc7f6ea')) +paddle.fluid.optimizer.AdamaxOptimizer.__init__ (ArgSpec(args=['self', 'learning_rate', 'beta1', 'beta2', 'epsilon', 'regularization', 'name'], varargs=None, keywords=None, defaults=(0.001, 0.9, 0.999, 1e-08, None, None)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.AdamaxOptimizer.apply_gradients (ArgSpec(args=['self', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', 'bfe7305918552aaecfdaa22411dbe871')) +paddle.fluid.optimizer.AdamaxOptimizer.apply_optimize (ArgSpec(args=['self', 'loss', 'startup_program', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', '5c46d1926a40f1f873ffe9f37ac89dae')) +paddle.fluid.optimizer.AdamaxOptimizer.backward (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set', 'callbacks'], varargs=None, keywords=None, defaults=(None, None, None, None)), ('document', 'ba3a113d0229ff7bc9d39bda0a6d947f')) +paddle.fluid.optimizer.AdamaxOptimizer.get_opti_var_name_list (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.AdamaxOptimizer.minimize (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)), ('document', '35fd5d3330c97903528c7e0dacc7f6ea')) +paddle.fluid.optimizer.DecayedAdagradOptimizer.__init__ (ArgSpec(args=['self', 'learning_rate', 'decay', 'epsilon', 'regularization', 'name'], varargs=None, keywords=None, defaults=(0.95, 1e-06, None, None)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.DecayedAdagradOptimizer.apply_gradients (ArgSpec(args=['self', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', 'bfe7305918552aaecfdaa22411dbe871')) +paddle.fluid.optimizer.DecayedAdagradOptimizer.apply_optimize (ArgSpec(args=['self', 'loss', 'startup_program', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', '5c46d1926a40f1f873ffe9f37ac89dae')) +paddle.fluid.optimizer.DecayedAdagradOptimizer.backward (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set', 'callbacks'], varargs=None, keywords=None, defaults=(None, None, None, None)), ('document', 'ba3a113d0229ff7bc9d39bda0a6d947f')) +paddle.fluid.optimizer.DecayedAdagradOptimizer.get_opti_var_name_list (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.DecayedAdagradOptimizer.minimize (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)), ('document', '35fd5d3330c97903528c7e0dacc7f6ea')) +paddle.fluid.optimizer.FtrlOptimizer.__init__ (ArgSpec(args=['self', 'learning_rate', 'l1', 'l2', 'lr_power', 'regularization', 'name'], varargs=None, keywords=None, defaults=(0.0, 0.0, -0.5, None, None)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.FtrlOptimizer.apply_gradients (ArgSpec(args=['self', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', 'bfe7305918552aaecfdaa22411dbe871')) +paddle.fluid.optimizer.FtrlOptimizer.apply_optimize (ArgSpec(args=['self', 'loss', 'startup_program', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', '5c46d1926a40f1f873ffe9f37ac89dae')) +paddle.fluid.optimizer.FtrlOptimizer.backward (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set', 'callbacks'], varargs=None, keywords=None, defaults=(None, None, None, None)), ('document', 'ba3a113d0229ff7bc9d39bda0a6d947f')) +paddle.fluid.optimizer.FtrlOptimizer.get_opti_var_name_list (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.FtrlOptimizer.minimize (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)), ('document', '35fd5d3330c97903528c7e0dacc7f6ea')) +paddle.fluid.optimizer.RMSPropOptimizer.__init__ (ArgSpec(args=['self', 'learning_rate', 'rho', 'epsilon', 'momentum', 'centered', 'regularization', 'name'], varargs=None, keywords=None, defaults=(0.95, 1e-06, 0.0, False, None, None)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.RMSPropOptimizer.apply_gradients (ArgSpec(args=['self', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', 'bfe7305918552aaecfdaa22411dbe871')) +paddle.fluid.optimizer.RMSPropOptimizer.apply_optimize (ArgSpec(args=['self', 'loss', 'startup_program', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', '5c46d1926a40f1f873ffe9f37ac89dae')) +paddle.fluid.optimizer.RMSPropOptimizer.backward (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set', 'callbacks'], varargs=None, keywords=None, defaults=(None, None, None, None)), ('document', 'ba3a113d0229ff7bc9d39bda0a6d947f')) +paddle.fluid.optimizer.RMSPropOptimizer.get_opti_var_name_list (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.RMSPropOptimizer.minimize (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)), ('document', '35fd5d3330c97903528c7e0dacc7f6ea')) +paddle.fluid.optimizer.AdadeltaOptimizer.__init__ (ArgSpec(args=['self', 'learning_rate', 'epsilon', 'rho', 'regularization', 'name'], varargs=None, keywords=None, defaults=(1e-06, 0.95, None, None)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.AdadeltaOptimizer.apply_gradients (ArgSpec(args=['self', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', 'bfe7305918552aaecfdaa22411dbe871')) +paddle.fluid.optimizer.AdadeltaOptimizer.apply_optimize (ArgSpec(args=['self', 'loss', 'startup_program', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', '5c46d1926a40f1f873ffe9f37ac89dae')) +paddle.fluid.optimizer.AdadeltaOptimizer.backward (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set', 'callbacks'], varargs=None, keywords=None, defaults=(None, None, None, None)), ('document', 'ba3a113d0229ff7bc9d39bda0a6d947f')) +paddle.fluid.optimizer.AdadeltaOptimizer.get_opti_var_name_list (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.AdadeltaOptimizer.minimize (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)), ('document', '35fd5d3330c97903528c7e0dacc7f6ea')) +paddle.fluid.optimizer.ModelAverage.__init__ (ArgSpec(args=['self', 'average_window_rate', 'min_average_window', 'max_average_window', 'regularization', 'name'], varargs=None, keywords=None, defaults=(10000, 10000, None, None)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.ModelAverage.apply (ArgSpec(args=['self', 'executor', 'need_restore'], varargs=None, keywords=None, defaults=(True,)), ('document', '46234a5470590feb336346f70a3db715')) +paddle.fluid.optimizer.ModelAverage.apply_gradients (ArgSpec(args=['self', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', 'bfe7305918552aaecfdaa22411dbe871')) +paddle.fluid.optimizer.ModelAverage.apply_optimize (ArgSpec(args=['self', 'loss', 'startup_program', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', '5c46d1926a40f1f873ffe9f37ac89dae')) +paddle.fluid.optimizer.ModelAverage.backward (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set', 'callbacks'], varargs=None, keywords=None, defaults=(None, None, None, None)), ('document', 'ba3a113d0229ff7bc9d39bda0a6d947f')) +paddle.fluid.optimizer.ModelAverage.get_opti_var_name_list (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.ModelAverage.minimize (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)), ('document', '35fd5d3330c97903528c7e0dacc7f6ea')) +paddle.fluid.optimizer.ModelAverage.restore (ArgSpec(args=['self', 'executor'], varargs=None, keywords=None, defaults=None), ('document', '18db9c70be9c4dd466f9844457b21bfe')) +paddle.fluid.optimizer.LarsMomentumOptimizer.__init__ (ArgSpec(args=['self', 'learning_rate', 'momentum', 'lars_coeff', 'lars_weight_decay', 'regularization', 'name'], varargs=None, keywords=None, defaults=(0.001, 0.0005, None, None)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.LarsMomentumOptimizer.apply_gradients (ArgSpec(args=['self', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', 'bfe7305918552aaecfdaa22411dbe871')) +paddle.fluid.optimizer.LarsMomentumOptimizer.apply_optimize (ArgSpec(args=['self', 'loss', 'startup_program', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', '5c46d1926a40f1f873ffe9f37ac89dae')) +paddle.fluid.optimizer.LarsMomentumOptimizer.backward (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set', 'callbacks'], varargs=None, keywords=None, defaults=(None, None, None, None)), ('document', 'ba3a113d0229ff7bc9d39bda0a6d947f')) +paddle.fluid.optimizer.LarsMomentumOptimizer.get_opti_var_name_list (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.LarsMomentumOptimizer.minimize (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)), ('document', '35fd5d3330c97903528c7e0dacc7f6ea')) +paddle.fluid.optimizer.DGCMomentumOptimizer.__init__ (ArgSpec(args=['self', 'learning_rate', 'momentum', 'rampup_begin_step', 'rampup_step', 'sparsity', 'use_nesterov', 'local_grad_clip_norm', 'num_trainers', 'regularization', 'name'], varargs=None, keywords=None, defaults=(1, [0.999], False, None, None, None, None)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.DGCMomentumOptimizer.apply_gradients (ArgSpec(args=['self', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', 'bfe7305918552aaecfdaa22411dbe871')) +paddle.fluid.optimizer.DGCMomentumOptimizer.apply_optimize (ArgSpec(args=['self', 'loss', 'startup_program', 'params_grads'], varargs=None, keywords=None, defaults=None), ('document', '5c46d1926a40f1f873ffe9f37ac89dae')) +paddle.fluid.optimizer.DGCMomentumOptimizer.backward (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set', 'callbacks'], varargs=None, keywords=None, defaults=(None, None, None, None)), ('document', 'ba3a113d0229ff7bc9d39bda0a6d947f')) +paddle.fluid.optimizer.DGCMomentumOptimizer.get_opti_var_name_list (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.optimizer.DGCMomentumOptimizer.minimize (ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)), ('document', '35fd5d3330c97903528c7e0dacc7f6ea')) +paddle.fluid.backward.append_backward (ArgSpec(args=['loss', 'parameter_list', 'no_grad_set', 'callbacks'], varargs=None, keywords=None, defaults=(None, None, None)), ('document', '1a79bd7d10ae54ca763ec81bca36ba24')) +paddle.fluid.regularizer.L1DecayRegularizer.__init__ (ArgSpec(args=['self', 'regularization_coeff'], varargs=None, keywords=None, defaults=(0.0,)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.regularizer.L2DecayRegularizer.__init__ (ArgSpec(args=['self', 'regularization_coeff'], varargs=None, keywords=None, defaults=(0.0,)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) paddle.fluid.LoDTensor.__init__ 1. __init__(self: paddle.fluid.core.LoDTensor, arg0: List[List[int]]) -> None 2. __init__(self: paddle.fluid.core.LoDTensor) -> None paddle.fluid.LoDTensor.has_valid_recursive_sequence_lengths has_valid_recursive_sequence_lengths(self: paddle.fluid.core.LoDTensor) -> bool paddle.fluid.LoDTensor.lod lod(self: paddle.fluid.core.LoDTensor) -> List[List[int]] paddle.fluid.LoDTensor.recursive_sequence_lengths recursive_sequence_lengths(self: paddle.fluid.core.LoDTensor) -> List[List[int]] paddle.fluid.LoDTensor.set 1. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[float32], arg1: paddle::platform::CPUPlace) -> None 2. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[int32], arg1: paddle::platform::CPUPlace) -> None 3. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[float64], arg1: paddle::platform::CPUPlace) -> None 4. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[int64], arg1: paddle::platform::CPUPlace) -> None 5. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[bool], arg1: paddle::platform::CPUPlace) -> None 6. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[uint16], arg1: paddle::platform::CPUPlace) -> None 7. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[uint8], arg1: paddle::platform::CPUPlace) -> None 8. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[int8], arg1: paddle::platform::CPUPlace) -> None 9. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[float32], arg1: paddle::platform::CUDAPlace) -> None 10. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[int32], arg1: paddle::platform::CUDAPlace) -> None 11. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[float64], arg1: paddle::platform::CUDAPlace) -> None 12. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[int64], arg1: paddle::platform::CUDAPlace) -> None 13. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[bool], arg1: paddle::platform::CUDAPlace) -> None 14. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[uint16], arg1: paddle::platform::CUDAPlace) -> None 15. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[uint8], arg1: paddle::platform::CUDAPlace) -> None 16. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[int8], arg1: paddle::platform::CUDAPlace) -> None 17. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[float32], arg1: paddle::platform::CUDAPinnedPlace) -> None 18. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[int32], arg1: paddle::platform::CUDAPinnedPlace) -> None 19. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[float64], arg1: paddle::platform::CUDAPinnedPlace) -> None 20. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[int64], arg1: paddle::platform::CUDAPinnedPlace) -> None 21. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[bool], arg1: paddle::platform::CUDAPinnedPlace) -> None 22. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[uint16], arg1: paddle::platform::CUDAPinnedPlace) -> None 23. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[uint8], arg1: paddle::platform::CUDAPinnedPlace) -> None 24. set(self: paddle.fluid.core.Tensor, arg0: numpy.ndarray[int8], arg1: paddle::platform::CUDAPinnedPlace) -> None -paddle.fluid.LoDTensor.set_lod set_lod(self: paddle.fluid.core.LoDTensor, arg0: List[List[int]]) -> None -paddle.fluid.LoDTensor.set_recursive_sequence_lengths set_recursive_sequence_lengths(self: paddle.fluid.core.LoDTensor, arg0: List[List[int]]) -> None +paddle.fluid.LoDTensor.set_lod set_lod(self: paddle.fluid.core.LoDTensor, lod: List[List[int]]) -> None +paddle.fluid.LoDTensor.set_recursive_sequence_lengths set_recursive_sequence_lengths(self: paddle.fluid.core.LoDTensor, recursive_sequence_lengths: List[List[int]]) -> None paddle.fluid.LoDTensor.shape shape(self: paddle.fluid.core.Tensor) -> List[int] paddle.fluid.LoDTensorArray.__init__ __init__(self: paddle.fluid.core.LoDTensorArray) -> None -paddle.fluid.LoDTensorArray.append append(self: paddle.fluid.core.LoDTensorArray, arg0: paddle.fluid.core.LoDTensor) -> None +paddle.fluid.LoDTensorArray.append append(self: paddle.fluid.core.LoDTensorArray, tensor: paddle.fluid.core.LoDTensor) -> None paddle.fluid.CPUPlace.__init__ __init__(self: paddle.fluid.core.CPUPlace) -> None paddle.fluid.CUDAPlace.__init__ __init__(self: paddle.fluid.core.CUDAPlace, arg0: int) -> None paddle.fluid.CUDAPinnedPlace.__init__ __init__(self: paddle.fluid.core.CUDAPinnedPlace) -> None -paddle.fluid.ParamAttr.__init__ ArgSpec(args=['self', 'name', 'initializer', 'learning_rate', 'regularizer', 'trainable', 'gradient_clip', 'do_model_average'], varargs=None, keywords=None, defaults=(None, None, 1.0, None, True, None, False)) -paddle.fluid.WeightNormParamAttr.__init__ ArgSpec(args=['self', 'dim', 'name', 'initializer', 'learning_rate', 'regularizer', 'trainable', 'gradient_clip', 'do_model_average'], varargs=None, keywords=None, defaults=(None, None, None, 1.0, None, True, None, False)) -paddle.fluid.DataFeeder.__init__ ArgSpec(args=['self', 'feed_list', 'place', 'program'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.DataFeeder.decorate_reader ArgSpec(args=['self', 'reader', 'multi_devices', 'num_places', 'drop_last'], varargs=None, keywords=None, defaults=(None, True)) -paddle.fluid.DataFeeder.feed ArgSpec(args=['self', 'iterable'], varargs=None, keywords=None, defaults=None) -paddle.fluid.DataFeeder.feed_parallel ArgSpec(args=['self', 'iterable', 'num_places'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.clip.ErrorClipByValue.__init__ ArgSpec(args=['self', 'max', 'min'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.clip.GradientClipByValue.__init__ ArgSpec(args=['self', 'max', 'min'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.clip.GradientClipByNorm.__init__ ArgSpec(args=['self', 'clip_norm'], varargs=None, keywords=None, defaults=None) -paddle.fluid.clip.GradientClipByGlobalNorm.__init__ ArgSpec(args=['self', 'clip_norm', 'group_name'], varargs=None, keywords=None, defaults=('default_group',)) -paddle.fluid.profiler.cuda_profiler ArgSpec(args=[], varargs='args', keywords='kwds', defaults=None) -paddle.fluid.profiler.reset_profiler ArgSpec(args=[], varargs=None, keywords=None, defaults=None) -paddle.fluid.profiler.profiler ArgSpec(args=[], varargs='args', keywords='kwds', defaults=None) -paddle.fluid.profiler.start_profiler ArgSpec(args=['state'], varargs=None, keywords=None, defaults=None) -paddle.fluid.profiler.stop_profiler ArgSpec(args=['sorted_key', 'profile_path'], varargs=None, keywords=None, defaults=(None, '/tmp/profile')) -paddle.fluid.unique_name.generate ArgSpec(args=['key'], varargs=None, keywords=None, defaults=None) -paddle.fluid.unique_name.switch ArgSpec(args=['new_generator'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.unique_name.guard ArgSpec(args=[], varargs='args', keywords='kwds', defaults=None) -paddle.fluid.recordio_writer.convert_reader_to_recordio_file ArgSpec(args=['filename', 'reader_creator', 'feeder', 'compressor', 'max_num_records', 'feed_order'], varargs=None, keywords=None, defaults=(Compressor.Snappy, 1000, None)) -paddle.fluid.recordio_writer.convert_reader_to_recordio_files ArgSpec(args=['filename', 'batch_per_file', 'reader_creator', 'feeder', 'compressor', 'max_num_records', 'feed_order'], varargs=None, keywords=None, defaults=(Compressor.Snappy, 1000, None)) -paddle.fluid.Scope.__init__ __init__(self: paddle.fluid.core.Scope) -> None -paddle.fluid.Scope.drop_kids drop_kids(self: paddle.fluid.core.Scope) -> None -paddle.fluid.Scope.find_var find_var(self: paddle.fluid.core.Scope, arg0: unicode) -> paddle.fluid.core.Variable -paddle.fluid.Scope.new_scope new_scope(self: paddle.fluid.core.Scope) -> paddle.fluid.core.Scope -paddle.fluid.Scope.var var(self: paddle.fluid.core.Scope, arg0: unicode) -> paddle.fluid.core.Variable +paddle.fluid.ParamAttr.__init__ (ArgSpec(args=['self', 'name', 'initializer', 'learning_rate', 'regularizer', 'trainable', 'gradient_clip', 'do_model_average'], varargs=None, keywords=None, defaults=(None, None, 1.0, None, True, None, False)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.WeightNormParamAttr.__init__ (ArgSpec(args=['self', 'dim', 'name', 'initializer', 'learning_rate', 'regularizer', 'trainable', 'gradient_clip', 'do_model_average'], varargs=None, keywords=None, defaults=(None, None, None, 1.0, None, True, None, False)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.DataFeeder.__init__ (ArgSpec(args=['self', 'feed_list', 'place', 'program'], varargs=None, keywords=None, defaults=(None,)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.DataFeeder.decorate_reader (ArgSpec(args=['self', 'reader', 'multi_devices', 'num_places', 'drop_last'], varargs=None, keywords=None, defaults=(None, True)), ('document', 'f8f3df23c5633c614db781a91b81fb62')) +paddle.fluid.DataFeeder.feed (ArgSpec(args=['self', 'iterable'], varargs=None, keywords=None, defaults=None), ('document', '459e316301279dfd82001b46f0b8ffca')) +paddle.fluid.DataFeeder.feed_parallel (ArgSpec(args=['self', 'iterable', 'num_places'], varargs=None, keywords=None, defaults=(None,)), ('document', '543863d1f9d4853758adb613b8659e85')) +paddle.fluid.clip.ErrorClipByValue.__init__ (ArgSpec(args=['self', 'max', 'min'], varargs=None, keywords=None, defaults=(None,)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.clip.GradientClipByValue.__init__ (ArgSpec(args=['self', 'max', 'min'], varargs=None, keywords=None, defaults=(None,)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.clip.GradientClipByNorm.__init__ (ArgSpec(args=['self', 'clip_norm'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.clip.GradientClipByGlobalNorm.__init__ (ArgSpec(args=['self', 'clip_norm', 'group_name'], varargs=None, keywords=None, defaults=('default_group',)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.profiler.cuda_profiler (ArgSpec(args=['output_file', 'output_mode', 'config'], varargs=None, keywords=None, defaults=(None, None)), ('document', '2e2fb1cfc469a67f19fb578a2ed6be79')) +paddle.fluid.profiler.reset_profiler (ArgSpec(args=[], varargs=None, keywords=None, defaults=None), ('document', '397ce757fabbe5c622e0c3458c41fcd0')) +paddle.fluid.profiler.profiler (ArgSpec(args=['state', 'sorted_key', 'profile_path'], varargs=None, keywords=None, defaults=(None, '/tmp/profile')), ('document', 'bd3a07eeb68e384f4d2d416cb2e28d86')) +paddle.fluid.profiler.start_profiler (ArgSpec(args=['state'], varargs=None, keywords=None, defaults=None), ('document', '88da8fb6dbebaee2f7520188a09574f9')) +paddle.fluid.profiler.stop_profiler (ArgSpec(args=['sorted_key', 'profile_path'], varargs=None, keywords=None, defaults=(None, '/tmp/profile')), ('document', 'a7500e39dd033f1e64f562e909333a8a')) +paddle.fluid.unique_name.generate (ArgSpec(args=['key'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.unique_name.switch (ArgSpec(args=['new_generator'], varargs=None, keywords=None, defaults=(None,)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.unique_name.guard (ArgSpec(args=['new_generator'], varargs=None, keywords=None, defaults=(None,)), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.fluid.recordio_writer.convert_reader_to_recordio_file (ArgSpec(args=['filename', 'reader_creator', 'feeder', 'compressor', 'max_num_records', 'feed_order'], varargs=None, keywords=None, defaults=(Compressor.Snappy, 1000, None)), ('document', '65c7523e86f0c50bb729b01667f36310')) +paddle.fluid.recordio_writer.convert_reader_to_recordio_files (ArgSpec(args=['filename', 'batch_per_file', 'reader_creator', 'feeder', 'compressor', 'max_num_records', 'feed_order'], varargs=None, keywords=None, defaults=(Compressor.Snappy, 1000, None)), ('document', 'bc643f0f5f1b9db57ff0d8a57d379bd7')) +paddle.fluid.Scope Scope() -> paddle.fluid.core._Scope +paddle.fluid.install_check.run_check (ArgSpec(args=[], varargs=None, keywords=None, defaults=None), ('document', '66b7c84a17ed32fec2df9628367be2b9')) +paddle.reader.cache (ArgSpec(args=['reader'], varargs=None, keywords=None, defaults=None), ('document', '1676886070eb607cb608f7ba47be0d3c')) +paddle.reader.map_readers (ArgSpec(args=['func'], varargs='readers', keywords=None, defaults=None), ('document', '77cbadb09df588e21e5cc0819b69c87d')) +paddle.reader.buffered (ArgSpec(args=['reader', 'size'], varargs=None, keywords=None, defaults=None), ('document', '0d6186f109feceb99f60ec50a0a624cb')) +paddle.reader.compose (ArgSpec(args=[], varargs='readers', keywords='kwargs', defaults=None), ('document', '884291104e1c3f37f33aae44b7deeb0d')) +paddle.reader.chain (ArgSpec(args=[], varargs='readers', keywords=None, defaults=None), ('document', 'd22c34e379a53901ae67a6bca7f4def4')) +paddle.reader.shuffle (ArgSpec(args=['reader', 'buf_size'], varargs=None, keywords=None, defaults=None), ('document', 'e42ea6fee23ce26b23cb142cd1d6522d')) +paddle.reader.firstn (ArgSpec(args=['reader', 'n'], varargs=None, keywords=None, defaults=None), ('document', 'c5bb8f7dd4f917f1569a368aab5b8aad')) +paddle.reader.xmap_readers (ArgSpec(args=['mapper', 'reader', 'process_num', 'buffer_size', 'order'], varargs=None, keywords=None, defaults=(False,)), ('document', '9c804a42f8a4dbaa76b3c98e0ab7f796')) +paddle.reader.PipeReader.__init__ (ArgSpec(args=['self', 'command', 'bufsize', 'file_type'], varargs=None, keywords=None, defaults=(8192, 'plain')), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.reader.PipeReader.get_line (ArgSpec(args=['self', 'cut_lines', 'line_break'], varargs=None, keywords=None, defaults=(True, '\n')), ('document', '9621ae612e595b6c34eb3bb5f3eb1a45')) +paddle.reader.multiprocess_reader (ArgSpec(args=['readers', 'use_pipe', 'queue_size'], varargs=None, keywords=None, defaults=(True, 1000)), ('document', '7d8b3a96e592107c893d5d51ce968ba0')) +paddle.reader.Fake.__init__ (ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None), ('document', '6adf97f83acf6453d4a6a4b1070f3754')) +paddle.reader.creator.np_array (ArgSpec(args=['x'], varargs=None, keywords=None, defaults=None), ('document', '28d457fbc9a71efa4ac91a3be179cada')) +paddle.reader.creator.text_file (ArgSpec(args=['path'], varargs=None, keywords=None, defaults=None), ('document', 'f45fcb7add066c8e042c6774fc7c3db2')) +paddle.reader.creator.recordio (ArgSpec(args=['paths', 'buf_size'], varargs=None, keywords=None, defaults=(100,)), ('document', 'b4a94ee0e2cefb495619275c2f8c61d2')) diff --git a/paddle/fluid/CMakeLists.txt b/paddle/fluid/CMakeLists.txt index 48b36df6499e59fe742766b5f81fd30a9fb8b900..595454e90b9cd713fd2baed24538cf5fbc93934a 100644 --- a/paddle/fluid/CMakeLists.txt +++ b/paddle/fluid/CMakeLists.txt @@ -1,16 +1,12 @@ add_subdirectory(memory) add_subdirectory(platform) add_subdirectory(framework) +add_subdirectory(imperative) add_subdirectory(operators) add_subdirectory(string) - -if (NOT WIN32) -add_subdirectory(pybind) add_subdirectory(recordio) -endif(NOT WIN32) +add_subdirectory(pybind) -if(WITH_INFERENCE) - # NOTE: please add subdirectory inference at last. - add_subdirectory(inference) - add_subdirectory(train) -endif() +# NOTE: please add subdirectory inference at last. +add_subdirectory(inference) +add_subdirectory(train) diff --git a/paddle/fluid/framework/CMakeLists.txt b/paddle/fluid/framework/CMakeLists.txt index 844291140602a7a0aac9d9d40256deaf9d8a4c60..4e00630bb124c5e10a3b4e0e346326a45642fa3e 100644 --- a/paddle/fluid/framework/CMakeLists.txt +++ b/paddle/fluid/framework/CMakeLists.txt @@ -1,45 +1,40 @@ -# windows treat symbolic file as a real file, which is different with unix -# We create a hidden file and compile it instead of origin source file. +#windows treat symbolic file as a real file, which is different with unix +#We create a hidden file and compile it instead of origin source file. function(windows_symbolic TARGET) set(oneValueArgs "") - set(multiValueArgs SRCS DEPS) + set(multiValueArgs SRCS PATH) cmake_parse_arguments(windows_symbolic "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) + set(final_path ${CMAKE_CURRENT_SOURCE_DIR}/${windows_symbolic_PATH}) foreach(src ${windows_symbolic_SRCS}) - get_filename_component(src ${src} NAME_WE) - if (NOT EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${src}.cc OR NOT EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${src}.cu) - message(FATAL " ${src}.cc and ${src}.cu must exsits, and ${src}.cu must be symbolic file.") - endif() - - # only copy the xx.cu to .xx.cu when the content are modified - set(copy_flag 1) - if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/.${src}.cu) - file(READ ${CMAKE_CURRENT_SOURCE_DIR}/${src}.cc SOURCE_STR) - file(READ ${CMAKE_CURRENT_SOURCE_DIR}/.${src}.cu TARGET_STR) - if (SOURCE_STR STREQUAL TARGET_STR) - set(copy_flag 0) - endif() - endif() - if (copy_flag) - add_custom_command(OUTPUT .${src}.cu - COMMAND ${CMAKE_COMMAND} -E remove ${CMAKE_CURRENT_SOURCE_DIR}/.${src}.cu - COMMAND ${CMAKE_COMMAND} -E copy "${CMAKE_CURRENT_SOURCE_DIR}/${src}.cc" "${CMAKE_CURRENT_SOURCE_DIR}/.${src}.cu" - COMMENT "create hidden file of ${src}.cu") - endif(copy_flag) - add_custom_target(${TARGET} ALL DEPENDS .${src}.cu) + get_filename_component(src ${src} NAME_WE) + if (NOT EXISTS ${final_path}/${src}.cc OR NOT EXISTS ${final_path}/${src}.cu) + message(FATAL " ${src}.cc and ${src}.cu must exsits, and ${src}.cu must be symbolic file.") + endif() + + file(GENERATE OUTPUT ${final_path}/.${src}.cu INPUT ${final_path}/${src}.cc) + + add_custom_command(OUTPUT ${final_path}/.${src}.cu + COMMAND ${CMAKE_COMMAND} -E copy_if_different "${final_path}/${src}.cc" "${final_path}/.${src}.cu" + COMMENT "create hidden file of ${src}.cu") + add_custom_target(${TARGET} ALL DEPENDS .${src}.cu) endforeach() endfunction() add_subdirectory(ir) -if (NOT WIN32) add_subdirectory(details) -endif (NOT WIN32) -# ddim lib +add_subdirectory(fleet) +add_subdirectory(io) +#ddim lib proto_library(framework_proto SRCS framework.proto) +proto_library(data_feed_proto SRCS data_feed.proto) +proto_library(async_executor_proto SRCS data_feed.proto) +proto_library(trainer_desc_proto SRCS trainer_desc.proto data_feed.proto) -cc_library(ddim SRCS ddim.cc DEPS eigen3 boost) +cc_library(ddim SRCS ddim.cc DEPS eigen3 boost enforce) cc_test(ddim_test SRCS ddim_test.cc DEPS ddim) nv_test(dim_test SRCS dim_test.cu DEPS ddim) +cc_test(unroll_array_ops_test SRCS unroll_array_ops_test.cc) cc_library(data_type SRCS data_type.cc DEPS framework_proto ddim device_context) cc_test(data_type_test SRCS data_type_test.cc DEPS data_type place tensor) if(WITH_GPU) @@ -48,10 +43,10 @@ if(WITH_GPU) nv_library(tensor SRCS tensor.cc .tensor_util.cu DEPS place memory data_type device_context) add_dependencies(tensor tensor_util) else() - nv_library(tensor SRCS tensor.cc tensor_util.cu DEPS place memory data_type device_context) + nv_library(tensor SRCS tensor.cc tensor_util.cu DEPS place memory data_type device_context profiler) endif(WIN32) else() - cc_library(tensor SRCS tensor.cc tensor_util.cc DEPS place memory data_type device_context) + cc_library(tensor SRCS tensor.cc tensor_util.cc DEPS place memory data_type device_context profiler) endif() cc_test(tensor_test SRCS tensor_test.cc DEPS tensor) @@ -68,34 +63,38 @@ if(WITH_GPU) else() cc_test(mixed_vector_test SRCS mixed_vector_test.cc DEPS place memory device_context tensor) endif() -if (NOT WIN32) - cc_library(lod_tensor SRCS lod_tensor.cc DEPS ddim place tensor framework_proto recordio version) -else() - cc_library(lod_tensor SRCS lod_tensor.cc DEPS ddim place tensor framework_proto version) -endif (NOT WIN32) +cc_library(lod_tensor SRCS lod_tensor.cc DEPS ddim place tensor framework_proto recordio version) cc_test(lod_tensor_test SRCS lod_tensor_test.cc DEPS lod_tensor memory) nv_test(lod_tensor_gpu_test SRCS lod_tensor_test.cu DEPS lod_tensor) +cc_library(garbage_collector SRCS garbage_collector.cc DEPS device_context memory gflags glog) + cc_library(reader SRCS reader.cc DEPS lod_tensor ddim) cc_test(reader_test SRCS reader_test.cc DEPS reader) -cc_test(variable_test SRCS variable_test.cc) - cc_library(threadpool SRCS threadpool.cc DEPS enforce) cc_test(threadpool_test SRCS threadpool_test.cc DEPS threadpool) -cc_library(scope SRCS scope.cc DEPS glog threadpool) +cc_library(var_type_traits SRCS var_type_traits DEPS lod_tensor selected_rows framework_proto) +if (WITH_GPU) + target_link_libraries(var_type_traits dynload_cuda) +endif() +cc_test(var_type_traits_test SRCS var_type_traits_test.cc DEPS var_type_traits) + +cc_library(scope SRCS scope.cc DEPS glog threadpool xxhash var_type_traits) +cc_library(scope_pool SRCS scope_pool.cc DEPS scope) cc_test(scope_test SRCS scope_test.cc DEPS scope) +cc_test(variable_test SRCS variable_test.cc DEPS tensor var_type_traits) cc_library(data_device_transform SRCS data_device_transform.cc DEPS tensor) nv_test(data_device_transform_test SRCS data_device_transform_test.cu - DEPS operator op_registry device_context math_function) + DEPS operator op_registry device_context math_function scope) if(WITH_GPU) if (WIN32) - # windows treat symbolic file as a real file, which is different with unix - # We create a hidden file and compile it instead of origin source file. +#windows treat symbolic file as a real file, which is different with unix +#We create a hidden file and compile it instead of origin source file. windows_symbolic(hidden_file SRCS data_type_transform.cu) nv_library(data_type_transform SRCS .data_type_transform.cu DEPS tensor) add_dependencies(data_type_transform hidden_file) @@ -122,13 +121,10 @@ cc_test(op_proto_maker_test SRCS op_proto_maker_test.cc DEPS op_proto_maker) cc_library(op_info SRCS op_info.cc DEPS attribute framework_proto) cc_library(shape_inference SRCS shape_inference.cc DEPS ddim attribute device_context) -if (NOT WIN32) -cc_library(operator SRCS operator.cc DEPS op_info device_context tensor scope glog - shape_inference data_transform lod_tensor profiler) -else() +cc_library(transfer_scope_cache SRCS transfer_scope_cache.cc DEPS scope framework_proto device_context) +cc_library(op_kernel_type SRCS op_kernel_type.cc DEPS device_context place) cc_library(operator SRCS operator.cc DEPS op_info device_context tensor scope glog - shape_inference data_transform lod_tensor) -endif(NOT WIN32) + shape_inference data_transform lod_tensor profiler transfer_scope_cache op_kernel_type) cc_test(operator_test SRCS operator_test.cc DEPS operator op_registry device_context) @@ -137,51 +133,94 @@ cc_test(version_test SRCS version_test.cc DEPS version) cc_library(proto_desc SRCS var_desc.cc op_desc.cc block_desc.cc program_desc.cc DEPS shape_inference op_info operator glog version) -cc_library(op_registry SRCS op_registry.cc DEPS op_proto_maker op_info operator glog proto_desc) +cc_library(op_registry SRCS op_registry.cc DEPS op_proto_maker op_info operator glog proto_desc memory_optimize_helper) + nv_test(op_registry_test SRCS op_registry_test.cc DEPS op_registry) -if (NOT WIN32) -py_proto_compile(framework_py_proto SRCS framework.proto) -# Generate an empty __init__.py to make framework_py_proto as a valid python module. +py_proto_compile(framework_py_proto SRCS framework.proto data_feed.proto) +py_proto_compile(trainer_py_proto SRCS trainer_desc.proto data_feed.proto) +#Generate an empty \ + #__init__.py to make framework_py_proto as a valid python module. add_custom_target(framework_py_proto_init ALL COMMAND ${CMAKE_COMMAND} -E touch __init__.py) add_dependencies(framework_py_proto framework_py_proto_init) -add_custom_command(TARGET framework_py_proto POST_BUILD - COMMAND ${CMAKE_COMMAND} -E make_directory ${PADDLE_BINARY_DIR}/python/paddle/fluid/proto - COMMAND cp *.py ${PADDLE_BINARY_DIR}/python/paddle/fluid/proto/ - COMMENT "Copy generated python proto into directory paddle/fluid/proto." - WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}) +if (NOT WIN32) + add_custom_command(TARGET framework_py_proto POST_BUILD + COMMAND ${CMAKE_COMMAND} -E make_directory ${PADDLE_BINARY_DIR}/python/paddle/fluid/proto + COMMAND cp *.py ${PADDLE_BINARY_DIR}/python/paddle/fluid/proto/ + COMMENT "Copy generated python proto into directory paddle/fluid/proto." + WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}) +else(NOT WIN32) + string(REPLACE "/" "\\" proto_dstpath "${PADDLE_BINARY_DIR}/python/paddle/fluid/proto/") + add_custom_command(TARGET framework_py_proto POST_BUILD + COMMAND ${CMAKE_COMMAND} -E make_directory ${PADDLE_BINARY_DIR}/python/paddle/fluid/proto + COMMAND copy /Y *.py ${proto_dstpath} + COMMENT "Copy generated python proto into directory paddle/fluid/proto." + WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}) endif(NOT WIN32) cc_library(lod_rank_table SRCS lod_rank_table.cc DEPS lod_tensor) cc_library(feed_fetch_method SRCS feed_fetch_method.cc DEPS lod_tensor scope glog) +cc_library(variable_helper SRCS variable_helper.cc DEPS lod_tensor) -cc_library(naive_executor SRCS naive_executor.cc DEPS op_registry device_context scope framework_proto glog lod_rank_table feed_fetch_method graph_to_program_pass) +cc_library(naive_executor SRCS naive_executor.cc DEPS op_registry device_context scope framework_proto glog lod_rank_table feed_fetch_method graph_to_program_pass variable_helper) + +if(WITH_NGRAPH) + set(NGRAPH_EXE_DEPS ngraph_engine) +else() + set(NGRAPH_EXE_DEPS) +endif() +cc_library(executor_gc_helper SRCS executor_gc_helper.cc DEPS scope proto_desc operator garbage_collector) if(WITH_DISTRIBUTE) - cc_library(executor SRCS executor.cc DEPS op_registry device_context scope framework_proto glog lod_rank_table feed_fetch_method sendrecvop_grpc cares grpc++_unsecure grpc_unsecure gpr graph_to_program_pass) - set(DISTRIBUTE_COMPILE_FLAGS "-Wno-non-virtual-dtor -Wno-error=non-virtual-dtor -Wno-error=delete-non-virtual-dtor") - set_source_files_properties(executor.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS}) + cc_library(executor SRCS executor.cc multi_trainer.cc dataset_factory.cc + dist_multi_trainer.cc trainer_factory.cc trainer.cc data_feed_factory.cc + data_feed.cc device_worker.cc hogwild_worker.cc downpour_worker.cc + pull_dense_worker.cc device_worker_factory.cc data_set.cc DEPS op_registry + device_context scope framework_proto trainer_desc_proto glog fs shell fleet_wrapper lodtensor_printer + lod_rank_table feed_fetch_method sendrecvop_rpc ${GLOB_DISTRIBUTE_DEPS} + graph_to_program_pass variable_helper data_feed_proto ${NGRAPH_EXE_DEPS} timer) +set(DISTRIBUTE_COMPILE_FLAGS "-Wno-non-virtual-dtor -Wno-error=non-virtual-dtor -Wno-error=delete-non-virtual-dtor") +set_source_files_properties(executor.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS}) else() - cc_library(executor SRCS executor.cc DEPS op_registry device_context scope framework_proto glog lod_rank_table feed_fetch_method graph_to_program_pass) + cc_library(executor SRCS executor.cc multi_trainer.cc dataset_factory.cc + dist_multi_trainer.cc trainer_factory.cc trainer.cc data_feed_factory.cc + data_feed.cc device_worker.cc hogwild_worker.cc downpour_worker.cc + pull_dense_worker.cc device_worker_factory.cc data_set.cc DEPS op_registry + device_context scope framework_proto data_feed_proto trainer_desc_proto glog + lod_rank_table fs shell fleet_wrapper lodtensor_printer feed_fetch_method + graph_to_program_pass variable_helper ${NGRAPH_EXE_DEPS} timer data_feed_proto) cc_test(test_naive_executor SRCS naive_executor_test.cc DEPS naive_executor elementwise_add_op) endif() - -if (NOT WIN32) + +target_link_libraries(executor while_op_helper executor_gc_helper) + cc_library(parallel_executor SRCS parallel_executor.cc DEPS - threaded_ssa_graph_executor scope_buffered_ssa_graph_executor + threaded_ssa_graph_executor scope_buffered_ssa_graph_executor parallel_ssa_graph_executor async_ssa_graph_executor graph build_strategy - fast_threaded_ssa_graph_executor) -endif() # NOT WIN32 + fast_threaded_ssa_graph_executor variable_helper) +cc_library(async_executor SRCS async_executor.cc data_feed.cc data_feed_factory.cc + executor_thread_worker.cc multi_trainer.cc dist_multi_trainer.cc + trainer_factory.cc trainer.cc device_worker.cc hogwild_worker.cc + downpour_worker.cc pull_dense_worker.cc device_worker_factory.cc + data_set.cc dataset_factory.cc + DEPS op_registry device_context scope framework_proto + trainer_desc_proto glog lod_rank_table fleet_wrapper lodtensor_printer + feed_fetch_method graph_to_program_pass data_feed_proto + variable_helper timer fs shell) + + +cc_test(data_feed_test SRCS data_feed_test.cc DEPS async_executor) cc_library(prune SRCS prune.cc DEPS framework_proto) cc_test(prune_test SRCS prune_test.cc DEPS op_info prune recurrent_op device_context) cc_test(var_type_inference_test SRCS var_type_inference_test.cc DEPS op_registry proto_desc) +cc_test(inplace_op_inference_test SRCS inplace_op_inference_test.cc DEPS inplace_op_pass op_registry proto_desc op_info memory_optimize_helper pass_builder) cc_library(selected_rows SRCS selected_rows.cc DEPS tensor) cc_test(selected_rows_test SRCS selected_rows_test.cc DEPS selected_rows) -cc_test(op_kernel_type_test SRCS op_kernel_type_test.cc DEPS place device_context framework_proto) +cc_test(op_kernel_type_test SRCS op_kernel_type_test.cc DEPS place device_context framework_proto op_kernel_type) cc_test(cow_ptr_tests SRCS details/cow_ptr_test.cc) cc_test(tuple_test SRCS tuple_test.cc ) @@ -189,3 +228,27 @@ cc_test(tuple_test SRCS tuple_test.cc ) if (NOT WIN32) cc_test(rw_lock_test SRCS rw_lock_test.cc) endif (NOT WIN32) + +cc_library(dlpack_tensor SRCS dlpack_tensor.cc DEPS tensor dlpack) +cc_test(dlpack_tensor_test SRCS dlpack_tensor_test.cc DEPS dlpack_tensor glog) + +# Get the current working branch +execute_process( + COMMAND git rev-parse --abbrev-ref HEAD + WORKING_DIRECTORY ${CMAKE_SOURCE_DIR} + OUTPUT_VARIABLE PADDLE_BRANCH + OUTPUT_STRIP_TRAILING_WHITESPACE + ) + +# Get the latest abbreviated commit hash of the working branch +execute_process( + COMMAND git log -1 --format=%h + WORKING_DIRECTORY ${CMAKE_SOURCE_DIR} + OUTPUT_VARIABLE PADDLE_COMMIT + OUTPUT_STRIP_TRAILING_WHITESPACE + ) + +message(STATUS "commit: ${PADDLE_COMMIT}") +message(STATUS "branch: ${PADDLE_BRANCH}") + +configure_file(commit.h.in commit.h) diff --git a/paddle/fluid/framework/array.h b/paddle/fluid/framework/array.h index be9efcd74924a2050a2fd9ab83059590a1a2a2fd..b53082986882c80a85826f10d5766525f72c0a97 100644 --- a/paddle/fluid/framework/array.h +++ b/paddle/fluid/framework/array.h @@ -15,34 +15,123 @@ #pragma once #include -#include "paddle/fluid/platform/hostdevice.h" +#include "paddle/fluid/framework/unroll_array_ops.h" +#include "paddle/fluid/platform/enforce.h" namespace paddle { namespace framework { + template class Array { - static_assert(N > 0, "The size of array must be larger than 0"); - public: - HOSTDEVICE Array() {} + static constexpr size_t kSize = N; + + HOSTDEVICE inline Array() {} - HOSTDEVICE explicit Array(const T &val) { - for (size_t i = 0; i < N; ++i) data_[i] = val; + template + HOSTDEVICE inline explicit Array(const T &val, Args... args) { + static_assert(N == sizeof...(Args) + 1, "Invalid argument"); + UnrollVarArgsAssign::Run(data_, val, args...); } - HOSTDEVICE const T *Get() const { return data_; } + HOSTDEVICE inline void Fill(const T &val) { + UnrollFillConstant::Run(data_, val); + } - HOSTDEVICE T *GetMutable() { return data_; } + HOSTDEVICE inline const T *Get() const { return data_; } - HOSTDEVICE T &operator[](size_t index) { return data_[index]; } + HOSTDEVICE inline T *GetMutable() { return data_; } - HOSTDEVICE const T &operator[](size_t index) const { return data_[index]; } + HOSTDEVICE inline T &operator[](size_t i) { return *advance(data_, i); } + + // Writing "return data_[i]" would cause compilation warning/error: + // "array subscript is above array bound" in Python 35 CI. + // It seems that it is a false warning of GCC if we do not check the bounds + // of array index. But for better performance, we do not check in operator[] + // like what is in STL. If users want to check the bounds, use at() instead + HOSTDEVICE inline const T &operator[](size_t i) const { + return *advance(data_, i); + } + + HOSTDEVICE inline T &at(size_t i) { +#ifndef __CUDA_ARCH__ + PADDLE_ENFORCE_LT(i, N, "Array index out of bounds"); +#endif + return (*this)[i]; + } + + HOSTDEVICE inline const T &at(size_t i) const { +#ifndef __CUDA_ARCH__ + PADDLE_ENFORCE_LT(i, N, "Array index out of bounds"); +#endif + return (*this)[i]; + } HOSTDEVICE constexpr size_t size() const { return N; } + HOSTDEVICE inline bool operator==(const Array &other) const { + return UnrollCompare::Run(data_, other.data_); + } + + HOSTDEVICE inline bool operator!=(const Array &other) const { + return !(*this == other); + } + private: + template + HOSTDEVICE static inline U *advance(U *ptr, size_t i) { + return ptr + i; + } + T data_[N]; }; +template +class Array { + public: + static constexpr size_t kSize = 0; + + HOSTDEVICE inline Array() {} + + HOSTDEVICE inline void Fill(const T &val) {} + + HOSTDEVICE inline constexpr T *Get() const { return nullptr; } + + // Add constexpr to GetMutable() cause warning in MAC + HOSTDEVICE inline T *GetMutable() { return nullptr; } + + HOSTDEVICE inline T &operator[](size_t) { +#ifdef __CUDA_ARCH__ + static T obj(); + return obj; +#else + PADDLE_THROW("Array has no element"); +#endif + } + + HOSTDEVICE inline const T &operator[](size_t) const { +#ifdef __CUDA_ARCH__ + static const T obj(); + return obj; +#else + PADDLE_THROW("Array has no element"); +#endif + } + + HOSTDEVICE inline T &at(size_t i) { return (*this)[i]; } + + HOSTDEVICE inline const T &at(size_t i) const { return (*this)[i]; } + + HOSTDEVICE constexpr size_t size() const { return 0; } + + HOSTDEVICE constexpr bool operator==(const Array &other) const { + return true; + } + + HOSTDEVICE constexpr bool operator!=(const Array &other) const { + return false; + } +}; + } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/async_executor.cc b/paddle/fluid/framework/async_executor.cc new file mode 100644 index 0000000000000000000000000000000000000000..89153d82d078b53d8d5582f0a38d3dafe21cc7eb --- /dev/null +++ b/paddle/fluid/framework/async_executor.cc @@ -0,0 +1,171 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/async_executor.h" +#include "google/protobuf/io/zero_copy_stream_impl.h" +#include "google/protobuf/message.h" +#include "google/protobuf/text_format.h" + +#include "gflags/gflags.h" +#include "paddle/fluid/framework/data_feed_factory.h" +#include "paddle/fluid/framework/executor_thread_worker.h" +#include "paddle/fluid/framework/feed_fetch_method.h" +#include "paddle/fluid/framework/feed_fetch_type.h" +#include "paddle/fluid/framework/lod_rank_table.h" +#include "paddle/fluid/framework/lod_tensor_array.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/framework/reader.h" +#include "paddle/fluid/framework/trainer.h" +#include "paddle/fluid/framework/trainer_desc.pb.h" +#include "paddle/fluid/framework/trainer_factory.h" +#include "paddle/fluid/inference/io.h" +#include "paddle/fluid/platform/place.h" +#include "paddle/fluid/pybind/pybind.h" + +namespace paddle { +namespace framework { +AsyncExecutor::AsyncExecutor(Scope* scope, const platform::Place& place) + : root_scope_(scope), place_(place) {} + +void AsyncExecutor::InitServer(const std::string& dist_desc, int index) { + fleet_ptr_ = FleetWrapper::GetInstance(); + fleet_ptr_->InitServer(dist_desc, index); +} + +void AsyncExecutor::InitWorker(const std::string& dist_desc, + const std::vector& host_sign_list, + int node_num, int index) { + fleet_ptr_ = FleetWrapper::GetInstance(); + fleet_ptr_->InitWorker(dist_desc, host_sign_list, node_num, index); +} + +uint64_t AsyncExecutor::StartServer() { return fleet_ptr_->RunServer(); } + +void AsyncExecutor::StopServer() { fleet_ptr_->StopServer(); } + +void AsyncExecutor::GatherServers(const std::vector& host_sign_list, + int node_num) { + fleet_ptr_->GatherServers(host_sign_list, node_num); +} + +// todo InitModel +void AsyncExecutor::InitModel() {} + +// todo SaveModel +void AsyncExecutor::SaveModel(const std::string& path) {} + +void AsyncExecutor::RunFromFile(const ProgramDesc& main_program, + const std::string& data_feed_desc_str, + const std::vector& filelist, + const int thread_num, + const std::vector& fetch_var_names, + const std::string& mode, const bool debug) { + std::vector threads; + + auto& block = main_program.Block(0); + for (auto var_name : fetch_var_names) { + auto var_desc = block.FindVar(var_name); + PADDLE_ENFORCE_NOT_NULL(var_desc, "%s is not found.", var_name); + auto shapes = var_desc->GetShape(); + PADDLE_ENFORCE(shapes[shapes.size() - 1] == 1, + "var %s: Fetched var has wrong shape, " + "only variables with the last dimension size 1 supported", + var_name); + } + + DataFeedDesc data_feed_desc; + google::protobuf::TextFormat::ParseFromString(data_feed_desc_str, + &data_feed_desc); + + actual_thread_num_ = thread_num; + int file_cnt = filelist.size(); + PADDLE_ENFORCE(file_cnt > 0, "File list cannot be empty"); + + if (actual_thread_num_ > file_cnt) { + VLOG(1) << "Thread num = " << thread_num << ", file num = " << file_cnt + << ". Changing thread_num = " << file_cnt; + actual_thread_num_ = file_cnt; + } + + /* + readerDesc: protobuf description for reader initlization + argument: class_name, batch_size, use_slot, queue_size, buffer_size, + padding_index + + reader: + 1) each thread has a reader, reader will read input data and + put it into input queue + 2) each reader has a Next() iterface, that can fetch an instance + from the input queue + */ + // todo: should be factory method for creating datafeed + std::vector> readers; + /* + PrepareReaders(readers, actual_thread_num_, data_feed_desc, filelist); +#ifdef PADDLE_WITH_PSLIB + PrepareDenseThread(mode); +#endif + */ + std::vector> workers; + workers.resize(actual_thread_num_); + for (auto& worker : workers) { +#ifdef PADDLE_WITH_PSLIB + if (mode == "mpi") { + worker.reset(new AsyncExecutorThreadWorker); + } else { + worker.reset(new ExecutorThreadWorker); + } +#else + worker.reset(new ExecutorThreadWorker); +#endif + } + + // prepare thread resource here + /* + for (int thidx = 0; thidx < actual_thread_num_; ++thidx) { + CreateThreads(workers[thidx].get(), main_program, readers[thidx], + fetch_var_names, root_scope_, thidx, debug); + } + */ + + // start executing ops in multiple threads + for (int thidx = 0; thidx < actual_thread_num_; ++thidx) { + if (debug) { + threads.push_back(std::thread(&ExecutorThreadWorker::TrainFilesWithTimer, + workers[thidx].get())); + } else { + threads.push_back( + std::thread(&ExecutorThreadWorker::TrainFiles, workers[thidx].get())); + } + } + + for (auto& th : threads) { + th.join(); + } + // TODO(guru4elephant): we don't need this + /* +#ifdef PADDLE_WITH_PSLIB + if (mode == "mpi") { + _pull_dense_thread->stop(); + } +#endif + */ + VLOG(3) << "start to run from files in async_executor"; + VLOG(3) << "Drop current scope kids"; + root_scope_->DropKids(); + return; +} + +} // end namespace framework +} // end namespace paddle diff --git a/paddle/fluid/framework/async_executor.h b/paddle/fluid/framework/async_executor.h new file mode 100644 index 0000000000000000000000000000000000000000..7b59e1b11ca577d4b03784db50d5fa6ed3d1f12b --- /dev/null +++ b/paddle/fluid/framework/async_executor.h @@ -0,0 +1,94 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include +#include // NOLINT +#include // local_random_engine +#include +#include +#include // NOLINT +#include +#include +#include "paddle/fluid/framework/data_feed.pb.h" +#include "paddle/fluid/framework/data_set.h" +#include "paddle/fluid/framework/executor.h" +#include "paddle/fluid/framework/executor_thread_worker.h" +#include "paddle/fluid/framework/fleet/fleet_wrapper.h" +#include "paddle/fluid/framework/program_desc.h" +#include "paddle/fluid/framework/scope.h" + +namespace paddle { +namespace framework { + +inline double current_realtime() { +#if !defined(_WIN32) + struct timespec tp; + clock_gettime(CLOCK_REALTIME, &tp); + return tp.tv_sec + tp.tv_nsec * 1e-9; +#else + return 0.0; +#endif +} + +inline std::default_random_engine& local_random_engine() { + struct engine_wrapper_t { + std::default_random_engine engine; + engine_wrapper_t() { + static std::atomic x(0); + std::seed_seq sseq = {x++, x++, x++, + static_cast(current_realtime() * 1000)}; + engine.seed(sseq); + } + }; + thread_local engine_wrapper_t r; + return r.engine; +} + +class AsyncExecutor { + public: + AsyncExecutor(Scope* scope, const platform::Place& place); + virtual ~AsyncExecutor() {} + void RunFromFile(const ProgramDesc& main_program, + const std::string& data_feed_desc_str, + const std::vector& filelist, + const int thread_num, + const std::vector& fetch_var_names, + const std::string& mode, const bool debug); + + // TODO(guru4elephant): make init server decoupled from executor + void InitServer(const std::string& dist_desc, int index); + void InitWorker(const std::string& dist_desc, + const std::vector& host_sign_list, int node_num, + int index); + uint64_t StartServer(); + void StopServer(); + void GatherServers(const std::vector& host_sign_list, int node_num); + void InitModel(); + void SaveModel(const std::string& path); + + public: + std::shared_ptr fleet_ptr_; + Scope* root_scope_; + platform::Place place_; + + private: + int actual_thread_num_; +}; + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/attribute.cc b/paddle/fluid/framework/attribute.cc index 0dcecb62dba971b48c4f11c0ef47494be40eeea0..fabf2abfc803b8838edb48aa01ab8896799c97ac 100644 --- a/paddle/fluid/framework/attribute.cc +++ b/paddle/fluid/framework/attribute.cc @@ -64,6 +64,13 @@ Attribute GetAttrValue(const proto::OpDesc::Attr& attr_desc) { case proto::AttrType::LONG: { return attr_desc.l(); } + case proto::AttrType::LONGS: { + std::vector val(attr_desc.longs_size()); + for (int i = 0; i < attr_desc.longs_size(); ++i) { + val[i] = attr_desc.longs(i); + } + return val; + } default: PADDLE_THROW("Unsupport attr type %d", attr_desc.type()); } diff --git a/paddle/fluid/framework/attribute.h b/paddle/fluid/framework/attribute.h index 14ca3e96209ed17f12e87fda8506806514698977..67054eccb3397ea40f0fb3e2ff2530ee1ea64736 100644 --- a/paddle/fluid/framework/attribute.h +++ b/paddle/fluid/framework/attribute.h @@ -26,6 +26,113 @@ limitations under the License. */ namespace paddle { namespace framework { + +template +struct ExtractAttribute { + explicit ExtractAttribute(const std::string& attr_name) + : attr_name_(attr_name) {} + + T* operator()(Attribute& attr) const { + T* attr_value = nullptr; + try { + attr_value = &boost::get(attr); + } catch (boost::bad_get& bad_get) { + PADDLE_THROW("Cannot get attribute %s by type %s, its type is %s", + attr_name_, paddle::platform::demangle(typeid(T).name()), + paddle::platform::demangle(attr.type().name())); + } + return attr_value; + } + + const std::string& attr_name_; +}; + +// special handle bool +// FIXME(yuyang18): Currently we cast bool into int in python binding. It is +// hard to change the logic there. In another way, we should correct handle +// if the user set `some_flag=1`. +// +// FIX ME anytime if there is a better solution. +template <> +struct ExtractAttribute { + explicit ExtractAttribute(const std::string& attr_name) + : attr_name_(attr_name) {} + + bool* operator()(Attribute& attr) const { + if (attr.type() == typeid(int)) { // NOLINT + int val = boost::get(attr); + attr = static_cast(val); + } else if (attr.type() == typeid(float)) { // NOLINT + float val = boost::get(attr); + attr = static_cast(val); + } + bool* attr_value = nullptr; + try { + attr_value = &boost::get(attr); + } catch (boost::bad_get& bad_get) { + PADDLE_THROW("Cannot get attribute %s by type bool, its type is %s", + attr_name_, paddle::platform::demangle(attr.type().name())); + } + return attr_value; + } + + const std::string& attr_name_; +}; + +template <> +struct ExtractAttribute { + explicit ExtractAttribute(const std::string& attr_name) + : attr_name_(attr_name) {} + + int64_t* operator()(Attribute& attr) const { + if (attr.type() == typeid(int)) { // NOLINT + int val = boost::get(attr); + attr = static_cast(val); + } else if (attr.type() == typeid(float)) { // NOLINT + int val = boost::get(attr); + attr = static_cast(val); + } + int64_t* attr_value = nullptr; + try { + attr_value = &boost::get(attr); + } catch (boost::bad_get& bad_get) { + PADDLE_THROW("Cannot get attribute %s by type int64_t, its type is %s", + attr_name_, paddle::platform::demangle(attr.type().name())); + } + return attr_value; + } + + const std::string& attr_name_; +}; + +template <> +struct ExtractAttribute> { + explicit ExtractAttribute(const std::string& attr_name) + : attr_name_(attr_name) {} + + std::vector* operator()(Attribute& attr) const { + if (attr.type() == typeid(std::vector)) { // NOLINT + std::vector val = boost::get>(attr); + std::vector vec(val.begin(), val.end()); + attr = vec; + } else if (attr.type() == typeid(std::vector)) { // NOLINT + std::vector val = boost::get>(attr); + std::vector vec(val.begin(), val.end()); + attr = vec; + } + std::vector* attr_value = nullptr; + try { + attr_value = &boost::get>(attr); + } catch (boost::bad_get& bad_get) { + PADDLE_THROW("Cannot get attribute %s by type int64_t, its type is %s", + attr_name_, paddle::platform::demangle(attr.type().name())); + } + return attr_value; + } + + const std::string& attr_name_; +}; + template inline proto::AttrType AttrTypeID() { Attribute tmp = T(); @@ -42,7 +149,11 @@ class AttrReader { inline const T& Get(const std::string& name) const { PADDLE_ENFORCE(attrs_.count(name) != 0, "%s should be in AttributeMap", name); - return boost::get(attrs_.at(name)); + + Attribute& attr = const_cast(attrs_.at(name)); + ExtractAttribute extract_attr(name); + T* attr_value = extract_attr(attr); + return *attr_value; } private: @@ -54,7 +165,7 @@ template class GreaterThanChecker { public: explicit GreaterThanChecker(T lower_bound) : lower_bound_(lower_bound) {} - void operator()(T& value) const { + void operator()(const T& value) const { PADDLE_ENFORCE(value > lower_bound_, "larger_than check fails."); } @@ -66,7 +177,7 @@ template class EqualGreaterThanChecker { public: explicit EqualGreaterThanChecker(T lower_bound) : lower_bound_(lower_bound) {} - void operator()(T& value) const { + void operator()(const T& value) const { PADDLE_ENFORCE_GE(value, lower_bound_, "equal_larger_than check fails."); } @@ -82,7 +193,7 @@ class DefaultValueSetter { public: explicit DefaultValueSetter(T default_value) : default_value_(default_value) {} - void operator()(T& value) const { value = default_value_; } + void operator()(T* value) const { *value = default_value_; } private: T default_value_; @@ -92,7 +203,7 @@ template class EnumInContainer { public: explicit EnumInContainer(const std::unordered_set& c) : container_(c) {} - void operator()(T& val) const { + void operator()(const T& val) const { PADDLE_ENFORCE(container_.find(val) != container_.end(), "Value %s is not in enum container %s", val, ContainerDebugString()); @@ -117,89 +228,12 @@ class EnumInContainer { std::unordered_set container_; }; -template -struct ExtractAttribute { - explicit ExtractAttribute(const std::string& attr_name) - : attr_name_(attr_name) {} - - T* operator()(Attribute& attr) const { - T* attr_value = nullptr; - try { - attr_value = &boost::get(attr); - } catch (boost::bad_get& bad_get) { - PADDLE_THROW("Cannot get attribute %s by type %s, its type is %s", - attr_name_, paddle::platform::demangle(typeid(T).name()), - paddle::platform::demangle(attr.type().name())); - } - return attr_value; - } - - const std::string& attr_name_; -}; - -// special handle bool -// FIXME(yuyang18): Currently we cast bool into int in python binding. It is -// hard to change the logic there. In another way, we should correct handle -// if the user set `some_flag=1`. -// -// FIX ME anytime if there is a better solution. -template <> -struct ExtractAttribute { - explicit ExtractAttribute(const std::string& attr_name) - : attr_name_(attr_name) {} - - bool* operator()(Attribute& attr) const { - if (attr.type() == typeid(int)) { // NOLINT - int val = boost::get(attr); - attr = static_cast(val); - } else if (attr.type() == typeid(float)) { // NOLINT - float val = boost::get(attr); - attr = static_cast(val); - } - bool* attr_value = nullptr; - try { - attr_value = &boost::get(attr); - } catch (boost::bad_get& bad_get) { - PADDLE_THROW("Cannot get attribute %s by type bool, its type is %s", - attr_name_, paddle::platform::demangle(attr.type().name())); - } - return attr_value; - } - - const std::string& attr_name_; -}; - -template <> -struct ExtractAttribute { - explicit ExtractAttribute(const std::string& attr_name) - : attr_name_(attr_name) {} - - int64_t* operator()(Attribute& attr) const { - if (attr.type() == typeid(int)) { // NOLINT - int val = boost::get(attr); - attr = static_cast(val); - } else if (attr.type() == typeid(float)) { // NOLINT - int val = boost::get(attr); - attr = static_cast(val); - } - int64_t* attr_value = nullptr; - try { - attr_value = &boost::get(attr); - } catch (boost::bad_get& bad_get) { - PADDLE_THROW("Cannot get attribute %s by type int64_t, its type is %s", - attr_name_, paddle::platform::demangle(attr.type().name())); - } - return attr_value; - } - - const std::string& attr_name_; -}; - // check whether a certain attribute fit its limits // an attribute can have more than one limits template class TypedAttrChecker { - typedef std::function ValueChecker; + typedef std::function DefaultValueChecker; + typedef std::function ValueChecker; public: explicit TypedAttrChecker(const std::string& attr_name) @@ -235,17 +269,17 @@ class TypedAttrChecker { return *this; } - void operator()(AttributeMap& attr_map) const { - if (!attr_map.count(attr_name_)) { + void operator()(AttributeMap* attr_map) const { + if (!attr_map->count(attr_name_)) { // user do not set this attr PADDLE_ENFORCE(!default_value_setter_.empty(), "Attribute '%s' is required!", attr_name_); // default_value_setter_ has no more than one element T val; - (default_value_setter_[0])(val); - attr_map[attr_name_] = val; + (default_value_setter_[0])(&val); + (*attr_map)[attr_name_] = val; } - Attribute& attr = attr_map.at(attr_name_); + Attribute& attr = attr_map->at(attr_name_); ExtractAttribute extract_attr(attr_name_); T* attr_value = extract_attr(attr); for (const auto& checker : value_checkers_) { @@ -256,12 +290,12 @@ class TypedAttrChecker { private: std::string attr_name_; std::vector value_checkers_; - std::vector default_value_setter_; + std::vector default_value_setter_; }; // check whether op's all attributes fit their own limits class OpAttrChecker { - typedef std::function AttrChecker; + typedef std::function AttrChecker; public: template @@ -271,7 +305,7 @@ class OpAttrChecker { return *(checker.target>()); } - void Check(AttributeMap& attr_map) const { + void Check(AttributeMap* attr_map) const { for (const auto& checker : attr_checkers_) { checker(attr_map); } diff --git a/paddle/fluid/framework/block_desc.cc b/paddle/fluid/framework/block_desc.cc index f537e4b9e569dd4c513ac0efde7240833bcf04b6..0b7aaf11746d1931e10ad7e5368d9e053092500e 100644 --- a/paddle/fluid/framework/block_desc.cc +++ b/paddle/fluid/framework/block_desc.cc @@ -13,7 +13,11 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/framework/block_desc.h" + #include +#include +#include + #include "paddle/fluid/framework/operator.h" #include "paddle/fluid/framework/program_desc.h" @@ -155,6 +159,16 @@ void BlockDesc::RemoveOp(size_t s, size_t e) { ops_.erase(ops_.begin() + s, ops_.begin() + e); } +void BlockDesc::RemoveOpInternal(const OpDesc *op_desc) { + // TODO(minqiyang): make this faster + for (auto it = ops_.begin(); it != ops_.end(); ++it) { + if (it->get() == op_desc) { + ops_.erase(it); + break; + } + } +} + std::vector BlockDesc::AllOps() const { std::vector res; for (const auto &op : ops_) { diff --git a/paddle/fluid/framework/block_desc.h b/paddle/fluid/framework/block_desc.h index 960ca39e1eadd3c064beb0e2c1342a406c4f0b6a..5c6e421516269a9b9865605400efa772f944a96f 100644 --- a/paddle/fluid/framework/block_desc.h +++ b/paddle/fluid/framework/block_desc.h @@ -93,6 +93,8 @@ class BlockDesc { */ void RemoveOp(size_t s, size_t e); + void RemoveOpInternal(const OpDesc *op_desc); + void RemoveVar(const std::string &name) { vars_.erase(name); } std::vector AllOps() const; diff --git a/paddle/fluid/framework/blocking_queue.h b/paddle/fluid/framework/blocking_queue.h index a19558c0ae59005bee575e8c469c7f95d8780ab1..cc5b4e8c4b8e114668f472ea2af9de96835720d0 100644 --- a/paddle/fluid/framework/blocking_queue.h +++ b/paddle/fluid/framework/blocking_queue.h @@ -33,6 +33,14 @@ class BlockingQueue { cv_.notify_one(); } + void Push(T &&item) { + { + std::lock_guard g(mutex_); + q_.emplace_back(std::move(item)); + } + cv_.notify_one(); + } + template void Extend(const U &items) { { @@ -44,6 +52,17 @@ class BlockingQueue { cv_.notify_all(); } + template + void Extend(U &&items) { + { + std::lock_guard g(mutex_); + for (auto &item : items) { + q_.emplace_back(std::move(item)); + } + } + cv_.notify_all(); + } + std::deque PopAll(size_t ms, bool *timeout) { auto time = std::chrono::system_clock::now() + std::chrono::milliseconds(ms); @@ -64,6 +83,18 @@ class BlockingQueue { return rc; } + void Pop(T *t) { + std::unique_lock lock(mutex_); + cv_.wait(lock, [=] { return !q_.empty(); }); + *t = std::move(q_.front()); + q_.pop_front(); + } + + size_t Size() { + std::lock_guard lock(mutex_); + return q_.size(); + } + private: std::mutex mutex_; std::condition_variable cv_; diff --git a/paddle/fluid/framework/commit.h.in b/paddle/fluid/framework/commit.h.in new file mode 100644 index 0000000000000000000000000000000000000000..3a33ece624443a99083ae29abb70254a5ac40a3d --- /dev/null +++ b/paddle/fluid/framework/commit.h.in @@ -0,0 +1,21 @@ +#pragma once + +#include + +namespace paddle { +namespace framework { + +static std::string paddle_commit() { + return "@PADDLE_COMMIT@"; +} + +static std::string paddle_compile_branch() { + return "@PADDLE_BRANCH@"; +} + +static std::string paddle_version() { + return "@PADDLE_VERSION@"; +} + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/data_device_transform_test.cu b/paddle/fluid/framework/data_device_transform_test.cu index f2c55e533a2747325b1b16fdada37945a8ed3c42..96a2f9250ff928fe58a5339a25c68c9db515522d 100644 --- a/paddle/fluid/framework/data_device_transform_test.cu +++ b/paddle/fluid/framework/data_device_transform_test.cu @@ -17,7 +17,8 @@ limitations under the License. */ #include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/framework/op_info.h" #include "paddle/fluid/framework/op_registry.h" -#include "paddle/fluid/operators/elementwise_op_function.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/operators/elementwise/elementwise_op_function.h" #include "paddle/fluid/operators/math/math_function.h" #include "paddle/fluid/platform/device_context.h" #include "paddle/fluid/platform/init.h" diff --git a/paddle/fluid/framework/data_feed.cc b/paddle/fluid/framework/data_feed.cc new file mode 100644 index 0000000000000000000000000000000000000000..365c80da34eb287f50d2f0dcbf3844001ab43ec8 --- /dev/null +++ b/paddle/fluid/framework/data_feed.cc @@ -0,0 +1,984 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#if defined _WIN32 || defined __APPLE__ +#else +#define _LINUX +#endif + +#include "paddle/fluid/framework/data_feed.h" +#ifdef _LINUX +#include +#endif +#include +#include "gflags/gflags.h" +#include "google/protobuf/io/zero_copy_stream_impl.h" +#include "google/protobuf/message.h" +#include "google/protobuf/text_format.h" +#include "io/fs.h" +#include "io/shell.h" +#include "paddle/fluid/framework/feed_fetch_method.h" +#include "paddle/fluid/framework/feed_fetch_type.h" +#include "paddle/fluid/platform/timer.h" + +namespace paddle { +namespace framework { + +void DataFeed::AddFeedVar(Variable* var, const std::string& name) { + CheckInit(); + for (size_t i = 0; i < use_slots_.size(); ++i) { + if (name == use_slots_[i]) { + feed_vec_[i] = var->GetMutable(); + } + } +} + +bool DataFeed::SetFileList(const std::vector& files) { + std::unique_lock lock(*mutex_for_pick_file_); + CheckInit(); + // Do not set finish_set_filelist_ flag, + // since a user may set file many times after init reader + filelist_.assign(files.begin(), files.end()); + + finish_set_filelist_ = true; + return true; +} + +void DataFeed::SetBatchSize(int batch_size) { + PADDLE_ENFORCE(batch_size > 0, "Illegal batch size: %d.", batch_size); + default_batch_size_ = batch_size; +} + +bool DataFeed::PickOneFile(std::string* filename) { + PADDLE_ENFORCE(mutex_for_pick_file_ != nullptr, + "should call SetFileListMutex before PickOneFile"); + PADDLE_ENFORCE(file_idx_ != nullptr, + "should call SetFileListIndex before PickOneFile"); + std::unique_lock lock(*mutex_for_pick_file_); + if (*file_idx_ == filelist_.size()) { + VLOG(3) << "DataFeed::PickOneFile no more file to pick"; + return false; + } + VLOG(3) << "file_idx_=" << *file_idx_; + *filename = filelist_[(*file_idx_)++]; + return true; +} + +void DataFeed::CheckInit() { + PADDLE_ENFORCE(finish_init_, "Initialization did not succeed."); +} + +void DataFeed::CheckSetFileList() { + PADDLE_ENFORCE(finish_set_filelist_, "Set filelist did not succeed."); +} + +void DataFeed::CheckStart() { + PADDLE_ENFORCE(finish_start_, "Datafeed has not started running yet."); +} + +template +void PrivateQueueDataFeed::SetQueueSize(int queue_size) { + PADDLE_ENFORCE(queue_size > 0, "Illegal queue size: %d.", queue_size); + queue_size_ = queue_size; + queue_ = std::unique_ptr>( + new paddle::operators::reader::BlockingQueue(queue_size_)); +} + +template +bool PrivateQueueDataFeed::Start() { + CheckSetFileList(); + read_thread_ = std::thread(&PrivateQueueDataFeed::ReadThread, this); + read_thread_.detach(); + + finish_start_ = true; + return true; +} + +template +void PrivateQueueDataFeed::ReadThread() { +#ifdef _LINUX + std::string filename; + while (PickOneFile(&filename)) { + int err_no = 0; + fp_ = fs_open_read(filename, &err_no, pipe_command_); + __fsetlocking(&*fp_, FSETLOCKING_BYCALLER); + T instance; + while (ParseOneInstanceFromPipe(&instance)) { + queue_->Send(instance); + } + } + queue_->Close(); +#endif +} + +template +int PrivateQueueDataFeed::Next() { +#ifdef _LINUX + CheckStart(); + int index = 0; + T instance; + T ins_vec; + while (index < default_batch_size_) { + if (!queue_->Receive(&instance)) { + break; + } + AddInstanceToInsVec(&ins_vec, instance, index++); + } + batch_size_ = index; + if (batch_size_ != 0) { + PutToFeedVec(ins_vec); + } + return batch_size_; +#else + return 0; +#endif +} + +// explicit instantiation +template class PrivateQueueDataFeed>; + +template +InMemoryDataFeed::InMemoryDataFeed() { + cur_channel_ = 0; + shuffled_ins_ = std::make_shared>(); + shuffled_ins_out_ = std::make_shared>(); + fleet_send_batch_size_ = 80000; // hard code here + memory_data_ = nullptr; + mutex_for_update_memory_data_ = nullptr; + this->file_idx_ = nullptr; + this->mutex_for_pick_file_ = nullptr; +} + +template +bool InMemoryDataFeed::Start() { +#ifdef _LINUX + DataFeed::CheckSetFileList(); + if (shuffled_ins_->Size() == 0 && shuffled_ins_out_->Size() == 0) { + FillMemoryDataToChannel(); + } +#endif + DataFeed::finish_start_ = true; + return true; +} + +template +int InMemoryDataFeed::Next() { +#ifdef _LINUX + DataFeed::CheckStart(); + std::shared_ptr> in_channel = nullptr; + std::shared_ptr> out_channel = nullptr; + if (cur_channel_ == 0) { + in_channel = shuffled_ins_; + out_channel = shuffled_ins_out_; + } else { + in_channel = shuffled_ins_out_; + out_channel = shuffled_ins_; + } + CHECK(in_channel != nullptr); + CHECK(out_channel != nullptr); + VLOG(3) << "in_channel size=" << in_channel->Size() + << ", out_channel size=" << out_channel->Size() + << ", thread_id=" << thread_id_; + int index = 0; + T instance; + T ins_vec; + while (index < DataFeed::default_batch_size_) { + if (in_channel->Size() == 0) { + break; + } + in_channel->Pop(&instance); + + AddInstanceToInsVec(&ins_vec, instance, index++); + out_channel->Push(std::move(instance)); + } + DataFeed::batch_size_ = index; + VLOG(3) << "batch_size_=" << DataFeed::batch_size_ + << ", thread_id=" << thread_id_; + if (DataFeed::batch_size_ != 0) { + PutToFeedVec(ins_vec); + } else { + cur_channel_ = 1 - cur_channel_; + } + return DataFeed::batch_size_; +#else + return 0; +#endif +} + +template +void InMemoryDataFeed::SetMemoryData(void* memory_data) { + memory_data_ = static_cast*>(memory_data); +} + +template +void InMemoryDataFeed::SetMemoryDataMutex(std::mutex* mutex) { + mutex_for_update_memory_data_ = mutex; +} + +template +void InMemoryDataFeed::SetThreadId(int thread_id) { + thread_id_ = thread_id; +} + +template +void InMemoryDataFeed::SetThreadNum(int thread_num) { + thread_num_ = thread_num; +} + +template +void InMemoryDataFeed::SetTrainerNum(int trainer_num) { + trainer_num_ = trainer_num; +} + +template +void InMemoryDataFeed::SetFleetSendBatchSize(int64_t size) { + fleet_send_batch_size_ = size; +} + +template +void InMemoryDataFeed::PutInsToChannel(const std::string& ins_str) { +#ifdef _LINUX + std::vector ins; + DeserializeIns(&ins, ins_str); + shuffled_ins_->Extend(std::move(ins)); + VLOG(3) << "PutInsToChannel put ins num=" << ins.size() + << " to channel, channel size=" << shuffled_ins_->Size() + << " thread_id=" << thread_id_; +#endif +} + +template +void InMemoryDataFeed::FillMemoryDataToChannel() { +#ifdef _LINUX + VLOG(3) << "FillMemoryDataToChannel, thread_id=" << thread_id_; + auto interval = GetMemoryDataInterval(); + VLOG(3) << "memory data size=" << memory_data_->size() + << ", fill data from [" << interval.first << ", " << interval.second + << "), thread_id=" << thread_id_; + for (int64_t i = interval.first; i < interval.second; ++i) { + T& t = (*memory_data_)[i]; + shuffled_ins_->Push(std::move(t)); + } +#endif +} + +template +void InMemoryDataFeed::FillChannelToMemoryData() { +#ifdef _LINUX + VLOG(3) << "FillChannelToMemoryData, thread_id=" << thread_id_; + std::vector local_vec; + std::shared_ptr> channel = nullptr; + std::shared_ptr> pre_channel = nullptr; + if (cur_channel_ == 0) { + channel = shuffled_ins_; + pre_channel = shuffled_ins_out_; + } else { + channel = shuffled_ins_out_; + pre_channel = shuffled_ins_; + } + CHECK(channel != nullptr); + CHECK(pre_channel != nullptr); + CHECK_EQ(pre_channel->Size(), 0); + local_vec.resize(channel->Size()); + for (int64_t i = 0; i < local_vec.size(); ++i) { + channel->Pop(&local_vec[i]); + } + VLOG(3) << "local_vec size=" << local_vec.size() + << ", thread_id=" << thread_id_; + { + std::lock_guard g(*mutex_for_update_memory_data_); + VLOG(3) << "before insert, memory_data_ size=" << memory_data_->size() + << ", thread_id=" << thread_id_; + memory_data_->insert(memory_data_->end(), local_vec.begin(), + local_vec.end()); + VLOG(3) << "after insert memory_data_ size=" << memory_data_->size() + << ", thread_id=" << thread_id_; + } + std::vector().swap(local_vec); +#endif +} + +template +void InMemoryDataFeed::LoadIntoMemory() { +#ifdef _LINUX + VLOG(3) << "LoadIntoMemory() begin, thread_id=" << thread_id_; + std::vector local_vec; + std::string filename; + while (DataFeed::PickOneFile(&filename)) { + VLOG(3) << "PickOneFile, filename=" << filename + << ", thread_id=" << thread_id_; + int err_no = 0; + PrivateQueueDataFeed::fp_ = + fs_open_read(filename, &err_no, PrivateQueueDataFeed::pipe_command_); + CHECK(PrivateQueueDataFeed::fp_ != nullptr); + __fsetlocking(&*PrivateQueueDataFeed::fp_, FSETLOCKING_BYCALLER); + T instance; + platform::Timer timeline; + timeline.Start(); + while (ParseOneInstanceFromPipe(&instance)) { + local_vec.push_back(instance); + } + timeline.Pause(); + VLOG(3) << "LoadIntoMemory() read all lines, file=" << filename + << ", cost time=" << timeline.ElapsedSec() + << " seconds, thread_id=" << thread_id_; + { + std::lock_guard lock(*mutex_for_update_memory_data_); + timeline.Start(); + memory_data_->insert(memory_data_->end(), + std::make_move_iterator(local_vec.begin()), + std::make_move_iterator(local_vec.end())); + timeline.Pause(); + VLOG(3) << "LoadIntoMemory() memory_data insert, cost time=" + << timeline.ElapsedSec() << " seconds, thread_id=" << thread_id_; + } + local_vec.clear(); + } + std::vector().swap(local_vec); + VLOG(3) << "LoadIntoMemory() end, thread_id=" << thread_id_; +#endif +} + +template +void InMemoryDataFeed::LocalShuffle() { +#ifdef _LINUX + VLOG(3) << "LocalShuffle() begin, thread_id=" << thread_id_; + FillMemoryDataToChannel(); + VLOG(3) << "LocalShuffle() end, thread_id=" << thread_id_; +#endif +} + +template +void InMemoryDataFeed::GlobalShuffle() { +#ifdef _LINUX + VLOG(3) << "GlobalShuffle() begin, thread_id=" << thread_id_; + auto fleet_ptr = FleetWrapper::GetInstance(); + std::vector> send_vec(trainer_num_); + std::vector send_index(trainer_num_); + uint64_t reserve_len = fleet_send_batch_size_ / trainer_num_; + for (auto& vec : send_vec) { + vec.reserve(reserve_len); + } + for (int i = 0; i < trainer_num_; ++i) { + send_index[i] = i; + } + std::vector> total_status; + auto interval = GetMemoryDataInterval(); + VLOG(3) << "global shuffle data from [" << interval.first << ", " + << interval.second << "), thread_id=" << thread_id_; + for (int64_t i = interval.first; i < interval.second; ++i) { + // if get ins id, can also use hash + // std::string ins_id = memory_data_[i].ins_id; + int64_t random_num = rand_r(&rand_seed); + int64_t node_id = random_num % trainer_num_; + send_vec[node_id].push_back(&((*memory_data_)[i])); + if (i % fleet_send_batch_size_ == 0 && i != 0) { + // shuffle the sequence of sending to avoid network timeout error + std::random_shuffle(send_index.begin(), send_index.end()); + for (int index = 0; index < send_index.size(); ++index) { + int j = send_index[index]; + std::string send_str; + SerializeIns(send_vec[j], &send_str); + VLOG(3) << "send str_length=" << send_str.length() + << ", ins num=" << send_vec[j].size() << " to node_id=" << j + << ", thread_id=" << thread_id_; + auto ret = fleet_ptr->SendClientToClientMsg(0, j, send_str); + VLOG(3) << "end send, thread_id=" << thread_id_; + send_vec[j].clear(); + total_status.push_back(std::move(ret)); + } + } + } + // shuffle the sequence of sending to avoid network timeout error + std::random_shuffle(send_index.begin(), send_index.end()); + for (int index = 0; index < send_index.size(); ++index) { + int j = send_index[index]; + if (send_vec[j].size() != 0) { + std::string send_str; + SerializeIns(send_vec[j], &send_str); + VLOG(3) << "send str_length=" << send_str.length() << " to node_id=" << j + << ", thread_id=" << thread_id_; + auto ret = fleet_ptr->SendClientToClientMsg(0, j, send_str); + VLOG(3) << "end send, thread_id=" << thread_id_; + total_status.push_back(std::move(ret)); + } + std::vector().swap(send_vec[j]); + } + for (auto& t : total_status) { + t.wait(); + } + VLOG(3) << "GlobalShuffle() end, thread_id=" << thread_id_; +#endif +} + +template +std::pair InMemoryDataFeed::GetMemoryDataInterval() { + int64_t start = 0; + int64_t end = 0; + int64_t size = memory_data_->size(); + for (int64_t i = 0; i <= static_cast(thread_id_); ++i) { + int64_t len = size / static_cast(thread_num_) + + (i < (size % static_cast(thread_num_))); + start = end; + end += len; + } + return std::make_pair(start, end); +} + +// explicit instantiation +template class InMemoryDataFeed>; + +void MultiSlotDataFeed::Init( + const paddle::framework::DataFeedDesc& data_feed_desc) { + finish_init_ = false; + finish_set_filelist_ = false; + finish_start_ = false; + + PADDLE_ENFORCE(data_feed_desc.has_multi_slot_desc(), + "Multi_slot_desc has not been set."); + paddle::framework::MultiSlotDesc multi_slot_desc = + data_feed_desc.multi_slot_desc(); + SetBatchSize(data_feed_desc.batch_size()); + SetQueueSize(data_feed_desc.batch_size()); + size_t all_slot_num = multi_slot_desc.slots_size(); + all_slots_.resize(all_slot_num); + all_slots_type_.resize(all_slot_num); + use_slots_index_.resize(all_slot_num); + use_slots_.clear(); + use_slots_is_dense_.clear(); + for (size_t i = 0; i < all_slot_num; ++i) { + const auto& slot = multi_slot_desc.slots(i); + all_slots_[i] = slot.name(); + all_slots_type_[i] = slot.type(); + use_slots_index_[i] = slot.is_used() ? use_slots_.size() : -1; + if (slot.is_used()) { + use_slots_.push_back(all_slots_[i]); + use_slots_is_dense_.push_back(slot.is_dense()); + std::vector local_shape; + if (slot.is_dense()) { + // for batch size holder if is_dense + if (slot.shape(0) > 0) { + local_shape.push_back(0); + } + } + for (size_t i = 0; i < slot.shape_size(); ++i) { + local_shape.push_back(slot.shape(i)); + } + use_slots_shape_.push_back(local_shape); + } + } + feed_vec_.resize(use_slots_.size()); + pipe_command_ = data_feed_desc.pipe_command(); + finish_init_ = true; +} + +void MultiSlotDataFeed::ReadThread() { +#ifdef _LINUX + std::string filename; + while (PickOneFile(&filename)) { + int err_no = 0; + fp_ = fs_open_read(filename, &err_no, pipe_command_); + CHECK(fp_ != nullptr); + __fsetlocking(&*fp_, FSETLOCKING_BYCALLER); + std::vector instance; + int ins_num = 0; + while (ParseOneInstanceFromPipe(&instance)) { + ins_num++; + queue_->Send(instance); + } + VLOG(3) << "filename: " << filename << " inst num: " << ins_num; + } + queue_->Close(); +#endif +} + +bool MultiSlotDataFeed::CheckFile(const char* filename) { +#ifdef _LINUX + CheckInit(); // get info of slots + std::ifstream fin(filename); + if (!fin.good()) { + VLOG(1) << "error: open file<" << filename << "> fail"; + return false; + } + std::string line; + int instance_cout = 0; + std::string all_slots_alias = ""; + for (const auto& alias : all_slots_) { + all_slots_alias += alias + " "; + } + std::string use_slots_alias = ""; + for (const auto& alias : use_slots_) { + use_slots_alias += alias + " "; + } + VLOG(3) << "total slots num: " << all_slots_.size(); + VLOG(3) << "total slots alias: " << all_slots_alias; + VLOG(3) << "used slots num: " << use_slots_.size(); + VLOG(3) << "used slots alias: " << use_slots_alias; + while (getline(fin, line)) { + ++instance_cout; + const char* str = line.c_str(); + char* endptr = const_cast(str); + int len = line.length(); + for (size_t i = 0; i < all_slots_.size(); ++i) { + int num = strtol(endptr, &endptr, 10); + if (num < 0) { + VLOG(0) << "error: the number of ids is a negative number: " << num; + VLOG(0) << "please check line<" << instance_cout << "> in file<" + << filename << ">"; + return false; + } else if (num == 0) { + VLOG(0) + << "error: the number of ids can not be zero, you need " + "padding it in data generator; or if there is something wrong" + " with the data, please check if the data contains unresolvable " + "characters."; + VLOG(0) << "please check line<" << instance_cout << "> in file<" + << filename << ">"; + return false; + } else if (errno == ERANGE || num > INT_MAX) { + VLOG(0) << "error: the number of ids greater than INT_MAX"; + VLOG(0) << "please check line<" << instance_cout << "> in file<" + << filename << ">"; + return false; + } + if (all_slots_type_[i] == "float") { + for (int i = 0; i < num; ++i) { + strtof(endptr, &endptr); + if (errno == ERANGE) { + VLOG(0) << "error: the value is out of the range of " + "representable values for float"; + VLOG(0) << "please check line<" << instance_cout << "> in file<" + << filename << ">"; + return false; + } + if (i + 1 != num && endptr - str == len) { + VLOG(0) << "error: there is a wrong with the number of ids."; + VLOG(0) << "please check line<" << instance_cout << "> in file<" + << filename << ">"; + return false; + } + } + } else if (all_slots_type_[i] == "uint64") { + for (int i = 0; i < num; ++i) { + strtoull(endptr, &endptr, 10); + if (errno == ERANGE) { + VLOG(0) << "error: the value is out of the range of " + "representable values for uint64_t"; + VLOG(0) << "please check line<" << instance_cout << "> in file<" + << filename << ">"; + return false; + } + if (i + 1 != num && endptr - str == len) { + VLOG(0) << "error: there is a wrong with the number of ids."; + VLOG(0) << "please check line<" << instance_cout << "> in file<" + << filename << ">"; + return false; + } + } + } else { + VLOG(0) << "error: this type<" << all_slots_type_[i] + << "> is not supported"; + return false; + } + } + // It may be added '\t' character to the end of the output of reduce + // task when processes data by Hadoop(when the output of the reduce + // task of Hadoop has only one field, it will add a '\t' at the end + // of the line by default, and you can use this option to avoid it: + // `-D mapred.textoutputformat.ignoreseparator=true`), which does + // not affect the correctness of the data. Therefore, it should be + // judged that the data is not normal when the end of each line of + // data contains characters which are not spaces. + while (endptr - str != len) { + if (!isspace(*(endptr++))) { + VLOG(0) + << "error: there is some extra characters at the end of the line."; + VLOG(0) << "please check line<" << instance_cout << "> in file<" + << filename << ">"; + return false; + } + } + } + VLOG(3) << "instances cout: " << instance_cout; + VLOG(3) << "The file format is correct"; +#endif + return true; +} + +bool MultiSlotDataFeed::ParseOneInstanceFromPipe( + std::vector* instance) { +#ifdef _LINUX + thread_local string::LineFileReader reader; + + if (!reader.getline(&*(fp_.get()))) { + return false; + } else { + int use_slots_num = use_slots_.size(); + instance->resize(use_slots_num); + + const char* str = reader.get(); + std::string line = std::string(str); + // VLOG(3) << line; + char* endptr = const_cast(str); + int pos = 0; + for (size_t i = 0; i < use_slots_index_.size(); ++i) { + int idx = use_slots_index_[i]; + int num = strtol(&str[pos], &endptr, 10); + PADDLE_ENFORCE( + num, + "The number of ids can not be zero, you need padding " + "it in data generator; or if there is something wrong with " + "the data, please check if the data contains unresolvable " + "characters.\nplease check this error line: %s", + str); + if (idx != -1) { + (*instance)[idx].Init(all_slots_type_[i]); + if ((*instance)[idx].GetType()[0] == 'f') { // float + for (int j = 0; j < num; ++j) { + float feasign = strtof(endptr, &endptr); + (*instance)[idx].AddValue(feasign); + } + } else if ((*instance)[idx].GetType()[0] == 'u') { // uint64 + for (int j = 0; j < num; ++j) { + uint64_t feasign = (uint64_t)strtoull(endptr, &endptr, 10); + (*instance)[idx].AddValue(feasign); + } + } + pos = endptr - str; + } else { + for (int j = 0; j <= num; ++j) { + // pos = line.find_first_of(' ', pos + 1); + while (line[pos + 1] != ' ') { + pos++; + } + } + } + } + return true; + } +#else + return true; +#endif +} + +bool MultiSlotDataFeed::ParseOneInstance(std::vector* instance) { +#ifdef _LINUX + std::string line; + if (getline(file_, line)) { + int use_slots_num = use_slots_.size(); + instance->resize(use_slots_num); + // parse line + const char* str = line.c_str(); + char* endptr = const_cast(str); + int pos = 0; + for (size_t i = 0; i < use_slots_index_.size(); ++i) { + int idx = use_slots_index_[i]; + int num = strtol(&str[pos], &endptr, 10); + PADDLE_ENFORCE( + num, + "The number of ids can not be zero, you need padding " + "it in data generator; or if there is something wrong with " + "the data, please check if the data contains unresolvable " + "characters.\nplease check this error line: %s", + str); + + if (idx != -1) { + (*instance)[idx].Init(all_slots_type_[i]); + if ((*instance)[idx].GetType()[0] == 'f') { // float + for (int j = 0; j < num; ++j) { + float feasign = strtof(endptr, &endptr); + (*instance)[idx].AddValue(feasign); + } + } else if ((*instance)[idx].GetType()[0] == 'u') { // uint64 + for (int j = 0; j < num; ++j) { + uint64_t feasign = (uint64_t)strtoull(endptr, &endptr, 10); + (*instance)[idx].AddValue(feasign); + } + } + pos = endptr - str; + } else { + for (int j = 0; j <= num; ++j) { + pos = line.find_first_of(' ', pos + 1); + } + } + } + } else { + return false; + } +#endif + return false; +} + +void MultiSlotDataFeed::AddInstanceToInsVec( + std::vector* ins_vec, + const std::vector& instance, int index) { +#ifdef _LINUX + if (index == 0) { + ins_vec->resize(instance.size()); + for (size_t i = 0; i < instance.size(); ++i) { + (*ins_vec)[i].Init(instance[i].GetType()); + (*ins_vec)[i].InitOffset(); + } + } + + for (size_t i = 0; i < instance.size(); ++i) { + (*ins_vec)[i].AddIns(instance[i]); + } +#endif +} + +void MultiSlotDataFeed::PutToFeedVec( + const std::vector& ins_vec) { +#ifdef _LINUX + for (size_t i = 0; i < use_slots_.size(); ++i) { + const auto& type = ins_vec[i].GetType(); + const auto& offset = ins_vec[i].GetOffset(); + int total_instance = static_cast(offset.back()); + + if (type[0] == 'f') { // float + const auto& feasign = ins_vec[i].GetFloatData(); + float* tensor_ptr = feed_vec_[i]->mutable_data( + {total_instance, 1}, platform::CPUPlace()); + memcpy(tensor_ptr, &feasign[0], total_instance * sizeof(float)); + } else if (type[0] == 'u') { // uint64 + // no uint64_t type in paddlepaddle + const auto& feasign = ins_vec[i].GetUint64Data(); + int64_t* tensor_ptr = feed_vec_[i]->mutable_data( + {total_instance, 1}, platform::CPUPlace()); + memcpy(tensor_ptr, &feasign[0], total_instance * sizeof(int64_t)); + } + + LoD data_lod{offset}; + feed_vec_[i]->set_lod(data_lod); + if (use_slots_is_dense_[i]) { + use_slots_shape_[i][0] = batch_size_; + feed_vec_[i]->Resize(framework::make_ddim(use_slots_shape_[i])); + } + } +#endif +} + +void MultiSlotInMemoryDataFeed::Init( + const paddle::framework::DataFeedDesc& data_feed_desc) { + finish_init_ = false; + finish_set_filelist_ = false; + finish_start_ = false; + + PADDLE_ENFORCE(data_feed_desc.has_multi_slot_desc(), + "Multi_slot_desc has not been set."); + paddle::framework::MultiSlotDesc multi_slot_desc = + data_feed_desc.multi_slot_desc(); + SetBatchSize(data_feed_desc.batch_size()); + SetQueueSize(data_feed_desc.batch_size()); + size_t all_slot_num = multi_slot_desc.slots_size(); + all_slots_.resize(all_slot_num); + all_slots_type_.resize(all_slot_num); + use_slots_index_.resize(all_slot_num); + use_slots_.clear(); + use_slots_is_dense_.clear(); + for (size_t i = 0; i < all_slot_num; ++i) { + const auto& slot = multi_slot_desc.slots(i); + all_slots_[i] = slot.name(); + all_slots_type_[i] = slot.type(); + use_slots_index_[i] = slot.is_used() ? use_slots_.size() : -1; + if (slot.is_used()) { + use_slots_.push_back(all_slots_[i]); + use_slots_is_dense_.push_back(slot.is_dense()); + std::vector local_shape; + if (slot.is_dense()) { + if (slot.shape(0) > 0) { + local_shape.push_back(0); + } + } + for (size_t i = 0; i < slot.shape_size(); ++i) { + local_shape.push_back(slot.shape(i)); + } + use_slots_shape_.push_back(local_shape); + } + } + feed_vec_.resize(use_slots_.size()); + pipe_command_ = data_feed_desc.pipe_command(); + finish_init_ = true; +} + +bool MultiSlotInMemoryDataFeed::ParseOneInstanceFromPipe( + std::vector* instance) { +#ifdef _LINUX + thread_local string::LineFileReader reader; + + if (!reader.getline(&*(fp_.get()))) { + return false; + } else { + int use_slots_num = use_slots_.size(); + instance->resize(use_slots_num); + + const char* str = reader.get(); + std::string line = std::string(str); + // VLOG(3) << line; + char* endptr = const_cast(str); + int pos = 0; + for (size_t i = 0; i < use_slots_index_.size(); ++i) { + int idx = use_slots_index_[i]; + int num = strtol(&str[pos], &endptr, 10); + PADDLE_ENFORCE( + num, + "The number of ids can not be zero, you need padding " + "it in data generator; or if there is something wrong with " + "the data, please check if the data contains unresolvable " + "characters.\nplease check this error line: %s", + str); + if (idx != -1) { + (*instance)[idx].Init(all_slots_type_[i]); + if ((*instance)[idx].GetType()[0] == 'f') { // float + for (int j = 0; j < num; ++j) { + float feasign = strtof(endptr, &endptr); + (*instance)[idx].AddValue(feasign); + } + } else if ((*instance)[idx].GetType()[0] == 'u') { // uint64 + for (int j = 0; j < num; ++j) { + uint64_t feasign = (uint64_t)strtoull(endptr, &endptr, 10); + (*instance)[idx].AddValue(feasign); + } + } + pos = endptr - str; + } else { + for (int j = 0; j <= num; ++j) { + // pos = line.find_first_of(' ', pos + 1); + while (line[pos + 1] != ' ') { + pos++; + } + } + } + } + return true; + } +#else + return false; +#endif +} + +bool MultiSlotInMemoryDataFeed::ParseOneInstance( + std::vector* instance) { +#ifdef _LINUX + std::string line; + if (getline(file_, line)) { + int use_slots_num = use_slots_.size(); + instance->resize(use_slots_num); + VLOG(3) << line; + // parse line + const char* str = line.c_str(); + char* endptr = const_cast(str); + int pos = 0; + for (size_t i = 0; i < use_slots_index_.size(); ++i) { + int idx = use_slots_index_[i]; + int num = strtol(&str[pos], &endptr, 10); + PADDLE_ENFORCE( + num, + "The number of ids can not be zero, you need padding " + "it in data generator; or if there is something wrong with " + "the data, please check if the data contains unresolvable " + "characters.\nplease check this error line: %s", + str); + + if (idx != -1) { + (*instance)[idx].Init(all_slots_type_[i]); + if ((*instance)[idx].GetType()[0] == 'f') { // float + for (int j = 0; j < num; ++j) { + float feasign = strtof(endptr, &endptr); + (*instance)[idx].AddValue(feasign); + } + } else if ((*instance)[idx].GetType()[0] == 'u') { // uint64 + for (int j = 0; j < num; ++j) { + uint64_t feasign = (uint64_t)strtoull(endptr, &endptr, 10); + (*instance)[idx].AddValue(feasign); + } + } + pos = endptr - str; + } else { + for (int j = 0; j <= num; ++j) { + pos = line.find_first_of(' ', pos + 1); + } + } + } + } else { + return false; + } +#endif + return false; +} + +void MultiSlotInMemoryDataFeed::AddInstanceToInsVec( + std::vector* ins_vec, + const std::vector& instance, int index) { +#ifdef _LINUX + if (index == 0) { + ins_vec->resize(instance.size()); + for (size_t i = 0; i < instance.size(); ++i) { + (*ins_vec)[i].Init(instance[i].GetType()); + (*ins_vec)[i].InitOffset(); + } + } + + for (size_t i = 0; i < instance.size(); ++i) { + (*ins_vec)[i].AddIns(instance[i]); + } +#endif +} + +void MultiSlotInMemoryDataFeed::PutToFeedVec( + const std::vector& ins_vec) { +#ifdef _LINUX + for (size_t i = 0; i < use_slots_.size(); ++i) { + const auto& type = ins_vec[i].GetType(); + const auto& offset = ins_vec[i].GetOffset(); + int total_instance = static_cast(offset.back()); + + if (type[0] == 'f') { // float + const auto& feasign = ins_vec[i].GetFloatData(); + float* tensor_ptr = feed_vec_[i]->mutable_data( + {total_instance, 1}, platform::CPUPlace()); + memcpy(tensor_ptr, &feasign[0], total_instance * sizeof(float)); + } else if (type[0] == 'u') { // uint64 + // no uint64_t type in paddlepaddle + const auto& feasign = ins_vec[i].GetUint64Data(); + int64_t* tensor_ptr = feed_vec_[i]->mutable_data( + {total_instance, 1}, platform::CPUPlace()); + memcpy(tensor_ptr, &feasign[0], total_instance * sizeof(int64_t)); + } + + LoD data_lod{offset}; + feed_vec_[i]->set_lod(data_lod); + if (use_slots_is_dense_[i]) { + use_slots_shape_[i][0] = batch_size_; + feed_vec_[i]->Resize(framework::make_ddim(use_slots_shape_[i])); + } + } +#endif +} + +// todo serialize ins in global shuffle +void MultiSlotInMemoryDataFeed::SerializeIns( + const std::vector*>& ins, std::string* str) { + auto fleet_ptr = FleetWrapper::GetInstance(); + fleet_ptr->Serialize(ins, str); +} +// todo deserialize ins in global shuffle +void MultiSlotInMemoryDataFeed::DeserializeIns( + std::vector>* ins, const std::string& str) { + auto fleet_ptr = FleetWrapper::GetInstance(); + fleet_ptr->Deserialize(ins, str); +} + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/data_feed.h b/paddle/fluid/framework/data_feed.h new file mode 100644 index 0000000000000000000000000000000000000000..d098c7858a98c644bd3cad78d3cf1e3b35ca026b --- /dev/null +++ b/paddle/fluid/framework/data_feed.h @@ -0,0 +1,383 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include // NOLINT +#include +#include // NOLINT +#include +#include +#include // NOLINT +#include +#include + +#include "paddle/fluid/framework/blocking_queue.h" +#include "paddle/fluid/framework/data_feed.pb.h" +#include "paddle/fluid/framework/fleet/fleet_wrapper.h" +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/reader.h" +#include "paddle/fluid/framework/variable.h" +#include "paddle/fluid/operators/reader/blocking_queue.h" +#include "paddle/fluid/string/string_helper.h" + +namespace paddle { +namespace framework { + +// DataFeed is the base virtual class for all ohther DataFeeds. +// It is used to read files and parse the data for subsequent trainer. +// Example: +// DataFeed* reader = +// paddle::framework::DataFeedFactory::CreateDataFeed(data_feed_name); +// reader->Init(data_feed_desc); // data_feed_desc is a protobuf object +// reader->SetFileList(filelist); +// const std::vector & use_slot_alias = +// reader->GetUseSlotAlias(); +// for (auto name: use_slot_alias){ // for binding memory +// reader->AddFeedVar(scope->Var(name), name); +// } +// reader->Start(); +// while (reader->Next()) { +// // trainer do something +// } +class DataFeed { + public: + DataFeed() { + mutex_for_pick_file_ = nullptr; + file_idx_ = nullptr; + } + virtual ~DataFeed() {} + virtual void Init(const paddle::framework::DataFeedDesc& data_feed_desc) = 0; + virtual bool CheckFile(const char* filename) { + PADDLE_THROW("This function(CheckFile) is not implemented."); + } + // Set filelist for DataFeed. + // Pay attention that it must init all readers before call this function. + // Otherwise, Init() function will init finish_set_filelist_ flag. + virtual bool SetFileList(const std::vector& files); + virtual bool Start() = 0; + + // The trainer calls the Next() function, and the DataFeed will load a new + // batch to the feed_vec. The return value of this function is the batch + // size of the current batch. + virtual int Next() = 0; + // Get all slots' alias which defined in protofile + virtual const std::vector& GetAllSlotAlias() { + return all_slots_; + } + // Get used slots' alias which defined in protofile + virtual const std::vector& GetUseSlotAlias() { + return use_slots_; + } + // This function is used for binding feed_vec memory + virtual void AddFeedVar(Variable* var, const std::string& name); + + // This function will do nothing at default + virtual void SetMemoryData(void* memory_data) {} + // This function will do nothing at default + virtual void SetMemoryDataMutex(std::mutex* mutex) {} + // This function will do nothing at default + virtual void SetThreadId(int thread_id) {} + // This function will do nothing at default + virtual void SetThreadNum(int thread_num) {} + // This function will do nothing at default + virtual void SetTrainerNum(int trainer_num) {} + // This function will do nothing at default + virtual void SetFleetSendBatchSize(int64_t size) {} + virtual void SetFileListMutex(std::mutex* mutex) { + mutex_for_pick_file_ = mutex; + } + virtual void SetFileListIndex(size_t* file_index) { file_idx_ = file_index; } + virtual void LoadIntoMemory() { + PADDLE_THROW("This function(LoadIntoMemory) is not implemented."); + } + virtual void LocalShuffle() { + PADDLE_THROW("This function(LocalShuffle) is not implemented."); + } + virtual void GlobalShuffle() { + PADDLE_THROW("This function(GlobalShuffle) is not implemented."); + } + // This function will do nothing at default + virtual void FillMemoryDataToChannel() {} + // This function will do nothing at default + virtual void FillChannelToMemoryData() {} + // This function will do nothing at default + virtual void PutInsToChannel(const std::string& ins_str) {} + + protected: + // The following three functions are used to check if it is executed in this + // order: + // Init() -> SetFileList() -> Start() -> Next() + virtual void CheckInit(); + virtual void CheckSetFileList(); + virtual void CheckStart(); + virtual void SetBatchSize( + int batch); // batch size will be set in Init() function + // This function is used to pick one file from the global filelist(thread + // safe). + virtual bool PickOneFile(std::string* filename); + + std::vector filelist_; + size_t* file_idx_; + std::mutex* mutex_for_pick_file_; + + // the alias of used slots, and its order is determined by + // data_feed_desc(proto object) + std::vector use_slots_; + std::vector use_slots_is_dense_; + + // the alias of all slots, and its order is determined by data_feed_desc(proto + // object) + std::vector all_slots_; + std::vector all_slots_type_; + std::vector> use_slots_shape_; + std::vector + use_slots_index_; // -1: not used; >=0: the index of use_slots_ + + // The data read by DataFeed will be stored here + std::vector feed_vec_; + + // the batch size defined by user + int default_batch_size_; + // current batch size + int batch_size_; + + bool finish_init_; + bool finish_set_filelist_; + bool finish_start_; + std::string pipe_command_; +}; + +// PrivateQueueDataFeed is the base virtual class for ohther DataFeeds. +// It use a read-thread to read file and parse data to a private-queue +// (thread level), and get data from this queue when trainer call Next(). +template +class PrivateQueueDataFeed : public DataFeed { + public: + PrivateQueueDataFeed() {} + virtual ~PrivateQueueDataFeed() {} + virtual void Init(const paddle::framework::DataFeedDesc& data_feed_desc) = 0; + virtual bool Start(); + virtual int Next(); + + protected: + // The thread implementation function for reading file and parse. + virtual void ReadThread(); + // This function is used to set private-queue size, and the most + // efficient when the queue size is close to the batch size. + virtual void SetQueueSize(int queue_size); + // The reading and parsing method called in the ReadThread. + virtual bool ParseOneInstance(T* instance) = 0; + virtual bool ParseOneInstanceFromPipe(T* instance) = 0; + // This function is used to put instance to vec_ins + virtual void AddInstanceToInsVec(T* vec_ins, const T& instance, + int index) = 0; + // This function is used to put ins_vec to feed_vec + virtual void PutToFeedVec(const T& ins_vec) = 0; + + // The thread for read files + std::thread read_thread_; + // using ifstream one line and one line parse is faster + // than using fread one buffer and one buffer parse. + // for a 601M real data: + // ifstream one line and one line parse: 6034 ms + // fread one buffer and one buffer parse: 7097 ms + std::ifstream file_; + std::shared_ptr fp_; + size_t queue_size_; + string::LineFileReader reader_; + // The queue for store parsed data + std::unique_ptr> queue_; +}; + +template +class InMemoryDataFeed : public PrivateQueueDataFeed { + public: + InMemoryDataFeed(); + virtual ~InMemoryDataFeed() {} + virtual void Init(const paddle::framework::DataFeedDesc& data_feed_desc) = 0; + virtual bool Start(); + virtual int Next(); + virtual void SetMemoryData(void* memory_data); + virtual void SetMemoryDataMutex(std::mutex* mutex); + virtual void SetThreadId(int thread_id); + virtual void SetThreadNum(int thread_num); + virtual void SetTrainerNum(int trainer_num); + virtual void SetFleetSendBatchSize(int64_t size); + virtual void PutInsToChannel(const std::string& ins_str); + virtual void FillMemoryDataToChannel(); + virtual void FillChannelToMemoryData(); + virtual void LoadIntoMemory(); + virtual void LocalShuffle(); + virtual void GlobalShuffle(); + + protected: + virtual void AddInstanceToInsVec(T* vec_ins, const T& instance, + int index) = 0; + virtual bool ParseOneInstance(T* instance) = 0; + virtual bool ParseOneInstanceFromPipe(T* instance) = 0; + virtual void PutToFeedVec(const T& ins_vec) = 0; + virtual void SerializeIns(const std::vector& ins, std::string* str) = 0; + virtual void DeserializeIns(std::vector* ins, const std::string& str) = 0; + virtual std::pair GetMemoryDataInterval(); + + int thread_id_; + int thread_num_; + int trainer_num_; + uint32_t rand_seed; + std::vector* memory_data_; + std::mutex* mutex_for_update_memory_data_; + // when read ins, we put ins from one channel to the other, + // and when finish reading, we set cur_channel = 1 - cur_channel, + // so if cur_channel=0, all data are in shuffled_ins_, else shuffled_ins_out_ + int cur_channel_; + std::shared_ptr> shuffled_ins_; + std::shared_ptr> shuffled_ins_out_; + int64_t fleet_send_batch_size_; +}; + +// This class define the data type of instance(ins_vec) in MultiSlotDataFeed +class MultiSlotType { + public: + MultiSlotType() {} + ~MultiSlotType() {} + void Init(const std::string& type) { + CheckType(type); + if (type_[0] == 'f') { + float_feasign_.clear(); + } else if (type_[0] == 'u') { + uint64_feasign_.clear(); + } + type_ = type; + } + void InitOffset() { + offset_.resize(1); + // LoDTensor' lod is counted from 0, the size of lod + // is one size larger than the size of data. + offset_[0] = 0; + } + const std::vector& GetOffset() const { return offset_; } + std::vector& MutableOffset() { return offset_; } + void AddValue(const float v) { + CheckFloat(); + float_feasign_.push_back(v); + } + void AddValue(const uint64_t v) { + CheckUint64(); + uint64_feasign_.push_back(v); + } + void AddIns(const MultiSlotType& ins) { + if (ins.GetType()[0] == 'f') { // float + CheckFloat(); + auto& vec = ins.GetFloatData(); + offset_.push_back(offset_.back() + vec.size()); + float_feasign_.insert(float_feasign_.end(), vec.begin(), vec.end()); + } else if (ins.GetType()[0] == 'u') { // uint64 + CheckUint64(); + auto& vec = ins.GetUint64Data(); + offset_.push_back(offset_.back() + vec.size()); + uint64_feasign_.insert(uint64_feasign_.end(), vec.begin(), vec.end()); + } + } + const std::vector& GetFloatData() const { return float_feasign_; } + std::vector& MutableFloatData() { return float_feasign_; } + const std::vector& GetUint64Data() const { return uint64_feasign_; } + std::vector& MutableUint64Data() { return uint64_feasign_; } + const std::string& GetType() const { return type_; } + std::string& MutableType() { return type_; } + + std::string DebugString() { + std::stringstream ss; + ss << "\ntype: " << type_ << "\n"; + ss << "offset: "; + ss << "["; + for (const size_t& i : offset_) { + ss << offset_[i] << ","; + } + ss << "]\ndata: ["; + if (type_[0] == 'f') { + for (const float& i : float_feasign_) { + ss << i << ","; + } + } else { + for (const uint64_t& i : uint64_feasign_) { + ss << i << ","; + } + } + ss << "]\n"; + return ss.str(); + } + + private: + void CheckType(const std::string& type) const { + PADDLE_ENFORCE((type == "uint64") || (type == "float"), + "There is no this type<%s>.", type); + } + void CheckFloat() const { + PADDLE_ENFORCE(type_[0] == 'f', "Add %s value to float slot.", type_); + } + void CheckUint64() const { + PADDLE_ENFORCE(type_[0] == 'u', "Add %s value to uint64 slot.", type_); + } + std::vector float_feasign_; + std::vector uint64_feasign_; + std::string type_; + std::vector offset_; +}; + +// This DataFeed is used to feed multi-slot type data. +// The format of multi-slot type data: +// [n feasign_0 feasign_1 ... feasign_n]* +class MultiSlotDataFeed + : public PrivateQueueDataFeed> { + public: + MultiSlotDataFeed() {} + virtual ~MultiSlotDataFeed() {} + virtual void Init(const paddle::framework::DataFeedDesc& data_feed_desc); + virtual bool CheckFile(const char* filename); + // virtual void ReadThread(); + + protected: + virtual void ReadThread(); + virtual void AddInstanceToInsVec(std::vector* vec_ins, + const std::vector& instance, + int index); + virtual bool ParseOneInstance(std::vector* instance); + virtual bool ParseOneInstanceFromPipe(std::vector* instance); + virtual void PutToFeedVec(const std::vector& ins_vec); +}; + +class MultiSlotInMemoryDataFeed + : public InMemoryDataFeed> { + public: + MultiSlotInMemoryDataFeed() {} + virtual ~MultiSlotInMemoryDataFeed() {} + virtual void Init(const paddle::framework::DataFeedDesc& data_feed_desc); + + protected: + virtual void AddInstanceToInsVec(std::vector* vec_ins, + const std::vector& instance, + int index); + virtual bool ParseOneInstance(std::vector* instance); + virtual bool ParseOneInstanceFromPipe(std::vector* instance); + virtual void PutToFeedVec(const std::vector& ins_vec); + virtual void SerializeIns(const std::vector*>& ins, + std::string* str); + virtual void DeserializeIns(std::vector>* ins, + const std::string& str); +}; + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/data_feed.proto b/paddle/fluid/framework/data_feed.proto new file mode 100644 index 0000000000000000000000000000000000000000..03996e0e20a1729ee300a5ad37abc325876930b7 --- /dev/null +++ b/paddle/fluid/framework/data_feed.proto @@ -0,0 +1,33 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ +syntax = "proto2"; +package paddle.framework; + +message Slot { + required string name = 1; + required string type = 2; + optional bool is_dense = 3 [ default = false ]; + optional bool is_used = 4 [ default = false ]; + repeated int32 shape = 5; // we can define N-D Tensor +} + +message MultiSlotDesc { repeated Slot slots = 1; } + +message DataFeedDesc { + optional string name = 1; + optional int32 batch_size = 2 [ default = 32 ]; + optional MultiSlotDesc multi_slot_desc = 3; + optional string pipe_command = 4; + optional int32 thread_num = 5; +} diff --git a/paddle/fluid/framework/data_feed_factory.cc b/paddle/fluid/framework/data_feed_factory.cc new file mode 100644 index 0000000000000000000000000000000000000000..201d6c0d0b96469afbee1c3262e549d9d4e512dd --- /dev/null +++ b/paddle/fluid/framework/data_feed_factory.cc @@ -0,0 +1,68 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/data_feed_factory.h" +#include +#include +#include + +#include "paddle/fluid/framework/data_feed.h" + +namespace paddle { +namespace framework { +typedef std::shared_ptr (*Createdata_feedFunction)(); +typedef std::unordered_map data_feedMap; +data_feedMap g_data_feed_map; + +#define REGISTER_DATAFEED_CLASS(data_feed_class) \ + namespace { \ + std::shared_ptr Creator_##data_feed_class() { \ + return std::shared_ptr(new data_feed_class); \ + } \ + class __Registerer_##data_feed_class { \ + public: \ + __Registerer_##data_feed_class() { \ + g_data_feed_map[#data_feed_class] = &Creator_##data_feed_class; \ + } \ + }; \ + __Registerer_##data_feed_class g_registerer_##data_feed_class; \ + } // namespace + +std::string DataFeedFactory::DataFeedTypeList() { + std::string data_feed_types; + for (auto iter = g_data_feed_map.begin(); iter != g_data_feed_map.end(); + ++iter) { + if (iter != g_data_feed_map.begin()) { + data_feed_types += ", "; + } + data_feed_types += iter->first; + } + return data_feed_types; +} + +std::shared_ptr DataFeedFactory::CreateDataFeed( + std::string data_feed_class) { + if (g_data_feed_map.count(data_feed_class) < 1) { + LOG(WARNING) << "Your DataFeed " << data_feed_class + << "is not supported currently"; + LOG(WARNING) << "Supported DataFeed: " << DataFeedTypeList(); + exit(-1); + } + return g_data_feed_map[data_feed_class](); +} + +REGISTER_DATAFEED_CLASS(MultiSlotDataFeed); +REGISTER_DATAFEED_CLASS(MultiSlotInMemoryDataFeed); +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/data_feed_factory.h b/paddle/fluid/framework/data_feed_factory.h new file mode 100644 index 0000000000000000000000000000000000000000..13678edb0b8d084a0b3016d93f6e1bc32ce0169a --- /dev/null +++ b/paddle/fluid/framework/data_feed_factory.h @@ -0,0 +1,29 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include "paddle/fluid/framework/data_feed.h" + +namespace paddle { +namespace framework { +class DataFeedFactory { + public: + static std::string DataFeedTypeList(); + static std::shared_ptr CreateDataFeed(std::string data_feed_class); +}; +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/data_feed_test.cc b/paddle/fluid/framework/data_feed_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..e1d6246862155509569b25b1fd552c04dcf455df --- /dev/null +++ b/paddle/fluid/framework/data_feed_test.cc @@ -0,0 +1,330 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/data_feed.h" +#include +#include // NOLINT +#include +#include +#include +#include // NOLINT +#include +#include // NOLINT +#include +#include +#include "google/protobuf/io/zero_copy_stream_impl.h" +#include "google/protobuf/text_format.h" +#include "gtest/gtest.h" +#include "paddle/fluid/framework/data_feed_factory.h" +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/scope.h" + +paddle::framework::DataFeedDesc load_datafeed_param_from_file( + const char* filename) { + paddle::framework::DataFeedDesc data_feed_desc; + int file_descriptor = open(filename, O_RDONLY); + PADDLE_ENFORCE(file_descriptor != -1, "Can not open %s.", filename); + google::protobuf::io::FileInputStream fileInput(file_descriptor); + google::protobuf::TextFormat::Parse(&fileInput, &data_feed_desc); + close(file_descriptor); + return data_feed_desc; +} + +const std::vector load_filelist_from_file(const char* filename) { + std::vector filelist; + std::ifstream fin(filename); + PADDLE_ENFORCE(fin.good(), "Can not open %s.", filename); + std::string line; + while (getline(fin, line)) { + filelist.push_back(line); + } + fin.close(); + return filelist; +} + +void GenerateFileForTest(const char* protofile, const char* filelist) { + std::ofstream w_protofile(protofile); + w_protofile << "name: \"MultiSlotDataFeed\"\n" + "batch_size: 2\n" + "multi_slot_desc {\n" + " slots {\n" + " name: \"uint64_sparse_slot\"\n" + " type: \"uint64\"\n" + " is_dense: false\n" + " is_used: true\n" + " }\n" + " slots {\n" + " name: \"float_sparse_slot\"\n" + " type: \"float\"\n" + " is_dense: false\n" + " is_used: true\n" + " }\n" + " slots {\n" + " name: \"uint64_dense_slot\"\n" + " type: \"uint64\"\n" + " is_dense: true\n" + " is_used: true\n" + " }\n" + " slots {\n" + " name: \"float_dense_slot\"\n" + " type: \"float\"\n" + " is_dense: true\n" + " is_used: true\n" + " }\n" + " slots {\n" + " name: \"not_used_slot\"\n" + " type: \"uint64\"\n" + " is_dense: false\n" + " is_used: false\n" + " }\n" + "}"; + w_protofile.close(); + std::ofstream w_filelist(filelist); + int total_file = 4; + for (int i = 0; i < total_file; ++i) { + std::string filename = "TestMultiSlotDataFeed.data." + std::to_string(i); + w_filelist << filename; + if (i + 1 != total_file) { + w_filelist << std::endl; + } + std::ofstream w_datafile(filename.c_str()); + w_datafile << "3 3978 620 82 1 1926.08 1 1926 1 6.02 1 1996\n" + "2 1300 2983353 1 985.211 1 8 1 0.618 1 12\n" + "1 19260827 2 3.14 2.718 1 27 1 2.236 1 28\n"; + w_datafile.close(); + } + w_filelist.close(); +} + +class MultiTypeSet { + public: + MultiTypeSet() { + uint64_set_.clear(); + float_set_.clear(); + } + ~MultiTypeSet() {} + void AddValue(uint64_t v) { uint64_set_.insert(v); } + void AddValue(float v) { float_set_.insert(v); } + const std::set& GetUint64Set() const { return uint64_set_; } + const std::set& GetFloatSet() const { return float_set_; } + + private: + std::set uint64_set_; + std::set float_set_; +}; + +void GetElemSetFromReader(std::vector* reader_elem_set, + const paddle::framework::DataFeedDesc& data_feed_desc, + const std::vector& filelist, + const int thread_num) { + int used_slot_num = 0; + for (auto i = 0; i < data_feed_desc.multi_slot_desc().slots_size(); ++i) { + if (data_feed_desc.multi_slot_desc().slots(i).is_used()) { + ++used_slot_num; + } + } + reader_elem_set->resize(used_slot_num); + std::vector threads; + std::vector> readers; + readers.resize(thread_num); + for (int i = 0; i < thread_num; ++i) { + readers[i] = paddle::framework::DataFeedFactory::CreateDataFeed( + data_feed_desc.name()); + readers[i]->Init(data_feed_desc); + } + readers[0]->SetFileList(filelist); + std::mutex mu; + for (int idx = 0; idx < thread_num; ++idx) { + threads.emplace_back(std::thread([&, idx] { + std::unique_ptr scope( + new paddle::framework::Scope()); + const auto& multi_slot_desc = data_feed_desc.multi_slot_desc(); + std::map + lodtensor_targets; + for (int i = 0; i < multi_slot_desc.slots_size(); ++i) { + const auto& slot = multi_slot_desc.slots(i); + if (slot.is_used()) { + const auto& name = slot.name(); + readers[idx]->AddFeedVar(scope->Var(name), name); + lodtensor_targets[name] = + &scope->FindVar(name)->Get(); + } + } + readers[idx]->Start(); + while (readers[idx]->Next()) { + int index = 0; + for (int k = 0; k < multi_slot_desc.slots_size(); ++k) { + const auto& slot = multi_slot_desc.slots(k); + if (!slot.is_used()) { + continue; + } + const paddle::framework::LoDTensor* tens = + lodtensor_targets[slot.name()]; + if (slot.is_dense()) { // dense branch + if (slot.type() == "uint64") { + const int64_t* data = tens->data(); + int batch_size = tens->dims()[0]; + int dim = tens->dims()[1]; + for (int i = 0; i < batch_size; ++i) { + for (int j = 0; j < dim; ++j) { + std::lock_guard lock(mu); + (*reader_elem_set)[index].AddValue( + (uint64_t)data[i * dim + j]); + } + } + } else if (slot.type() == "float") { + const float* data = tens->data(); + int batch_size = tens->dims()[0]; + int dim = tens->dims()[1]; + for (int i = 0; i < batch_size; ++i) { + for (int j = 0; j < dim; ++j) { + std::lock_guard lock(mu); + (*reader_elem_set)[index].AddValue(data[i * dim + j]); + } + } + } else { + PADDLE_THROW("Error type in proto file."); + } + } else { // sparse branch + if (slot.type() == "uint64") { + const int64_t* data = tens->data(); + for (size_t i = 0; i < tens->NumElements(); ++i) { + std::pair element = tens->lod_element(0, i); + for (size_t j = element.first; j < element.second; ++j) { + std::lock_guard lock(mu); + (*reader_elem_set)[index].AddValue((uint64_t)data[j]); + } + } + } else if (slot.type() == "float") { + const float* data = tens->data(); + for (size_t i = 0; i < tens->NumElements(); ++i) { + std::pair element = tens->lod_element(0, i); + for (size_t j = element.first; j < element.second; ++j) { + std::lock_guard lock(mu); + (*reader_elem_set)[index].AddValue(data[j]); + } + } + } else { + PADDLE_THROW("Error type in proto file."); + } + } // end sparse branch + ++index; + } // end slots loop + } // end while Next() + })); // end anonymous function + } + for (auto& th : threads) { + th.join(); + } +} + +void CheckIsUnorderedSame(const std::vector& s1, + const std::vector& s2) { + EXPECT_EQ(s1.size(), s2.size()); + for (size_t i = 0; i < s1.size(); ++i) { + // check for uint64 + const std::set& uint64_s1 = s1[i].GetUint64Set(); + const std::set& uint64_s2 = s2[i].GetUint64Set(); + EXPECT_EQ(uint64_s1.size(), uint64_s2.size()); + auto uint64_it1 = uint64_s1.begin(); + auto uint64_it2 = uint64_s2.begin(); + while (uint64_it1 != uint64_s1.end()) { + EXPECT_EQ(*uint64_it1, *uint64_it2); + ++uint64_it1; + ++uint64_it2; + } + // check for float + const std::set& float_s1 = s1[i].GetFloatSet(); + const std::set& float_s2 = s2[i].GetFloatSet(); + EXPECT_EQ(float_s1.size(), float_s2.size()); + auto float_it1 = float_s1.begin(); + auto float_it2 = float_s2.begin(); + while (float_it1 != float_s1.end()) { + EXPECT_EQ(*float_it1, *float_it2); + ++float_it1; + ++float_it2; + } + } +} + +void GetElemSetFromFile(std::vector* file_elem_set, + const paddle::framework::DataFeedDesc& data_feed_desc, + const std::vector& filelist) { + int used_slot_num = 0; + for (auto i = 0; i < data_feed_desc.multi_slot_desc().slots_size(); ++i) { + if (data_feed_desc.multi_slot_desc().slots(i).is_used()) { + ++used_slot_num; + } + } + file_elem_set->resize(used_slot_num); + for (const auto& file : filelist) { + std::ifstream fin(file.c_str()); + PADDLE_ENFORCE(fin.good(), "Can not open %s.", file.c_str()); + while (1) { + bool end_flag = false; + int index = 0; + for (auto i = 0; i < data_feed_desc.multi_slot_desc().slots_size(); ++i) { + int num; + if (fin >> num) { + auto slot = data_feed_desc.multi_slot_desc().slots(i); + auto type = slot.type(); + if (type == "uint64") { + while (num--) { + uint64_t feasign; + fin >> feasign; + if (slot.is_used()) { + (*file_elem_set)[index].AddValue(feasign); + } + } + } else if (type == "float") { + while (num--) { + float feasign; + fin >> feasign; + if (slot.is_used()) { + (*file_elem_set)[index].AddValue(feasign); + } + } + } else { + PADDLE_THROW("Error type in proto file."); + } + if (slot.is_used()) { + ++index; + } + } else { + end_flag = true; + break; + } + } + if (end_flag) { + break; + } + } + fin.close(); + } +} + +TEST(DataFeed, MultiSlotUnitTest) { + const char* protofile = "data_feed_desc.prototxt"; + const char* filelist_name = "filelist.txt"; + GenerateFileForTest(protofile, filelist_name); + const std::vector filelist = + load_filelist_from_file(filelist_name); + paddle::framework::DataFeedDesc data_feed_desc = + load_datafeed_param_from_file(protofile); + std::vector reader_elem_set; + std::vector file_elem_set; + // GetElemSetFromReader(&reader_elem_set, data_feed_desc, filelist, 4); + // GetElemSetFromFile(&file_elem_set, data_feed_desc, filelist); + // CheckIsUnorderedSame(reader_elem_set, file_elem_set); +} diff --git a/paddle/fluid/framework/data_layout_transform.cc b/paddle/fluid/framework/data_layout_transform.cc index c9e3a8ac1d1e5228725bff49ecc6d91e640dfe57..72c50518af08b9c1b2f97e6864e5836e806c77fc 100644 --- a/paddle/fluid/framework/data_layout_transform.cc +++ b/paddle/fluid/framework/data_layout_transform.cc @@ -85,7 +85,7 @@ void TransDataLayout(const OpKernelType& kernel_type_for_var, out->mutable_data(expected_kernel_type.place_, in.type()); framework::VisitDataType( - framework::ToDataType(in.type()), + in.type(), CastDataLayout(pool.Get(expected_kernel_type.place_), axis, in, out)); out->set_layout(expected_kernel_type.data_layout_); @@ -101,7 +101,7 @@ void* GetDataFromTensor(const Tensor& tensor, mkldnn::memory::data_type type) { case mkldnn::memory::data_type::f32: return platform::to_void_cast(tensor.data()); case mkldnn::memory::data_type::s8: - return platform::to_void_cast(tensor.data()); + return platform::to_void_cast(tensor.data()); case mkldnn::memory::data_type::u8: return platform::to_void_cast(tensor.data()); case mkldnn::memory::data_type::s16: @@ -144,26 +144,29 @@ void TransDataLayoutFromMKLDNN(const OpKernelType& kernel_type_for_var, memory::data_type in_type = ToMKLDNNDataType(in.type()); PADDLE_ENFORCE(in_type != memory::data_type::data_undef, - "Input tensor type is not supported: ", in.type().name()); + "Input tensor type is not supported: %s", in.type()); memory::data_type out_type = in_type; auto in_format = platform::MKLDNNFormatForSize(in_tz.size(), in.format()); auto out_format = platform::MKLDNNFormatForSize(in_tz.size(), ToMKLDNNFormat(out_layout)); - void* in_data = GetDataFromTensor(in, in_type); - // output tensor has the same dims as input. Reorder don't change dims out->Resize(in.dims()); - auto out_data = out->mutable_data(expected_kernel_type.place_, in.type()); - - auto in_memory = memory({{{in_tz}, in_type, in_format}, cpu_engine}, in_data); - auto out_memory = - memory({{{out_tz}, out_type, out_format}, cpu_engine}, out_data); + if (in_format != out_format) { + void* in_data = GetDataFromTensor(in, in_type); + auto out_data = out->mutable_data(expected_kernel_type.place_, in.type()); - platform::Reorder(in_memory, out_memory); + auto in_memory = + memory({{{in_tz}, in_type, in_format}, cpu_engine}, in_data); + auto out_memory = + memory({{{out_tz}, out_type, out_format}, cpu_engine}, out_data); + platform::Reorder(in_memory, out_memory); + } else { + out->ShareDataWith(in); + } out->set_layout(out_layout); // reset format since the out tensor will be feed to non-MKLDNN OPkernel out->set_format(memory::format::format_undef); diff --git a/paddle/fluid/framework/data_layout_transform.h b/paddle/fluid/framework/data_layout_transform.h index 90bb206ec6b698bc23ad1a5c9609a25186ec6de8..2479de4fd46802148af09d34b627a8804276cacf 100644 --- a/paddle/fluid/framework/data_layout_transform.h +++ b/paddle/fluid/framework/data_layout_transform.h @@ -50,14 +50,14 @@ inline DataLayout ToPaddleLayout(const MKLDNNFormat& format) { } } -inline MKLDNNDataType ToMKLDNNDataType(const std::type_index type) { - static const std::map dict{ - {std::type_index(typeid(float)), MKLDNNDataType::f32}, // NOLINT - {std::type_index(typeid(char)), MKLDNNDataType::s8}, // NOLINT - {std::type_index(typeid(unsigned char)), MKLDNNDataType::u8}, - {std::type_index(typeid(int16_t)), MKLDNNDataType::s16}, - {std::type_index(typeid(int32_t)), MKLDNNDataType::s32}}; - auto iter = dict.find(type); +inline MKLDNNDataType ToMKLDNNDataType(proto::VarType::Type type) { + static std::unordered_map dict{ + {DataTypeTrait::DataType, MKLDNNDataType::f32}, + {DataTypeTrait::DataType, MKLDNNDataType::s8}, + {DataTypeTrait::DataType, MKLDNNDataType::u8}, + {DataTypeTrait::DataType, MKLDNNDataType::s16}, + {DataTypeTrait::DataType, MKLDNNDataType::s32}}; + auto iter = dict.find(static_cast(type)); if (iter != dict.end()) return iter->second; return MKLDNNDataType::data_undef; } diff --git a/paddle/fluid/framework/data_set.cc b/paddle/fluid/framework/data_set.cc new file mode 100644 index 0000000000000000000000000000000000000000..a3b7b1e454ecec9da766b9b156c31b1317bb9d35 --- /dev/null +++ b/paddle/fluid/framework/data_set.cc @@ -0,0 +1,281 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. */ + +#include "paddle/fluid/framework/data_set.h" +#include +#include "google/protobuf/io/zero_copy_stream_impl.h" +#include "google/protobuf/message.h" +#include "google/protobuf/text_format.h" +#include "paddle/fluid/framework/data_feed_factory.h" +#include "paddle/fluid/framework/io/fs.h" +#include "paddle/fluid/platform/timer.h" + +#if defined _WIN32 || defined __APPLE__ +#else +#define _LINUX +#endif + +namespace paddle { +namespace framework { + +// constructor +template +DatasetImpl::DatasetImpl() { + thread_num_ = 1; + trainer_num_ = 1; + file_idx_ = 0; +} + +// set filelist, file_idx_ will reset to zero. +template +void DatasetImpl::SetFileList(const std::vector& filelist) { + VLOG(3) << "filelist size: " << filelist.size(); + filelist_ = filelist; + file_idx_ = 0; +} + +// set expect thread num. actually it may change +template +void DatasetImpl::SetThreadNum(int thread_num) { + VLOG(3) << "SetThreadNum thread_num=" << thread_num; + thread_num_ = thread_num; +} + +// if you run distributed, and want to do global shuffle, +// set this before global shuffle. +// be sure you call CreateReaders before SetTrainerNum +template +void DatasetImpl::SetTrainerNum(int trainer_num) { + trainer_num_ = trainer_num; + // should inform reader of trainer_num directly + for (auto reader : readers_) { + reader->SetTrainerNum(trainer_num); + } +} + +// if you run distributed, and want to do global shuffle, +// set this before global shuffle. +// be sure you call CreateReaders before SetFleetSendBatchSize +template +void DatasetImpl::SetFleetSendBatchSize(int64_t size) { + fleet_send_batch_size_ = size; + for (auto reader : readers_) { + reader->SetFleetSendBatchSize(size); + } +} + +template +void DatasetImpl::SetHdfsConfig(const std::string& fs_name, + const std::string& fs_ugi) { + fs_name_ = fs_name; + fs_ugi_ = fs_ugi; + std::string cmd = std::string("hadoop fs"); + cmd += " -D fs.default.name=" + fs_name; + cmd += " -D hadoop.job.ugi=" + fs_ugi; + paddle::framework::hdfs_set_command(cmd); +} + +template +void DatasetImpl::SetDataFeedDesc(const std::string& data_feed_desc_str) { + google::protobuf::TextFormat::ParseFromString(data_feed_desc_str, + &data_feed_desc_); +} + +// readers_.size() may not be equal to thread_num_, +// it changes when filelist_.size() < thread_num_ +template +std::vector>& +DatasetImpl::GetReaders() { + return readers_; +} + +// if sent message between workers, should first call this function +template +void DatasetImpl::RegisterClientToClientMsgHandler() { + auto fleet_ptr = FleetWrapper::GetInstance(); + VLOG(3) << "RegisterClientToClientMsgHandler"; + fleet_ptr->RegisterClientToClientMsgHandler( + 0, [this](int msg_type, int client_id, const std::string& msg) -> int { + return this->ReceiveFromClient(msg_type, client_id, msg); + }); + VLOG(3) << "RegisterClientToClientMsgHandler done"; +} + +// load data into memory, Dataset hold this memory, +// which will later be fed into readers' channel +template +void DatasetImpl::LoadIntoMemory() { + VLOG(3) << "DatasetImpl::LoadIntoMemory() begin"; + platform::Timer timeline; + timeline.Start(); + if (readers_.size() == 0) { + CreateReaders(); + } + std::vector load_threads; + for (int64_t i = 0; i < thread_num_; ++i) { + load_threads.push_back(std::thread( + &paddle::framework::DataFeed::LoadIntoMemory, readers_[i].get())); + } + for (std::thread& t : load_threads) { + t.join(); + } + timeline.Pause(); + VLOG(3) << "DatasetImpl::LoadIntoMemory() end" + << ", memory data size=" << memory_data_.size() + << ", cost time=" << timeline.ElapsedSec() << " seconds"; +} + +// release memory data +template +void DatasetImpl::ReleaseMemory() { + VLOG(3) << "DatasetImpl::ReleaseMemory() begin"; + std::vector().swap(memory_data_); + VLOG(3) << "DatasetImpl::ReleaseMemory() end"; +} + +// do local shuffle +template +void DatasetImpl::LocalShuffle() { + VLOG(3) << "DatasetImpl::LocalShuffle() begin"; + platform::Timer timeline; + timeline.Start(); + if (readers_.size() == 0) { + CreateReaders(); + } + // if it is not InMemory, memory_data_ is empty + std::random_shuffle(memory_data_.begin(), memory_data_.end()); + + std::vector local_shuffle_threads; + for (int64_t i = 0; i < thread_num_; ++i) { + local_shuffle_threads.push_back(std::thread( + &paddle::framework::DataFeed::LocalShuffle, readers_[i].get())); + } + for (std::thread& t : local_shuffle_threads) { + t.join(); + } + std::vector().swap(memory_data_); + timeline.Pause(); + VLOG(3) << "DatasetImpl::LocalShuffle() end, cost time=" + << timeline.ElapsedSec() << " seconds"; +} + +template +void DatasetImpl::GlobalShuffle() { + VLOG(3) << "DatasetImpl::GlobalShuffle() begin"; + platform::Timer timeline; + timeline.Start(); + if (readers_.size() == 0) { + CreateReaders(); + } + // if it is not InMemory, memory_data_ is empty + std::random_shuffle(memory_data_.begin(), memory_data_.end()); + VLOG(3) << "start global shuffle threads"; + std::vector global_shuffle_threads; + for (int i = 0; i < thread_num_; ++i) { + global_shuffle_threads.push_back(std::thread( + &paddle::framework::DataFeed::GlobalShuffle, readers_[i].get())); + } + for (std::thread& t : global_shuffle_threads) { + t.join(); + } + std::vector().swap(memory_data_); + timeline.Pause(); + VLOG(3) << "DatasetImpl::GlobalShuffle() end, cost time=" + << timeline.ElapsedSec() << " seconds"; +} + +template +void DatasetImpl::CreateReaders() { + VLOG(3) << "Calling CreateReaders()"; + CHECK(thread_num_ > 0) << "thread_num should > 0"; + int file_cnt = filelist_.size(); + int memory_data_size = memory_data_.size(); + if (memory_data_size != 0 && thread_num_ > memory_data_size) { + VLOG(3) << "Dataset thread num = " << thread_num_ + << ", memory data size = " << memory_data_size + << ". Changing Dataset thread num = " << memory_data_size; + thread_num_ = memory_data_size; + } else if (file_cnt != 0 && thread_num_ > file_cnt) { + VLOG(3) << "Dataset thread num = " << thread_num_ + << ", file num = " << file_cnt + << ". Changing Dataset thread num = " << file_cnt; + thread_num_ = file_cnt; + } + VLOG(3) << "thread_num in Readers: " << thread_num_; + VLOG(3) << "readers size: " << readers_.size(); + VLOG(3) << "Filelist size in readers: " << filelist_.size(); + if (readers_.size() != 0) { + return; + } + VLOG(3) << "data feed class name: " << data_feed_desc_.name(); + for (int i = 0; i < thread_num_; ++i) { + readers_.push_back(DataFeedFactory::CreateDataFeed(data_feed_desc_.name())); + readers_.back()->Init(data_feed_desc_); + readers_.back()->SetMemoryData(&memory_data_); + readers_.back()->SetMemoryDataMutex(&mutex_for_update_memory_data_); + readers_.back()->SetThreadId(i); + readers_.back()->SetThreadNum(thread_num_); + readers_.back()->SetTrainerNum(trainer_num_); + readers_.back()->SetFileListMutex(&mutex_for_pick_file_); + readers_.back()->SetFileListIndex(&file_idx_); + readers_.back()->SetFileList(filelist_); + } +} + +template +void DatasetImpl::DestroyReaders() { + VLOG(3) << "Calling DestroyReaders()"; + // clear memory_data_ before fill it + // because if LoadIntoMemory but no Shuffle, + // memory_data_ has empty data which has been std::move to channel + if (memory_data_.size() != 0) { + std::vector().swap(memory_data_); + } + std::vector fill_threads; + for (int i = 0; i < thread_num_; ++i) { + fill_threads.push_back( + std::thread(&paddle::framework::DataFeed::FillChannelToMemoryData, + readers_[i].get())); + } + for (std::thread& t : fill_threads) { + t.join(); + } + std::vector>().swap(readers_); + VLOG(3) << "readers size: " << readers_.size(); + // if memory_data_ is empty, which means it's not InMemory mode, + // so the next epoch should read all data again + if (memory_data_.size() == 0) { + file_idx_ = 0; + } +} + +template +int DatasetImpl::ReceiveFromClient(int msg_type, int client_id, + const std::string& msg) { +#ifdef _LINUX + VLOG(3) << "ReceiveFromClient msg_type=" << msg_type + << ", client_id=" << client_id << ", msg length=" << msg.length(); + auto fleet_ptr = FleetWrapper::GetInstance(); + int64_t index = rand_r(&rand_seed) % thread_num_; + VLOG(3) << "ramdom index=" << index; + readers_[index]->PutInsToChannel(msg); +#endif + return 0; +} + +// explicit instantiation +template class DatasetImpl>; + +} // end namespace framework +} // end namespace paddle diff --git a/paddle/fluid/framework/data_set.h b/paddle/fluid/framework/data_set.h new file mode 100644 index 0000000000000000000000000000000000000000..bbe0f937abfa635b126062059abfcfb70adb996e --- /dev/null +++ b/paddle/fluid/framework/data_set.h @@ -0,0 +1,157 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. */ + +#pragma once + +#include +#include +#include // NOLINT +#include +#include // NOLINT +#include +#include + +#include "paddle/fluid/framework/data_feed.h" + +namespace paddle { +namespace framework { + +// Dataset is a abstract class, which defines user interfaces +// Example Usage: +// Dataset* dataset = DatasetFactory::CreateDataset("InMemoryDataset") +// dataset->SetFileList(std::vector{"a.txt", "b.txt"}) +// dataset->SetThreadNum(1) +// dataset->CreateReaders(); +// dataset->SetDataFeedDesc(your_data_feed_desc); +// dataset->LoadIntoMemory(); +// dataset->SetTrainerNum(2); +// dataset->GlobalShuffle(); +class Dataset { + public: + Dataset() {} + virtual ~Dataset() {} + // set file list + virtual void SetFileList(const std::vector& filelist) = 0; + // set readers' num + virtual void SetThreadNum(int thread_num) = 0; + // set workers' num + virtual void SetTrainerNum(int trainer_num) = 0; + // set fleet send batch size + virtual void SetFleetSendBatchSize(int64_t size) = 0; + // set fs name and ugi + virtual void SetHdfsConfig(const std::string& fs_name, + const std::string& fs_ugi) = 0; + // set data fedd desc, which contains: + // data feed name, batch size, slots + virtual void SetDataFeedDesc(const std::string& data_feed_desc_str) = 0; + // get file list + virtual const std::vector& GetFileList() = 0; + // get thread num + virtual int GetThreadNum() = 0; + // get worker num + virtual int GetTrainerNum() = 0; + // get fleet send batch size + virtual int64_t GetFleetSendBatchSize() = 0; + // get hdfs config + virtual std::pair GetHdfsConfig() = 0; + // get data fedd desc + virtual const paddle::framework::DataFeedDesc& GetDataFeedDesc() = 0; + // get readers, the reader num depend both on thread num + // and filelist size + virtual std::vector>& + GetReaders() = 0; + // register message handler between workers + virtual void RegisterClientToClientMsgHandler() = 0; + // load all data into memory + virtual void LoadIntoMemory() = 0; + // release all memory data + virtual void ReleaseMemory() = 0; + // local shuffle data + virtual void LocalShuffle() = 0; + // global shuffle data + virtual void GlobalShuffle() = 0; + // create readers + virtual void CreateReaders() = 0; + // destroy readers + virtual void DestroyReaders() = 0; + + protected: + virtual int ReceiveFromClient(int msg_type, int client_id, + const std::string& msg) = 0; +}; + +// DatasetImpl is the implementation of Dataset, +// it holds memory data if user calls load_into_memory +template +class DatasetImpl : public Dataset { + public: + DatasetImpl(); + virtual ~DatasetImpl() {} + + virtual void SetFileList(const std::vector& filelist); + virtual void SetThreadNum(int thread_num); + virtual void SetTrainerNum(int trainer_num); + virtual void SetFleetSendBatchSize(int64_t size); + virtual void SetHdfsConfig(const std::string& fs_name, + const std::string& fs_ugi); + virtual void SetDataFeedDesc(const std::string& data_feed_desc_str); + + virtual const std::vector& GetFileList() { return filelist_; } + virtual int GetThreadNum() { return thread_num_; } + virtual int GetTrainerNum() { return trainer_num_; } + virtual int64_t GetFleetSendBatchSize() { return fleet_send_batch_size_; } + virtual std::pair GetHdfsConfig() { + return std::make_pair(fs_name_, fs_ugi_); + } + virtual const paddle::framework::DataFeedDesc& GetDataFeedDesc() { + return data_feed_desc_; + } + virtual std::vector>& + GetReaders(); + + virtual void RegisterClientToClientMsgHandler(); + virtual void LoadIntoMemory(); + virtual void ReleaseMemory(); + virtual void LocalShuffle(); + virtual void GlobalShuffle(); + virtual void CreateReaders(); + virtual void DestroyReaders(); + + protected: + virtual int ReceiveFromClient(int msg_type, int client_id, + const std::string& msg); + std::vector> readers_; + std::vector memory_data_; + std::mutex mutex_for_update_memory_data_; + int thread_num_; + paddle::framework::DataFeedDesc data_feed_desc_; + int trainer_num_; + std::vector filelist_; + size_t file_idx_; + std::mutex mutex_for_pick_file_; + std::string fs_name_; + std::string fs_ugi_; + unsigned int rand_seed; + int64_t fleet_send_batch_size_; +}; + +// use std::vector as data type +class MultiSlotDataset : public DatasetImpl> { + public: + MultiSlotDataset() {} + virtual ~MultiSlotDataset() {} +}; + +} // end namespace framework +} // end namespace paddle diff --git a/paddle/fluid/framework/data_type.cc b/paddle/fluid/framework/data_type.cc index 28f3da88fa18021f6b71e458fdb467be86d4dbf0..a0248cf3c75690fb9ec3fcc22596af245d042d80 100644 --- a/paddle/fluid/framework/data_type.cc +++ b/paddle/fluid/framework/data_type.cc @@ -26,7 +26,7 @@ struct DataTypeMap { std::unordered_map cpp_to_proto_; std::unordered_map proto_to_cpp_; std::unordered_map proto_to_str_; - std::unordered_map cpp_to_size_; + std::unordered_map proto_to_size_; }; static DataTypeMap* InitDataTypeMap(); @@ -45,7 +45,7 @@ static inline void RegisterType(DataTypeMap* map, map->proto_to_cpp_.emplace(static_cast(proto_type), typeid(T)); map->cpp_to_proto_.emplace(typeid(T), proto_type); map->proto_to_str_.emplace(static_cast(proto_type), name); - map->cpp_to_size_.emplace(typeid(T), sizeof(T)); + map->proto_to_size_.emplace(static_cast(proto_type), sizeof(T)); } static DataTypeMap* InitDataTypeMap() { @@ -54,17 +54,7 @@ static DataTypeMap* InitDataTypeMap() { #define RegType(cc_type, proto_type) \ RegisterType(retv, proto_type, #cc_type) - // NOTE: Add your customize type here. - RegType(float16, proto::VarType::FP16); - RegType(float, proto::VarType::FP32); - RegType(double, proto::VarType::FP64); - RegType(int, proto::VarType::INT32); - RegType(int64_t, proto::VarType::INT64); - RegType(bool, proto::VarType::BOOL); - RegType(size_t, proto::VarType::SIZE_T); - RegType(int16_t, proto::VarType::INT16); - RegType(uint8_t, proto::VarType::UINT8); - RegType(int8_t, proto::VarType::INT8); + _ForEachDataType_(RegType); #undef RegType return retv; @@ -96,12 +86,12 @@ std::string DataTypeToString(const proto::VarType::Type type) { static_cast(type)); } -size_t SizeOfType(std::type_index type) { - auto it = gDataTypeMap().cpp_to_size_.find(type); - if (it != gDataTypeMap().cpp_to_size_.end()) { +size_t SizeOfType(proto::VarType::Type type) { + auto it = gDataTypeMap().proto_to_size_.find(static_cast(type)); + if (it != gDataTypeMap().proto_to_size_.end()) { return it->second; } - PADDLE_THROW("Not support %s as tensor type", type.name()); + PADDLE_THROW("Not support %s as tensor type", DataTypeToString(type)); } } // namespace framework diff --git a/paddle/fluid/framework/data_type.h b/paddle/fluid/framework/data_type.h index d5be43b33edab7871e1bba930a4fc6cd1e293825..76df78ea5e17c7eaf1e8ce7a7dc2282a5a4ed579 100644 --- a/paddle/fluid/framework/data_type.h +++ b/paddle/fluid/framework/data_type.h @@ -22,46 +22,59 @@ limitations under the License. */ namespace paddle { namespace framework { +template +struct DataTypeTrait {}; + +// Stub handle for void +template <> +struct DataTypeTrait { + constexpr static auto DataType = proto::VarType::RAW; +}; + +#define _ForEachDataTypeHelper_(callback, cpp_type, proto_type) \ + callback(cpp_type, ::paddle::framework::proto::VarType::proto_type); + +#define _ForEachDataType_(callback) \ + _ForEachDataTypeHelper_(callback, float, FP32); \ + _ForEachDataTypeHelper_(callback, ::paddle::platform::float16, FP16); \ + _ForEachDataTypeHelper_(callback, double, FP64); \ + _ForEachDataTypeHelper_(callback, int, INT32); \ + _ForEachDataTypeHelper_(callback, int64_t, INT64); \ + _ForEachDataTypeHelper_(callback, bool, BOOL); \ + _ForEachDataTypeHelper_(callback, uint8_t, UINT8); \ + _ForEachDataTypeHelper_(callback, int16_t, INT16); \ + _ForEachDataTypeHelper_(callback, int8_t, INT8) + +#define DefineDataTypeTrait(cpp_type, proto_type) \ + template <> \ + struct DataTypeTrait { \ + constexpr static auto DataType = proto_type; \ + } + +_ForEachDataType_(DefineDataTypeTrait); + +#undef DefineDataTypeTrait + extern proto::VarType::Type ToDataType(std::type_index type); extern std::type_index ToTypeIndex(proto::VarType::Type type); template inline void VisitDataType(proto::VarType::Type type, Visitor visitor) { - switch (type) { - case proto::VarType::FP16: - visitor.template apply(); - break; - case proto::VarType::FP32: - visitor.template apply(); - break; - case proto::VarType::FP64: - visitor.template apply(); - break; - case proto::VarType::INT32: - visitor.template apply(); - break; - case proto::VarType::INT64: - visitor.template apply(); - break; - case proto::VarType::BOOL: - visitor.template apply(); - break; - case proto::VarType::UINT8: - visitor.template apply(); - break; - case proto::VarType::INT16: - visitor.template apply(); - break; - case proto::VarType::INT8: - visitor.template apply(); - break; - default: - PADDLE_THROW("Not supported %d", type); - } +#define VisitDataTypeCallback(cpp_type, proto_type) \ + do { \ + if (type == proto_type) { \ + visitor.template apply(); \ + return; \ + } \ + } while (0) + + _ForEachDataType_(VisitDataTypeCallback); +#undef VisitDataTypeCallback + PADDLE_THROW("Not supported %d", type); } extern std::string DataTypeToString(const proto::VarType::Type type); -extern size_t SizeOfType(std::type_index type); +extern size_t SizeOfType(proto::VarType::Type type); inline std::ostream& operator<<(std::ostream& out, const proto::VarType::Type& type) { out << DataTypeToString(type); diff --git a/paddle/fluid/framework/data_type_test.cc b/paddle/fluid/framework/data_type_test.cc index 54c41c55ba63c0b2001cfcb6a9e94fbb0036d437..2a380201f297f42dd82a6809bef9a72660066819 100644 --- a/paddle/fluid/framework/data_type_test.cc +++ b/paddle/fluid/framework/data_type_test.cc @@ -26,15 +26,15 @@ TEST(DataType, float16) { Tensor tensor; CPUPlace cpu; - tensor.mutable_data(cpu, f::ToTypeIndex(dtype)); + tensor.mutable_data(cpu, dtype); // test fp16 tensor - EXPECT_EQ(tensor.type(), std::type_index(typeid(float16))); + EXPECT_EQ(tensor.type(), f::ToDataType(typeid(float16))); // test fp16 size - EXPECT_EQ(f::SizeOfType(f::ToTypeIndex(dtype)), 2u); + EXPECT_EQ(f::SizeOfType(dtype), 2u); // test debug info - std::string type = "float16"; + std::string type = "::paddle::platform::float16"; EXPECT_STREQ(f::DataTypeToString(dtype).c_str(), type.c_str()); } diff --git a/paddle/fluid/framework/dataset_factory.cc b/paddle/fluid/framework/dataset_factory.cc new file mode 100644 index 0000000000000000000000000000000000000000..60be4cf9a43c01666c94018b7339da5f3ba797e5 --- /dev/null +++ b/paddle/fluid/framework/dataset_factory.cc @@ -0,0 +1,66 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/dataset_factory.h" +#include +#include +#include + +#include "paddle/fluid/framework/data_set.h" + +namespace paddle { +namespace framework { +typedef std::shared_ptr (*CreateDatasetFunction)(); +typedef std::unordered_map datasetMap; +datasetMap g_dataset_map; + +#define REGISTER_DATASET_CLASS(dataset_class) \ + namespace { \ + std::shared_ptr Creator_##dataset_class() { \ + return std::shared_ptr(new dataset_class); \ + } \ + class __Registerer_##dataset_class { \ + public: \ + __Registerer_##dataset_class() { \ + g_dataset_map[#dataset_class] = &Creator_##dataset_class; \ + } \ + }; \ + __Registerer_##dataset_class g_registerer_##dataset_class; \ + } // namespace + +std::string DatasetFactory::DatasetTypeList() { + std::string dataset_types; + for (auto iter = g_dataset_map.begin(); iter != g_dataset_map.end(); ++iter) { + if (iter != g_dataset_map.begin()) { + dataset_types += ", "; + } + dataset_types += iter->first; + } + return dataset_types; +} + +std::shared_ptr DatasetFactory::CreateDataset( + std::string dataset_class) { + if (g_dataset_map.count(dataset_class) < 1) { + LOG(WARNING) << "Your Dataset " << dataset_class + << "is not supported currently"; + LOG(WARNING) << "Supported Dataset: " << DatasetTypeList(); + exit(-1); + } + return g_dataset_map[dataset_class](); +} + +REGISTER_DATASET_CLASS(MultiSlotDataset); +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/dataset_factory.h b/paddle/fluid/framework/dataset_factory.h new file mode 100644 index 0000000000000000000000000000000000000000..2894b69f8faca4b261347ed3b55e965ff8ee53fa --- /dev/null +++ b/paddle/fluid/framework/dataset_factory.h @@ -0,0 +1,29 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include "paddle/fluid/framework/data_set.h" + +namespace paddle { +namespace framework { +class DatasetFactory { + public: + static std::string DatasetTypeList(); + static std::shared_ptr CreateDataset(std::string dataset_class); +}; +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ddim.cc b/paddle/fluid/framework/ddim.cc index 05e423b8a52962d47a6615d48243444374b470e3..e7a6df57e538164969bc101ced4b91de8f75ca56 100644 --- a/paddle/fluid/framework/ddim.cc +++ b/paddle/fluid/framework/ddim.cc @@ -18,312 +18,159 @@ limitations under the License. */ namespace paddle { namespace framework { -/// @cond HIDDEN - -template -Dim make_dim(const int64_t* d) { - return Dim(*d, make_dim(d + 1)); -} - -template <> -Dim<0> make_dim<0>(const int64_t* d) { - return Dim<0>(*d); -} - -void make_ddim(DDim& ddim, const int64_t* dims, int n) { - switch (n) { - case 0: - ddim = make_dim<0>(dims); - break; - case 1: - ddim = make_dim<1>(dims); - break; - case 2: - ddim = make_dim<2>(dims); - break; - case 3: - ddim = make_dim<3>(dims); - break; - case 4: - ddim = make_dim<4>(dims); - break; - case 5: - ddim = make_dim<5>(dims); - break; - case 6: - ddim = make_dim<6>(dims); - break; - case 7: - ddim = make_dim<7>(dims); - break; - case 8: - ddim = make_dim<8>(dims); - break; - case 9: - ddim = make_dim<9>(dims); - break; - default: - PADDLE_THROW("Dynamic dimensions must have between [1, 9] dimensions."); - } -} - -/// @endcond - DDim make_ddim(std::initializer_list dims) { - DDim result(make_dim(0)); - make_ddim(result, dims.begin(), dims.size()); - return result; + return DDim(dims.begin(), dims.size()); } DDim make_ddim(const std::vector& dims) { - DDim result(make_dim(0)); - make_ddim(result, &dims[0], dims.size()); - return result; + return DDim(dims.data(), dims.size()); } DDim make_ddim(const std::vector& dims) { - std::vector res(dims.size()); - std::transform(dims.begin(), dims.end(), res.begin(), - [](int d) { return static_cast(d); }); - return make_ddim(res); + return DDim(dims.data(), dims.size()); } -/// @cond HIDDEN -// XXX For some reason, putting this in an anonymous namespace causes errors -class DynamicMutableIndexer : public boost::static_visitor { - public: - explicit DynamicMutableIndexer(int idx) : idx_(idx) {} +struct DDimEqualityVisitor { + explicit DDimEqualityVisitor(const int64_t* d) : d_(d) {} template - int64_t& operator()(Dim& dim) const { - return dim[idx_]; + inline bool operator()(const Dim& self) const { + return UnrollCompare::Run(self.Get(), d_); } - private: - int idx_; + const int64_t* d_; }; -class DynamicConstIndexer : public boost::static_visitor { - public: - explicit DynamicConstIndexer(int idx) : idx_(idx) {} - - template - int64_t operator()(const Dim& dim) const { - return dim[idx_]; - } - - private: - int idx_; -}; - -/// @endcond - -int64_t& DDim::operator[](int idx) { - return boost::apply_visitor(DynamicMutableIndexer(idx), var); +bool DDim::operator==(const DDim& d) const { + return size() == d.size() && + this->apply_visitor(DDimEqualityVisitor(d.Get())); } -int64_t DDim::operator[](int idx) const { - return boost::apply_visitor(DynamicConstIndexer(idx), var); -} +bool DDim::operator!=(const DDim& d) const { return !(*this == d); } -int DDim::size() const { return arity(*this); } +struct DDimPlusVisitor { + explicit DDimPlusVisitor(const int64_t* d1, const int64_t* d2) + : d1_(d1), d2_(d2) {} -bool DDim::operator==(DDim d) const { - if (var.which() != d.getVar().which()) { - return false; - } else { - std::vector v1 = vectorize(*this); - std::vector v2 = vectorize(d); - - for (unsigned int i = 0; i < v1.size(); i++) { - if (v1[i] != v2[i]) { - return false; - } - } - - return true; + template + inline void operator()(Dim& self) const { + UnrollAdd::Run(d1_, d2_, self.GetMutable()); } -} - -bool DDim::operator!=(DDim d) const { return !(*this == d); } - -DDim DDim::operator+(DDim d) const { - std::vector v1 = vectorize(*this); - std::vector v2 = vectorize(d); - - std::vector v3; - assert(v1.size() == v2.size()); - - for (unsigned int i = 0; i < v1.size(); i++) { - v3.push_back(v1[i] + v2[i]); - } + const int64_t* d1_; + const int64_t* d2_; +}; - return make_ddim(v3); +DDim DDim::operator+(const DDim& d) const { + PADDLE_ENFORCE(size() == d.size()); + DDim ret; + ret.rank_ = rank_; + ret.apply_visitor(DDimPlusVisitor(Get(), d.Get())); + return ret; } -DDim DDim::operator*(DDim d) const { - std::vector v1 = vectorize(*this); - std::vector v2 = vectorize(d); +struct DDimMulVisitor { + explicit DDimMulVisitor(const int64_t* d1, const int64_t* d2) + : d1_(d1), d2_(d2) {} - std::vector v3; - - assert(v1.size() == v2.size()); - - for (unsigned int i = 0; i < v1.size(); i++) { - v3.push_back(v1[i] * v2[i]); + template + inline void operator()(Dim& self) const { + UnrollMul::Run(d1_, d2_, self.GetMutable()); } - return make_ddim(v3); + const int64_t* d1_; + const int64_t* d2_; +}; + +DDim DDim::operator*(const DDim& d) const { + PADDLE_ENFORCE(size() == d.size()); + DDim ret; + ret.rank_ = rank_; + ret.apply_visitor(DDimMulVisitor(Get(), d.Get())); + return ret; } int64_t get(const DDim& ddim, int idx) { return ddim[idx]; } -void set(DDim& ddim, int idx, int value) { ddim[idx] = value; } - -/// @cond HIDDEN -struct VectorizeVisitor : public boost::static_visitor<> { - std::vector& vector; - - explicit VectorizeVisitor(std::vector& v) : vector(v) {} - - template - void operator()(const T& t) { - vector.push_back(t.head); - this->operator()(t.tail); - } - - void operator()(const Dim<0>& t) {} -}; -/// @endcond +void set(DDim& ddim, int idx, int value) { ddim[idx] = value; } // NOLINT std::vector vectorize(const DDim& ddim) { - std::vector result; - VectorizeVisitor visitor(result); - boost::apply_visitor(visitor, ddim); + std::vector result(DDim::kMaxRank); + dynamic_dim_assign(ddim.Get(), result.data(), ddim.size()); + result.resize(ddim.size()); return result; } // NOTE: framework::vectorize converts to type int64_t // which does not fit cudnn inputs. std::vector vectorize2int(const DDim& ddim) { - std::vector temp = vectorize(ddim); - std::vector result(temp.begin(), temp.end()); + std::vector result(DDim::kMaxRank); + dynamic_dim_assign(ddim.Get(), result.data(), ddim.size()); + result.resize(ddim.size()); return result; } -struct ProductVisitor : public boost::static_visitor { +struct ProductVisitor { template - int64_t operator()(const Dim& dim) { + inline int64_t operator()(const Dim& dim) { return product(dim); } }; int64_t product(const DDim& ddim) { - ProductVisitor visitor; - return boost::apply_visitor(visitor, ddim); + return ddim.apply_visitor(ProductVisitor()); } -struct SliceVectorizeVisitor : public boost::static_visitor<> { - std::vector& vector; - int begin; - int end; - - SliceVectorizeVisitor(std::vector& v, int b, int e) - : vector(v), begin(b), end(e) { - PADDLE_ENFORCE(begin < end, - "Begin index must be less than end index in ddim slice."); - PADDLE_ENFORCE(begin >= 0, - "Begin index can't be less than zero in ddim slice."); - } - - template - void operator()(const Dim& dim) { - if (begin == 0) { - vector.push_back(dim.head); - } else { - --begin; - } - --end; - if (end > 0) { - this->operator()(dim.tail); - } - } - - void operator()(const Dim<0>& dim) { - PADDLE_ENFORCE(end == 0, "End index in ddim slice is out of bound."); - } -}; - DDim slice_ddim(const DDim& dim, int begin, int end) { - std::vector vec; - vec.reserve(end - begin); - SliceVectorizeVisitor visitor(vec, begin, end); - boost::apply_visitor(visitor, dim); - return make_ddim(vec); + PADDLE_ENFORCE(begin >= 0 && end <= dim.size(), + "[begin(%d), end(%d)) must be inside [0, %d) in ddim slice.", + begin, end, dim.size()); + // Constructor of DDim would check whether end - begin is valid + return DDim(dim.Get() + begin, end - begin); } -/// \cond HIDDEN - -struct ArityVisitor : boost::static_visitor { - template - int operator()(Dim) const { - return D; - } -}; - -/// \endcond - -int arity(const DDim& d) { return boost::apply_visitor(ArityVisitor(), d); } +int arity(const DDim& d) { return d.size(); } -/// \cond HIDDEN - -struct DDimPrinter : boost::static_visitor { +struct DDimPrinter { std::ostream& os; explicit DDimPrinter(std::ostream& os_) : os(os_) {} - template - void operator()(const T& t) { + template + void operator()(const Dim& t) { os << t; } }; -/// \endcond - std::ostream& operator<<(std::ostream& os, const DDim& ddim) { - DDimPrinter printer(os); - boost::apply_visitor(printer, ddim); + ddim.apply_visitor(DDimPrinter(os)); return os; } -DDim::DDim(std::initializer_list init_list) { - *this = make_ddim(init_list); -} - DDim flatten_to_2d(const DDim& src, int num_col_dims) { - int rank = src.size(); - return make_ddim({product(slice_ddim(src, 0, num_col_dims)), - product(slice_ddim(src, num_col_dims, rank))}); + return DDim({product(slice_ddim(src, 0, num_col_dims)), + product(slice_ddim(src, num_col_dims, src.size()))}); } -DDim flatten_to_1d(const DDim& src) { return make_ddim({product(src)}); } +DDim flatten_to_1d(const DDim& src) { return DDim({product(src)}); } DDim stride(const DDim& ddim) { - std::vector strides(ddim.size()); + DDim strides; + strides.rank_ = ddim.size(); strides[ddim.size() - 1] = 1; for (int i = ddim.size() - 2; i >= 0; --i) { strides[i] = strides[i + 1] * ddim[i + 1]; } - return framework::make_ddim(strides); + return strides; } -DDim stride_numel(const framework::DDim& ddim) { - std::vector strides(ddim.size()); +DDim stride_numel(const DDim& ddim) { + DDim strides; + strides.rank_ = ddim.size(); strides[ddim.size() - 1] = ddim[ddim.size() - 1]; for (int i = ddim.size() - 2; i >= 0; --i) { strides[i] = strides[i + 1] * ddim[i]; } - return framework::make_ddim(strides); + return strides; } } // namespace framework diff --git a/paddle/fluid/framework/ddim.h b/paddle/fluid/framework/ddim.h index f05b5ee3faee856a41f1376e5952710b550e7c42..31a41dab2a1f1d6bad9fe697c5d367f32e219160 100644 --- a/paddle/fluid/framework/ddim.h +++ b/paddle/fluid/framework/ddim.h @@ -18,62 +18,145 @@ limitations under the License. */ #include #include #include "paddle/fluid/framework/dim.h" -#include "paddle/fluid/platform/enforce.h" -#include "paddle/fluid/platform/variant.h" namespace paddle { namespace framework { +#define PADDLE_VISIT_DDIM_BASE(rank, callback) \ + case (rank): { \ + constexpr auto kRank = (rank); \ + return (callback); \ + } + +#define PADDLE_VISIT_DDIM(rank, callback) \ + switch (rank) { \ + PADDLE_VISIT_DDIM_BASE(0, callback); \ + PADDLE_VISIT_DDIM_BASE(1, callback); \ + PADDLE_VISIT_DDIM_BASE(2, callback); \ + PADDLE_VISIT_DDIM_BASE(3, callback); \ + PADDLE_VISIT_DDIM_BASE(4, callback); \ + PADDLE_VISIT_DDIM_BASE(5, callback); \ + PADDLE_VISIT_DDIM_BASE(6, callback); \ + PADDLE_VISIT_DDIM_BASE(7, callback); \ + PADDLE_VISIT_DDIM_BASE(8, callback); \ + PADDLE_VISIT_DDIM_BASE(9, callback); \ + default: \ + PADDLE_THROW("Invalid rank %d", rank); \ + } + +template +inline void dynamic_dim_assign(const T1* in, T2* out, int n) { + PADDLE_VISIT_DDIM(n, (static_dim_assign(in, out))); +} + /** * \brief A dynamically sized dimension. * * The number of dimensions must be between [1, 9]. */ -struct DDim { - typedef boost::variant, Dim<1>, Dim<2>, Dim<3>, Dim<4>, Dim<5>, Dim<6>, - Dim<7>, Dim<8>, Dim<9>> - DDimVar; - DDimVar var; +class DDim { + public: + constexpr static int kMaxRank = 9; + + DDim() : rank_(1) { dim_[0] = 0; } - DDim() : var(Dim<1>()) {} + DDim(const DDim& ddim) : dim_() { CopyFrom(ddim); } + + DDim(const int* d, int n) : rank_(n) { + dynamic_dim_assign(d, dim_.GetMutable(), n); + } + + DDim(const int64_t* d, int n) : rank_(n) { + dynamic_dim_assign(d, dim_.GetMutable(), n); + } template - explicit DDim(const Dim& in) : var(in) {} + /*implicit*/ DDim(const Dim& in) : rank_(D) { // NOLINT + UnsafeCast() = in; + } + + /*implicit*/ DDim(std::initializer_list init_list) + : DDim(init_list.begin(), init_list.size()) {} - /*implicit*/ DDim(std::initializer_list init_list); + inline DDim& operator=(const DDim& ddim) { return CopyFrom(ddim); } template - DDim& operator=(const Dim& in) { - var = in; + inline DDim& operator=(const Dim& dim) { + rank_ = D; + UnsafeCast() = dim; return *this; } - int64_t& operator[](int idx); - int64_t operator[](int idx) const; + inline int64_t& operator[](int idx) { return dim_[idx]; } + + inline int64_t operator[](int idx) const { return dim_[idx]; } + + inline int64_t& at(int idx) { + PADDLE_ENFORCE(idx >= 0 && idx < rank_, "Invalid idx %d", idx); + return dim_[idx]; + } + + inline int64_t at(int idx) const { + PADDLE_ENFORCE(idx >= 0 && idx < rank_, "Invalid idx %d", idx); + return dim_[idx]; + } template - typename Visitor::result_type apply_visitor(Visitor& visitor) { - return var.apply_visitor(visitor); + typename std::result_of&)>::type apply_visitor( + Visitor&& visitor) { + PADDLE_VISIT_DDIM(rank_, visitor(UnsafeCast())); } template - typename Visitor::result_type apply_visitor(Visitor& visitor) const { - return var.apply_visitor(visitor); + typename std::result_of&)>::type apply_visitor( + Visitor&& visitor) const { + PADDLE_VISIT_DDIM(rank_, visitor(UnsafeCast())); } - DDimVar getVar() { return var; } + bool operator==(const DDim& d) const; + + bool operator!=(const DDim& d) const; + + DDim operator+(const DDim& d) const; - bool operator==(DDim d) const; + DDim operator*(const DDim& d) const; - bool operator!=(DDim d) const; + inline const int64_t* Get() const { return dim_.Get(); } - DDim operator+(DDim d) const; + inline int64_t* GetMutable() { return dim_.GetMutable(); } - DDim operator*(DDim d) const; + inline int size() const { return rank_; } + + private: + template + inline Dim& UnsafeCast() { + static_assert(D >= 0 && D <= kMaxRank, "Invalid rank"); + auto* p = static_cast(&dim_); + return *reinterpret_cast*>(p); + } + + template + inline const Dim& UnsafeCast() const { + static_assert(D >= 0 && D <= kMaxRank, "Invalid rank"); + auto* p = static_cast(&dim_); + return *reinterpret_cast*>(p); + } - int size() const; + inline DDim& CopyFrom(const DDim& ddim) { + PADDLE_VISIT_DDIM(ddim.rank_, (*this = ddim.UnsafeCast())); + } + + friend DDim stride(const DDim& ddim); + friend DDim stride_numel(const DDim& ddim); + + private: + Dim dim_; + int rank_; }; +#undef PADDLE_VISIT_DDIM_BASE +#undef PADDLE_VISIT_DDIM + /** * \brief Make a DDim from std::vector * @@ -92,7 +175,7 @@ DDim make_ddim(const std::vector& dims); DDim make_ddim(std::initializer_list dims); int64_t get(const DDim& dim, int idx); -void set(DDim& dim, int idx, int val); +void set(DDim& dim, int idx, int val); // NOLINT std::vector vectorize(const DDim& ddim); std::vector vectorize2int(const DDim& ddim); @@ -129,12 +212,3 @@ DDim stride(const DDim& ddim); DDim stride_numel(const DDim& ddim); } // namespace framework } // namespace paddle - -namespace boost { - -template -T get(const paddle::framework::DDim& in) { - return boost::get(in.var); -} - -} // namespace boost diff --git a/paddle/fluid/framework/details/CMakeLists.txt b/paddle/fluid/framework/details/CMakeLists.txt index e0a3ef5a9c6c53c42ebea1a41cac0d18a77781b2..2c1f3ae638cf95c3ab49219909fe3b1f22137099 100644 --- a/paddle/fluid/framework/details/CMakeLists.txt +++ b/paddle/fluid/framework/details/CMakeLists.txt @@ -1,50 +1,107 @@ cc_library(var_handle SRCS var_handle.cc DEPS place framework_proto node) cc_library(op_handle_base SRCS op_handle_base.cc DEPS var_handle device_context lod_tensor) +cc_library(op_graph_view SRCS op_graph_view.cc DEPS op_handle_base) cc_library(scale_loss_grad_op_handle SRCS scale_loss_grad_op_handle.cc DEPS op_handle_base scope lod_tensor ddim memory) cc_library(fetch_op_handle SRCS fetch_op_handle.cc DEPS op_handle_base scope lod_tensor ddim memory) cc_library(computation_op_handle SRCS computation_op_handle.cc DEPS framework_proto scope place operator op_registry) cc_library(rpc_op_handle SRCS rpc_op_handle.cc DEPS framework_proto scope place operator op_registry) +cc_library(fetch_barrier_op_handle SRCS fetch_barrier_op_handle.cc DEPS framework_proto scope place operator op_registry) cc_library(multi_devices_helper SRCS multi_devices_helper.cc DEPS graph graph_helper) cc_library(multi_devices_graph_print_pass SRCS multi_devices_graph_print_pass.cc DEPS multi_devices_helper) cc_library(multi_devices_graph_check_pass SRCS multi_devices_graph_check_pass.cc DEPS multi_devices_helper) +cc_library(alloc_continuous_space_for_grad_pass SRCS alloc_continuous_space_for_grad_pass.cc DEPS graph graph_helper) +cc_library(fuse_adam_op_pass SRCS fuse_adam_op_pass.cc fuse_optimizer_op_pass.cc DEPS graph graph_helper) +cc_library(fuse_sgd_op_pass SRCS fuse_sgd_op_pass.cc fuse_optimizer_op_pass.cc DEPS graph graph_helper) + cc_library(variable_visitor SRCS variable_visitor.cc DEPS lod_tensor selected_rows) +if(WITH_DISTRIBUTE) + if(NOT WITH_GRPC) + set(DISTRIBUTE_COMPILE_FLAGS "-Wno-non-virtual-dtor -Wno-error=non-virtual-dtor -Wno-error=delete-non-virtual-dtor") + set_source_files_properties(reduce_op_handle.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS}) + endif() +endif() + if(WITH_GPU) + set(dgc_deps "") + if(NOT WIN32) + set(dgc_deps dgc) + endif() nv_library(all_reduce_op_handle SRCS all_reduce_op_handle.cc DEPS op_handle_base scope lod_tensor ddim memory + dynload_cuda variable_visitor ${dgc_deps}) + nv_library(fused_all_reduce_op_handle SRCS fused_all_reduce_op_handle.cc DEPS op_handle_base scope lod_tensor ddim memory dynload_cuda variable_visitor) - nv_library(reduce_op_handle SRCS reduce_op_handle.cc DEPS op_handle_base variable_visitor scope ddim dynload_cuda) + if(WITH_DISTRIBUTE) + nv_library(reduce_op_handle SRCS reduce_op_handle.cc DEPS op_handle_base variable_visitor scope + ddim dynload_cuda selected_rows_functor sendrecvop_rpc) + else() + nv_library(reduce_op_handle SRCS reduce_op_handle.cc DEPS op_handle_base variable_visitor scope + ddim dynload_cuda selected_rows_functor) + endif() nv_library(broadcast_op_handle SRCS broadcast_op_handle.cc DEPS op_handle_base scope ddim memory variable_visitor dynload_cuda) + nv_library(fused_broadcast_op_handle SRCS fused_broadcast_op_handle.cc DEPS broadcast_op_handle) else() cc_library(all_reduce_op_handle SRCS all_reduce_op_handle.cc DEPS op_handle_base scope lod_tensor ddim memory variable_visitor) - cc_library(reduce_op_handle SRCS reduce_op_handle.cc DEPS op_handle_base variable_visitor scope ddim) + cc_library(fused_all_reduce_op_handle SRCS fused_all_reduce_op_handle.cc DEPS op_handle_base scope lod_tensor ddim memory + variable_visitor) + if(WITH_DISTRIBUTE) + cc_library(reduce_op_handle SRCS reduce_op_handle.cc DEPS op_handle_base variable_visitor scope + ddim selected_rows_functor sendrecvop_rpc) + else() + cc_library(reduce_op_handle SRCS reduce_op_handle.cc DEPS op_handle_base variable_visitor scope + ddim selected_rows_functor) + endif() cc_library(broadcast_op_handle SRCS broadcast_op_handle.cc DEPS op_handle_base scope ddim memory variable_visitor) + cc_library(fused_broadcast_op_handle SRCS fused_broadcast_op_handle.cc DEPS broadcast_op_handle) endif() -cc_library(data_balance_op_handle SRCS data_balance_op_handle.cc DEPS op_handle_base scope lod_tensor) cc_library(gather_op_handle SRCS gather_op_handle.cc DEPS op_handle_base scope ddim memory variable_visitor) -cc_library(fuse_vars_op_handle SRCS fuse_vars_op_handle.cc DEPS op_handle_base scope) if(WITH_GPU) - cc_library(reference_count_pass SRCS reference_count_pass.cc DEPS computation_op_handle scale_loss_grad_op_handle rpc_op_handle - all_reduce_op_handle reduce_op_handle broadcast_op_handle data_balance_op_handle graph graph_helper pass) +cc_library(memory_optimize_helper SRCS memory_optimize_helper.cc DEPS graph graph_helper gpu_info) +else() +cc_library(memory_optimize_helper SRCS memory_optimize_helper.cc DEPS graph graph_helper cpu_info) endif() +cc_library(memory_optimize_pass SRCS memory_optimize_pass.cc DEPS memory_optimize_helper pass) +cc_library(inplace_op_pass SRCS inplace_op_pass.cc DEPS memory_optimize_pass op_info) +cc_library(modify_op_lock_and_record_event_pass SRCS modify_op_lock_and_record_event_pass.cc DEPS computation_op_handle op_graph_view multi_devices_helper) +cc_library(reference_count_pass_helper SRCS reference_count_pass_helper.cc DEPS garbage_collector computation_op_handle) +cc_library(eager_deletion_op_handle SRCS eager_deletion_op_handle.cc DEPS lod_tensor selected_rows reference_count_pass_helper) +cc_library(while_op_eager_deletion_pass SRCS while_op_eager_deletion_pass.cc DEPS while_op_helper graph_helper pass computation_op_handle) +cc_library(eager_deletion_pass SRCS eager_deletion_pass.cc DEPS computation_op_handle eager_deletion_op_handle graph graph_helper pass while_op_eager_deletion_pass) +cc_library(reference_count_pass SRCS reference_count_pass.cc DEPS computation_op_handle graph graph_helper pass op_graph_view reference_count_pass_helper) + +cc_library(sequential_execution_pass SRCS sequential_execution_pass.cc DEPS graph graph_helper pass) +cc_library(all_reduce_deps_pass SRCS all_reduce_deps_pass.cc DEPS graph graph_helper pass) + cc_library(multi_devices_graph_pass SRCS multi_devices_graph_pass.cc DEPS multi_devices_helper computation_op_handle - scale_loss_grad_op_handle rpc_op_handle all_reduce_op_handle reduce_op_handle broadcast_op_handle data_balance_op_handle) + scale_loss_grad_op_handle rpc_op_handle fetch_barrier_op_handle all_reduce_op_handle reduce_op_handle broadcast_op_handle fused_broadcast_op_handle) -if(WITH_GPU) - cc_library(ssa_graph_executor SRCS ssa_graph_executor.cc DEPS graph framework_proto reference_count_pass) -else() - cc_library(ssa_graph_executor SRCS ssa_graph_executor.cc DEPS graph framework_proto) +cc_library(fuse_all_reduce_op_pass SRCS fuse_all_reduce_op_pass.cc DEPS graph graph_helper fused_all_reduce_op_handle) + +set(SSA_GRAPH_EXECUTOR_DEPS graph framework_proto sequential_execution_pass modify_op_lock_and_record_event_pass all_reduce_deps_pass reference_count_pass eager_deletion_pass memory_optimize_pass inplace_op_pass) +if (WITH_GPU) + list(APPEND SSA_GRAPH_EXECUTOR_DEPS reference_count_pass) endif() +cc_test(memory_optimize_helper_test SRCS memory_optimize_helper_test.cc memory_optimize_helper.cc DEPS framework_proto graph graph_helper op_registry) +cc_library(ssa_graph_executor SRCS ssa_graph_executor.cc DEPS ${SSA_GRAPH_EXECUTOR_DEPS}) cc_library(threaded_ssa_graph_executor SRCS threaded_ssa_graph_executor.cc DEPS fetch_op_handle ssa_graph_executor scope simple_threadpool device_context) +cc_library(parallel_ssa_graph_executor SRCS parallel_ssa_graph_executor.cc DEPS threaded_ssa_graph_executor) + +set(ASYNC_SSA_GRAPH_EXECUTOR_DEPS threaded_ssa_graph_executor) +if(WITH_DISTRIBUTE) + list(APPEND ASYNC_SSA_GRAPH_EXECUTOR_DEPS communicator) +endif() +cc_library(async_ssa_graph_executor SRCS async_ssa_graph_executor.cc DEPS ${ASYNC_SSA_GRAPH_EXECUTOR_DEPS}) + cc_test(broadcast_op_test SRCS broadcast_op_handle_test.cc DEPS var_handle op_handle_base scope ddim memory device_context broadcast_op_handle) cc_test(gather_op_test SRCS gather_op_handle_test.cc DEPS var_handle op_handle_base scope ddim memory @@ -54,8 +111,13 @@ cc_library(scope_buffered_ssa_graph_executor SRCS scope_buffered_ssa_graph_execu # device_context reduce_op_handle ) cc_library(fast_threaded_ssa_graph_executor SRCS fast_threaded_ssa_graph_executor.cc DEPS fetch_op_handle ssa_graph_executor scope simple_threadpool device_context) +cc_test(fused_broadcast_op_test SRCS fused_broadcast_op_handle_test.cc DEPS fused_broadcast_op_handle) cc_library(build_strategy SRCS build_strategy.cc DEPS graph_viz_pass multi_devices_graph_pass multi_devices_graph_print_pass multi_devices_graph_check_pass - fuse_elewise_add_act_pass) + fuse_elewise_add_act_pass multi_batch_merge_pass + fuse_relu_depthwise_conv_pass + memory_optimize_pass lock_free_optimize_pass + alloc_continuous_space_for_grad_pass fuse_all_reduce_op_pass + fuse_adam_op_pass fuse_sgd_op_pass) diff --git a/paddle/fluid/framework/details/all_reduce_deps_pass.cc b/paddle/fluid/framework/details/all_reduce_deps_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..c44793cd11d22b29b4b3422a047d81fe26624982 --- /dev/null +++ b/paddle/fluid/framework/details/all_reduce_deps_pass.cc @@ -0,0 +1,202 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include +#include +#include +#include +#include + +#include "paddle/fluid/framework/details/all_reduce_op_handle.h" +#include "paddle/fluid/framework/details/container_cast.h" +#include "paddle/fluid/framework/details/multi_devices_helper.h" +#include "paddle/fluid/framework/details/op_graph_view.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_helper.h" +#include "paddle/fluid/framework/ir/pass.h" +#include "paddle/fluid/framework/op_proto_maker.h" + +namespace paddle { +namespace framework { +namespace details { + +class AllReduceDepsPass : public ir::Pass { + protected: + void ApplyImpl(ir::Graph* graph) const override { + std::vector all_reduce_op_handles = + GetSortedAllReduceOps(*graph); + + for (size_t i = 1; i < all_reduce_op_handles.size(); ++i) { + auto* dep_var = new DummyVarHandle(graph->CreateControlDepVar()); + graph->Get(kGraphDepVars).emplace(dep_var); + all_reduce_op_handles[i - 1]->AddOutput(dep_var); + all_reduce_op_handles[i]->AddInput(dep_var); + } + + if (VLOG_IS_ON(10)) { + DebugString(*graph, all_reduce_op_handles); + } + } + + std::vector GetSortedAllReduceOps( + const ir::Graph& graph) const { + std::vector all_reduce_op_handles; + std::unordered_map pending_ops; + std::unordered_set ready_ops; + std::unordered_set next_ready_ops; + + auto op_handles = ir::FilterByNodeWrapper(graph); + size_t num_of_ops = op_handles.size(); + for (OpHandleBase* op : op_handles) { + size_t not_ready_vars = op->NotReadyInputSize(); + if (not_ready_vars) { + pending_ops.insert({op, not_ready_vars}); + } else { + ready_ops.insert(op); + } + } + + GetSortedAllReduceOps(ready_ops, &all_reduce_op_handles); + + size_t has_run_ops = ready_ops.size(); + while (has_run_ops != num_of_ops) { + for (auto* op : ready_ops) { + for (auto& ready_var : op->Outputs()) { + for (auto* pend_op : ready_var->PendingOps()) { + auto& deps = --pending_ops[pend_op]; + if (deps == 0) { + next_ready_ops.insert(pend_op); + } + } + } + } + + PADDLE_ENFORCE_NE(next_ready_ops.size(), 0, "There maybe have a cycle."); + ready_ops.clear(); + std::swap(ready_ops, next_ready_ops); + GetSortedAllReduceOps(ready_ops, &all_reduce_op_handles); + has_run_ops += ready_ops.size(); + } + return all_reduce_op_handles; + } + + void GetSortedAllReduceOps( + const std::unordered_set& ready_ops, + std::vector* all_reduce_op_handles) const { + std::vector current_all_reduce_op_handles; + for (auto& op_handle : ready_ops) { + auto all_reduce_op_handle = dynamic_cast(op_handle); + if (all_reduce_op_handle) { + current_all_reduce_op_handles.emplace_back(all_reduce_op_handle); + } + } + + // NOTE(zcd): For distributed training, it is important to keep the order of + // allReduce on each node consistent. Otherwise, hang may occur. + // Sort the current_all_reduce_op_handles according to the name of input. + sort(current_all_reduce_op_handles.begin(), + current_all_reduce_op_handles.end(), + [](const AllReduceOpHandle* left, + const AllReduceOpHandle* right) -> bool { + auto left_in_vars = DynamicCast(left->Inputs()); + auto right_in_vars = DynamicCast(right->Inputs()); + PADDLE_ENFORCE_GT(left_in_vars.size(), 0); + PADDLE_ENFORCE_EQ(left_in_vars.size(), right_in_vars.size()); + return left_in_vars[0]->Name() > right_in_vars[0]->Name(); + }); + + all_reduce_op_handles->insert(all_reduce_op_handles->end(), + current_all_reduce_op_handles.begin(), + current_all_reduce_op_handles.end()); + } + + void DebugString( + const ir::Graph& graph, + const std::vector& all_reduce_op_handles) const { + // get vars order + std::map> vars = + GetSoredGradientsFromStaleProgram(graph); + std::stringstream out; + size_t grads_of_stale_program = 0; + out << "Get Order From kStaleProgramOpDescs: "; + for (auto& var : vars) { + out << "Order " << var.first << " ["; + for (auto& var_name : var.second) { + out << var_name << ", "; + ++grads_of_stale_program; + } + out << "], "; + } + VLOG(10) << out.str(); + + std::stringstream out2; + out2 << "Get Order From Topological order: "; + for (auto& op : all_reduce_op_handles) { + bool find_valid_input = false; + for (auto& in_var : op->Inputs()) { + if (dynamic_cast(in_var)) { + out2 << in_var->Name() << ", "; + find_valid_input = true; + break; + } + } + PADDLE_ENFORCE(find_valid_input, "Doesn't find valid input."); + } + VLOG(10) << out2.str(); + if (grads_of_stale_program != all_reduce_op_handles.size()) { + VLOG(10) + << "The gradients number of stale program and graph is not equal."; + } + } + + std::map> GetSoredGradientsFromStaleProgram( + const ir::Graph& graph) const { + std::map> vars; + auto ops = graph.Get>(kStaleProgramOpDescs); + int order = 0; + for (auto* op_desc : ops) { + try { + bool is_bk_op = + static_cast(boost::get(op_desc->GetAttr( + OpProtoAndCheckerMaker::OpRoleAttrName())) & + static_cast(OpRole::kBackward)); + if (!is_bk_op) continue; + + auto backward_vars = + boost::get>(op_desc->GetNullableAttr( + OpProtoAndCheckerMaker::OpRoleVarAttrName())); + if (backward_vars.empty()) continue; + + PADDLE_ENFORCE_EQ(backward_vars.size() % 2, 0); + for (size_t i = 1; i < backward_vars.size(); i += 2) { + vars[order].emplace_back(backward_vars[i]); + VLOG(1) << "get parameter and gradient: " << backward_vars[i - 1] + << ", " << backward_vars[i]; + } + order++; + } catch (boost::bad_get e) { + } + } + return vars; + } +}; +} // namespace details +} // namespace framework +} // namespace paddle + +REGISTER_PASS(all_reduce_deps_pass, + paddle::framework::details::AllReduceDepsPass) + .RequireGraphAttr(paddle::framework::details::kStaleProgramOpDescs); diff --git a/paddle/fluid/framework/details/all_reduce_op_handle.cc b/paddle/fluid/framework/details/all_reduce_op_handle.cc index 7c5f5bd80a937bf1a1c891155764833d7b21c5c2..61276efedeeca76a8818c15ddab73b3c53725c4b 100644 --- a/paddle/fluid/framework/details/all_reduce_op_handle.cc +++ b/paddle/fluid/framework/details/all_reduce_op_handle.cc @@ -11,32 +11,52 @@ // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License. -#include - #include "paddle/fluid/framework/details/all_reduce_op_handle.h" +#include #include "paddle/fluid/framework/details/container_cast.h" #include "paddle/fluid/framework/details/reduce_and_gather.h" #include "paddle/fluid/framework/details/variable_visitor.h" +#include "paddle/fluid/framework/operator.h" + +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) +#include "dgc/dgc.h" +#endif + +#include "paddle/fluid/platform/gpu_info.h" #include "paddle/fluid/platform/profiler.h" +// asynchronous nccl allreduce or synchronous issue: +// https://github.com/PaddlePaddle/Paddle/issues/15049 +DEFINE_bool( + sync_nccl_allreduce, true, + "If set true, will call `cudaStreamSynchronize(nccl_stream)`" + "after allreduce, this mode can get better performance in some scenarios."); + namespace paddle { namespace framework { namespace details { -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) AllReduceOpHandle::AllReduceOpHandle(ir::Node *node, const std::vector &local_scopes, const std::vector &places, - const platform::NCCLContextMap *ctxs) + const platform::NCCLContextMap *ctxs, + bool is_encoded, int nranks) : OpHandleBase(node), local_scopes_(local_scopes), places_(places), - nccl_ctxs_(ctxs) { + nccl_ctxs_(ctxs), + is_encoded_(is_encoded), + nranks_(nranks) { if (nccl_ctxs_) { for (auto &p : places_) { - this->dev_ctxes_[p] = nccl_ctxs_->DevCtx(p); + this->SetDeviceContext(p, nccl_ctxs_->DevCtx(p)); } } + // TODO(gongwb) :polish them! + if (is_encoded) { + VLOG(1) << "Use dgc allreduce mode"; + } } #else AllReduceOpHandle::AllReduceOpHandle(ir::Node *node, @@ -45,96 +65,292 @@ AllReduceOpHandle::AllReduceOpHandle(ir::Node *node, : OpHandleBase(node), local_scopes_(local_scopes), places_(places) {} #endif +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) +void AllReduceOpHandle::RunImplEncoded() { + platform::RecordEvent record_event(Name()); + + WaitInputVarGenerated(); + + auto in_var_handles = DynamicCast(this->Inputs()); + auto out_var_handles = DynamicCast(this->Outputs()); + PADDLE_ENFORCE_EQ( + in_var_handles.size(), places_.size(), + "The NoDummyInputSize should be equal to the number of places."); + PADDLE_ENFORCE_EQ( + in_var_handles.size(), out_var_handles.size(), + "The NoDummyInputSize and NoDummyOutputSize should be equal."); + + std::vector ins; + std::vector outs; + int k = -1; + for (size_t i = 0; i < local_scopes_.size(); ++i) { + auto &local_scope = + local_scopes_[i]->FindVar(kLocalExecScopeName)->Get(); + auto original_name = + paddle::framework::GradOriginalVarName(in_var_handles[i]->name()); + auto encode_var_name = original_name + g_dgc_encoded; + auto *in_var = local_scope->FindVar(encode_var_name); + PADDLE_ENFORCE_NOT_NULL(in_var, "%s should not be null", encode_var_name); + auto &in = in_var->Get(); + ins.emplace_back(&in); + + auto *out = local_scope->FindVar(out_var_handles[i]->name()) + ->GetMutable(); + outs.emplace_back(out); + + if (k < 0) { + k = GetKValue(in_var_handles[i]->name()); + } + } + + PADDLE_ENFORCE(platform::is_gpu_place(ins[0]->place())); + PADDLE_ENFORCE(platform::is_gpu_place(outs[0]->place())); + PADDLE_ENFORCE(nccl_ctxs_, "nccl_ctxs should not be nullptr."); + + int dtype = -1; + size_t in_numel = 0; + size_t out_numel = 0; + PADDLE_ENFORCE(nranks_ > 1); + std::vector> all_reduce_calls; + + for (size_t i = 0; i < local_scopes_.size(); ++i) { + auto &place = places_[i]; + auto &in = *ins[i]; + void *in_tensor_buf = const_cast(in.data()); + + auto &out = *outs[i]; + float *out_tensor_buf = out.data(); + + dtype = (dtype == -1) ? platform::ToNCCLDataType(in.type()) : dtype; + in_numel = (in_numel == 0) ? static_cast(in.numel()) : in_numel; + PADDLE_ENFORCE(in_numel % 2 == 0); + PADDLE_ENFORCE(in_numel / 2 == static_cast(k)); + out_numel = (out_numel == 0) ? static_cast(out.numel()) : out_numel; + + int dev_id = boost::get(place).device; + auto &nccl_ctx = nccl_ctxs_->at(dev_id); + auto stream = nccl_ctx.stream(); + auto comm = nccl_ctx.comm_; + + auto &allocator = + platform::DeviceTemporaryAllocator::Instance().Get(place, stream); + int encode_size = 2 * k * sizeof(int); + // dgc use ncclAllGather to get all the encoded data + // so the buffer need nranks. + int buf_size = nranks_ * encode_size; + auto tmp_ious_data = allocator.Allocate(buf_size); + void *gather_buff = reinterpret_cast(tmp_ious_data->ptr()); + + VLOG(10) << "in_numel:" << in_numel << ", out_numel:" << out_numel + << ", nranks:" << nranks_ << ", gather_buf size:" << buf_size + << ", k:" << k << ", place:" << place << ", dtype:" << dtype; + + all_reduce_calls.emplace_back([=] { + PADDLE_ENFORCE(paddle::communication::dgc::sparseAllGReduce( + in_tensor_buf, gather_buff, k, out_tensor_buf, out_numel, comm, + stream)); + }); + } + + this->RunAndRecordEvent([&] { + if (all_reduce_calls.size() == 1UL) { + // Do not use NCCLGroup when manage NCCL by per thread per device + all_reduce_calls[0](); + } else { + platform::NCCLGroupGuard guard; + for (auto &call : all_reduce_calls) { + call(); + } + } + }); + + if (FLAGS_sync_nccl_allreduce) { + for (auto &p : places_) { + int dev_id = boost::get(p).device; + auto &nccl_ctx = nccl_ctxs_->at(dev_id); + auto stream = nccl_ctx.stream(); + cudaError_t e_sync = cudaStreamSynchronize(stream); + if (e_sync != 0) { + LOG(FATAL) << "cudaStreamSynchronize " << cudaGetErrorString(e_sync); + } + + cudaError_t e_get = cudaGetLastError(); + if (e_get != 0) { + LOG(FATAL) << "cudaGetLastError " << cudaGetErrorString(e_get) + << " errno:" << e_get; + } + } + } +} + +int AllReduceOpHandle::GetKValue(const std::string &grad_name) { + auto original_name = paddle::framework::GradOriginalVarName(grad_name); + auto var_name = original_name + g_dgc_k; + PADDLE_ENFORCE(local_scopes_.size() > 0); + + auto *scope = local_scopes_[0]; + auto &local_scope = scope->FindVar(kLocalExecScopeName)->Get(); + auto var = local_scope->FindVar(var_name); + PADDLE_ENFORCE_NOT_NULL(var); + auto tensor = var->Get().data(); + return *tensor; +} +#endif + +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) +bool AllReduceOpHandle::IsEncoded() { + if (!is_encoded_) { + return false; + } + auto counter_name = g_dgc_counter_name; + auto step_name = g_dgc_rampup_begin_step; + PADDLE_ENFORCE(local_scopes_.size() > 0); + + auto *scope = local_scopes_[0]; + auto &local_scope = scope->FindVar(kLocalExecScopeName)->Get(); + auto count_var = local_scope->FindVar(counter_name); + auto step_var = local_scope->FindVar(step_name); + if (count_var == nullptr || step_var == nullptr) { + PADDLE_THROW("not find count_var:%s or step_var:%s", counter_name, + step_var); + } + + float count = *count_var->Get().data(); + float step = *step_var->Get().data(); + if (static_cast(count) < static_cast(step)) { + VLOG(10) << "in all_reduce currentstep:" << count + << " < rampup_begin_step:" << step + << " so not use sparse all reduce"; + return false; + } + + return true; +} +#else +bool AllReduceOpHandle::IsEncoded() { return false; } +#endif + void AllReduceOpHandle::RunImpl() { - platform::RecordEvent record_event(Name(), dev_ctxes_.begin()->second); - - if (NoDummyInputSize() == 1) { - return; // No need to all reduce when GPU count = 1; - } else { - // Wait input done - WaitInputVarGenerated(); - auto in_var_handles = DynamicCast(this->Inputs()); - auto out_var_handles = DynamicCast(this->Outputs()); - PADDLE_ENFORCE_EQ( - in_var_handles.size(), places_.size(), - "The NoDummyInputSize should be equal to the number of places."); - PADDLE_ENFORCE_EQ( - in_var_handles.size(), out_var_handles.size(), - "The NoDummyInputSize and NoDummyOutputSize should be equal."); - - std::vector lod_tensors; + if (!IsEncoded()) { + RunImplNormal(); + return; + } + +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + RunImplEncoded(); +#else + PADDLE_THROW("Not compiled with CUDA"); +#endif +} + +void AllReduceOpHandle::RunImplNormal() { + platform::RecordEvent record_event(Name()); + + WaitInputVarGenerated(); + + auto in_var_handles = DynamicCast(this->Inputs()); + auto out_var_handles = DynamicCast(this->Outputs()); + PADDLE_ENFORCE_EQ( + in_var_handles.size(), places_.size(), + "The NoDummyInputSize should be equal to the number of places."); + PADDLE_ENFORCE_EQ( + in_var_handles.size(), out_var_handles.size(), + "The NoDummyInputSize and NoDummyOutputSize should be equal."); + + std::vector lod_tensors; + for (size_t i = 0; i < local_scopes_.size(); ++i) { + auto *s = local_scopes_[i]; + auto &local_scope = *s->FindVar(kLocalExecScopeName)->Get(); + auto &lod_tensor = + local_scope.FindVar(in_var_handles[i]->name())->Get(); + lod_tensors.emplace_back(&lod_tensor); + VLOG(10) << "place:" << i << ", input_name:" << in_var_handles[i]->name() + << ", out_name:" << out_var_handles[i]->name(); + PADDLE_ENFORCE_EQ(in_var_handles[i]->name(), out_var_handles[i]->name(), + "The name of input and output should be equal."); + } + + if (platform::is_gpu_place(lod_tensors[0]->place())) { +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + PADDLE_ENFORCE(nccl_ctxs_, "nccl_ctxs should not be nullptr."); + int dtype = -1; + size_t numel = 0; + std::vector> all_reduce_calls; for (size_t i = 0; i < local_scopes_.size(); ++i) { - auto *s = local_scopes_[i]; - auto &local_scope = *s->FindVar(kLocalExecScopeName)->Get(); - auto &lod_tensor = - local_scope.FindVar(in_var_handles[i]->name_)->Get(); - lod_tensors.emplace_back(&lod_tensor); - PADDLE_ENFORCE_EQ(in_var_handles[i]->name_, out_var_handles[i]->name_, - "The name of input and output should be equal."); - } + auto &p = places_[i]; + auto &lod_tensor = *lod_tensors[i]; + void *buffer = const_cast(lod_tensor.data()); - if (platform::is_gpu_place(lod_tensors[0]->place())) { -#ifdef PADDLE_WITH_CUDA - PADDLE_ENFORCE(nccl_ctxs_, "nccl_ctxs should not be nullptr."); - int dtype = -1; - size_t numel = 0; - std::vector> all_reduce_calls; - for (size_t i = 0; i < local_scopes_.size(); ++i) { - auto &p = places_[i]; - auto &lod_tensor = *lod_tensors[i]; - void *buffer = const_cast(lod_tensor.data()); - - if (dtype == -1) { - dtype = platform::ToNCCLDataType(lod_tensor.type()); - } + if (dtype == -1) { + dtype = platform::ToNCCLDataType(lod_tensor.type()); + } - if (numel == 0) { - numel = static_cast(lod_tensor.numel()); + if (numel == 0) { + numel = static_cast(lod_tensor.numel()); + } + + int dev_id = boost::get(p).device; + auto &nccl_ctx = nccl_ctxs_->at(dev_id); + auto stream = nccl_ctx.stream(); + auto comm = nccl_ctx.comm_; + + VLOG(10) << "before all reduce buffer:" << buffer << ", numel:" << numel + << ", dev_id:" << dev_id << ", dtype:" << dtype + << ", place:" << p; + + all_reduce_calls.emplace_back([=] { + PADDLE_ENFORCE(platform::dynload::ncclAllReduce( + buffer, buffer, numel, static_cast(dtype), ncclSum, + comm, stream)); + }); + } + this->RunAndRecordEvent([&] { + if (all_reduce_calls.size() == 1UL) { + // Do not use NCCLGroup when manage NCCL by per thread per device + all_reduce_calls[0](); + } else { + platform::NCCLGroupGuard guard; + for (auto &call : all_reduce_calls) { + call(); } + } + }); + if (FLAGS_sync_nccl_allreduce) { + for (auto &p : places_) { int dev_id = boost::get(p).device; auto &nccl_ctx = nccl_ctxs_->at(dev_id); auto stream = nccl_ctx.stream(); - auto comm = nccl_ctx.comm_; - all_reduce_calls.emplace_back([=] { - PADDLE_ENFORCE(platform::dynload::ncclAllReduce( - buffer, buffer, numel, static_cast(dtype), - ncclSum, comm, stream)); - }); + cudaStreamSynchronize(stream); } - this->RunAndRecordEvent([&] { - platform::NCCLGroupGuard guard; - for (auto &call : all_reduce_calls) { - call(); - } - }); + } + #else - PADDLE_THROW("Not compiled with CUDA"); + PADDLE_THROW("Not compiled with CUDA"); #endif - } else { // Special handle CPU only Operator's gradient. Like CRF - auto &trg = *this->local_scopes_[0] - ->FindVar(kLocalExecScopeName) - ->Get() - ->FindVar(out_var_handles[0]->name_) - ->GetMutable(); - - // Reduce All Tensor to trg in CPU - ReduceLoDTensor func(lod_tensors, &trg); - VisitDataType(ToDataType(lod_tensors[0]->type()), func); - - for (size_t i = 1; i < local_scopes_.size(); ++i) { - auto &scope = - *local_scopes_[i]->FindVar(kLocalExecScopeName)->Get(); - auto &p = places_[i]; - auto *var = scope.FindVar(out_var_handles[i]->name_); - auto *dev_ctx = dev_ctxes_[p]; - - RunAndRecordEvent(p, [&trg, var, dev_ctx, p] { - auto &tensor_gpu = *var->GetMutable(); - auto &tensor_cpu = trg; - TensorCopy(tensor_cpu, p, *dev_ctx, &tensor_gpu); - }); - } + } else { // Special handle CPU only Operator's gradient. Like CRF + auto &trg = *this->local_scopes_[0] + ->FindVar(kLocalExecScopeName) + ->Get() + ->FindVar(out_var_handles[0]->name()) + ->GetMutable(); + + // Reduce All Tensor to trg in CPU + ReduceLoDTensor func(lod_tensors, &trg); + VisitDataType(lod_tensors[0]->type(), func); + + for (size_t i = 1; i < local_scopes_.size(); ++i) { + auto &scope = + *local_scopes_[i]->FindVar(kLocalExecScopeName)->Get(); + auto &p = places_[i]; + auto *var = scope.FindVar(out_var_handles[i]->name()); + auto *dev_ctx = dev_ctxes_.at(p); + + RunAndRecordEvent(p, [&trg, var, dev_ctx, p] { + auto &tensor_gpu = *var->GetMutable(); + auto &tensor_cpu = trg; + TensorCopy(tensor_cpu, p, *dev_ctx, &tensor_gpu); + }); } } } diff --git a/paddle/fluid/framework/details/all_reduce_op_handle.h b/paddle/fluid/framework/details/all_reduce_op_handle.h index f6ef3a1367b91b6abf8ce74a91f73056efd0f84e..ca75186f6ceed3e48fe9326e85738d91bde0ca70 100644 --- a/paddle/fluid/framework/details/all_reduce_op_handle.h +++ b/paddle/fluid/framework/details/all_reduce_op_handle.h @@ -20,7 +20,7 @@ #include "paddle/fluid/framework/details/op_handle_base.h" #include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/framework/scope.h" -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) #include "paddle/fluid/platform/nccl_helper.h" #endif @@ -28,11 +28,19 @@ namespace paddle { namespace framework { namespace details { +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) +constexpr char g_dgc_counter_name[] = "__g_dgc_counter__"; +constexpr char g_dgc_rampup_begin_step[] = "__g_rampup_begin_step__"; +constexpr char g_dgc_encoded[] = "__dgc_encoded__"; +constexpr char g_dgc_k[] = "__dgc_k__"; +#endif + struct AllReduceOpHandle : public OpHandleBase { -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) AllReduceOpHandle(ir::Node *node, const std::vector &local_scopes, const std::vector &places, - const platform::NCCLContextMap *ctxs); + const platform::NCCLContextMap *ctxs, + bool is_encoded = false, int nranks = -1); #else AllReduceOpHandle(ir::Node *node, const std::vector &local_scopes, const std::vector &places); @@ -49,9 +57,15 @@ struct AllReduceOpHandle : public OpHandleBase { private: std::vector local_scopes_; std::vector places_; -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + void RunImplEncoded(); const platform::NCCLContextMap *nccl_ctxs_; + bool is_encoded_{false}; + int nranks_{-1}; + int GetKValue(const std::string &grad_name); #endif + void RunImplNormal(); + bool IsEncoded(); }; } // namespace details diff --git a/paddle/fluid/framework/details/alloc_continuous_space_for_grad_pass.cc b/paddle/fluid/framework/details/alloc_continuous_space_for_grad_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..58ec427859e9f0ec4d29cc419f5bfe382e245852 --- /dev/null +++ b/paddle/fluid/framework/details/alloc_continuous_space_for_grad_pass.cc @@ -0,0 +1,411 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/details/alloc_continuous_space_for_grad_pass.h" +#include +#include +#include +#include +#include +#include "paddle/fluid/framework/details/build_strategy.h" +#include "paddle/fluid/framework/details/multi_devices_helper.h" +#include "paddle/fluid/framework/ir/graph_helper.h" +#include "paddle/fluid/framework/op_registry.h" + +DEFINE_uint64(fuse_parameter_memory_size, 0, // 0 KB + "fuse_parameter_memory_size is up limited memory size " + "of one group parameters' gradient which is the input " + "of communication calling(e.g NCCLAllReduce). " + "The default value is 0, it means that " + "not set group according to memory_size."); +DEFINE_int32( + fuse_parameter_groups_size, 3, + "fuse_parameter_groups_size is the size of one group parameters' gradient. " + "The default value is a experimental result. If the " + "fuse_parameter_groups_size is 1, it means that the groups size is " + "the number of parameters' gradient. If the fuse_parameter_groups_size is " + "-1, it means that there are only one group. The default value is 3, it is " + "an experimental value."); + +namespace paddle { +namespace framework { +namespace details { +// SetFuseParameterGroupsSize and SetFuseParameterMemorySize are used in unit +// test, because it is invalid that seting 'FLAGS_fuse_parameter_memory_size' +// and 'FLAGS_fuse_parameter_groups_size' in unit test. +void SetFuseParameterGroupsSize(int group_size) { + FLAGS_fuse_parameter_groups_size = group_size; +} + +int GetFuseParameterGroupsSize() { return FLAGS_fuse_parameter_groups_size; } + +void SetFuseParameterMemorySize(uint64_t memory_size) { + FLAGS_fuse_parameter_memory_size = memory_size; +} + +uint64_t GetFuseParameterMemorySize() { + return FLAGS_fuse_parameter_memory_size; +} + +static const char kUnKnow[] = "@UNKNOW@"; +static framework::proto::VarType::Type kDefaultDtype = + framework::proto::VarType::Type::VarType_Type_BOOL; + +void AllocContinuousSpaceForGradPass::ApplyImpl(ir::Graph *graph) const { + ir::Graph &result = *graph; + + auto &places = Get>(kPlaces); + auto &local_scopes = Get>(kLocalScopes); + + ResetAttribute(kParamsAndGrads, &result); + ResetAttribute(kGroupGradsAndParams, &result); + + // NOTE: The operator nodes should be in topology order. + std::vector topo_nodes = ir::TopologySortOperations(result); + auto ¶ms_grads = result.Get(kParamsAndGrads); + for (auto &node : topo_nodes) { + RecordParamsAndGrads(node, ¶ms_grads); + } + + if (params_grads.size() == 0) { + VLOG(10) << "Doesn't find gradients"; + return; + } + + std::unordered_map vars; + for (ir::Node *node : result.Nodes()) { + if (node->IsVar() && node->Var()) { + // Note: The graph may have the same name node. For example, parameter + // is the input of operator and it also is the output of optimizer; + vars.emplace(node->Var()->Name(), node); + } + } + + auto &group_grads_params = + result.Get(kGroupGradsAndParams); + + // Note: the order of params_grads may be changed by SetGroupGradsAndParams. + SetGroupGradsAndParams(vars, params_grads, &group_grads_params); + + params_grads.clear(); + for (auto &group_p_g : group_grads_params) { + params_grads.insert(params_grads.begin(), group_p_g.begin(), + group_p_g.end()); + } + for (auto &p_g : params_grads) { + std::swap(p_g.first, p_g.second); + } + + // Set Gradients as Persistable to prevent this var becoming reusable. + auto dtype = kDefaultDtype; + for (auto &p_g : params_grads) { + // Get gradient var + auto iter = vars.find(p_g.second); + PADDLE_ENFORCE(iter != vars.end(), "%s is not found.", p_g.second); + iter->second->Var()->SetPersistable(true); + + PADDLE_ENFORCE(IsSupportedVarType(iter->second->Var()->GetType())); + + // Get Dtype + auto ele_dtype = iter->second->Var()->GetDataType(); + if (dtype == kDefaultDtype) { + dtype = ele_dtype; + PADDLE_ENFORCE_NE(ele_dtype, kDefaultDtype, + "The data type should not be bool."); + } + PADDLE_ENFORCE_EQ(ele_dtype, dtype, + "The data type of input is not consistent."); + } + + // Create a FusedVarsSet to avoid duplicating names for fused_var in other + // pass. + if (!result.Has(kFusedVars)) { + result.Set(kFusedVars, new FusedVars); + } + // the kFusedGrads is used be fuse_optimizer_op_pass. + result.Set(kFusedGrads, new FusedGrads); + + // the fused_var_name should be unique, so it appends + // params_grads.begin()->second. + auto fused_var_name = std::string(kFusedVarNamePrefix) + "@GRAD@" + + params_grads.begin()->second; + result.Get(kFusedGrads) = fused_var_name; + auto &fused_var_set = result.Get(kFusedVars); + PADDLE_ENFORCE_EQ(fused_var_set.count(fused_var_name), 0, + "%s is duplicate in FusedVars.", fused_var_name); + fused_var_set.insert(fused_var_name); + + InitFusedVarsAndAllocSpaceForVars(places, local_scopes, vars, fused_var_name, + params_grads); +} + +template +void AllocContinuousSpaceForGradPass::ResetAttribute( + const std::string &attr_name, ir::Graph *graph) const { + if (graph->Has(attr_name)) { + VLOG(10) << attr_name << " is reset."; + graph->Erase(attr_name); + } + graph->Set(attr_name, new AttrType); +} + +void AllocContinuousSpaceForGradPass::SetGroupGradsAndParams( + const std::unordered_map &var_nodes, + const ParamsAndGrads ¶ms_grads, + GroupGradsAndParams *group_grads_params) const { + SetGroupAccordingToLayers(var_nodes, params_grads, group_grads_params); + SetGroupAccordingToMemorySize(var_nodes, group_grads_params); + SetGroupAccordingToGroupSize(var_nodes, group_grads_params); +} + +void AllocContinuousSpaceForGradPass::SetGroupAccordingToLayers( + const std::unordered_map &var_nodes, + const ParamsAndGrads ¶ms_grads, + GroupGradsAndParams *group_grads_params) const { + std::unordered_map> layer_params; + + for (size_t i = 0; i < params_grads.size(); ++i) { + auto pos = params_grads[i].first.find_first_of("."); + if (pos == std::string::npos) { + layer_params[std::string(kUnKnow)].emplace_back(i); + } else { + layer_params[params_grads[i].first.substr(0, pos)].emplace_back(i); + } + } + + group_grads_params->reserve(layer_params.size()); + for (size_t i = 0; i < params_grads.size(); ++i) { + auto pos = params_grads[i].first.find_first_of("."); + std::string key = kUnKnow; + if (pos != std::string::npos) { + key = params_grads[i].first.substr(0, pos); + } + auto iter = layer_params.find(key); + if (iter == layer_params.end()) continue; + + group_grads_params->emplace_back(); + auto &local_group_grads_params = group_grads_params->back(); + for (auto &idx : iter->second) { + local_group_grads_params.emplace_back( + std::make_pair(params_grads[idx].second, params_grads[idx].first)); + } + layer_params.erase(iter); + } + + VLOG(10) << "SetGroupAccordingToLayers: "; + for (size_t i = 0; i < group_grads_params->size(); ++i) { + VLOG(10) << "group " << i; + std::stringstream out; + for (auto &p_g : group_grads_params->at(i)) { + out << "(" << p_g.second << ", " << p_g.first << "), "; + } + VLOG(10) << out.str(); + } +} + +void AllocContinuousSpaceForGradPass::SetGroupAccordingToMemorySize( + const std::unordered_map &var_nodes, + GroupGradsAndParams *group_grads_params) const { + const uint64_t group_memory_size = GetFuseParameterMemorySize(); + if (group_memory_size == 0) { + return; + } + GroupGradsAndParams local_group_grads_params; + size_t j = 0; + while (j < group_grads_params->size()) { + local_group_grads_params.emplace_back(); + auto &group_p_g = local_group_grads_params.back(); + size_t local_group_memory_size = 0; + while (j < group_grads_params->size()) { + std::for_each( + group_grads_params->at(j).begin(), group_grads_params->at(j).end(), + [&local_group_memory_size, + &var_nodes](const std::pair &g_p) { + auto iter = var_nodes.find(g_p.second); + PADDLE_ENFORCE(iter != var_nodes.end(), "%s is not found.", + g_p.second); + auto shape = iter->second->Var()->GetShape(); + size_t size = + framework::SizeOfType(iter->second->Var()->GetDataType()); + std::for_each(shape.begin(), shape.end(), + [&size](const int64_t &n) { size *= n; }); + local_group_memory_size += size; + }); + group_p_g.insert(group_p_g.end(), group_grads_params->at(j).begin(), + group_grads_params->at(j).end()); + ++j; + if (local_group_memory_size >= group_memory_size) { + break; + } + } + } + + std::swap(*group_grads_params, local_group_grads_params); + + VLOG(10) << string::Sprintf("SetGroupAccordingToMemorySize(memory_size: %d):", + group_memory_size); + for (size_t i = 0; i < group_grads_params->size(); ++i) { + VLOG(10) << "group " << i; + std::stringstream out; + for (auto &g_p : group_grads_params->at(i)) { + auto iter = var_nodes.find(g_p.second); + PADDLE_ENFORCE(iter != var_nodes.end(), "%s is not found.", g_p.second); + auto shape = iter->second->Var()->GetShape(); + size_t size = framework::SizeOfType(iter->second->Var()->GetDataType()); + std::for_each(shape.begin(), shape.end(), + [&size](const int64_t &n) { size *= n; }); + out << string::Sprintf("(%s(%d), %s)", g_p.second, size, g_p.first); + } + VLOG(10) << out.str(); + } +} + +void AllocContinuousSpaceForGradPass::SetGroupAccordingToGroupSize( + const std::unordered_map &var_nodes, + GroupGradsAndParams *group_grads_params) const { + if (GetFuseParameterGroupsSize() == 1) { + return; + } + const int group_size = GetFuseParameterGroupsSize() == -1 + ? static_cast(group_grads_params->size()) + : GetFuseParameterGroupsSize(); + PADDLE_ENFORCE_GT(group_size, 1); + size_t groups = (group_grads_params->size() + group_size - 1) / group_size; + GroupGradsAndParams local_group_grads_params; + local_group_grads_params.reserve(groups); + + size_t j = 0; + for (size_t i = 0; i < groups; ++i) { + local_group_grads_params.emplace_back(); + auto &group_p_g = local_group_grads_params.back(); + group_p_g.reserve(group_size); + while (j < group_grads_params->size()) { + group_p_g.insert(group_p_g.end(), group_grads_params->at(j).begin(), + group_grads_params->at(j).end()); + ++j; + if (j % group_size == 0) break; + } + } + std::swap(*group_grads_params, local_group_grads_params); + + VLOG(10) << string::Sprintf("SetGroupAccordingToGroupSize(group_size: %d):", + group_size); + for (size_t i = 0; i < group_grads_params->size(); ++i) { + VLOG(10) << "group " << i; + std::stringstream out; + for (auto &p_g : group_grads_params->at(i)) { + out << "(" << p_g.second << ", " << p_g.first << "), "; + } + VLOG(10) << out.str(); + } +} + +bool AllocContinuousSpaceForGradPass::IsSupportedVarType( + const proto::VarType::Type &type) const { + // Current only support LOD_TENSOR. + return type == proto::VarType::LOD_TENSOR; +} + +void AllocContinuousSpaceForGradPass::RecordParamsAndGrads( + ir::Node *node, ParamsAndGrads *params_grads) const { + try { + bool is_bk_op = + static_cast(boost::get(node->Op()->GetAttr( + OpProtoAndCheckerMaker::OpRoleAttrName())) & + static_cast(OpRole::kBackward)); + if (!is_bk_op) return; + + // Currently, we assume that once gradient is generated, it can be + // broadcast, and each gradient is only broadcast once. + auto backward_vars = + boost::get>(node->Op()->GetNullableAttr( + OpProtoAndCheckerMaker::OpRoleVarAttrName())); + PADDLE_ENFORCE_EQ(backward_vars.size() % 2, static_cast(0)); + + for (size_t i = 0; i < backward_vars.size(); i += 2) { + VLOG(10) << "Trainable parameter: " << backward_vars[i] + << ", gradient: " << backward_vars[i + 1]; + + params_grads->emplace_back(std::make_pair(backward_vars[i] /*param*/, + backward_vars[i + 1] /*grad*/)); + } + } catch (boost::bad_get e) { + } +} + +void AllocContinuousSpaceForGradPass::InitFusedVarsAndAllocSpaceForVars( + const std::vector &places, + const std::vector &local_scopes, + const std::unordered_map &vars, + const std::string &fused_var_name, + const ParamsAndGrads ¶ms_grads) const { + // Init Gradients and FusedVars + VLOG(10) << "Init FusedVars and Gradients."; + for (auto it = local_scopes.rbegin(); it != local_scopes.rend(); ++it) { + auto &scope = *it; + + PADDLE_ENFORCE(scope->FindVar(fused_var_name) == nullptr, + "%s has existed in scope.", fused_var_name); + scope->Var(fused_var_name)->GetMutable(); + + for (auto &p_g : params_grads) { + auto iter = vars.find(p_g.second); + PADDLE_ENFORCE(iter != vars.end()); + PADDLE_ENFORCE_NOT_NULL(iter->second->Var()); + PADDLE_ENFORCE_EQ(iter->second->Var()->GetType(), + proto::VarType::LOD_TENSOR); + scope->Var(p_g.second)->GetMutable(); + } + } + + // Alloc continuous space for vars. + std::vector grads_name; + std::vector params_name; + grads_name.reserve(params_grads.size()); + params_name.reserve(params_grads.size()); + for (auto &p_g : params_grads) { + params_name.emplace_back(p_g.first); + grads_name.emplace_back(p_g.second); + } + framework::ProgramDesc program_desc; + AppendAllocSpaceForVarsOp(params_name, grads_name, fused_var_name, + program_desc.MutableBlock(0)); + + for (size_t i = 0; i < local_scopes.size(); ++i) { + for (auto &op_desc : program_desc.Block(0).AllOps()) { + auto op = OpRegistry::CreateOp(*op_desc); + op->Run(*local_scopes[i], places[i]); + } + } +} + +void AllocContinuousSpaceForGradPass::AppendAllocSpaceForVarsOp( + const std::vector ¶ms_name, + const std::vector &grads_name, + const std::string &fused_var_name, BlockDesc *global_block) const { + auto op_desc = global_block->AppendOp(); + op_desc->SetType("alloc_continuous_space"); + op_desc->SetInput("Input", params_name); + op_desc->SetOutput("Output", grads_name); + op_desc->SetOutput("FusedOutput", {fused_var_name}); +} + +} // namespace details +} // namespace framework +} // namespace paddle + +REGISTER_PASS(alloc_continuous_space_for_grad_pass, + paddle::framework::details::AllocContinuousSpaceForGradPass) + .RequirePassAttr(paddle::framework::details::kPlaces) + .RequirePassAttr(paddle::framework::details::kLocalScopes); diff --git a/paddle/fluid/framework/details/alloc_continuous_space_for_grad_pass.h b/paddle/fluid/framework/details/alloc_continuous_space_for_grad_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..e6d56f17cc4ef7e07500aae8067211a7b9ac04b0 --- /dev/null +++ b/paddle/fluid/framework/details/alloc_continuous_space_for_grad_pass.h @@ -0,0 +1,79 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +#pragma once +#include +#include +#include +#include +#include +#include "paddle/fluid/framework/details/build_strategy.h" +#include "paddle/fluid/framework/details/multi_devices_helper.h" +#include "paddle/fluid/framework/ir/graph_helper.h" + +namespace paddle { +namespace framework { +namespace details { + +void SetFuseParameterGroupsSize(int group_size); +int GetFuseParameterGroupsSize(); + +void SetFuseParameterMemorySize(uint64_t memory_size); +uint64_t GetFuseParameterMemorySize(); + +class AllocContinuousSpaceForGradPass : public ir::Pass { + protected: + void ApplyImpl(ir::Graph *graph) const override; + + template + void ResetAttribute(const std::string &attr_name, ir::Graph *graph) const; + + void SetGroupGradsAndParams( + const std::unordered_map &var_nodes, + const ParamsAndGrads ¶ms_grads, + GroupGradsAndParams *group_grads_params) const; + + void SetGroupAccordingToLayers( + const std::unordered_map &var_nodes, + const ParamsAndGrads ¶ms_grads, + GroupGradsAndParams *group_grads_params) const; + + void SetGroupAccordingToMemorySize( + const std::unordered_map &var_nodes, + GroupGradsAndParams *group_grads_params) const; + + void SetGroupAccordingToGroupSize( + const std::unordered_map &var_nodes, + GroupGradsAndParams *group_grads_params) const; + + private: + bool IsSupportedVarType(const proto::VarType::Type &type) const; + + void RecordParamsAndGrads(ir::Node *node, ParamsAndGrads *params_grads) const; + + void InitFusedVarsAndAllocSpaceForVars( + const std::vector &places, + const std::vector &local_scopes, + const std::unordered_map &vars, + const std::string &fused_var_name, + const ParamsAndGrads ¶ms_grads) const; + + void AppendAllocSpaceForVarsOp(const std::vector ¶ms_name, + const std::vector &grads_name, + const std::string &fused_var_name, + BlockDesc *global_block) const; +}; + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/async_ssa_graph_executor.cc b/paddle/fluid/framework/details/async_ssa_graph_executor.cc new file mode 100644 index 0000000000000000000000000000000000000000..e9aad5d264d1745662848d1ba313b573d0974cb7 --- /dev/null +++ b/paddle/fluid/framework/details/async_ssa_graph_executor.cc @@ -0,0 +1,203 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/details/async_ssa_graph_executor.h" + +#include "paddle/fluid/framework/variable_helper.h" + +#ifdef PADDLE_WITH_DISTRIBUTE +#include "paddle/fluid/operators/distributed/communicator.h" +#endif + +namespace paddle { +namespace framework { +namespace details { + +inline void NewTempScopeAndInitVars(const std::vector &var_infos, + Scope *scope) { + VLOG(3) << "NewTempScopeAndInitVars"; + Scope &local_scope = scope->NewScope(); + *scope->Var(details::kLocalExecScopeName)->GetMutable() = + &local_scope; + + for (auto &info : var_infos) { + if (scope->FindVar(info.name_) != nullptr) { + continue; + } + + if (info.persistable_) { // Persistable + InitializeVariable(scope->Var(info.name_), info.type_); + } else { + InitializeVariable(local_scope.Var(info.name_), info.type_); + } + } +} + +// get RpcContext and remote send and recv op +void ProcessGraph(std::vector graphs, Scope *scope) { +#ifdef PADDLE_WITH_DISTRIBUTE + using RpcCtxMap = operators::distributed::RpcCtxMap; + VLOG(3) << "ProcessGraph"; + RpcCtxMap send_varname_to_ctx; + RpcCtxMap recv_varname_to_ctx; + for (auto i = 0; i < graphs.size(); ++i) { + std::vector nodes_to_delete; + for (auto &node : graphs[i]->Nodes()) { + VLOG(3) << "node name " << node->Name(); + if (node && node->IsOp()) { + if (node->Name() == "send") { + auto send_var_name = node->Op()->Input("X")[0]; + auto send_varnames = boost::get>( + node->Op()->GetNullableAttr("send_varnames")); + auto epmap = boost::get>( + node->Op()->GetNullableAttr("epmap")); + auto height_section = boost::get>( + node->Op()->GetNullableAttr("sections")); + send_varname_to_ctx[send_var_name] = + operators::distributed::RpcContext(send_var_name, send_varnames, + epmap, height_section); + VLOG(3) << "find and init an send op: " + << send_varname_to_ctx[send_var_name]; + } else if (node->Name() == "recv") { + auto recv_var_name = node->Op()->Output("Out")[0]; + auto recv_varnames = boost::get>( + node->Op()->GetNullableAttr("recv_varnames")); + auto epmap = boost::get>( + node->Op()->GetNullableAttr("epmap")); + recv_varname_to_ctx[recv_var_name] = + operators::distributed::RpcContext(recv_var_name, recv_varnames, + epmap, {}); + nodes_to_delete.push_back(node); + VLOG(3) << "find and remove an recv op: " + << recv_varname_to_ctx[recv_var_name]; + } + } + } + } + // init communicator here + if (send_varname_to_ctx.size() > 0) { + VLOG(3) << "this is distribute mode, will use communicator"; + operators::distributed::Communicator::Init(send_varname_to_ctx, + recv_varname_to_ctx, scope); + operators::distributed::Communicator::GetInstance()->Start(); + } +#endif +} + +AsyncSSAGraphExecutor::AsyncSSAGraphExecutor( + const ExecutionStrategy &strategy, const std::vector &local_scopes, + const std::vector &places, std::vector graphs) + : strategy_(std::move(strategy)), + local_scopes_(std::move(local_scopes)), + pool_(places.size() >= 2 ? new ::ThreadPool(places.size()) : nullptr), + places_(std::move(places)), + graphs_(std::move(graphs)) { + VLOG(3) << "build AsyncSSAGraphExecutor"; + PADDLE_ENFORCE_EQ(places_.size(), local_scopes_.size()); + + // set the correct size of thread pool to each device. + strategy_.num_threads_ = strategy_.num_threads_ < places_.size() + ? 1UL + : strategy_.num_threads_ / places_.size(); + VLOG(1) << "set num_threads: " << strategy_.num_threads_ + << " to run the operators of the graph on each device."; + for (size_t i = 0; i < places.size(); ++i) { + executors_.emplace_back(new details::ThreadedSSAGraphExecutor( + strategy_, {local_scopes_[i]}, {places_[i]}, graphs_[i])); + } + + for (auto &node : graphs_[0]->Nodes()) { + if (node->IsVar() && !node->IsCtrlVar() && node->Var()) { + var_infos_.emplace_back(); + var_infos_.back().name_ = node->Var()->Name(); + var_infos_.back().type_ = node->Var()->GetType(); + var_infos_.back().persistable_ = node->Var()->Persistable(); + } + } + for (auto *scope : local_scopes_) { + NewTempScopeAndInitVars(var_infos_, scope); + } + ProcessGraph(graphs_, local_scopes_[0]); +} + +void AsyncSSAGraphExecutor::StartOffPythonTrainLoop() { + VLOG(3) << "StartOffPythonTrainLoop size = " << places_.size(); + for (size_t i = 1; i < places_.size(); ++i) { + auto call = [this, i]() -> void { + VLOG(3) << "start off python thread " << i; + try { + while (true) { + executors_[i]->Run({}); + } + } catch (...) { + exception_holder_.Catch(std::current_exception()); + VLOG(3) << "get exception type = " << exception_holder_.Type(); + } + VLOG(3) << "thread " << i << " exited!"; + }; + run_futures_.emplace_back(pool_->enqueue(std::move(call))); + } +} + +void AsyncSSAGraphExecutor::HandleException() { + if (exception_holder_.IsCaught()) { + for (auto &f : run_futures_) { + VLOG(3) << "wait future"; + f.wait(); + } + VLOG(3) << "caught exception " << exception_holder_.Type() + << ", rethrow it"; + run_futures_.clear(); + exception_holder_.ReThrow(); + } +} + +FeedFetchList AsyncSSAGraphExecutor::Run( + const std::vector &fetch_tensors) { + // init once + if (run_futures_.size() == 0 && places_.size() > 1) { + exception_holder_.Clear(); + StartOffPythonTrainLoop(); + } + + if (places_.size() == 1) { + exception_holder_.Clear(); + } else { + HandleException(); + } + + FeedFetchList fetch_data; + fetch_data.reserve(fetch_tensors.size()); + + try { + fetch_data = executors_[0]->Run(fetch_tensors); + } catch (...) { + exception_holder_.Catch(std::current_exception()); + } + + HandleException(); + + FeedFetchList ret; + for (size_t fetch_idx = 0; fetch_idx < fetch_tensors.size(); ++fetch_idx) { + std::vector lodtensor_ptrs; + lodtensor_ptrs.push_back(&fetch_data.at(fetch_idx)); + ret.emplace_back(); + ret.back().MergeLoDTensor(lodtensor_ptrs, platform::CPUPlace()); + } + return ret; +} + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/async_ssa_graph_executor.h b/paddle/fluid/framework/details/async_ssa_graph_executor.h new file mode 100644 index 0000000000000000000000000000000000000000..6aaf8f9a165f2eae3a64874e60084e4d9bdbc182 --- /dev/null +++ b/paddle/fluid/framework/details/async_ssa_graph_executor.h @@ -0,0 +1,65 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include + +#include "ThreadPool.h" +#include "paddle/fluid/framework/details/threaded_ssa_graph_executor.h" + +namespace paddle { +namespace framework { +namespace details { + +struct VarInfo { + std::string name_; + proto::VarType::Type type_; + bool persistable_; +}; + +class AsyncSSAGraphExecutor : public SSAGraphExecutor { + public: + AsyncSSAGraphExecutor(const ExecutionStrategy &strategy, + const std::vector &local_scopes, + const std::vector &places, + std::vector graphs); + ~AsyncSSAGraphExecutor() final = default; + const ir::Graph &Graph() const override { return *graphs_[0]; } + + FeedFetchList Run(const std::vector &fetch_tensors) override; + + private: + void StartOffPythonTrainLoop(); + void HandleException(); + + private: + ExecutionStrategy strategy_; + std::vector local_scopes_; + std::unique_ptr<::ThreadPool> pool_{nullptr}; + std::vector places_; + std::vector graphs_; + + std::vector> executors_; + ExceptionHolder exception_holder_; + std::vector> run_futures_; + std::vector var_infos_; +}; + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/broadcast_op_handle.cc b/paddle/fluid/framework/details/broadcast_op_handle.cc index 4fdab5cd94358d08eac7f8b041bf16d09042f0bd..752c932a215bad53f47f19f143a8008b66617a51 100644 --- a/paddle/fluid/framework/details/broadcast_op_handle.cc +++ b/paddle/fluid/framework/details/broadcast_op_handle.cc @@ -22,25 +22,22 @@ namespace framework { namespace details { void BroadcastOpHandle::RunImpl() { - platform::RecordEvent record_event(Name(), dev_ctxes_.begin()->second); + platform::RecordEvent record_event(Name()); if (places_.size() == 1) return; // The input and output may have dummy vars. - VarHandle *in_var_handle; - { - auto in_var_handles = DynamicCast(inputs_); - PADDLE_ENFORCE_EQ(in_var_handles.size(), 1, - "The number of input should be one."); - in_var_handle = in_var_handles[0]; - } - + auto in_var_handles = DynamicCast(inputs_); auto out_var_handles = DynamicCast(outputs_); + PADDLE_ENFORCE_EQ(in_var_handles.size(), 1UL, + "The number of input should be one."); PADDLE_ENFORCE_EQ( out_var_handles.size(), places_.size(), "The number of output should equal to the number of places."); + VarHandle *in_var_handle = in_var_handles[0]; + WaitInputVarGenerated(); std::vector var_scopes; @@ -48,21 +45,32 @@ void BroadcastOpHandle::RunImpl() { var_scopes.emplace_back(s->FindVar(kLocalExecScopeName)->Get()); } + BroadcastOneVar(*in_var_handle, out_var_handles, var_scopes); +} + +void BroadcastOpHandle::BroadcastOneVar( + const VarHandle &in_var_handle, + const std::vector &out_var_handles, + const std::vector &var_scopes) { auto *in_var = - var_scopes.at(in_var_handle->scope_idx_)->FindVar(in_var_handle->name_); + var_scopes.at(in_var_handle.scope_idx())->FindVar(in_var_handle.name()); PADDLE_ENFORCE_NOT_NULL(in_var); Tensor &in_tensor = VariableVisitor::GetMutableTensor(in_var); + if (UNLIKELY(!in_tensor.IsInitialized())) { + VLOG(3) << "in var " << in_var_handle.name() << "not inited, return!"; + return; + } - InitOutputValue(*in_var_handle, out_var_handles); + InitOutputValue(in_var_handle, out_var_handles); if (platform::is_cpu_place(in_tensor.place())) { for (auto *out_var_handle : out_var_handles) { - if (out_var_handle->IsTheSameVar(*in_var_handle)) { + if (out_var_handle->IsTheSameVar(in_var_handle)) { continue; } - auto &out_p = out_var_handle->place_; - auto *out_var = var_scopes.at(out_var_handle->scope_idx_) - ->FindVar(out_var_handle->name_); + auto &out_p = out_var_handle->place(); + auto *out_var = var_scopes.at(out_var_handle->scope_idx()) + ->FindVar(out_var_handle->name()); RunAndRecordEvent(out_p, [in_tensor, out_var] { paddle::framework::TensorCopy( @@ -71,7 +79,7 @@ void BroadcastOpHandle::RunImpl() { }); } } else { -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) VarHandle *out_handle = nullptr; int root_id = boost::get(in_tensor.place()).device; std::vector> broadcast_calls; @@ -80,11 +88,11 @@ void BroadcastOpHandle::RunImpl() { size_t numel = static_cast(in_tensor.numel()); for (auto out_var_handle : out_var_handles) { - Variable *out_var = var_scopes.at(out_var_handle->scope_idx_) - ->FindVar(out_var_handle->name_); + Variable *out_var = var_scopes.at(out_var_handle->scope_idx()) + ->FindVar(out_var_handle->name()); int dst_id = - boost::get(out_var_handle->place_).device; + boost::get(out_var_handle->place()).device; auto &nccl_ctx = nccl_ctxs_->at(dst_id); @@ -95,7 +103,7 @@ void BroadcastOpHandle::RunImpl() { } else { send_recv_buffer = VariableVisitor::GetMutableTensor(out_var) .Resize(in_tensor.dims()) - .mutable_data(out_var_handle->place_); + .mutable_data(out_var_handle->place()); } broadcast_calls.emplace_back( @@ -114,12 +122,12 @@ void BroadcastOpHandle::RunImpl() { } } - if (!out_handle->IsTheSameVar(*in_var_handle)) { - auto out_var = var_scopes.at(in_var_handle->scope_idx_) - ->FindVar(out_var_handles[0]->name_); + if (!out_handle->IsTheSameVar(in_var_handle)) { + auto out_var = var_scopes.at(in_var_handle.scope_idx()) + ->FindVar(out_var_handles[0]->name()); paddle::framework::TensorCopy( - in_tensor, in_var_handle->place_, - *(dev_ctxes_.at(in_var_handle->place_)), + in_tensor, in_var_handle.place(), + *(dev_ctxes_.at(in_var_handle.place())), &VariableVisitor::GetMutableTensor(out_var)); } }); @@ -137,7 +145,7 @@ void BroadcastOpHandle::InitOutputValue( var_scopes.emplace_back(s->FindVar(kLocalExecScopeName)->Get()); } auto *in_var = - var_scopes.at(in_var_handle.scope_idx_)->FindVar(in_var_handle.name_); + var_scopes.at(in_var_handle.scope_idx())->FindVar(in_var_handle.name()); Tensor &in_tensor = VariableVisitor::GetMutableTensor(in_var); @@ -147,9 +155,9 @@ void BroadcastOpHandle::InitOutputValue( if (out_var_handle->IsTheSameVar(in_var_handle)) { continue; } - auto t_out_p = out_var_handle->place_; - auto *out_var = var_scopes.at(out_var_handle->scope_idx_) - ->FindVar(out_var_handle->name_); + auto t_out_p = out_var_handle->place(); + auto *out_var = var_scopes.at(out_var_handle->scope_idx()) + ->FindVar(out_var_handle->name()); PADDLE_ENFORCE_NOT_NULL(out_var); if (is_gpu_place(in_tensor.place())) { PADDLE_ENFORCE(platform::is_gpu_place(t_out_p), diff --git a/paddle/fluid/framework/details/broadcast_op_handle.h b/paddle/fluid/framework/details/broadcast_op_handle.h index fe4e733e43417977df324fde808f52b228a27d19..0b4d33513506d41a63db8316abaa5cd0458ff352 100644 --- a/paddle/fluid/framework/details/broadcast_op_handle.h +++ b/paddle/fluid/framework/details/broadcast_op_handle.h @@ -24,7 +24,7 @@ #include "paddle/fluid/framework/selected_rows.h" #include "paddle/fluid/platform/device_context.h" -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) #include "paddle/fluid/platform/nccl_helper.h" #endif @@ -34,7 +34,7 @@ namespace details { struct BroadcastOpHandle : public OpHandleBase { public: -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) BroadcastOpHandle(ir::Node *node, const std::vector &local_scopes, const std::vector &places, const platform::NCCLContextMap *nccl_ctxs) @@ -44,7 +44,8 @@ struct BroadcastOpHandle : public OpHandleBase { nccl_ctxs_(nccl_ctxs) { if (nccl_ctxs_) { for (auto &p_ctx : nccl_ctxs_->contexts_) { - dev_ctxes_[platform::CUDAPlace(p_ctx.first)] = p_ctx.second.ctx_.get(); + this->SetDeviceContext(platform::CUDAPlace(p_ctx.first), + p_ctx.second.ctx_.get()); } } } @@ -56,15 +57,18 @@ struct BroadcastOpHandle : public OpHandleBase { std::string Name() const override; - bool IsMultiDeviceTransfer() override { return false; }; + bool IsMultiDeviceTransfer() override { return true; }; protected: void RunImpl() override; - private: + void BroadcastOneVar(const VarHandle &in_var_handle, + const std::vector &out_var_handles, + const std::vector &var_scopes); + std::vector local_scopes_; std::vector places_; -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) const platform::NCCLContextMap *nccl_ctxs_; #endif diff --git a/paddle/fluid/framework/details/broadcast_op_handle_test.cc b/paddle/fluid/framework/details/broadcast_op_handle_test.cc index ab7412a19fbd13fa39dbae9af528d158cc9ddbd0..650de5a48de6b1fdab120cdeda563a169fd1a1c1 100644 --- a/paddle/fluid/framework/details/broadcast_op_handle_test.cc +++ b/paddle/fluid/framework/details/broadcast_op_handle_test.cc @@ -12,232 +12,12 @@ // See the License for the specific language governing permissions and // limitations under the License. -#include "paddle/fluid/framework/details/broadcast_op_handle.h" -#include "gtest/gtest.h" - -#include "paddle/fluid/platform/device_context.h" +#include "paddle/fluid/framework/details/broadcast_op_handle_test.h" namespace paddle { namespace framework { namespace details { -namespace f = paddle::framework; -namespace p = paddle::platform; - -// test data amount -const f::DDim kDims = {20, 20}; - -struct TestBroadcastOpHandle { - std::vector> ctxs_; - std::vector local_scopes_; - std::vector param_scopes_; - Scope g_scope_; - std::unique_ptr op_handle_; - std::vector> vars_; - std::vector gpu_list_; - bool use_gpu_; -#ifdef PADDLE_WITH_CUDA - std::unique_ptr nccl_ctxs_; -#endif - - void WaitAll() { - for (size_t j = 0; j < ctxs_.size(); ++j) { - ctxs_[j]->Wait(); - } -#ifdef PADDLE_WITH_CUDA - if (nccl_ctxs_) { - nccl_ctxs_->WaitAll(); - } -#endif - } - - void InitCtxOnGpu(bool use_gpu) { - use_gpu_ = use_gpu; - if (use_gpu_) { -#ifdef PADDLE_WITH_CUDA - int count = p::GetCUDADeviceCount(); - if (count <= 1) { - LOG(WARNING) << "Cannot test multi-gpu Broadcast, because the CUDA " - "device count is " - << count; - exit(0); - } - for (int i = 0; i < count; ++i) { - auto p = p::CUDAPlace(i); - gpu_list_.push_back(p); - ctxs_.emplace_back(new p::CUDADeviceContext(p)); - } - nccl_ctxs_.reset(new platform::NCCLContextMap(gpu_list_)); -#else - PADDLE_THROW("CUDA is not support."); -#endif - } else { - int count = 8; - for (int i = 0; i < count; ++i) { - auto p = p::CPUPlace(); - gpu_list_.push_back(p); - ctxs_.emplace_back(new p::CPUDeviceContext(p)); - } -#ifdef PADDLE_WITH_CUDA - nccl_ctxs_.reset(nullptr); -#endif - } - } - - void InitBroadcastOp(size_t input_scope_idx) { - for (size_t j = 0; j < gpu_list_.size(); ++j) { - local_scopes_.push_back(&(g_scope_.NewScope())); - Scope& local_scope = local_scopes_.back()->NewScope(); - *local_scopes_.back() - ->Var(details::kLocalExecScopeName) - ->GetMutable() = &local_scope; - local_scope.Var("out"); - param_scopes_.emplace_back(&local_scope); - } - param_scopes_[input_scope_idx]->Var("input"); - - std::unique_ptr n = - ir::CreateNodeForTest("node0", ir::Node::Type::kOperation); - if (use_gpu_) { -#ifdef PADDLE_WITH_CUDA - op_handle_.reset(new BroadcastOpHandle(n.get(), local_scopes_, gpu_list_, - nccl_ctxs_.get())); -#else - PADDLE_THROW("CUDA is not support."); -#endif - } else { -#ifdef PADDLE_WITH_CUDA - op_handle_.reset(new BroadcastOpHandle(n.get(), local_scopes_, gpu_list_, - nccl_ctxs_.get())); -#else - op_handle_.reset( - new BroadcastOpHandle(n.get(), local_scopes_, gpu_list_)); -#endif - } - - std::unique_ptr v = - ir::CreateNodeForTest("node1", ir::Node::Type::kVariable); - auto* in_var_handle = new VarHandle(v.get(), 1, input_scope_idx, "input", - gpu_list_[input_scope_idx]); - vars_.emplace_back(in_var_handle); - op_handle_->AddInput(in_var_handle); - - // add dummy var - - std::unique_ptr v2 = - ir::CreateNodeForTest("node2", ir::Node::Type::kVariable); - vars_.emplace_back(new DummyVarHandle(v2.get())); - DummyVarHandle* dummy_var_handle = - static_cast(vars_.back().get()); - dummy_var_handle->ClearGeneratedOp(); - op_handle_->AddInput(dummy_var_handle); - - for (size_t j = 0; j < gpu_list_.size(); ++j) { - if (!use_gpu_) { - op_handle_->SetDeviceContext(gpu_list_[j], ctxs_[j].get()); - } - std::unique_ptr v3 = - ir::CreateNodeForTest("node3", ir::Node::Type::kVariable); - VarHandle* out_var_handle = - new VarHandle(v3.get(), 2, j, "out", gpu_list_[j]); - vars_.emplace_back(out_var_handle); - op_handle_->AddOutput(out_var_handle); - } - - // add dummy var - std::unique_ptr v4 = - ir::CreateNodeForTest("node4", ir::Node::Type::kVariable); - vars_.emplace_back(new DummyVarHandle(v4.get())); - DummyVarHandle* out_dummy_var_handle = - static_cast(vars_.back().get()); - out_dummy_var_handle->ClearGeneratedOp(); - op_handle_->AddOutput(out_dummy_var_handle); - } - - void TestBroadcastLodTensor(size_t input_scope_idx) { - auto in_var = param_scopes_[input_scope_idx]->FindVar("input"); - PADDLE_ENFORCE_NOT_NULL(in_var); - auto in_lod_tensor = in_var->GetMutable(); - in_lod_tensor->mutable_data(kDims, gpu_list_[input_scope_idx]); - - std::vector send_vector(static_cast(f::product(kDims))); - for (size_t k = 0; k < send_vector.size(); ++k) { - send_vector[k] = k; - } - f::LoD lod{{0, 10, 20}}; - paddle::framework::TensorFromVector( - send_vector, *(ctxs_[input_scope_idx]), in_lod_tensor); - in_lod_tensor->set_lod(lod); - in_lod_tensor->Resize(kDims); - - op_handle_->Run(false); - - WaitAll(); - - p::CPUPlace cpu_place; - for (size_t j = 0; j < gpu_list_.size(); ++j) { - auto out_var = param_scopes_[j]->FindVar("out"); - PADDLE_ENFORCE_NOT_NULL(out_var); - auto out_tensor = out_var->Get(); - PADDLE_ENFORCE_EQ(out_tensor.lod(), lod, "lod is not equal."); - - f::Tensor result_tensor; - f::TensorCopySync(out_tensor, cpu_place, &result_tensor); - float* ct = result_tensor.mutable_data(cpu_place); - - for (int64_t i = 0; i < f::product(kDims); ++i) { - ASSERT_NEAR(ct[i], send_vector[i], 1e-5); - } - } - } - - void TestBroadcastSelectedRows(size_t input_scope_idx) { - auto in_var = param_scopes_[input_scope_idx]->FindVar("input"); - PADDLE_ENFORCE_NOT_NULL(in_var); - auto in_selected_rows = in_var->GetMutable(); - auto value = in_selected_rows->mutable_value(); - value->mutable_data(kDims, gpu_list_[input_scope_idx]); - int height = static_cast(kDims[0]) * 2; - std::vector rows{0, 1, 2, 3, 3, 0, 14, 7, 3, 1, - 2, 4, 6, 3, 1, 1, 1, 1, 3, 7}; - in_selected_rows->set_height(height); - in_selected_rows->set_rows(rows); - - std::vector send_vector(static_cast(f::product(kDims))); - for (size_t k = 0; k < send_vector.size(); ++k) { - send_vector[k] = k; - } - paddle::framework::TensorFromVector( - send_vector, *(ctxs_[input_scope_idx]), value); - - op_handle_->Run(false); - - WaitAll(); - - p::CPUPlace cpu_place; - for (size_t j = 0; j < gpu_list_.size(); ++j) { - auto out_var = param_scopes_[j]->FindVar("out"); - PADDLE_ENFORCE_NOT_NULL(out_var); - auto& out_select_rows = out_var->Get(); - auto rt = out_select_rows.value(); - - PADDLE_ENFORCE_EQ(out_select_rows.height(), height, - "height is not equal."); - for (size_t k = 0; k < out_select_rows.rows().size(); ++k) { - PADDLE_ENFORCE_EQ(out_select_rows.rows()[k], rows[k]); - } - - f::Tensor result_tensor; - f::TensorCopySync(rt, cpu_place, &result_tensor); - float* ct = result_tensor.data(); - - for (int64_t i = 0; i < f::product(kDims); ++i) { - ASSERT_NEAR(ct[i], send_vector[i], 1e-5); - } - } - } -}; - TEST(BroadcastTester, TestCPUBroadcastTestLodTensor) { TestBroadcastOpHandle test_op; size_t input_scope_idx = 0; diff --git a/paddle/fluid/framework/details/broadcast_op_handle_test.h b/paddle/fluid/framework/details/broadcast_op_handle_test.h new file mode 100644 index 0000000000000000000000000000000000000000..df3b3cc9ca012eabc428a7fb4c3af9be5b1c5bd5 --- /dev/null +++ b/paddle/fluid/framework/details/broadcast_op_handle_test.h @@ -0,0 +1,273 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include + +#include "gtest/gtest.h" +#include "paddle/fluid/framework/details/broadcast_op_handle.h" + +#include "paddle/fluid/platform/device_context.h" + +namespace paddle { +namespace framework { +namespace details { + +namespace f = paddle::framework; +namespace p = paddle::platform; + +// test data amount +const f::DDim kDims = {20, 20}; + +struct TestBroadcastOpHandle { + std::vector> ctxs_; + std::vector local_scopes_; + std::vector param_scopes_; + Scope g_scope_; + OpHandleBase* op_handle_; + std::vector vars_; + std::vector> nodes_; + std::vector place_list_; + bool use_gpu_; +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + std::unique_ptr nccl_ctxs_; +#endif + + void WaitAll() { + for (size_t j = 0; j < ctxs_.size(); ++j) { + ctxs_[j]->Wait(); + } +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + if (nccl_ctxs_) { + nccl_ctxs_->WaitAll(); + } +#endif + } + + void InitCtxOnGpu(bool use_gpu) { + use_gpu_ = use_gpu; + if (use_gpu_) { +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + int count = p::GetCUDADeviceCount(); + if (count <= 1) { + LOG(WARNING) << "Cannot test multi-gpu Broadcast, because the CUDA " + "device count is " + << count; + exit(0); + } + for (int i = 0; i < count; ++i) { + auto p = p::CUDAPlace(i); + place_list_.push_back(p); + ctxs_.emplace_back(new p::CUDADeviceContext(p)); + } + nccl_ctxs_.reset(new platform::NCCLContextMap(place_list_)); +#else + PADDLE_THROW("CUDA is not support."); +#endif + } else { + int count = 8; + for (int i = 0; i < count; ++i) { + auto p = p::CPUPlace(); + place_list_.push_back(p); + ctxs_.emplace_back(new p::CPUDeviceContext(p)); + } +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + nccl_ctxs_.reset(nullptr); +#endif + } + } + + void InitBroadcastOp(size_t input_scope_idx) { + nodes_.clear(); + for (size_t j = 0; j < place_list_.size(); ++j) { + local_scopes_.push_back(&(g_scope_.NewScope())); + Scope& local_scope = local_scopes_.back()->NewScope(); + *local_scopes_.back() + ->Var(details::kLocalExecScopeName) + ->GetMutable() = &local_scope; + local_scope.Var("out"); + param_scopes_.emplace_back(&local_scope); + } + param_scopes_[input_scope_idx]->Var("input"); + + nodes_.emplace_back( + ir::CreateNodeForTest("node0", ir::Node::Type::kOperation)); + if (use_gpu_) { +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + op_handle_ = new BroadcastOpHandle(nodes_.back().get(), local_scopes_, + place_list_, nccl_ctxs_.get()); +#else + PADDLE_THROW("CUDA is not support."); +#endif + } else { +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + op_handle_ = new BroadcastOpHandle(nodes_.back().get(), local_scopes_, + place_list_, nccl_ctxs_.get()); +#else + op_handle_ = new BroadcastOpHandle(nodes_.back().get(), local_scopes_, + place_list_); +#endif + } + + nodes_.emplace_back( + ir::CreateNodeForTest("node1", ir::Node::Type::kVariable)); + auto* in_var_handle = new VarHandle(nodes_.back().get(), 1, input_scope_idx, + "input", place_list_[input_scope_idx]); + vars_.emplace_back(in_var_handle); + op_handle_->AddInput(in_var_handle); + + // add dummy var + + nodes_.emplace_back( + ir::CreateNodeForTest("node2", ir::Node::Type::kVariable)); + vars_.emplace_back(new DummyVarHandle(nodes_.back().get())); + DummyVarHandle* dummy_var_handle = + static_cast(vars_.back()); + dummy_var_handle->ClearGeneratedOp(); + op_handle_->AddInput(dummy_var_handle); + + for (size_t j = 0; j < place_list_.size(); ++j) { + if (!use_gpu_) { + op_handle_->SetDeviceContext(place_list_[j], ctxs_[j].get()); + } + nodes_.emplace_back( + ir::CreateNodeForTest("node3", ir::Node::Type::kVariable)); + VarHandle* out_var_handle = + new VarHandle(nodes_.back().get(), 2, j, "out", place_list_[j]); + vars_.emplace_back(out_var_handle); + op_handle_->AddOutput(out_var_handle); + } + + // add dummy var + nodes_.emplace_back( + ir::CreateNodeForTest("node4", ir::Node::Type::kVariable)); + vars_.emplace_back(new DummyVarHandle(nodes_.back().get())); + DummyVarHandle* out_dummy_var_handle = + static_cast(vars_.back()); + out_dummy_var_handle->ClearGeneratedOp(); + op_handle_->AddOutput(out_dummy_var_handle); + } + + std::vector InitLoDTensor(const std::string& varname, + size_t input_scope_idx, const f::LoD& lod, + float val_scalar = 0.0) { + auto var = param_scopes_[input_scope_idx]->FindVar(varname); + + PADDLE_ENFORCE_NOT_NULL(var); + auto lod_tensor = var->GetMutable(); + std::vector send_vector(static_cast(f::product(kDims))); + for (size_t k = 0; k < send_vector.size(); ++k) { + send_vector[k] = k + val_scalar; + } + paddle::framework::TensorFromVector( + send_vector, *(ctxs_[input_scope_idx]), lod_tensor); + lod_tensor->set_lod(lod); + lod_tensor->Resize(kDims); + return send_vector; + } + + std::vector InitSelectedRows(const std::string& varname, + size_t input_scope_idx, + const std::vector& rows, + int height, float value_scalar = 0.0) { + std::vector send_vector(static_cast(f::product(kDims))); + for (size_t k = 0; k < send_vector.size(); ++k) { + send_vector[k] = k + value_scalar; + } + + auto var = param_scopes_[input_scope_idx]->FindVar(varname); + PADDLE_ENFORCE_NOT_NULL(var); + auto selected_rows = var->GetMutable(); + auto value = selected_rows->mutable_value(); + value->mutable_data(kDims, place_list_[input_scope_idx]); + selected_rows->set_height(height); + selected_rows->set_rows(rows); + + paddle::framework::TensorFromVector( + send_vector, *(ctxs_[input_scope_idx]), value); + + return send_vector; + } + + void SelectedRowsEqual(const std::string& varname, int input_scope_idx, + const std::vector& send_vector, + const std::vector& rows, int height) { + auto var = param_scopes_[input_scope_idx]->FindVar(varname); + PADDLE_ENFORCE_NOT_NULL(var); + auto& selected_rows = var->Get(); + auto rt = selected_rows.value(); + PADDLE_ENFORCE_EQ(selected_rows.height(), height, "height is not equal."); + + for (size_t k = 0; k < selected_rows.rows().size(); ++k) { + PADDLE_ENFORCE_EQ(selected_rows.rows()[k], rows[k]); + } + + p::CPUPlace cpu_place; + f::Tensor result_tensor; + f::TensorCopySync(rt, cpu_place, &result_tensor); + float* ct = result_tensor.data(); + + for (int64_t i = 0; i < f::product(kDims); ++i) { + ASSERT_NEAR(ct[i], send_vector[i], 1e-5); + } + } + + void LoDTensorEqual(const std::string& varname, + const std::vector& send_vec, const f::LoD& lod, + framework::Scope* scope) { + p::CPUPlace cpu_place; + auto var = scope->FindVar(varname); + PADDLE_ENFORCE_NOT_NULL(var); + auto tensor = var->Get(); + PADDLE_ENFORCE_EQ(tensor.lod(), lod, "lod is not equal."); + f::Tensor result_tensor; + f::TensorCopySync(tensor, cpu_place, &result_tensor); + float* ct = result_tensor.mutable_data(cpu_place); + for (int64_t k = 0; k < f::product(kDims); ++k) { + ASSERT_NEAR(ct[k], send_vec[k], 1e-5); + } + } + + void TestBroadcastLodTensor(size_t input_scope_idx) { + f::LoD lod{{0, 10, 20}}; + auto send_vector = InitLoDTensor("input", input_scope_idx, lod); + + op_handle_->Run(false); + + WaitAll(); + for (size_t j = 0; j < place_list_.size(); ++j) { + LoDTensorEqual("out", send_vector, lod, param_scopes_[j]); + } + } + + void TestBroadcastSelectedRows(size_t input_scope_idx) { + std::vector rows{0, 1, 2, 3, 3, 0, 14, 7, 3, 1, + 2, 4, 6, 3, 1, 1, 1, 1, 3, 7}; + int height = static_cast(kDims[0] * 2); + auto send_vector = InitSelectedRows("input", input_scope_idx, rows, height); + + op_handle_->Run(false); + + WaitAll(); + for (size_t j = 0; j < place_list_.size(); ++j) { + SelectedRowsEqual("out", input_scope_idx, send_vector, rows, height); + } + } +}; + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/build_strategy.cc b/paddle/fluid/framework/details/build_strategy.cc index 6a6b497fa897e3882995688bf36704b1d77ea962..20cfa75292cf52a01bf794a2714deaac1e821f50 100644 --- a/paddle/fluid/framework/details/build_strategy.cc +++ b/paddle/fluid/framework/details/build_strategy.cc @@ -14,15 +14,33 @@ limitations under the License. */ #include "paddle/fluid/framework/details/build_strategy.h" -#include "paddle/fluid/framework/details/multi_devices_graph_check_pass.h" +#include +#include +#include +#include "paddle/fluid/framework/details/memory_optimize_helper.h" +#include "paddle/fluid/framework/details/multi_devices_graph_pass.h" #include "paddle/fluid/framework/details/multi_devices_graph_print_pass.h" +#include "paddle/fluid/framework/details/reduce_op_handle.h" +#include "paddle/fluid/framework/details/sequential_execution_pass.h" #include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_helper.h" +#include "paddle/fluid/framework/ir/graph_to_program_pass.h" #include "paddle/fluid/framework/ir/graph_viz_pass.h" namespace paddle { namespace framework { namespace details { +static inline bool SeqOnlyAllReduceOps(const BuildStrategy &strategy) { + // Should fix the allreduce op order if scheduling + // them in multiple threads or processes to avoid hang. + // NOTE: ParallelGraph would execute this pass on each graph, so + // don't need to append it here. + return (!strategy.enable_sequential_execution_ && + strategy.num_trainers_ > 1) && + !strategy.enable_parallel_graph_; +} + class ParallelExecutorPassBuilder : public ir::PassBuilder { public: explicit ParallelExecutorPassBuilder(const BuildStrategy &strategy) @@ -35,92 +53,293 @@ class ParallelExecutorPassBuilder : public ir::PassBuilder { viz_pass->Set("graph_viz_path", new std::string(graph_path)); } + if (strategy_.enable_sequential_execution_) { + VLOG(10) << "Add sequential_execution_pass"; + AppendPass("sequential_execution_pass"); + } + // Add op fusion. - if (strategy.fuse_elewise_add_act_ops_) { - auto fuse_elewise_add_act_pass = AppendPass("fuse_elewise_add_act_pass"); - // Add a graph viz pass to record a graph. - if (!strategy.debug_graphviz_path_.empty()) { - auto viz_pass = AppendPass("graph_viz_pass"); - const std::string graph_path = string::Sprintf( - "%s%s", strategy.debug_graphviz_path_.c_str(), "_fused_graph"); - viz_pass->Set("graph_viz_path", - new std::string(graph_path)); + if (strategy.sync_batch_norm_) { + AppendPass("sync_batch_norm_pass"); + } + + // Add op fusion. + if (strategy.fuse_relu_depthwise_conv_) { + VLOG(10) << "Add fuse_relu_depthwise_conv_pass"; + AppendPass("fuse_relu_depthwise_conv_pass"); + } + + // NOTE(dzhwinter): A note for automatical inplace. + // 1. modify program desc passes should put + // before inplace pass. + // 2. manually configured inplace should put + // before inplace_pass + + // Add automatically inplace. + if (strategy_.enable_inplace_) { + VLOG(10) << "Add inplace_pass"; + AppendPass("inplace_pass"); + } + + if (strategy_.fuse_elewise_add_act_ops_) { + VLOG(10) << "Add fuse_elewise_add_act_pass"; + AppendPass("fuse_elewise_add_act_pass"); + } + + // for single card training, fuse_all_reduce_ops is unnecessary. + // alloc_continuous_space_for_grad_pass should be before of MultiDevPass. + if (strategy_.fuse_all_reduce_ops_) { + VLOG(10) << "Add alloc_continuous_space_for_grad_pass"; + AppendPass("alloc_continuous_space_for_grad_pass"); + } + + if (strategy_.fuse_all_optimizer_ops_) { + if (strategy_.reduce_ == BuildStrategy::ReduceStrategy::kReduce || + strategy_.is_distribution_) { + VLOG(3) + << "Currently, fuse_all_optimizer_ops only works under AllReduce " + "mode."; + strategy_.fuse_all_optimizer_ops_ = false; + } else { + VLOG(10) << "Add alloc_continuous_space_for_grad_pass"; + AppendPass("alloc_continuous_space_for_grad_pass"); + // NOTE: fuse_all_xx_ops will count the number of xx operator first, + // if the number is zero, fuse_all_reduce_ops will do nothing. + // Currently, only one type of optimization algorithm can be fused. + VLOG(10) << "Add fuse_adam_op_pass"; + AppendPass("fuse_adam_op_pass"); + VLOG(10) << "Add fuse_sgd_op_pass"; + AppendPass("fuse_sgd_op_pass"); } } - // Convert graph to run on multi-devices. - auto multi_devices_pass = AppendPass("multi_devices_pass"); - multi_devices_pass->SetNotOwned("strategy", - &strategy_); + // Add a graph viz pass to record a graph. + if (!strategy.debug_graphviz_path_.empty()) { + auto viz_pass = AppendPass("graph_viz_pass"); + const std::string graph_path = string::Sprintf( + "%s%s", strategy_.debug_graphviz_path_.c_str(), "_fused_graph"); + viz_pass->Set("graph_viz_path", new std::string(graph_path)); + } + + CollectiveContext *context = CollectiveContext::GetInstance(); + context->endpoints_ = strategy_.trainers_endpoints_; + context->trainer_id_ = strategy_.trainer_id_; + PADDLE_ENFORCE(strategy_.trainer_id_ >= 0, "trainer_id_ >= 0"); + if (strategy_.trainer_id_ > 0 && strategy_.trainers_endpoints_.size() > 0) { + PADDLE_ENFORCE((unsigned)(strategy_.trainer_id_) < + strategy_.trainers_endpoints_.size(), + "trainer_id_ < endpoints_ size"); + } + VLOG(1) << "CollectiveContext:" << context->String(); + + // NOTE(dzh): memory optimize should be a runtime pass. + // However, after multi_devices_pass, VarHandle, OpHandle is + // the de-fact IR, any reuse on Graph is meaningless. + // A side-effect of that, memory optimize cannot forsee the fetched vars + // , so fetchlist should be set persistable before call the Run interface. + if (strategy_.memory_optimize_) { + VLOG(10) << "Add memory_optimize_pass"; + AppendPass("memory_optimize_pass"); + } + + // runtime_context_cache pass should be the last pass to enable the attr of + // all original and fused operators. But no operators can be enabled this + // attr if putting it after MultiDevPass. + if (strategy_.cache_runtime_context_) { + VLOG(10) << "Add runtime_context_cache_pass"; + AppendPass("runtime_context_cache_pass"); + } + + if (strategy_.cache_expected_kernel_) { + VLOG(10) << "Add expected_kernel_cache_pass"; + AppendPass("expected_kernel_cache_pass"); + } + + AppendMultiDevPass(strategy_); + + if (strategy_.fuse_all_reduce_ops_) { + // NOTE: fuse_all_reduce_ops will count the number of all_reduce operator + // first, if the number is zero, fuse_all_reduce_ops will do nothing. + VLOG(10) << "Add fuse_all_reduce_op_pass"; + AppendPass("fuse_all_reduce_op_pass"); + } // Add a graph print pass to record a graph with device info. if (!strategy_.debug_graphviz_path_.empty()) { auto multi_devices_print_pass = AppendPass("multi_devices_print_pass"); - multi_devices_print_pass->SetNotOwned( - "debug_graphviz_path", &strategy_.debug_graphviz_path_); + const std::string graph_path = + string::Sprintf("%s%s", strategy_.debug_graphviz_path_.c_str(), + "_multi_devices_graph"); + multi_devices_print_pass->Set(kGraphvizPath, + new std::string(graph_path)); multi_devices_print_pass->Set( "graph_printer", new details::GraphvizSSAGraphPrinter); } + // experimental shows that the program will be faster if append + // all_reduce_deps_pass here. + if (!strategy_.enable_parallel_graph_ && + (SeqOnlyAllReduceOps(strategy_) || + strategy.reduce_ == BuildStrategy::ReduceStrategy::kAllReduce)) { + VLOG(10) << "Add all_reduce_deps_pass"; + AppendPass("all_reduce_deps_pass"); + } + + if (strategy_.remove_unnecessary_lock_) { + VLOG(10) << "Add modify_op_lock_and_record_event_pass"; + AppendPass("modify_op_lock_and_record_event_pass"); + } + // Verify that the graph is correct for multi-device executor. AppendPass("multi_devices_check_pass"); } + // Convert graph to run on multi-devices. + void AppendMultiDevPass(const BuildStrategy &strategy) { + ir::Pass *multi_devices_pass = nullptr; + + if (strategy_.async_mode_) { + multi_devices_pass = AppendPass("async_multi_devices_pass").get(); + } else if (strategy_.is_distribution_) { + VLOG(10) + << "Add dist_multi_devices_pass, multi device parameter server mode"; + multi_devices_pass = AppendPass("dist_multi_devices_pass").get(); + } else { + if (strategy.reduce_ == BuildStrategy::ReduceStrategy::kAllReduce) { + VLOG(10) << "Add all_reduce_mode_multi_devices_pass"; + multi_devices_pass = + AppendPass("all_reduce_mode_multi_devices_pass").get(); + } else if (strategy.reduce_ == BuildStrategy::ReduceStrategy::kReduce) { + VLOG(10) << "Add reduce_mode_multi_devices_pass"; + multi_devices_pass = AppendPass("reduce_mode_multi_devices_pass").get(); + } else { + PADDLE_THROW("Unknown reduce strategy."); + } + } + multi_devices_pass->SetNotOwned("strategy", + &strategy_); + } + private: BuildStrategy strategy_; }; -std::shared_ptr BuildStrategy::CreatePassesFromStrategy() - const { +std::shared_ptr BuildStrategy::CreatePassesFromStrategy( + bool finalize_strategy) const { + if (is_finalized_) { + return pass_builder_; + } pass_builder_.reset(new ParallelExecutorPassBuilder(*this)); + if (finalize_strategy) { + is_finalized_ = true; + } return pass_builder_; } -std::unique_ptr BuildStrategy::Apply( - const ProgramDesc &main_program, const std::vector &places, - const std::string &loss_var_name, - const std::unordered_set ¶m_names, - const std::vector &local_scopes, -#ifdef PADDLE_WITH_CUDA - const bool use_cuda, platform::NCCLContextMap *nccl_ctxs) const { +bool BuildStrategy::IsMultiDevPass(const std::string &pass_name) const { + return framework::details::MultiDevSSAGraphBuilder().count(pass_name) > 0; +} + +ir::Graph *BuildStrategy::Apply(ir::Graph *graph, + const std::vector &places, + const std::string &loss_var_name, + const std::vector &local_scopes, + const size_t &nranks, +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + const bool use_cuda, + platform::NCCLContextMap *nccl_ctxs) const { #else - const bool use_cuda) const { + const bool use_cuda) const { #endif - // Create a default one if not initialized by user. - if (!pass_builder_) { - CreatePassesFromStrategy(); - } - - std::unique_ptr graph(new ir::Graph(main_program)); + VLOG(3) << "apply all passes"; + // Create a default one if not finalized by user. + CreatePassesFromStrategy(false); for (std::shared_ptr &pass : pass_builder_->AllPasses()) { - if (pass->Type() == "multi_devices_pass") { - pass->Erase("places"); - pass->SetNotOwned>("places", &places); - pass->Erase("loss_var_name"); - pass->SetNotOwned("loss_var_name", &loss_var_name); - pass->Erase("params"); - pass->SetNotOwned>("params", - ¶m_names); - pass->Erase("local_scopes"); - pass->SetNotOwned>("local_scopes", + VLOG(3) << "BuildStrategy::Apply pass:" << pass->Type(); + if (IsMultiDevPass(pass->Type())) { + pass->Erase(kPlaces); + pass->SetNotOwned>(kPlaces, &places); + pass->Erase(kLossVarName); + pass->SetNotOwned(kLossVarName, &loss_var_name); + pass->Erase(kLocalScopes); + pass->SetNotOwned>(kLocalScopes, &local_scopes); -#ifdef PADDLE_WITH_CUDA + pass->Erase(kNRanks); + pass->Set(kNRanks, new size_t(nranks)); + +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) platform::NCCLContextMap *nctx = use_cuda ? nccl_ctxs : nullptr; - pass->Erase("nccl_ctxs"); - pass->SetNotOwned("nccl_ctxs", nctx); + pass->Erase(kNCCLCtxs); + pass->SetNotOwned(kNCCLCtxs, nctx); +#endif + } else if (pass->Type() == "alloc_continuous_space_for_grad_pass" || + pass->Type() == "fuse_adam_op_pass" || + pass->Type() == "fuse_sgd_op_pass" || + pass->Type() == "fuse_all_reduce_op_pass") { + pass->Erase(kPlaces); + pass->SetNotOwned>(kPlaces, &places); + pass->Erase(kLocalScopes); + pass->SetNotOwned>(kLocalScopes, + &local_scopes); + if (pass->Type() == "fuse_all_reduce_op_pass") { +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + platform::NCCLContextMap *nctx = use_cuda ? nccl_ctxs : nullptr; + pass->Erase(kNCCLCtxs); + pass->SetNotOwned(kNCCLCtxs, nctx); #endif + } + } else if (pass->Type() == "alloc_continuous_space_for_grad_pass") { + pass->Erase(kPlaces); + pass->SetNotOwned>(kPlaces, &places); + pass->Erase(kLocalScopes); + pass->SetNotOwned>(kLocalScopes, + &local_scopes); + } else if (pass->Type() == "sequential_execution_pass") { + LOG(INFO) << "set enable_sequential_execution:" + << enable_sequential_execution_; + } else if (pass->Type() == "all_reduce_deps_pass") { + LOG(INFO) << "SeqOnlyAllReduceOps:" << SeqOnlyAllReduceOps(*this) + << ", num_trainers:" << num_trainers_; + } else if (pass->Type() == "fuse_relu_depthwise_conv_pass") { + if (!use_cuda) { + LOG(WARNING) << "fuse_relu_depthwise_conv_pass is only supported on " + "GPU, skipped."; + continue; + } } - graph = pass->Apply(std::move(graph)); + VLOG(3) << "Start Apply Pass " << pass->Type(); + graph = pass->Apply(graph); + VLOG(3) << "Finish Apply Pass " << pass->Type(); } + VLOG(3) << "All Passes Applied"; return graph; } + } // namespace details } // namespace framework } // namespace paddle +USE_PASS(sync_batch_norm_pass); +USE_PASS(fuse_relu_depthwise_conv_pass); USE_PASS(fuse_elewise_add_act_pass); USE_PASS(graph_viz_pass); -USE_PASS(multi_devices_pass); +USE_PASS(multi_batch_merge_pass); +USE_PASS(reduce_mode_multi_devices_pass); +USE_PASS(all_reduce_mode_multi_devices_pass); +USE_PASS(dist_multi_devices_pass); USE_PASS(multi_devices_check_pass); USE_PASS(multi_devices_print_pass); +USE_PASS(memory_optimize_pass); +USE_PASS(sequential_execution_pass); +USE_PASS(all_reduce_deps_pass); +USE_PASS(modify_op_lock_and_record_event_pass); +USE_PASS(inplace_pass); +USE_PASS(lock_free_optimize_pass); +USE_PASS(alloc_continuous_space_for_grad_pass); +USE_PASS(graph_to_program_pass); +USE_PASS(fuse_adam_op_pass); +USE_PASS(fuse_sgd_op_pass); +USE_PASS(fuse_all_reduce_op_pass); +USE_PASS(runtime_context_cache_pass); +USE_PASS(expected_kernel_cache_pass); diff --git a/paddle/fluid/framework/details/build_strategy.h b/paddle/fluid/framework/details/build_strategy.h index 02c4bea16916d58a6d0fce8918f8fceb9ff9356e..b1601cfbcd5e9c66f1bbecd1f6fe10bc279cea26 100644 --- a/paddle/fluid/framework/details/build_strategy.h +++ b/paddle/fluid/framework/details/build_strategy.h @@ -14,16 +14,17 @@ #pragma once +#include #include +#include #include - #include "paddle/fluid/framework/ir/pass_builder.h" #include "paddle/fluid/framework/program_desc.h" #include "paddle/fluid/framework/scope.h" #include "paddle/fluid/platform/device_context.h" #include "paddle/fluid/platform/enforce.h" -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) #include "paddle/fluid/platform/nccl_helper.h" #endif @@ -60,37 +61,94 @@ struct BuildStrategy { kCustomized = 2, }; + enum class OptimizeStrategy { + // To be Implemented,bruteforce, recursive compute unused var names. + kBruteForce = 0, + kControlFlowGraph = 1, // use cfg_graph algorithm, faster speed. + }; + ReduceStrategy reduce_{ReduceStrategy::kAllReduce}; GradientScaleStrategy gradient_scale_{GradientScaleStrategy::kCoeffNumDevice}; + OptimizeStrategy strategy_{OptimizeStrategy::kControlFlowGraph}; std::string debug_graphviz_path_{""}; bool fuse_elewise_add_act_ops_{false}; - bool enable_data_balance_{false}; + bool fuse_all_optimizer_ops_{false}; + + bool fuse_all_reduce_ops_{false}; + + bool fuse_relu_depthwise_conv_{false}; + + bool sync_batch_norm_{false}; + + // FIXME(liuwei1031) disable memory_optimzie and enable_inplace in 1.4 + // to open them by default, we need to solve the fetch variable issue + bool memory_optimize_{false}; + + bool enable_inplace_{false}; + + bool enable_sequential_execution_{false}; + + // NOTE(zcd): In reduce mode, fusing broadcast ops may make the program + // faster. Because fusing broadcast OP equals delaying the execution of all + // broadcast Ops, in this case, all nccl streams are used only for reduce + // operations for a period of time. + bool fuse_broadcast_ops_{false}; + + // FIXME(zcd): is_distribution_ is a temporary field, because in pserver mode, + // num_trainers is 1, so the current fields of build_strategy doesn't tell if + // it's distributed model. + bool is_distribution_{false}; + bool async_mode_{false}; + int num_trainers_{1}; + int trainer_id_{0}; + std::vector trainers_endpoints_; + bool remove_unnecessary_lock_{true}; + + bool cache_runtime_context_{false}; + bool cache_expected_kernel_{true}; + + // NOTE: + // Before you add new options, think if it's a general strategy that works + // with other strategy. If not, the strategy should be created through + // CreatePassesFromStrategy and the pass can be managed separately. // User normally doesn't need to call this API. // The PassBuilder allows for more customized insert, remove of passes // from python side. // A new PassBuilder is created based on configs defined above and // passes are owned by the PassBuilder. - std::shared_ptr CreatePassesFromStrategy() const; + std::shared_ptr CreatePassesFromStrategy( + bool finalize_strategy) const; + + bool IsFinalized() const { return is_finalized_; } + + bool IsMultiDevPass(const std::string &pass_name) const; // Apply the passes built by the pass_builder_. The passes will be // applied to the Program and output an ir::Graph. - std::unique_ptr Apply( - const ProgramDesc &main_program, - const std::vector &places, - const std::string &loss_var_name, - const std::unordered_set ¶m_names, - const std::vector &local_scopes, -#ifdef PADDLE_WITH_CUDA - const bool use_cuda, platform::NCCLContextMap *nccl_ctxs) const; + ir::Graph *Apply(ir::Graph *graph, const std::vector &places, + const std::string &loss_var_name, + const std::vector &local_scopes, + const size_t &nranks, +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + const bool use_cuda, + platform::NCCLContextMap *nccl_ctxs) const; #else - const bool use_cuda) const; + const bool use_cuda) const; #endif + // If set true, ParallelExecutor would build the main_program into multiple + // graphs, + // each of the graphs would run with one device. This approach can achieve + // better performance + // on some scenarios. + mutable bool enable_parallel_graph_ = false; + private: + mutable bool is_finalized_ = false; mutable std::shared_ptr pass_builder_; }; diff --git a/paddle/fluid/framework/details/computation_op_handle.cc b/paddle/fluid/framework/details/computation_op_handle.cc index b6282debdb4eb6b1f29c39e54ac4f3e2296838da..7beb8c8de9fc49aebc66ca44de8736240aabbc30 100644 --- a/paddle/fluid/framework/details/computation_op_handle.cc +++ b/paddle/fluid/framework/details/computation_op_handle.cc @@ -20,24 +20,32 @@ namespace paddle { namespace framework { namespace details { ComputationOpHandle::ComputationOpHandle(ir::Node *node, Scope *scope, - platform::Place place) + platform::Place place, + size_t scope_idx) : OpHandleBase(node), op_(framework::OpRegistry::CreateOp(*node->Op())), scope_(scope), - place_(place) {} + place_(place), + scope_idx_(scope_idx) {} void ComputationOpHandle::RunImpl() { WaitInputVarGenerated(place_); - this->RunAndRecordEvent([this] { + auto run_func = [this]() { op_->Run(*scope_->FindVar(kLocalExecScopeName)->Get(), place_); - }); + }; + + if (is_lock_and_record_event_free_) { + run_func(); + } else { + this->RunAndRecordEvent(run_func); + } } bool ComputationOpHandle::NeedWait(VarHandleBase *in_var) { bool need_wait = in_var && in_var->GeneratedOp() && - in_var->GeneratedOp()->DeviceContext(place_) != dev_ctxes_[place_]; + in_var->GeneratedOp()->DeviceContext(place_) != dev_ctxes_.at(place_); return need_wait; } diff --git a/paddle/fluid/framework/details/computation_op_handle.h b/paddle/fluid/framework/details/computation_op_handle.h index e98f1ab148db083ac63a1afd43e334fbfae62539..e98b16e6b3a07bfa0994295306e3bfa9e4174834 100644 --- a/paddle/fluid/framework/details/computation_op_handle.h +++ b/paddle/fluid/framework/details/computation_op_handle.h @@ -14,6 +14,7 @@ #pragma once +#include #include #include @@ -26,9 +27,12 @@ namespace paddle { namespace framework { namespace details { -struct ComputationOpHandle : public OpHandleBase { +class ComputationOpHandle : public OpHandleBase { public: - ComputationOpHandle(ir::Node *node, Scope *scope, platform::Place place); + ComputationOpHandle(ir::Node *node, Scope *scope, platform::Place place, + size_t scope_idx); + + OperatorBase *GetOp() { return op_.get(); } std::string Name() const override; @@ -36,6 +40,10 @@ struct ComputationOpHandle : public OpHandleBase { const platform::Place &GetPlace() const { return place_; } + void SetLockAndRecordEventFree(bool b) { is_lock_and_record_event_free_ = b; } + + size_t GetScopeIdx() const { return scope_idx_; } + protected: void RunImpl() override; @@ -45,6 +53,8 @@ struct ComputationOpHandle : public OpHandleBase { std::unique_ptr op_; Scope *scope_; platform::Place place_; + size_t scope_idx_; + bool is_lock_and_record_event_free_{false}; }; } // namespace details } // namespace framework diff --git a/paddle/fluid/framework/details/data_balance_op_handle.cc b/paddle/fluid/framework/details/data_balance_op_handle.cc deleted file mode 100644 index 525d24322442ef4dd6e8c24212af61c908959b87..0000000000000000000000000000000000000000 --- a/paddle/fluid/framework/details/data_balance_op_handle.cc +++ /dev/null @@ -1,154 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include "paddle/fluid/framework/details/data_balance_op_handle.h" -#include -#include "paddle/fluid/framework/details/container_cast.h" - -namespace paddle { -namespace framework { -namespace details { - -#ifdef PADDLE_WITH_CUDA -DataBalanceOpHandle::DataBalanceOpHandle( - ir::Node *node, const std::vector &local_scopes, - const std::vector &places, - const platform::NCCLContextMap *ctxs) - : OpHandleBase(node), local_scopes_(local_scopes), places_(places) { - if (ctxs) { - for (auto &p : places_) { - this->dev_ctxes_[p] = ctxs->DevCtx(p); - } - } -} -#else -DataBalanceOpHandle::DataBalanceOpHandle( - ir::Node *node, const std::vector &local_scopes, - const std::vector &places) - : OpHandleBase(node), local_scopes_(local_scopes), places_(places) {} -#endif - -std::string DataBalanceOpHandle::Name() const { return "data balance"; } - -std::vector> DataBalanceOpHandle::GetBalancePlan( - const std::vector &device_sizes) { - int device_num = device_sizes.size(); - int total_size = 0; - int empty_num = 0; - std::vector> size_device_vec; - size_device_vec.reserve(device_num); - for (int i = 0; i < device_num; ++i) { - if (device_sizes[i] == 0) { - ++empty_num; - } - total_size += device_sizes[i]; - size_device_vec.push_back({{device_sizes[i], i}}); - } - std::vector> res; - if (empty_num == 0) { - // No need to do data balance. - return res; - } - if (total_size < device_num) { - // No enough data. - PADDLE_THROW_EOF(); - } - std::sort(size_device_vec.begin(), size_device_vec.end(), - [](const std::array &a, const std::array &b) { - return a[0] > b[0]; - }); - int expected_device_size = total_size / device_num; - int src_idx = 0; - for (int dst_idx = device_num - empty_num; dst_idx < device_num; ++dst_idx) { - if (size_device_vec[src_idx][0] <= expected_device_size) { - ++src_idx; - PADDLE_ENFORCE_LT( - src_idx, device_num - empty_num, - "In current srategy an empty tensor should not be copy source."); - } - size_device_vec[src_idx][0] -= expected_device_size; - size_device_vec[dst_idx][0] += expected_device_size; - res.push_back({{size_device_vec[src_idx][1], size_device_vec[dst_idx][1], - expected_device_size}}); - } - return res; -} - -void DataBalanceOpHandle::RunImpl() { - PADDLE_ENFORCE_GT(places_.size(), 1, - "Data balance can only be enabled when the number of " - "places to run larger than 1."); - auto in_var_handles = DynamicCast(inputs_); - auto out_var_handles = DynamicCast(outputs_); - PADDLE_ENFORCE(in_var_handles.size() % places_.size() == 0); - PADDLE_ENFORCE_EQ( - in_var_handles.size(), out_var_handles.size(), - "The NoDummyInputSize and NoDummyOutputSize should be equal."); - int data_num = in_var_handles.size() / places_.size(); - WaitInputVarGenerated(); - std::vector> lod_tensors(data_num); - std::vector device_sizes; - for (int i = 0; i < static_cast(in_var_handles.size()); ++i) { - PADDLE_ENFORCE_EQ(in_var_handles[i]->name_, out_var_handles[i]->name_, - "The name of input and output should be equal."); - int place_idx = i / data_num; - int data_idx = i % data_num; - auto *local_scope = - local_scopes_[place_idx]->FindVar(kLocalExecScopeName)->Get(); - auto *tensor_var = local_scope->FindVar(in_var_handles[i]->name_); - PADDLE_ENFORCE(tensor_var->IsType()); - auto *tensor = tensor_var->GetMutable(); - lod_tensors[data_idx].push_back(tensor); - int ins_size = - tensor->lod().empty() ? tensor->dims()[0] : tensor->NumElements(); - if (data_idx == 0) { - device_sizes.emplace_back(ins_size); - } else { - PADDLE_ENFORCE_EQ( - ins_size, device_sizes.at(place_idx), - "All data on the same device shall have the same batch size."); - } - } - const auto &balance_plan = GetBalancePlan(device_sizes); - - for (const auto &trans : balance_plan) { - for (int data_idx = 0; data_idx < data_num; ++data_idx) { - LoDTensor *src_tensor = lod_tensors[data_idx][trans[0]]; - LoDTensor *dst_tensor = lod_tensors[data_idx][trans[1]]; - int trans_ins_size = trans[2]; - LoD src_lod = src_tensor->lod(); - int src_ins_size = - src_lod.empty() ? src_tensor->dims()[0] : src_tensor->NumElements(); - int cut_point = src_ins_size - trans_ins_size; - if (!src_lod.empty()) { - for (auto &level : src_lod) { - cut_point = level[cut_point]; - } - } - TensorCopySync(src_tensor->Slice(cut_point, src_tensor->dims()[0]), - dst_tensor->place(), dst_tensor); - src_tensor->ShareDataWith(src_tensor->Slice(0, cut_point)); - if (!src_lod.empty()) { - dst_tensor->set_lod(SliceInLevel( - src_lod, 0, src_ins_size - trans_ins_size, src_ins_size)); - src_tensor->set_lod( - SliceInLevel(src_lod, 0, 0, src_ins_size - trans_ins_size)); - } - } - } -} - -} // namespace details -} // namespace framework -} // namespace paddle diff --git a/paddle/fluid/framework/details/data_balance_op_handle.h b/paddle/fluid/framework/details/data_balance_op_handle.h deleted file mode 100644 index 0462fb6ec713eb977f420a9cb485c0273e782496..0000000000000000000000000000000000000000 --- a/paddle/fluid/framework/details/data_balance_op_handle.h +++ /dev/null @@ -1,59 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#pragma once - -#include -#include -#include "paddle/fluid/framework/details/op_handle_base.h" -#include "paddle/fluid/framework/lod_tensor.h" -#include "paddle/fluid/framework/scope.h" -#ifdef PADDLE_WITH_CUDA -#include "paddle/fluid/platform/nccl_helper.h" -#endif - -namespace paddle { -namespace framework { -namespace details { - -struct DataBalanceOpHandle : public OpHandleBase { - public: -#ifdef PADDLE_WITH_CUDA - DataBalanceOpHandle(ir::Node *node, const std::vector &local_scopes, - const std::vector &places, - const platform::NCCLContextMap *ctxs); -#else - DataBalanceOpHandle(ir::Node *node, const std::vector &local_scopes, - const std::vector &places); -#endif - - std::string Name() const override; - - bool IsMultiDeviceTransfer() override { return false; }; - - protected: - void RunImpl() override; - - private: - // std::vector<(src_dev_id, dst_dev_id, trans_size)> - std::vector> GetBalancePlan( - const std::vector &batch_size_per_device); - - const std::vector local_scopes_; - const std::vector places_; -}; - -} // namespace details -} // namespace framework -} // namespace paddle diff --git a/paddle/fluid/framework/details/eager_deletion_op_handle.cc b/paddle/fluid/framework/details/eager_deletion_op_handle.cc new file mode 100644 index 0000000000000000000000000000000000000000..dbc90737f2286db6e74d3271f39d004c25e4a949 --- /dev/null +++ b/paddle/fluid/framework/details/eager_deletion_op_handle.cc @@ -0,0 +1,132 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include + +#include "paddle/fluid/framework/details/eager_deletion_op_handle.h" +#include "paddle/fluid/framework/lod_tensor_array.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/framework/selected_rows.h" +#ifdef PADDLE_WITH_CUDA +#include "paddle/fluid/platform/cuda_device_guard.h" +#endif + +namespace paddle { +namespace framework { +namespace details { + +EagerDeletionOpHandle::EagerDeletionOpHandle( + ir::Node *node, const Scope *scope, const platform::Place &place, + const std::unordered_set &var_names, GarbageCollector *gc, + AtomicReferenceCountMap *ref_cnts) + : OpHandleBase(node), + scope_(scope), + var_names_(var_names), + gc_(gc), + ref_cnts_(ref_cnts) { +#ifdef PADDLE_WITH_CUDA + if (platform::is_gpu_place(place)) { + dev_ctx_ = reinterpret_cast( + platform::DeviceContextPool::Instance().Get(place)); + if (dynamic_cast(gc_)) { + platform::CUDADeviceGuard guard( + boost::get(place).device); + PADDLE_ENFORCE(cudaEventCreateWithFlags(&event_, cudaEventDisableTiming)); + PADDLE_ENFORCE_NOT_NULL(event_); + } + } +#endif + PADDLE_ENFORCE(!var_names_.empty(), "Var names cannot be empty"); +} + +EagerDeletionOpHandle::~EagerDeletionOpHandle() { +#ifdef PADDLE_WITH_CUDA + if (event_) { + auto gpu_place = boost::get(dev_ctx_->GetPlace()); + platform::CUDADeviceGuard guard(gpu_place.device); + PADDLE_ENFORCE(cudaEventDestroy(event_)); + } +#endif +} + +std::string EagerDeletionOpHandle::Name() const { return "eager_deletion"; } + +void EagerDeletionOpHandle::RunImpl() { + Scope *exec_scope = nullptr; + std::deque> garbages; + for (auto &name : var_names_) { + auto it = ref_cnts_->find(name); + // Reference count has not decreased to 0 + if (it == ref_cnts_->end() || it->second.fetch_sub(1) != 1) { + continue; + } + + if (!exec_scope) { + exec_scope = scope_->FindVar(kLocalExecScopeName)->Get(); + } + + // Var not found + auto *var = exec_scope->FindVar(name); + if (var == nullptr) { + continue; + } + + VLOG(2) << "Erase variable " << name; + + if (var->IsType()) { + garbages.emplace_back(var->GetMutable()->MoveMemoryHolder()); + } else if (var->IsType()) { + garbages.emplace_back( + var->GetMutable()->mutable_value()->MoveMemoryHolder()); + } else if (var->IsType()) { + auto *tensor_arr = var->GetMutable(); + for (auto &t : *tensor_arr) { + garbages.emplace_back(t.MoveMemoryHolder()); + } + } else { + PADDLE_THROW("Type %s of %s is not supported eager deletion", + framework::ToTypeName(var->Type()), name); + } + } + + if (!garbages.empty()) { + ClearGarbages(&garbages); + } +} + +void EagerDeletionOpHandle::ClearGarbages( + std::deque> *garbages) { +#ifdef PADDLE_WITH_CUDA + if (event_) { + auto compute_stream = dev_ctx_->stream(); + auto callback_stream = + reinterpret_cast(gc_)->stream(); + auto callback_func = [=]() { + PADDLE_ENFORCE(cudaEventRecord(event_, compute_stream)); + PADDLE_ENFORCE(cudaStreamWaitEvent(callback_stream, event_, 0)); + }; + gc_->Add(std::move(*garbages), callback_func); + } else { +#endif + gc_->Add(std::move(*garbages)); +#ifdef PADDLE_WITH_CUDA + } +#endif +} + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/eager_deletion_op_handle.h b/paddle/fluid/framework/details/eager_deletion_op_handle.h new file mode 100644 index 0000000000000000000000000000000000000000..64867afad5b70a2ba31e5cb315daffcf433b5935 --- /dev/null +++ b/paddle/fluid/framework/details/eager_deletion_op_handle.h @@ -0,0 +1,58 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include "paddle/fluid/framework/details/op_handle_base.h" +#include "paddle/fluid/framework/details/reference_count_pass_helper.h" + +namespace paddle { +namespace framework { +class Scope; + +namespace details { + +class EagerDeletionOpHandle : public OpHandleBase { + public: + EagerDeletionOpHandle(ir::Node *node, const Scope *scope, + const platform::Place &place, + const std::unordered_set &var_names, + GarbageCollector *gc, + AtomicReferenceCountMap *ref_cnts); + + ~EagerDeletionOpHandle(); + + std::string Name() const override; + + protected: + void RunImpl() override; + + private: + void ClearGarbages(std::deque> *garbages); + + const Scope *scope_; + std::unordered_set var_names_; + GarbageCollector *gc_; // not own + AtomicReferenceCountMap *ref_cnts_; // not own +#ifdef PADDLE_WITH_CUDA + platform::CUDADeviceContext *dev_ctx_{nullptr}; + cudaEvent_t event_{nullptr}; +#endif +}; + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/eager_deletion_pass.cc b/paddle/fluid/framework/details/eager_deletion_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..622a59b4c2e24c420da00cac2cce82ca365077e8 --- /dev/null +++ b/paddle/fluid/framework/details/eager_deletion_pass.cc @@ -0,0 +1,255 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include +#include +#include +#include + +#include "paddle/fluid/framework/details/computation_op_handle.h" +#include "paddle/fluid/framework/details/eager_deletion_op_handle.h" +#include "paddle/fluid/framework/details/multi_devices_helper.h" +#include "paddle/fluid/framework/garbage_collector.h" +#include "paddle/fluid/framework/ir/graph_helper.h" + +namespace paddle { +namespace framework { +namespace details { + +// op -> variables which can be deleted after op runs +using OpToVarNameSetMap = + std::unordered_map>; + +// Check whether the variable is LoDTensor based on static VarDesc info +static bool IsLoDTensor(VarDesc *var) { + return var->Proto()->type().type() == proto::VarType::LOD_TENSOR; +} + +// Get memory size of LoDTensor +static int64_t GetMemorySize( + const std::unordered_map> &vars, + const std::string &var_name) { + auto *var_desc = TryGetLatestVarDesc(vars.at(var_name)); + PADDLE_ENFORCE_NOT_NULL(var_desc); + PADDLE_ENFORCE(IsLoDTensor(var_desc)); + auto dims = var_desc->GetShape(); + return SizeOfType(var_desc->GetDataType()) * + std::accumulate(dims.begin(), dims.end(), static_cast(1), + std::multiplies()); +} + +// Split all variables in the graph into LoDTensor and Non-LoDTensor (e.g. +// SelectedRows, LoDTensorArray) +// Since partial GC is based on static analysis of memory size of each variable +// So we should skip SelectedRows and LoDTensorArray here +static void SplitIntoLoDTensorAndNonLoDTensorVars( + const OpToVarNameSetMap &m, const GraphVars &vars, + OpToVarNameSetMap *lod_tensors, OpToVarNameSetMap *other_vars) { + lod_tensors->clear(); + other_vars->clear(); + + for (auto &op_vars_pair : m) { + for (auto &var_name : op_vars_pair.second) { + auto *var_desc = TryGetLatestVarDesc( + vars[op_vars_pair.first->GetScopeIdx()].at(var_name)); + if (IsLoDTensor(var_desc)) { + (*lod_tensors)[op_vars_pair.first].insert(var_name); + } else { + (*other_vars)[op_vars_pair.first].insert(var_name); + } + } + } +} + +struct GCVarInfo { + GCVarInfo(const std::string &name, int64_t memory_size, + ComputationOpHandle *op, size_t scope_idx) + : name_(name), + memory_size_(memory_size), + op_(op), + scope_idx_(scope_idx) {} + + std::string name_; // variable name + int64_t memory_size_; // memory size + ComputationOpHandle *op_; // op after which the variable could be deleted + size_t scope_idx_; // scope index where the variable locates + + int64_t AbsMemorySize() const { return std::abs(memory_size_); } +}; + +// Delete delete_lod_tensor_only is not used currently +static OpToVarNameSetMap ShrinkGCVars( + const OpToVarNameSetMap &m, const GraphVars &vars, + const std::vector &places, double fraction_of_memory_size, + bool delete_lod_tensor_only = false) { + // Do not perform gc when fraction_of_memory_size = 0 + if (fraction_of_memory_size <= 0.0) return {}; + + /** + * Step 1: Split all variables into LoDTensor and Non-LoDTensor. + * We can only calculate memory size of LoDTensors + */ + OpToVarNameSetMap lod_tensors, other_vars; + SplitIntoLoDTensorAndNonLoDTensorVars(m, vars, &lod_tensors, &other_vars); + + // Perform complete gc when fraction_of_memory_size >= 1 + if (fraction_of_memory_size >= 1.0) { + return delete_lod_tensor_only ? lod_tensors : m; + } + + /** + * Step 2: build GCVarInfos, and calculate total memory sizes of each device + */ + + // place -> variable info (name, memory size, place, scope_idx) + std::map> place_to_vars; + + // place -> total memory sizes + std::map place_to_size; + for (auto &op_vars_pair : lod_tensors) { + auto *op = op_vars_pair.first; + auto &var_names = op_vars_pair.second; + auto scope_idx = op->GetScopeIdx(); + auto &place = places[scope_idx]; + + for (auto &var_name : var_names) { + auto var_size = GetMemorySize(vars[scope_idx], var_name); + GCVarInfo var_info(var_name, var_size, op, scope_idx); + place_to_size[place] += var_info.AbsMemorySize(); + place_to_vars[place].emplace_back(std::move(var_info)); + } + } + + /** + * Step 3: sort GCVarInfos, and only delete the largest variables. + */ + OpToVarNameSetMap partial_vars; + for (auto &place_to_var_pair : place_to_vars) { + auto &place = place_to_var_pair.first; + auto &gc_vars = place_to_var_pair.second; + std::sort(gc_vars.begin(), gc_vars.end(), + [](const GCVarInfo &var1, const GCVarInfo &var2) { + return var1.AbsMemorySize() > var2.AbsMemorySize(); + }); + + int64_t accumulated_size = 0; + int64_t size_threshold = + static_cast(fraction_of_memory_size * place_to_size[place]); + for (size_t i = 0; i < gc_vars.size() && accumulated_size < size_threshold; + ++i) { + partial_vars[gc_vars[i].op_].insert(gc_vars[i].name_); + accumulated_size += gc_vars[i].AbsMemorySize(); + } + } + + /** + * Step 4: Combine other vars (SelectedRows, LoDTensorArray) + */ + if (!delete_lod_tensor_only) { + for (auto &op_vars_pair : other_vars) { + partial_vars[op_vars_pair.first].insert(op_vars_pair.second.begin(), + op_vars_pair.second.end()); + } + } + + return partial_vars; +} + +class EagerDeletionPass : public ir::Pass { + protected: + void ApplyImpl(ir::Graph *graph) const override; +}; + +void EagerDeletionPass::ApplyImpl(ir::Graph *graph) const { + auto &ref_cnts = + Get>(kRuntimeReferenceCount); + PADDLE_ENFORCE(ref_cnts.empty(), + "kRuntimeReferenceCount should be initialized here!"); + + const auto &vars = graph->Get(kGraphVars); + ref_cnts.resize(vars.size()); + + const auto &last_live_ops = + Get>(kLastLiveOpsOfVars); + const auto &gcs = Get(kGarbageCollector); + const auto &places = Get>(kAllPlaces); + + // a reverse map of last_live_ops + // i.e., last op --> variable names which can be deleted. + OpToVarNameSetMap op_vars_map; + for (auto &var_ops_map : last_live_ops) { + for (auto &var_ops_pair : var_ops_map) { + const std::string &var_name = var_ops_pair.first; + for (auto *op : var_ops_pair.second) { + op_vars_map[op].insert(var_name); + } + } + } + + double memory_fraction = framework::GetEagerDeletionMemoryFraction(); + + op_vars_map = ShrinkGCVars(op_vars_map, vars, places, memory_fraction); + + for (auto &pair : op_vars_map) { + auto *op = pair.first; + auto &var_names = pair.second; + + auto *eager_deletion_node = + graph->CreateEmptyNode("eager_deletion", ir::Node::Type::kOperation); + auto *eager_deletion_op = new EagerDeletionOpHandle( + eager_deletion_node, op->GetScope(), op->GetPlace(), var_names, + gcs.at(places[op->GetScopeIdx()]).get(), + &(ref_cnts[op->GetScopeIdx()])); + + auto it = std::find_if( + op->Outputs().begin(), op->Outputs().end(), [](VarHandleBase *var) { + return dynamic_cast(var) != nullptr; + }); + + if (it != op->Outputs().end()) { + eager_deletion_op->AddInput(*it); + } else { + auto *dep_var = new DummyVarHandle(graph->CreateControlDepVar()); + graph->Get(kGraphDepVars).emplace(dep_var); + op->AddOutput(dep_var); + eager_deletion_op->AddInput(dep_var); + } + + auto *dummy_leaf = new DummyVarHandle(graph->CreateControlDepVar()); + graph->Get(kGraphDepVars).emplace(dummy_leaf); + eager_deletion_op->AddOutput(dummy_leaf); + } + + VLOG(10) << "FLAGS_memory_fraction_of_eager_deletion = " << memory_fraction; + VLOG(10) << "Create " << op_vars_map.size() << " EagerDeletionOpHandle(s)"; + + auto while_op_eager_deletion_pass = + ir::PassRegistry::Instance().Get("while_op_eager_deletion_pass"); + while_op_eager_deletion_pass->Apply(graph); +} + +} // namespace details +} // namespace framework +} // namespace paddle + +REGISTER_PASS(eager_deletion_pass, + paddle::framework::details::EagerDeletionPass) + .RequirePassAttr(paddle::framework::details::kRuntimeReferenceCount) + .RequirePassAttr(paddle::framework::details::kLastLiveOpsOfVars) + .RequirePassAttr(paddle::framework::details::kAllPlaces) + .RequirePassAttr(paddle::framework::details::kGarbageCollector); + +USE_PASS(while_op_eager_deletion_pass); diff --git a/paddle/fluid/framework/details/exception_holder.h b/paddle/fluid/framework/details/exception_holder.h index c97b364de1ecae21e97351196389615187932b5e..f8fd395bd9cc1e569bf7789e6a3adc63b00716ac 100644 --- a/paddle/fluid/framework/details/exception_holder.h +++ b/paddle/fluid/framework/details/exception_holder.h @@ -14,6 +14,9 @@ #pragma once +#include +#include + #include "glog/logging.h" #include "paddle/fluid/platform/enforce.h" @@ -30,6 +33,8 @@ class ExceptionHolder { Catch(exp); } catch (platform::EnforceNotMet exp) { Catch(exp); + } catch (std::exception& ex) { + LOG(FATAL) << "std::exception caught, " << ex.what(); } catch (...) { LOG(FATAL) << "Unknown exception caught"; } @@ -62,6 +67,21 @@ class ExceptionHolder { ClearImpl(); } + std::string Type() { + std::lock_guard lock(mu_); + switch (type_) { + case kNone: + return "None"; + case kEnforceNotMet: { + return "EnforceNotMet"; + } + case kEOF: { + return "EOF"; + } + } + return "unknown"; + } + private: void ClearImpl() { exception_.reset(); diff --git a/paddle/fluid/framework/details/execution_strategy.h b/paddle/fluid/framework/details/execution_strategy.h index 5183be878eb49cccc68603c3fdd8023be5578036..6a8d99f900cf29d5e579a3c9dd5739d2122b7deb 100644 --- a/paddle/fluid/framework/details/execution_strategy.h +++ b/paddle/fluid/framework/details/execution_strategy.h @@ -13,6 +13,7 @@ // limitations under the License. #pragma once +#include // for size_t namespace paddle { namespace framework { @@ -24,8 +25,14 @@ struct ExecutionStrategy { size_t num_threads_{0}; bool use_cuda_{true}; bool allow_op_delay_{false}; - size_t num_iteration_per_drop_scope_{100}; + // If we set this to 1, we will delete all variables when finish a batch. and + // this will loss 15%+ performance. + // Please be aware about this parameters. + size_t num_iteration_per_drop_scope_{1}; ExecutorType type_{kDefault}; + bool dry_run_{false}; + size_t num_iteration_per_run_{1}; // only use with async_ssa_graph_executor + // and pyreader with data queue }; } // namespace details diff --git a/paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.cc b/paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.cc index 6e22fedf1c39428528c00cce4c9a4460dfb95cb3..3e805bd5b480241954960f92a72514723c3a8bb7 100644 --- a/paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.cc +++ b/paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.cc @@ -12,10 +12,13 @@ // See the License for the specific language governing permissions and // limitations under the License. #include "paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.h" +#include #include +#include #include #include "paddle/fluid/framework/details/fetch_op_handle.h" #include "paddle/fluid/framework/details/multi_devices_helper.h" +#include "paddle/fluid/framework/ir/graph_helper.h" namespace paddle { namespace framework { @@ -23,22 +26,20 @@ namespace details { FastThreadedSSAGraphExecutor::FastThreadedSSAGraphExecutor( const ExecutionStrategy &strategy, const std::vector &local_scopes, - const std::vector &places, - std::unique_ptr &&graph) + const std::vector &places, ir::Graph *graph) : strategy_(strategy), local_scopes_(local_scopes), places_(places), - graph_(std::move(graph)), - pool_(strategy.num_threads_ + - 1), // add one more thread for generate op_deps - fetch_ctxs_(places) { - auto &ops = graph_->Get("ops"); - - for (auto &op : ops) { + graph_(graph), + fetch_ctxs_(places), + pool_(strategy.num_threads_), + // add one more thread for generate op_deps + prepare_pool_(1) { + for (auto &op : ir::FilterByNodeWrapper(*graph_)) { int dep = static_cast(op->NotReadyInputSize()); - op_deps_.emplace(op.get(), dep); + op_deps_.emplace(op, dep); if (dep == 0) { - bootstrap_ops_.emplace_back(op.get()); + bootstrap_ops_.emplace_back(op); } } @@ -54,13 +55,14 @@ FeedFetchList FastThreadedSSAGraphExecutor::Run( paddle::framework::FeedFetchList fetches; fetches.resize(fetch_tensors.size()); std::unordered_map> fetched_vars; - std::vector> fetch_ops; + std::vector fetch_ops; + std::vector ready_fetch_ops; for (auto &fetch_var_name : fetch_tensors) { - for (auto &var_map : graph_->Get("vars")) { + for (auto &var_map : graph_->Get(details::kGraphVars)) { auto it = var_map.find(fetch_var_name); if (it != var_map.end()) { - fetched_vars[fetch_var_name].push_back(it->second.rbegin()->get()); + fetched_vars[fetch_var_name].push_back(*it->second.rbegin()); } } } @@ -69,8 +71,9 @@ FeedFetchList FastThreadedSSAGraphExecutor::Run( auto &var_name = fetch_tensors[i]; auto fetched_var_it = fetched_vars.find(var_name); PADDLE_ENFORCE(fetched_var_it != fetched_vars.end(), - "Cannot find fetched variable.(Perhaps the main_program " - "is not set to ParallelExecutor)"); + "Cannot find fetched variable(%s).(Perhaps the main_program " + "is not set to ParallelExecutor)", + var_name); auto &vars = fetched_var_it->second; @@ -87,18 +90,24 @@ FeedFetchList FastThreadedSSAGraphExecutor::Run( op->AddInput(var); } - (*op_deps)[op] = static_cast(op->NotReadyInputSize()); + int dep = static_cast(op->NotReadyInputSize()); + (*op_deps)[op] = dep; + if (dep == 0) { + ready_fetch_ops.emplace_back(op); + } } size_t num_complete = 0; remaining_ = 0; - BlockingQueue complete_q; + auto complete_q = std::make_shared>(); for (auto op : bootstrap_ops_) { - RunOpAsync(op_deps.get(), op, &complete_q); + RunOpAsync(op_deps.get(), op, complete_q); + } + for (auto op : ready_fetch_ops) { + RunOpAsync(op_deps.get(), op, complete_q); } - while (num_complete != op_deps->size()) { - size_t num_comp = complete_q.Pop(); + size_t num_comp = complete_q->Pop(); if (num_comp == -1UL) { int remaining = 0; while (true) { @@ -107,27 +116,34 @@ FeedFetchList FastThreadedSSAGraphExecutor::Run( break; } for (int i = 0; i < remaining; ++i) { - complete_q.Pop(); + complete_q->Pop(); } } - exception_.ReThrow(); + if (exception_.IsCaught()) { + ClearFetchOp(graph_, &fetch_ops); + exception_.ReThrow(); + } } num_complete += num_comp; } // Wait FetchOps. - ClearFetchOp(graph_.get(), &fetch_ops); + ClearFetchOp(graph_, &fetch_ops); return fetches; } + void FastThreadedSSAGraphExecutor::RunOpAsync( std::unordered_map> *op_deps, - OpHandleBase *op, BlockingQueue *complete_q) { + OpHandleBase *op, + const std::shared_ptr> &complete_q) { ++remaining_; this->pool_.enqueue([=] { OpHandleBase *op_to_run = op; size_t complete = 0; while (op_to_run != nullptr) { try { - op_to_run->Run(strategy_.use_cuda_); + if (LIKELY(!strategy_.dry_run_)) { + op_to_run->Run(strategy_.use_cuda_); + } ++complete; } catch (...) { exception_.Catch(std::current_exception()); @@ -144,7 +160,7 @@ void FastThreadedSSAGraphExecutor::RunOpAsync( if (op_to_run == nullptr) { op_to_run = pending_op; } else { - this->RunOpAsync(op_deps, pending_op, complete_q); + RunOpAsync(op_deps, pending_op, complete_q); } } } @@ -155,9 +171,8 @@ void FastThreadedSSAGraphExecutor::RunOpAsync( }); } void FastThreadedSSAGraphExecutor::PrepareAtomicOpDeps() { - atomic_op_deps_ = pool_.enqueue([&] { - std::unordered_map> *op_deps = - new std::unordered_map>; + atomic_op_deps_ = prepare_pool_.enqueue([&] { + auto *op_deps = new std::unordered_map>; for (auto &pair : op_deps_) { (*op_deps)[pair.first] = pair.second; } diff --git a/paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.h b/paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.h index dad3a231cba6402f57ba654a9ac5fb520b9c8f04..f6d5160e75cc3f48c5129dae05eec4ec82d83ae5 100644 --- a/paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.h +++ b/paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.h @@ -13,9 +13,11 @@ // limitations under the License. #pragma once +#include +#include #include +#include #include -#include "ThreadPool.h" #include "paddle/fluid/framework/blocking_queue.h" #include "paddle/fluid/framework/details/exception_holder.h" #include "paddle/fluid/framework/details/execution_strategy.h" @@ -32,32 +34,37 @@ class FastThreadedSSAGraphExecutor : public SSAGraphExecutor { FastThreadedSSAGraphExecutor(const ExecutionStrategy &strategy, const std::vector &local_scopes, const std::vector &places, - std::unique_ptr &&graph); + ir::Graph *graph); FeedFetchList Run(const std::vector &fetch_tensors) override; const ir::Graph &Graph() const override; private: + // Note(zcd): the ThreadPool should be placed last so that ThreadPool should + // be destroyed first. ExecutionStrategy strategy_; std::vector local_scopes_; std::vector places_; - std::unique_ptr graph_; + ir::Graph *graph_; std::unordered_map op_deps_; std::vector bootstrap_ops_; - ::ThreadPool pool_; platform::DeviceContextPool fetch_ctxs_; std::atomic remaining_; - void RunOpAsync(std::unordered_map> *op_deps, - OpHandleBase *op, BlockingQueue *complete_q); - - void PrepareAtomicOpDeps(); - std::future< std::unique_ptr>>> atomic_op_deps_; ExceptionHolder exception_; + + ::ThreadPool pool_; + ::ThreadPool prepare_pool_; + + void RunOpAsync(std::unordered_map> *op_deps, + OpHandleBase *op, + const std::shared_ptr> &complete_q); + + void PrepareAtomicOpDeps(); }; } // namespace details } // namespace framework diff --git a/paddle/fluid/framework/details/fetch_barrier_op_handle.cc b/paddle/fluid/framework/details/fetch_barrier_op_handle.cc new file mode 100644 index 0000000000000000000000000000000000000000..019ecfbb61028537692c8fdeb874c6c490f75430 --- /dev/null +++ b/paddle/fluid/framework/details/fetch_barrier_op_handle.cc @@ -0,0 +1,66 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/details/fetch_barrier_op_handle.h" + +#include + +namespace paddle { +namespace framework { +namespace details { +FetchBarrierOpHandle::FetchBarrierOpHandle( + ir::Node *node, const std::vector &local_scopes, + const std::vector &places) + // fetch_barrier op always run on place0, but output on all places. + : OpHandleBase(node), + op_(framework::OpRegistry::CreateOp(*node->Op())), + local_scopes_(local_scopes), + places_(places), + run_scope_(local_scopes[0]), + place_(places[0]) { + for (auto &p : places) { + this->SetDeviceContext(p, platform::DeviceContextPool::Instance().Get(p)); + } +} + +bool FetchBarrierOpHandle::IsMultiDeviceTransfer() { + // override IsMultiDeviceTransfer to return true + return true; +} + +void FetchBarrierOpHandle::RunImpl() { + WaitInputVarGenerated(place_); + + auto run_func = [this]() { + op_->Run(*run_scope_->FindVar(kLocalExecScopeName)->Get(), place_); + }; + + if (is_lock_and_record_event_free_) { + run_func(); + } else { + this->RunAndRecordEvent(run_func); + } +} + +bool FetchBarrierOpHandle::NeedWait(VarHandleBase *in_var) { + bool need_wait = + in_var && in_var->GeneratedOp() && + in_var->GeneratedOp()->DeviceContext(place_) != dev_ctxes_.at(place_); + return need_wait; +} + +std::string FetchBarrierOpHandle::Name() const { return op_->Type(); } +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/fetch_barrier_op_handle.h b/paddle/fluid/framework/details/fetch_barrier_op_handle.h new file mode 100644 index 0000000000000000000000000000000000000000..b4d12785e0345c887f179bc53c8446dc1438f889 --- /dev/null +++ b/paddle/fluid/framework/details/fetch_barrier_op_handle.h @@ -0,0 +1,61 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include + +#include "paddle/fluid/framework/details/op_handle_base.h" +#include "paddle/fluid/framework/feed_fetch_type.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/platform/device_context.h" + +namespace paddle { +namespace framework { +namespace details { + +// **NOTE**: fetch_barrier op is special it outputs all recved variables on +// all places if there are multiple places, must init with +// multiple dev_ctxes_ !!!! + +struct FetchBarrierOpHandle : public OpHandleBase { + public: + FetchBarrierOpHandle(ir::Node *node, const std::vector &local_scopes, + const std::vector &places); + + bool IsMultiDeviceTransfer() override; + + std::string Name() const override; + + protected: + void RunImpl() override; + + bool NeedWait(VarHandleBase *in_var) override; + + private: + std::unique_ptr op_; + std::vector local_scopes_; + std::vector places_; + Scope *run_scope_; + platform::Place place_; + + bool is_lock_and_record_event_free_{false}; +}; + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/fetch_op_handle.cc b/paddle/fluid/framework/details/fetch_op_handle.cc index fe18b2060c5cd7e157374da53c5a985f70545ab7..6c8b8937ebe646042f71cb58cfbc2d32426a4e3c 100644 --- a/paddle/fluid/framework/details/fetch_op_handle.cc +++ b/paddle/fluid/framework/details/fetch_op_handle.cc @@ -13,9 +13,9 @@ // limitations under the License. #include "paddle/fluid/framework/details/fetch_op_handle.h" - #include #include +#include "paddle/fluid/platform/profiler.h" namespace paddle { namespace framework { @@ -28,11 +28,7 @@ FetchOpHandle::FetchOpHandle(ir::Node *node, FeedFetchList *data, size_t offset, offset_(offset), local_scopes_(local_scopes) {} -FetchOpHandle::~FetchOpHandle() { - for (auto *input_var : inputs_) { - input_var->RemoveOutput(this, this->Node()); - } -} +FetchOpHandle::~FetchOpHandle() {} void FetchOpHandle::RecordWaitEventOnCtx(platform::DeviceContext *waited_ctx) { PADDLE_THROW("Nobody should wait FetchOp. Unexpceted Error"); @@ -48,6 +44,7 @@ void FetchOpHandle::WaitAndMergeCPUTensors() const { } void FetchOpHandle::RunImpl() { + platform::RecordEvent record_event(Name()); WaitInputVarGenerated(platform::CPUPlace()); tensors_.resize(inputs_.size()); @@ -56,17 +53,18 @@ void FetchOpHandle::RunImpl() { for (size_t i = 0; i < inputs_.size(); ++i) { auto *var_handle = static_cast(inputs_[i]); - auto &scope = scopes.at(var_handle->scope_idx_); + auto &scope = scopes.at(var_handle->scope_idx()); auto *var = scope->FindVar(kLocalExecScopeName) ->Get() - ->FindVar(var_handle->name_); + ->FindVar(var_handle->name()); PADDLE_ENFORCE_NOT_NULL(var, "Cannot find variable %s in execution scope", - var_handle->name_); + var_handle->name()); auto &t = var->Get(); if (platform::is_gpu_place(t.place())) { #ifdef PADDLE_WITH_CUDA - TensorCopySync(t, cpu, &tensors_[i]); + TensorCopy(t, cpu, *dev_ctxes_.at(t.place()), &tensors_[i]); + dev_ctxes_.at(t.place())->Wait(); #endif } else { tensors_[i].ShareDataWith(t); @@ -86,6 +84,8 @@ void FetchOpHandle::WaitInputVarGenerated(const platform::Place &place) { } } +bool FetchOpHandle::IsMultiDeviceTransfer() { return true; } + std::string FetchOpHandle::Name() const { return "Fetch"; } } // namespace details diff --git a/paddle/fluid/framework/details/fetch_op_handle.h b/paddle/fluid/framework/details/fetch_op_handle.h index 6ce42f92d7f1e81eeafd1eb5c28ce3564a5ffebc..dbb7f4f6582f6e0f0b9b5702533852d12da1051c 100644 --- a/paddle/fluid/framework/details/fetch_op_handle.h +++ b/paddle/fluid/framework/details/fetch_op_handle.h @@ -39,6 +39,8 @@ struct FetchOpHandle : public OpHandleBase { std::string Name() const override; + bool IsMultiDeviceTransfer() override; + protected: void RunImpl() override; diff --git a/paddle/fluid/framework/details/fuse_adam_op_pass.cc b/paddle/fluid/framework/details/fuse_adam_op_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..0ef75e319244e2ccc63dfa3f93f0cd764cf67633 --- /dev/null +++ b/paddle/fluid/framework/details/fuse_adam_op_pass.cc @@ -0,0 +1,199 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/details/fuse_adam_op_pass.h" +#include +#include "paddle/fluid/framework/ir/graph_helper.h" +#include "paddle/fluid/framework/op_registry.h" + +namespace paddle { +namespace framework { +namespace details { + +const std::string FuseAdamOpPass::GetOpType() const { return "adam"; } + +const std::vector FuseAdamOpPass::GetAuxiliaryVarNames() const { + return {"Param", "Moment1", "Moment2", "Beta1Pow", "Beta2Pow"}; +} + +void FuseAdamOpPass::FuseOptimizerOps( + const std::unordered_map> + &aux_var_set, + const std::unordered_map &fused_vars_name, + const std::vector &adam_ops, ir::Graph *graph) const { + FuseAdamOps(aux_var_set, fused_vars_name, adam_ops, graph); + FuseScaleOps(aux_var_set.at("Beta1Pow"), fused_vars_name.at("Beta1Pow"), + adam_ops, graph); + FuseScaleOps(aux_var_set.at("Beta2Pow"), fused_vars_name.at("Beta2Pow"), + adam_ops, graph); +} + +void FuseAdamOpPass::FuseAdamOps( + const std::unordered_map> &vars_set, + const std::unordered_map &fused_vars_name, + const std::vector &adam_ops, ir::Graph *graph) const { + PADDLE_ENFORCE_GT(adam_ops.size(), static_cast(0)); + + // Check attributions + // NOTE: If new attribution is added, the following code maybe need change. + int op_role = boost::get( + adam_ops[0]->Op()->GetAttr(OpProtoAndCheckerMaker::OpRoleAttrName())); + float beta1 = boost::get(adam_ops[0]->Op()->GetAttr("beta1")); + float beta2 = boost::get(adam_ops[0]->Op()->GetAttr("beta2")); + float epsilon = boost::get(adam_ops[0]->Op()->GetAttr("epsilon")); + bool lazy_mode = boost::get(adam_ops[0]->Op()->GetAttr("lazy_mode")); + int64_t min_row_size_to_use_multithread = boost::get( + adam_ops[0]->Op()->GetAttr("min_row_size_to_use_multithread")); + for (auto &adam_op : adam_ops) { + PADDLE_ENFORCE_EQ(beta1, + boost::get(adam_op->Op()->GetAttr("beta1"))); + PADDLE_ENFORCE_EQ(beta2, + boost::get(adam_op->Op()->GetAttr("beta2"))); + PADDLE_ENFORCE_EQ(epsilon, + boost::get(adam_op->Op()->GetAttr("epsilon"))); + PADDLE_ENFORCE_EQ(lazy_mode, + boost::get(adam_op->Op()->GetAttr("lazy_mode"))); + PADDLE_ENFORCE_EQ(min_row_size_to_use_multithread, + boost::get(adam_op->Op()->GetAttr( + "min_row_size_to_use_multithread"))); + PADDLE_ENFORCE_EQ(op_role, boost::get(adam_op->Op()->GetAttr( + OpProtoAndCheckerMaker::OpRoleAttrName()))); + } + + // NOTE: fused_var is only exist in scope, so the graph doesn't have fused_var + // node. + + VLOG(10) << "Insert adam to graph "; + OpDesc adam_desc(adam_ops[0]->Op()->Block()); + adam_desc.SetType("adam"); + adam_desc.SetInput("Param", {fused_vars_name.at("Param")}); + adam_desc.SetInput("Grad", {fused_vars_name.at("Grad")}); + adam_desc.SetInput("Moment1", {fused_vars_name.at("Moment1")}); + adam_desc.SetInput("Moment2", {fused_vars_name.at("Moment2")}); + // TODO(zcd): The LearningRate, Beta1Pow, Beta2Pow should be equal. + adam_desc.SetInput("LearningRate", adam_ops[0]->Op()->Input("LearningRate")); + adam_desc.SetInput("Beta1Pow", adam_ops[0]->Op()->Input("Beta1Pow")); + adam_desc.SetInput("Beta2Pow", adam_ops[0]->Op()->Input("Beta2Pow")); + + adam_desc.SetOutput("ParamOut", {fused_vars_name.at("Param")}); + adam_desc.SetOutput("Moment1Out", {fused_vars_name.at("Moment1")}); + adam_desc.SetOutput("Moment2Out", {fused_vars_name.at("Moment2")}); + adam_desc.SetAttr("beta1", beta1); + adam_desc.SetAttr("beta2", beta2); + adam_desc.SetAttr("epsilon", epsilon); + adam_desc.SetAttr("lazy_mode", lazy_mode); + adam_desc.SetAttr("min_row_size_to_use_multithread", + min_row_size_to_use_multithread); + adam_desc.SetAttr(OpProtoAndCheckerMaker::OpRoleAttrName(), op_role); + + auto adam_node = graph->CreateOpNode(&adam_desc); + + InserInputAndOutputForOptOps(adam_ops, adam_node); +} + +void FuseAdamOpPass::FuseScaleOps(const std::vector &beta_name, + const std::string &fused_var_name, + const std::vector &adam_ops, + ir::Graph *graph) const { + PADDLE_ENFORCE_EQ(beta_name.size(), adam_ops.size()); + const std::string scale_op_name = "scale"; + + // Get the scale_ops of dealing the adam's beta var. + std::vector scale_ops; + scale_ops.reserve(beta_name.size()); + for (size_t i = 0; i < adam_ops.size(); ++i) { + auto &beta_1_pow_name = beta_name[i]; + auto beta_pow_iter = std::find_if( + adam_ops[i]->inputs.begin(), adam_ops[i]->inputs.end(), + [&beta_name, &beta_1_pow_name](ir::Node *var_node) -> bool { + return var_node->Var() && var_node->Var()->Name() == beta_1_pow_name; + }); + PADDLE_ENFORCE(beta_pow_iter != adam_ops[i]->inputs.end()); + + auto beta_pow_node = *beta_pow_iter; + auto scale_op_iter = std::find_if( + beta_pow_node->outputs.begin(), beta_pow_node->outputs.end(), + [&scale_op_name](ir::Node *op_node) -> bool { + return op_node->Op() && op_node->Op()->Type() == scale_op_name; + }); + PADDLE_ENFORCE(scale_op_iter != beta_pow_node->outputs.end()); + + scale_ops.emplace_back(*scale_op_iter); + } + PADDLE_ENFORCE_EQ(scale_ops.size(), beta_name.size()); + + // Check attributions + // NOTE: If new attribution is added, the following code maybe need change. + int op_role = boost::get( + scale_ops[0]->Op()->GetAttr(OpProtoAndCheckerMaker::OpRoleAttrName())); + float scale = boost::get(scale_ops[0]->Op()->GetAttr("scale")); + float bias = boost::get(scale_ops[0]->Op()->GetAttr("bias")); + bool bias_after_scale = + boost::get(scale_ops[0]->Op()->GetAttr("bias_after_scale")); + for (auto &scale_op : scale_ops) { + PADDLE_ENFORCE_EQ(scale, + boost::get(scale_op->Op()->GetAttr("scale"))); + PADDLE_ENFORCE_EQ(bias, boost::get(scale_op->Op()->GetAttr("bias"))); + PADDLE_ENFORCE_EQ( + bias_after_scale, + boost::get(scale_op->Op()->GetAttr("bias_after_scale"))); + PADDLE_ENFORCE_EQ(op_role, boost::get(scale_op->Op()->GetAttr( + OpProtoAndCheckerMaker::OpRoleAttrName()))); + } + + // NOTE: fused_var is only exist in scope, so the graph doesn't have fused_var + // node. + + VLOG(10) << "Insert fused scale to graph."; + OpDesc scale_desc(scale_ops[0]->Op()->Block()); + scale_desc.SetType("scale"); + scale_desc.SetInput("X", {fused_var_name}); + scale_desc.SetOutput("Out", {fused_var_name}); + scale_desc.SetAttr("scale", scale); + scale_desc.SetAttr("bias", bias); + scale_desc.SetAttr("bias_after_scale", bias_after_scale); + scale_desc.SetAttr(OpProtoAndCheckerMaker::OpRoleAttrName(), op_role); + auto scale_node = graph->CreateOpNode(&scale_desc); + + for (auto scale_op : scale_ops) { + // set inputs + scale_node->inputs.insert(scale_node->inputs.begin(), + scale_op->inputs.begin(), scale_op->inputs.end()); + for (auto &input : scale_op->inputs) { + std::replace(input->outputs.begin(), input->outputs.end(), scale_op, + scale_node); + } + // set outputs + scale_node->outputs.insert(scale_node->outputs.begin(), + scale_op->outputs.begin(), + scale_op->outputs.end()); + for (auto &output : scale_op->outputs) { + std::replace(output->inputs.begin(), output->inputs.end(), scale_op, + scale_node); + } + } + + // Delete scale_ops + for (auto &scale_op : scale_ops) { + graph->RemoveNode(scale_op); + } +} + +} // namespace details +} // namespace framework +} // namespace paddle + +REGISTER_PASS(fuse_adam_op_pass, paddle::framework::details::FuseAdamOpPass) + .RequirePassAttr(paddle::framework::details::kPlaces) + .RequirePassAttr(paddle::framework::details::kLocalScopes); diff --git a/paddle/fluid/framework/details/fuse_adam_op_pass.h b/paddle/fluid/framework/details/fuse_adam_op_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..5866c37552e26d9b14fa946e119f20121ecf7cb2 --- /dev/null +++ b/paddle/fluid/framework/details/fuse_adam_op_pass.h @@ -0,0 +1,55 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include "paddle/fluid/framework/details/build_strategy.h" +#include "paddle/fluid/framework/details/fuse_optimizer_op_pass.h" +#include "paddle/fluid/framework/details/multi_devices_helper.h" +#include "paddle/fluid/framework/ir/graph.h" + +namespace paddle { +namespace framework { +namespace details { + +class FuseAdamOpPass : public FuseOptimizerOpPass { + private: + virtual const std::string GetOpType() const; + + virtual const std::vector GetAuxiliaryVarNames() const; + + // Fuse Adam Ops and Scale Ops which are used to update "Beta1Pow", "Beta2Pow" + virtual void FuseOptimizerOps( + const std::unordered_map> &vars_set, + const std::unordered_map &fused_vars_name, + const std::vector &adam_ops, ir::Graph *graph) const; + + void FuseAdamOps( + const std::unordered_map> &vars_set, + const std::unordered_map &fused_vars_name, + const std::vector &adam_ops, ir::Graph *graph) const; + + void FuseScaleOps(const std::vector &aux_var_set, + const std::string &fused_var_name, + const std::vector &adam_ops, + ir::Graph *graph) const; +}; + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/fuse_all_reduce_op_pass.cc b/paddle/fluid/framework/details/fuse_all_reduce_op_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..31efd78ad3dbed73d7993bac47694c9d6d742343 --- /dev/null +++ b/paddle/fluid/framework/details/fuse_all_reduce_op_pass.cc @@ -0,0 +1,193 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include + +#include "paddle/fluid/framework/details/all_reduce_op_handle.h" +#include "paddle/fluid/framework/details/container_cast.h" +#include "paddle/fluid/framework/details/fused_all_reduce_op_handle.h" +#include "paddle/fluid/framework/details/multi_devices_helper.h" +#include "paddle/fluid/framework/ir/graph_helper.h" + +namespace paddle { +namespace framework { +namespace details { + +class FuseAllReduceOpPass : public ir::Pass { + protected: + void ApplyImpl(ir::Graph *graph) const override { + ir::Graph &result = *graph; + + auto &places = Get>(kPlaces); + auto &local_scopes = Get>(kLocalScopes); +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + auto *nccl_ctxs = &Get(kNCCLCtxs); +#endif + + std::unordered_set grads; + auto ¶ms_grads = result.Get(kParamsAndGrads); + size_t num_of_all_reduce = params_grads.size(); + grads.reserve(num_of_all_reduce); + for (auto p_g : params_grads) { + grads.insert(p_g.second); + } + + size_t num_place = places.size(); + std::unordered_map all_reduce_ops; + all_reduce_ops.reserve(grads.size()); + for (auto &node : result.Nodes()) { + if (node->IsOp()) { + PADDLE_ENFORCE(node->IsWrappedBy()); + auto *all_reduce_op_handle = + dynamic_cast(&node->Wrapper()); + if (all_reduce_op_handle) { + auto inputs = DynamicCast(all_reduce_op_handle->Inputs()); + PADDLE_ENFORCE_EQ(inputs.size(), num_place); + // The inputs' name should be the same. + auto &grad_name = inputs[0]->name(); + for (size_t i = 1; i < inputs.size(); ++i) { + PADDLE_ENFORCE_EQ(inputs[i]->name(), grad_name, + "The input name should be the same."); + } + PADDLE_ENFORCE_NE(grads.count(grad_name), static_cast(0)); + all_reduce_ops.emplace(grad_name, node); + } + } + } + + VLOG(10) << "Find all_reduce_ops: " << all_reduce_ops.size(); + if (all_reduce_ops.size() == 0) { + return; + } + + PADDLE_ENFORCE_EQ(all_reduce_ops.size(), grads.size(), + "The number of all_reduce OpHandle is not equal to the " + "number of grads. Maybe some gradients are sparse type, " + "it is not supported currently."); + VLOG(10) << "Insert fused_all_reduce"; + + auto &group_grads_params = + graph->Get(kGroupGradsAndParams); + + for (auto &group_g_p : group_grads_params) { + size_t group_size = group_g_p.size(); + PADDLE_ENFORCE_GT(group_size, static_cast(0)); + std::vector group_all_reduce_ops; + group_all_reduce_ops.reserve(group_size); + for (auto &g_p : group_g_p) { + group_all_reduce_ops.emplace_back(all_reduce_ops.at(g_p.first)); + } +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + InsertFusedAllReduce(places, local_scopes, group_size, + group_all_reduce_ops, nccl_ctxs, &result); +#else + InsertFusedAllReduce(places, local_scopes, group_size, + group_all_reduce_ops, &result); +#endif + } + } + + void InsertFusedAllReduce(const std::vector &places, + const std::vector &local_scopes, + const size_t num_of_all_reduce, + const std::vector &all_reduce_ops, +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + const platform::NCCLContextMap *nccl_ctxs, +#endif + ir::Graph *result) const { + std::vector inputs; + std::vector outputs; + for (auto &op : all_reduce_ops) { + auto &op_handle = op->Wrapper(); + inputs.insert(inputs.end(), op_handle.Inputs().begin(), + op_handle.Inputs().end()); + // Remove output + for_each(op_handle.Inputs().begin(), op_handle.Inputs().end(), + [&op_handle](VarHandleBase *var_handle) { + var_handle->RemoveOutput(&op_handle, op_handle.Node()); + }); + + outputs.insert(outputs.end(), op_handle.Outputs().begin(), + op_handle.Outputs().end()); + // Remove Input + for_each( + op_handle.Outputs().begin(), op_handle.Outputs().end(), + [](VarHandleBase *var_handle) { var_handle->ClearGeneratedOp(); }); + + result->RemoveNode(op_handle.Node()); + } + +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + CreateFusedAllReduceOp(inputs, outputs, num_of_all_reduce, places, + local_scopes, nccl_ctxs, result); +#else + CreateFusedAllReduceOp(inputs, outputs, num_of_all_reduce, places, + local_scopes, result); +#endif + } + + private: + void CreateFusedAllReduceOp(const std::vector &inputs, + const std::vector &outputs, + const size_t num_of_all_reduce, + const std::vector &places, + const std::vector &local_scopes, +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + const platform::NCCLContextMap *nccl_ctxs, +#endif + ir::Graph *result) const { +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + auto *op_handle = new FusedAllReduceOpHandle( + result->CreateEmptyNode("fused_all_reduce", ir::Node::Type::kOperation), + local_scopes, places, num_of_all_reduce, nccl_ctxs); +#else + auto *op_handle = new FusedAllReduceOpHandle( + result->CreateEmptyNode("fused_all_reduce", ir::Node::Type::kOperation), + local_scopes, places, num_of_all_reduce); +#endif + + for (auto in : inputs) { + op_handle->AddInput(in); + } + + for (auto out : outputs) { + op_handle->AddOutput(out); + } + +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + if (!nccl_ctxs) { + SetCommunicationContext(places, op_handle); + } +#else + SetCommunicationContext(places, op_handle); +#endif + } + + void SetCommunicationContext(const std::vector &places, + FusedAllReduceOpHandle *op_handle) const { + for (size_t i = 0; i < places.size(); ++i) { + op_handle->SetDeviceContext( + places[i], platform::DeviceContextPool::Instance().Get(places[i])); + } + } +}; + +} // namespace details +} // namespace framework +} // namespace paddle + +REGISTER_PASS(fuse_all_reduce_op_pass, + paddle::framework::details::FuseAllReduceOpPass); diff --git a/paddle/fluid/framework/details/fuse_optimizer_op_pass.cc b/paddle/fluid/framework/details/fuse_optimizer_op_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..b49f095d428a017dd1a3bed2788a048af9afa6bb --- /dev/null +++ b/paddle/fluid/framework/details/fuse_optimizer_op_pass.cc @@ -0,0 +1,240 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/details/fuse_optimizer_op_pass.h" +#include +#include +#include "paddle/fluid/framework/ir/graph_helper.h" +#include "paddle/fluid/framework/op_registry.h" + +namespace paddle { +namespace framework { +namespace details { + +void FuseOptimizerOpPass::ApplyImpl(ir::Graph *graph) const { + ir::Graph &result = *graph; + + auto &places = Get>(kPlaces); + auto &local_scopes = Get>(kLocalScopes); + + const std::string fuse_op_type = GetOpType(); + const std::vector aux_var_names = GetAuxiliaryVarNames(); + + // Step 1: Get the specified op and auxiliary variables. + std::vector topo_nodes = ir::TopologySortOperations(result); + std::unordered_map> aux_var_set; + std::vector opt_ops; + for (auto &node : topo_nodes) { + GetSpecifiedOpsAndVars(fuse_op_type, aux_var_names, node, &opt_ops, + &aux_var_set); + } + + VLOG(10) << "Find " << fuse_op_type << " operators: " << opt_ops.size(); + if (opt_ops.size() == 0) { + return; + } + + if (result.Has(kFusedOptType)) { + VLOG(10) + << "Currently only support fusing one type optimizer op. Has fused " + << result.Get(kFusedOptType); + return; + } else { + result.Set(kFusedOptType, new FusedOptType); + } + result.Get(kFusedOptType) = fuse_op_type; + + // Step 2: Insert fused_var_name to FusedVars, and the FusedVars need be + // initialized in scopes before execution. + if (!result.Has(kFusedVars)) { + result.Set(kFusedVars, new FusedVars); + } + std::unordered_map fused_vars_name; + fused_vars_name.reserve(aux_var_names.size() + 1); + auto &fused_var_set = result.Get(kFusedVars); + const std::string prefix(kFusedVarNamePrefix); + // NOTE: the fused_var_name should be unique. + for (auto &var_name : aux_var_names) { + auto fused_var_name = prefix + "_" + fuse_op_type + "_" + var_name + "_" + + aux_var_set[var_name][0]; + VLOG(10) << fused_var_name; + fused_vars_name.emplace(var_name, fused_var_name); + PADDLE_ENFORCE_EQ(fused_var_set.count(fused_var_name), 0); + fused_var_set.insert(fused_var_name); + } + + // Step 3: Get the fused Gradient's name + auto ¶ms_grads = result.Get(kParamsAndGrads); + if (!result.Has(kFusedGrads)) { + PADDLE_THROW( + "The alloc_continuous_space_for_grad_pass should be called before this " + "pass."); + } + auto &fused_grad = result.Get(kFusedGrads); + auto &fused_vars = result.Get(kFusedVars); + auto iter = std::find(fused_vars.begin(), fused_vars.end(), fused_grad); + PADDLE_ENFORCE(iter != fused_vars.end(), "Not find the fused_grad."); + fused_vars_name.emplace("Grad", fused_grad); + + // Step 4: Sort the parameters and auxiliary variables according + // to parameters' name to make variables' name correspond correctly. + PADDLE_ENFORCE(result.Has(kParamsAndGrads), "Does't find kParamsAndGrads."); + PADDLE_ENFORCE_EQ(params_grads.size(), aux_var_set.begin()->second.size(), + "The size of params_grads and aux_var_set are not equal."); + SortParametersAndAuxVars(params_grads, &aux_var_set, &opt_ops); + + // Step 5: Alloc continuous space for Parameters and AuxiliaryVar(e.g. + // Moment1, Moment2, Beta1Pow, Beta2Pow) of all the optimizer ops separately. + InitFusedVarsAndAllocSpaceForVars(places, local_scopes, aux_var_names, + aux_var_set, fused_vars_name); + + // Step 6: Fuse optimizer Ops and Scale Ops + FuseOptimizerOps(aux_var_set, fused_vars_name, opt_ops, &result); + + // Step 7: Remove optimizer Ops + for (auto &opt_op : opt_ops) { + graph->RemoveNode(opt_op); + } +} + +void FuseOptimizerOpPass::InitFusedVarsAndAllocSpaceForVars( + const std::vector &places, + const std::vector &local_scopes, + const std::vector &aux_var_names, + const std::unordered_map> + &aux_var_set, + const std::unordered_map &fused_vars_name) const { + VLOG(10) << "Init FusedVars."; + // Alloc parameters and auxiliary vars in the respective scope. + size_t idx = local_scopes.size(); + for (auto iter = local_scopes.rbegin(); iter != local_scopes.rend(); + ++iter, --idx) { + auto &scope = *iter; + for (auto &var_name : aux_var_names) { + auto fused_var_name = fused_vars_name.at(var_name); + VLOG(10) << "Init " << fused_var_name; + PADDLE_ENFORCE(scope->FindVar(fused_var_name) == nullptr, + "%s has exist in scope[%d]", fused_var_name, idx); + scope->Var(fused_var_name)->GetMutable(); + } + } + + ProgramDesc program_desc; + auto *global_block = program_desc.MutableBlock(0); + for (auto &var_name : aux_var_names) { + AppendAllocContinuousSpace(aux_var_set.at(var_name), + fused_vars_name.at(var_name), true, + global_block); + } + + for (size_t i = 0; i < local_scopes.size(); ++i) { + for (auto &op_desc : global_block->AllOps()) { + auto op = OpRegistry::CreateOp(*op_desc); + op->Run(*local_scopes[i], places[i]); + } + } +} + +void FuseOptimizerOpPass::SortParametersAndAuxVars( + const std::vector> ¶ms_grads, + std::unordered_map> *aux_vars_set, + std::vector *ops) const { + PADDLE_ENFORCE_NE(aux_vars_set->count("Param"), static_cast(0)); + auto ¶m_vec = aux_vars_set->at("Param"); + + std::vector param_sort_idx; + param_sort_idx.reserve(param_vec.size()); + + for (auto &p_g : params_grads) { + auto iter = std::find(param_vec.begin(), param_vec.end(), p_g.first); + PADDLE_ENFORCE(iter != param_vec.end()); + auto idx = std::distance(param_vec.begin(), iter); + param_sort_idx.emplace_back(idx); + } + + for (auto &aux_vars : *aux_vars_set) { + std::vector sorted_vars; + sorted_vars.reserve(aux_vars.second.size()); + for (size_t i = 0; i < aux_vars.second.size(); ++i) { + sorted_vars.emplace_back(aux_vars.second.at(param_sort_idx[i])); + } + std::swap(aux_vars.second, sorted_vars); + + std::stringstream out; + for (auto &var_name : aux_vars.second) { + out << var_name << " "; + } + VLOG(10) << aux_vars.first << ": " << out.str(); + } + + std::vector sorted_ops; + sorted_ops.reserve(ops->size()); + for (size_t i = 0; i < ops->size(); ++i) { + sorted_ops.emplace_back(ops->at(param_sort_idx[i])); + } + std::swap(*ops, sorted_ops); +} + +void FuseOptimizerOpPass::GetSpecifiedOpsAndVars( + const std::string &op_type, const std::vector &aux_vars_name, + ir::Node *node, std::vector *ops, + std::unordered_map> *aux_args_name) + const { + if (node->Op()->Type() != op_type) return; + + for (auto &var_n : aux_vars_name) { + auto arg_names = node->Op()->Input(var_n); + PADDLE_ENFORCE_EQ(arg_names.size(), static_cast(1)); + (*aux_args_name)[var_n].emplace_back(arg_names[0]); + VLOG(10) << var_n << ", " << arg_names[0]; + } + ops->emplace_back(node); +} + +void FuseOptimizerOpPass::AppendAllocContinuousSpace( + const std::vector &args, const std::string &out_arg, + bool copy_data, BlockDesc *global_block) const { + auto op_desc = global_block->AppendOp(); + op_desc->SetType("alloc_continuous_space"); + op_desc->SetInput("Input", args); + op_desc->SetOutput("Output", args); + op_desc->SetOutput("FusedOutput", {out_arg}); + op_desc->SetAttr("copy_data", copy_data); + op_desc->SetAttr("check_name", true); +} + +void FuseOptimizerOpPass::InserInputAndOutputForOptOps( + const std::vector &opt_ops, ir::Node *opt_node) const { + std::unordered_set inputs; + std::unordered_set outputs; + for (auto opt_op : opt_ops) { + // set inputs + inputs.insert(opt_op->inputs.begin(), opt_op->inputs.end()); + for (auto &input : opt_op->inputs) { + replace(input->outputs.begin(), input->outputs.end(), opt_op, opt_node); + } + // set outputs + outputs.insert(opt_op->outputs.begin(), opt_op->outputs.end()); + for (auto &output : opt_op->outputs) { + replace(output->inputs.begin(), output->inputs.end(), opt_op, opt_node); + } + } + opt_node->inputs.insert(opt_node->inputs.begin(), inputs.begin(), + inputs.end()); + opt_node->outputs.insert(opt_node->outputs.begin(), outputs.begin(), + outputs.end()); +} +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/fuse_optimizer_op_pass.h b/paddle/fluid/framework/details/fuse_optimizer_op_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..0240f1594d7ef9d855eb6e96e8e8a32ee1d957ba --- /dev/null +++ b/paddle/fluid/framework/details/fuse_optimizer_op_pass.h @@ -0,0 +1,75 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include +#include "paddle/fluid/framework/details/build_strategy.h" +#include "paddle/fluid/framework/details/multi_devices_helper.h" +#include "paddle/fluid/framework/ir/graph.h" + +namespace paddle { +namespace framework { +namespace details { + +class FuseOptimizerOpPass : public ir::Pass { + protected: + void ApplyImpl(ir::Graph *graph) const override; + + protected: + virtual void SortParametersAndAuxVars( + const std::vector> ¶ms_grads, + std::unordered_map> *aux_var_set, + std::vector *ops) const; + + void InserInputAndOutputForOptOps(const std::vector &opt_ops, + ir::Node *opt_node) const; + + private: + virtual const std::string GetOpType() const = 0; + + virtual const std::vector GetAuxiliaryVarNames() const = 0; + + virtual void FuseOptimizerOps( + const std::unordered_map> &vars_set, + const std::unordered_map &fused_vars_name, + const std::vector &adam_ops, ir::Graph *graph) const = 0; + + void GetSpecifiedOpsAndVars( + const std::string &op_type, const std::vector &aux_vars_name, + ir::Node *node, std::vector *ops, + std::unordered_map> *aux_args_name) + const; + + void AppendAllocContinuousSpace(const std::vector &args, + const std::string &out_arg, bool copy_data, + BlockDesc *global_block) const; + + void InitFusedVarsAndAllocSpaceForVars( + const std::vector &places, + const std::vector &local_scopes, + const std::vector &aux_var_names, + const std::unordered_map> + &aux_var_set, + const std::unordered_map &fused_vars_name) + const; +}; + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/fuse_sgd_op_pass.cc b/paddle/fluid/framework/details/fuse_sgd_op_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..f91c21e3cc869de1a6d67146eb99f27a2ca5497c --- /dev/null +++ b/paddle/fluid/framework/details/fuse_sgd_op_pass.cc @@ -0,0 +1,74 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/details/fuse_sgd_op_pass.h" +#include +#include "paddle/fluid/framework/ir/graph_helper.h" +#include "paddle/fluid/framework/op_registry.h" + +namespace paddle { +namespace framework { +namespace details { + +const std::string FuseSgdOpPass::GetOpType() const { return "sgd"; } + +const std::vector FuseSgdOpPass::GetAuxiliaryVarNames() const { + return {"Param"}; +} + +void FuseSgdOpPass::FuseOptimizerOps( + const std::unordered_map> + &aux_var_set, + const std::unordered_map &fused_vars_name, + const std::vector &sgd_ops, ir::Graph *graph) const { + FuseSgdOps(aux_var_set, fused_vars_name, sgd_ops, graph); +} + +void FuseSgdOpPass::FuseSgdOps( + const std::unordered_map> &vars_set, + const std::unordered_map &fused_vars_name, + const std::vector &sgd_ops, ir::Graph *graph) const { + PADDLE_ENFORCE_GT(sgd_ops.size(), static_cast(0)); + + // NOTE: fused_var is only exist in scope, so the graph doesn't have fused_var + // node. + + int op_role = boost::get( + sgd_ops[0]->Op()->GetAttr(OpProtoAndCheckerMaker::OpRoleAttrName())); + VLOG(10) << "Insert sgd to graph "; + // Add fused scale + OpDesc Sgd_desc(sgd_ops[0]->Op()->Block()); + Sgd_desc.SetType("sgd"); + Sgd_desc.SetInput("Param", {fused_vars_name.at("Param")}); + Sgd_desc.SetInput("Grad", {fused_vars_name.at("Grad")}); + Sgd_desc.SetOutput("ParamOut", {fused_vars_name.at("Param")}); + + // TODO(zcd): The LearningRate, Beta1Pow, Beta2Pow should be equal. + Sgd_desc.SetInput("LearningRate", sgd_ops[0]->Op()->Input("LearningRate")); + + // NOTE: multi_devices_pass requires that every op should have a role. + Sgd_desc.SetAttr(OpProtoAndCheckerMaker::OpRoleAttrName(), op_role); + + auto sgd_node = graph->CreateOpNode(&Sgd_desc); + + InserInputAndOutputForOptOps(sgd_ops, sgd_node); +} + +} // namespace details +} // namespace framework +} // namespace paddle + +REGISTER_PASS(fuse_sgd_op_pass, paddle::framework::details::FuseSgdOpPass) + .RequirePassAttr(paddle::framework::details::kPlaces) + .RequirePassAttr(paddle::framework::details::kLocalScopes); diff --git a/paddle/fluid/framework/details/fuse_sgd_op_pass.h b/paddle/fluid/framework/details/fuse_sgd_op_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..b3aa6a203b726a5a1540ce533c0305d7f579d4a9 --- /dev/null +++ b/paddle/fluid/framework/details/fuse_sgd_op_pass.h @@ -0,0 +1,50 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include "paddle/fluid/framework/details/build_strategy.h" +#include "paddle/fluid/framework/details/fuse_optimizer_op_pass.h" +#include "paddle/fluid/framework/details/multi_devices_helper.h" +#include "paddle/fluid/framework/ir/graph.h" + +namespace paddle { +namespace framework { +namespace details { + +class FuseSgdOpPass : public FuseOptimizerOpPass { + private: + virtual const std::string GetOpType() const; + + virtual const std::vector GetAuxiliaryVarNames() const; + + // Fuse Sgd Ops + virtual void FuseOptimizerOps( + const std::unordered_map> &vars_set, + const std::unordered_map &fused_vars_name, + const std::vector &sgd_ops, ir::Graph *graph) const; + + void FuseSgdOps( + const std::unordered_map> &vars_set, + const std::unordered_map &fused_vars_name, + const std::vector &sgd_ops, ir::Graph *graph) const; +}; + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/fuse_vars_op_handle.cc b/paddle/fluid/framework/details/fuse_vars_op_handle.cc deleted file mode 100644 index 018c9bff71e553d8a3641f06f10b350453676b24..0000000000000000000000000000000000000000 --- a/paddle/fluid/framework/details/fuse_vars_op_handle.cc +++ /dev/null @@ -1,51 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include "paddle/fluid/framework/details/fuse_vars_op_handle.h" - -namespace paddle { -namespace framework { -namespace details { - -void FuseVarsOpHandle::RunImpl() { - WaitInputVarGenerated(place_); - - auto in_var_handles = DynamicCast(this->Inputs()); - auto out_var_handles = DynamicCast(this->Outputs()); - PADDLE_ENFORCE_EQ(in_var_handles.size(), 0); - PADDLE_ENFORCE_EQ(out_var_handles.size() - 1, inputs_numel_.size(), ""); - - auto scope = local_scope_->FindVar(kLocalExecScopeName)->Get(); - - auto out_var_handle = out_var_handles[0]; - auto out_var = scope->Var(out_var_handle->name_); - - auto out_tensor = out_var->GetMutable(); - out_tensor->Resize({total_numel_}).mutable_data(this->place_, type_); - - int64_t s = 0; - for (size_t i = 1; i < out_var_handles.size(); ++i) { - auto out_name = out_var_handles[i]->name_; - auto out_t = scope->Var(out_name)->GetMutable(); - auto numel = this->inputs_numel_.at(out_name); - out_t->ShareDataWith(out_tensor->Slice(s, s + numel)); - s += numel; - } - this->RunAndRecordEvent([] {}); -} - -std::string FuseVarsOpHandle::Name() const { return "fuse vars"; } -} // namespace details -} // namespace framework -} // namespace paddle diff --git a/paddle/fluid/framework/details/fuse_vars_op_handle.h b/paddle/fluid/framework/details/fuse_vars_op_handle.h deleted file mode 100644 index 3f360c510a4fdc0caaeb15d862b217ef41b8ea6e..0000000000000000000000000000000000000000 --- a/paddle/fluid/framework/details/fuse_vars_op_handle.h +++ /dev/null @@ -1,65 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#pragma once - -#include -#include -#include - -#include "paddle/fluid/framework/details/container_cast.h" -#include "paddle/fluid/framework/details/op_handle_base.h" -#include "paddle/fluid/framework/lod_tensor.h" -#include "paddle/fluid/framework/scope.h" -#include "paddle/fluid/platform/device_context.h" - -namespace paddle { -namespace framework { -namespace details { - -struct FuseVarsOpHandle : public OpHandleBase { - public: - FuseVarsOpHandle(ir::Node *node, Scope *local_scope, - const platform::Place &place, - const std::unordered_map &inputs_numel, - const std::type_index &var_type) - : OpHandleBase(node), - local_scope_(local_scope), - place_(place), - inputs_numel_(inputs_numel), - type_(var_type) { - total_numel_ = 0; - for (auto in_numel : inputs_numel) { - PADDLE_ENFORCE_GT(in_numel.second, 0); - total_numel_ += in_numel.second; - } - } - - std::string Name() const override; - - bool IsMultiDeviceTransfer() override { return false; }; - - protected: - void RunImpl() override; - - private: - Scope *local_scope_; - const platform::Place place_; - const std::unordered_map inputs_numel_; - const std::type_index type_; - int64_t total_numel_; -}; -} // namespace details -} // namespace framework -} // namespace paddle diff --git a/paddle/fluid/framework/details/fused_all_reduce_op_handle.cc b/paddle/fluid/framework/details/fused_all_reduce_op_handle.cc new file mode 100644 index 0000000000000000000000000000000000000000..a57d670f118f2eb0bdcbeb7ed080729e4f9e4f2b --- /dev/null +++ b/paddle/fluid/framework/details/fused_all_reduce_op_handle.cc @@ -0,0 +1,266 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +#include "paddle/fluid/framework/details/fused_all_reduce_op_handle.h" +#include +#include +#include "paddle/fluid/framework/details/container_cast.h" +#include "paddle/fluid/framework/details/reduce_and_gather.h" +#include "paddle/fluid/framework/details/variable_visitor.h" +#include "paddle/fluid/platform/profiler.h" + +DEFINE_bool(skip_fused_all_reduce_check, false, ""); +namespace paddle { +namespace framework { +namespace details { + +// Note(zcd): Addresses should be aligned, otherwise, the results may have +// diff. +static size_t Alignment(size_t size, const platform::Place &place) { + // Allow to allocate the minimum chunk size is 4 KB. + size_t alignment = 1 << 12; + if (platform::is_gpu_place(place)) { + // Allow to allocate the minimum chunk size is 256 B. + alignment = 1 << 8; + } + size_t remaining = size % alignment; + return remaining == 0 ? size : size + (alignment - remaining); +} + +typedef std::vector>> + GradientAndLoDTensor; + +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) +FusedAllReduceOpHandle::FusedAllReduceOpHandle( + ir::Node *node, const std::vector &local_scopes, + const std::vector &places, const size_t num_of_all_reduce, + const platform::NCCLContextMap *ctxs) + : OpHandleBase(node), + local_scopes_(local_scopes), + places_(places), + num_of_all_reduce_(num_of_all_reduce), + nccl_ctxs_(ctxs) { + if (nccl_ctxs_) { + for (auto &p : places_) { + this->SetDeviceContext(p, nccl_ctxs_->DevCtx(p)); + } + } + PADDLE_ENFORCE_EQ(places_.size(), local_scopes_.size()); +} +#else + +FusedAllReduceOpHandle::FusedAllReduceOpHandle( + ir::Node *node, const std::vector &local_scopes, + const std::vector &places, const size_t num_of_all_reduce) + : OpHandleBase(node), + local_scopes_(local_scopes), + places_(places), + num_of_all_reduce_(num_of_all_reduce) { + PADDLE_ENFORCE_EQ(places_.size(), local_scopes_.size()); +} + +#endif + +void FusedAllReduceOpHandle::RunImpl() { + platform::RecordEvent record_event(Name()); + + VLOG(4) << this->DebugString(); + + WaitInputVarGenerated(); + // The input: grad0(dev0), grad0(dev1), grad1(dev0), grad1(dev1)... + // The output: grad0(dev0), grad0(dev1), grad1(dev0), grad1(dev1)... + auto in_var_handles = DynamicCast(this->Inputs()); + auto out_var_handles = DynamicCast(this->Outputs()); + + size_t place_num = places_.size(); + PADDLE_ENFORCE_EQ( + in_var_handles.size(), place_num * num_of_all_reduce_, + "The NoDummyInputSize should be equal to the number of places."); + PADDLE_ENFORCE_EQ( + in_var_handles.size(), out_var_handles.size(), + "The NoDummyInputSize and NoDummyOutputSize should be equal."); + + GradientAndLoDTensor grads_tensor; + grads_tensor.resize(place_num); + + int64_t numel = -1; + auto dtype = static_cast(0); + for (size_t scope_idx = 0; scope_idx < local_scopes_.size(); ++scope_idx) { + auto &g_tensor = grads_tensor.at(scope_idx); + g_tensor.reserve(num_of_all_reduce_); + + GetGradLoDTensor(scope_idx, in_var_handles, out_var_handles, &g_tensor); + + int64_t element_num = 0; + framework::proto::VarType::Type ele_dtype = + static_cast(0); + GetDTypeAndNumel(g_tensor, &ele_dtype, &element_num); + + if (numel == -1) { + numel = element_num; + } + if (dtype == static_cast(0)) { + dtype = ele_dtype; + PADDLE_ENFORCE_NE(ele_dtype, + static_cast(0)); + } + PADDLE_ENFORCE_EQ(ele_dtype, dtype); + + // Check whether the address space is contiguous. + std::sort( + g_tensor.begin(), g_tensor.end(), + [](const std::pair &grad1, + const std::pair &grad2) -> bool { + return grad1.second->data() < grad2.second->data(); + }); + + size_t size_of_dtype = framework::SizeOfType(dtype); + for (size_t k = 1; k < g_tensor.size(); ++k) { + const void *cur_address = g_tensor.at(k - 1).second->data(); + int64_t len = g_tensor.at(k - 1).second->numel(); + auto offset = Alignment(len * size_of_dtype, places_[0]); + void *infer_next_address = reinterpret_cast( + reinterpret_cast(cur_address) + offset); + const void *next_address = g_tensor.at(k).second->data(); + + VLOG(10) << string::Sprintf( + "Input[%d](%s) address: 0X%02x, Input[%d](%s) address: 0X%02x, Infer " + "input[%d] address: 0X%02x. The offset: %d", + k - 1, g_tensor.at(k - 1).first, cur_address, g_tensor.at(k).first, k, + next_address, k, infer_next_address, offset); + PADDLE_ENFORCE_EQ(infer_next_address, next_address, + "The address is not consistent."); + } + } + + if (!FLAGS_skip_fused_all_reduce_check) { + for (size_t scope_idx = 0; scope_idx < place_num; ++scope_idx) { + for (size_t j = 1; j < num_of_all_reduce_; ++j) { + PADDLE_ENFORCE_EQ(grads_tensor.at(0).at(j).first, + grads_tensor.at(scope_idx).at(j).first); + } + } + } + + std::vector lod_tensor_data; + for (size_t scope_idx = 0; scope_idx < place_num; ++scope_idx) { + auto data = grads_tensor.at(scope_idx).at(0).second->data(); + lod_tensor_data.emplace_back(data); + } + + if (platform::is_gpu_place(places_[0])) { +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + PADDLE_ENFORCE(nccl_ctxs_, "nccl_ctxs should not be nullptr."); + int nccl_dtype = platform::ToNCCLDataType(dtype); + std::vector> all_reduce_calls; + for (size_t i = 0; i < local_scopes_.size(); ++i) { + auto &p = places_[i]; + void *buffer = const_cast(lod_tensor_data.at(i)); + + int dev_id = boost::get(p).device; + auto &nccl_ctx = nccl_ctxs_->at(dev_id); + auto stream = nccl_ctx.stream(); + auto comm = nccl_ctx.comm_; + all_reduce_calls.emplace_back([=] { + PADDLE_ENFORCE(platform::dynload::ncclAllReduce( + buffer, buffer, numel, static_cast(nccl_dtype), + ncclSum, comm, stream)); + }); + } + + this->RunAndRecordEvent([&] { + if (all_reduce_calls.size() == 1UL) { + // Do not use NCCLGroup when manage NCCL by per thread per device + all_reduce_calls[0](); + } else { + platform::NCCLGroupGuard guard; + for (auto &call : all_reduce_calls) { + call(); + } + } + }); +#else + PADDLE_THROW("Not compiled with CUDA"); +#endif + } else { + // Special handle CPU only Operator's gradient. Like CRF + auto grad_name = grads_tensor.at(0).at(0).first; + auto &trg = *this->local_scopes_[0] + ->FindVar(kLocalExecScopeName) + ->Get() + ->FindVar(grad_name) + ->GetMutable(); + + // Reduce All data to trg in CPU + ReduceBufferData func(lod_tensor_data, trg.data(), numel); + VisitDataType(trg.type(), func); + + for (size_t i = 1; i < local_scopes_.size(); ++i) { + auto &scope = + *local_scopes_[i]->FindVar(kLocalExecScopeName)->Get(); + auto &p = places_[i]; + auto *var = scope.FindVar(grad_name); + auto *dev_ctx = dev_ctxes_.at(p); + size_t size = numel * SizeOfType(trg.type()); + RunAndRecordEvent(p, [&trg, var, dev_ctx, p, size] { + auto dst_ptr = var->GetMutable()->data(); + platform::CPUPlace cpu_place; + memory::Copy(cpu_place, dst_ptr, cpu_place, trg.data(), size); + }); + } + } +} + +void FusedAllReduceOpHandle::GetGradLoDTensor( + const size_t &scope_idx, const std::vector &in_var_handles, + const std::vector &out_var_handles, + std::vector> *grad_tensor) const { + auto *local_scope = + local_scopes_.at(scope_idx)->FindVar(kLocalExecScopeName)->Get(); + size_t place_num = places_.size(); + + for (size_t j = 0; j < in_var_handles.size(); j += place_num) { + auto var_name = in_var_handles[j]->name(); + PADDLE_ENFORCE_EQ(var_name, out_var_handles[j]->name()); + auto &lod_tensor = local_scope->FindVar(var_name)->Get(); + PADDLE_ENFORCE_EQ(lod_tensor.place(), places_.at(scope_idx)); + grad_tensor->emplace_back(std::make_pair(var_name, &lod_tensor)); + } +} + +void FusedAllReduceOpHandle::GetDTypeAndNumel( + const std::vector> &grad_tensor, + proto::VarType::Type *dtype, int64_t *numel) const { + *numel = 0; + size_t size_of_dtype = 0; + for (size_t i = 0; i < grad_tensor.size(); ++i) { + // Get dtype + auto ele_type = grad_tensor.at(i).second->type(); + if (i == 0) { + *dtype = ele_type; + size_of_dtype = framework::SizeOfType(ele_type); + } + PADDLE_ENFORCE_EQ(ele_type, *dtype); + + // Get element number + int64_t len = grad_tensor.at(i).second->numel(); + PADDLE_ENFORCE_GT(len, 0); + // Alignment(len) + *numel += Alignment(len * size_of_dtype, places_[0]) / size_of_dtype; + } +} + +std::string FusedAllReduceOpHandle::Name() const { return "fused_all_reduce"; } +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/fused_all_reduce_op_handle.h b/paddle/fluid/framework/details/fused_all_reduce_op_handle.h new file mode 100644 index 0000000000000000000000000000000000000000..79772c61f8c8b7abe3cf26dd8a94c2acdc0872a0 --- /dev/null +++ b/paddle/fluid/framework/details/fused_all_reduce_op_handle.h @@ -0,0 +1,76 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include "paddle/fluid/framework/details/op_handle_base.h" +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/scope.h" +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) +#include "paddle/fluid/platform/nccl_helper.h" +#endif + +namespace paddle { +namespace framework { +namespace details { + +struct FusedAllReduceOpHandle : public OpHandleBase { +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + FusedAllReduceOpHandle(ir::Node *node, + const std::vector &local_scopes, + const std::vector &places, + const size_t num_of_all_reduce, + const platform::NCCLContextMap *ctxs); +#else + FusedAllReduceOpHandle(ir::Node *node, + const std::vector &local_scopes, + const std::vector &places, + const size_t num_of_all_reduce); +#endif + std::string Name() const override; + + // Delay and buffer nccl_all_reduce together can significantly increase + // performance. Disable this feature by returning false. + bool IsMultiDeviceTransfer() override { return true; }; + + protected: + void RunImpl() override; + + private: + std::vector local_scopes_; + std::vector places_; + size_t num_of_all_reduce_; +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + const platform::NCCLContextMap *nccl_ctxs_; +#endif + + // Check the dtype of the input + void GetDTypeAndNumel( + const std::vector> &g_tensor, + proto::VarType::Type *dtype, int64_t *total_num) const; + + // Get gradient's name and LoDTensor + void GetGradLoDTensor(const size_t &scope_idx, + const std::vector &in_var_handles, + const std::vector &out_var_handles, + std::vector> + *grad_tensor) const; +}; + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/fused_broadcast_op_handle.cc b/paddle/fluid/framework/details/fused_broadcast_op_handle.cc new file mode 100644 index 0000000000000000000000000000000000000000..f48561ea32e6a3bbc7e9f2a8326b080ad21c6d61 --- /dev/null +++ b/paddle/fluid/framework/details/fused_broadcast_op_handle.cc @@ -0,0 +1,55 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/details/fused_broadcast_op_handle.h" +#include "paddle/fluid/framework/details/container_cast.h" +#include "paddle/fluid/framework/details/variable_visitor.h" +#include "paddle/fluid/platform/profiler.h" + +namespace paddle { +namespace framework { +namespace details { + +void FusedBroadcastOpHandle::RunImpl() { + platform::RecordEvent record_event(Name()); + + if (places_.size() == 1UL) return; + + auto in_var_handles = DynamicCast(inputs_); + auto out_var_handles = DynamicCast(outputs_); + + WaitInputVarGenerated(); + + std::vector var_scopes; + for (auto *s : local_scopes_) { + var_scopes.emplace_back(s->FindVar(kLocalExecScopeName)->Get()); + } + + size_t place_num = places_.size(); + PADDLE_ENFORCE_EQ(in_var_handles.size() * place_num, out_var_handles.size()); + + for (size_t i = 0; i < in_var_handles.size(); ++i) { + BroadcastOneVar( + *in_var_handles[i], + std::vector(out_var_handles.begin() + i * place_num, + out_var_handles.begin() + (i + 1) * place_num), + var_scopes); + } +} + +std::string FusedBroadcastOpHandle::Name() const { return "fused_broadcast"; } + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/fused_broadcast_op_handle.h b/paddle/fluid/framework/details/fused_broadcast_op_handle.h new file mode 100644 index 0000000000000000000000000000000000000000..e43d545c9c0d062a8814a2f5d27f9ca20290b9dd --- /dev/null +++ b/paddle/fluid/framework/details/fused_broadcast_op_handle.h @@ -0,0 +1,57 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include + +#include "paddle/fluid/framework/details/broadcast_op_handle.h" +#include "paddle/fluid/framework/details/multi_devices_helper.h" +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/framework/selected_rows.h" +#include "paddle/fluid/platform/device_context.h" + +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) +#include "paddle/fluid/platform/nccl_helper.h" +#endif + +namespace paddle { +namespace framework { +namespace details { + +struct FusedBroadcastOpHandle : public BroadcastOpHandle { + public: +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + FusedBroadcastOpHandle(ir::Node *node, + const std::vector local_scopes, + const std::vector &places, + const platform::NCCLContextMap *nccl_ctx) + : BroadcastOpHandle(node, local_scopes, places, nccl_ctx) {} +#else + FusedBroadcastOpHandle(ir::Node* node, const std::vector local_scopes, + const std::vector& places) + : BroadcastOpHandle(node, local_scopes, places) {} +#endif + std::string Name() const override; + + protected: + void RunImpl() override; +}; + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/fused_broadcast_op_handle_test.cc b/paddle/fluid/framework/details/fused_broadcast_op_handle_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..6d53dac5c0a20b4340e71274a00a7f3c0cd08ff6 --- /dev/null +++ b/paddle/fluid/framework/details/fused_broadcast_op_handle_test.cc @@ -0,0 +1,168 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/details/fused_broadcast_op_handle.h" +#include "gtest/gtest.h" +#include "paddle/fluid/framework/details/broadcast_op_handle_test.h" + +namespace paddle { +namespace framework { +namespace details { + +struct TestFusedBroadcastOpHandle : TestBroadcastOpHandle { + std::vector out_varnames_; + std::vector> nodes_; + + void InitFusedBroadcastOp(std::vector input_scope_idxes) { + nodes_.clear(); + // initialize scope and var + for (size_t i = 0; i < place_list_.size(); ++i) { + local_scopes_.push_back(&(g_scope_.NewScope())); + Scope& local_scope = local_scopes_.back()->NewScope(); + *local_scopes_.back() + ->Var(details::kLocalExecScopeName) + ->GetMutable() = &local_scope; + for (size_t j = 0; j < input_scope_idxes.size(); ++j) { + local_scope.Var("out_var" + std::to_string(j)); + if (i == j) local_scope.Var("in_var" + std::to_string(j)); + } + param_scopes_.emplace_back(&local_scope); + } + + // create op handle node + nodes_.emplace_back( + ir::CreateNodeForTest("fused_broadcast", ir::Node::Type::kOperation)); + if (use_gpu_) { +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + op_handle_ = new FusedBroadcastOpHandle( + nodes_.back().get(), local_scopes_, place_list_, nccl_ctxs_.get()); +#else + PADDLE_THROW("CUDA is not supported."); +#endif + } else { +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + op_handle_ = new FusedBroadcastOpHandle( + nodes_.back().get(), local_scopes_, place_list_, nccl_ctxs_.get()); +#else + op_handle_ = new FusedBroadcastOpHandle(nodes_.back().get(), + local_scopes_, place_list_); +#endif + } + + for (size_t i = 0; i < input_scope_idxes.size(); ++i) { + // add input var handle + nodes_.emplace_back(ir::CreateNodeForTest("in_node" + std::to_string(i), + ir::Node::Type::kVariable)); + VarHandle* in_var_handle = new VarHandle( + nodes_.back().get(), 1, input_scope_idxes[i], + "in_var" + std::to_string(i), place_list_[input_scope_idxes[i]]); + vars_.emplace_back(in_var_handle); + op_handle_->AddInput(in_var_handle); + + // add output var handle + for (size_t j = 0; j < place_list_.size(); ++j) { + nodes_.emplace_back(ir::CreateNodeForTest( + "out_node" + std::to_string(i), ir::Node::Type::kVariable)); + VarHandle* out_var_handle = + new VarHandle(nodes_.back().get(), 2, j, + "out_var" + std::to_string(i), place_list_[j]); + vars_.emplace_back(out_var_handle); + op_handle_->AddOutput(out_var_handle); + } + } + } + + void TestFusedBroadcastLoDTensor(std::vector input_scope_idxes) { + std::vector> send_vec; + f::LoD lod{{0, 10, 20}}; + for (size_t i = 0; i < input_scope_idxes.size(); ++i) { + const std::string varname("in_var" + std::to_string(i)); + float val_scalar = static_cast(i); + send_vec.push_back( + InitLoDTensor(varname, input_scope_idxes[i], lod, val_scalar)); + } + + op_handle_->Run(false); + + WaitAll(); + for (size_t i = 0; i < input_scope_idxes.size(); ++i) { + const std::string& varname("out_var" + std::to_string(i)); + for (size_t j = 0; j < place_list_.size(); ++j) { + LoDTensorEqual(varname, send_vec[i], lod, param_scopes_[j]); + } + } + } + + void TestFusedBroadcastSelectedRows(std::vector input_scope_idxes) { + std::vector> send_vector; + std::vector rows{0, 1, 2, 3, 3, 0, 14, 7, 3, 1, + 2, 4, 6, 3, 1, 1, 1, 1, 3, 7}; + int height = static_cast(kDims[0] * 2); + for (size_t i = 0; i < input_scope_idxes.size(); ++i) { + const std::string varname("in_var" + std::to_string(i)); + float val_scalar = static_cast(i); + send_vector.push_back(InitSelectedRows(varname, input_scope_idxes[i], + rows, height, val_scalar)); + } + + op_handle_->Run(false); + + WaitAll(); + for (size_t i = 0; i < input_scope_idxes.size(); ++i) { + const std::string& varname("out_var" + std::to_string(i)); + for (size_t j = 0; j < place_list_.size(); ++j) { + SelectedRowsEqual(varname, input_scope_idxes[i], send_vector[i], rows, + height); + } + } + } +}; + +TEST(FusedBroadcastTester, CPULodTensor) { + TestFusedBroadcastOpHandle test_op; + std::vector input_scope_idxes = {0, 1}; + test_op.InitCtxOnGpu(false); + test_op.InitFusedBroadcastOp(input_scope_idxes); + test_op.TestFusedBroadcastLoDTensor(input_scope_idxes); +} + +TEST(FusedBroadcastTester, CPUSelectedRows) { + TestFusedBroadcastOpHandle test_op; + std::vector input_scope_idxes = {0, 1}; + test_op.InitCtxOnGpu(false); + test_op.InitFusedBroadcastOp(input_scope_idxes); + test_op.TestFusedBroadcastSelectedRows(input_scope_idxes); +} + +#ifdef PADDLE_WITH_CUDA +TEST(FusedBroadcastTester, GPULodTensor) { + TestFusedBroadcastOpHandle test_op; + std::vector input_scope_idxes = {0, 1}; + test_op.InitCtxOnGpu(true); + test_op.InitFusedBroadcastOp(input_scope_idxes); + test_op.TestFusedBroadcastLoDTensor(input_scope_idxes); +} + +TEST(FusedBroadcastTester, GPUSelectedRows) { + TestFusedBroadcastOpHandle test_op; + std::vector input_scope_idxes = {0, 1}; + test_op.InitCtxOnGpu(true); + test_op.InitFusedBroadcastOp(input_scope_idxes); + test_op.TestFusedBroadcastSelectedRows(input_scope_idxes); +} +#endif + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/gather_op_handle.cc b/paddle/fluid/framework/details/gather_op_handle.cc index 9aae19fc73de4387186da47c55710c94d53f1b88..179cca44cb1871bb9667074f6c6b32edee42be09 100644 --- a/paddle/fluid/framework/details/gather_op_handle.cc +++ b/paddle/fluid/framework/details/gather_op_handle.cc @@ -36,7 +36,7 @@ void GatherOpHandle::RunImpl() { VarHandle *out_var_handle; { - auto out_var_handles = DynamicCast(outputs_); + auto out_var_handles = DynamicCast(this->Outputs()); PADDLE_ENFORCE_EQ(out_var_handles.size(), 1, "The number of output should be one."); out_var_handle = out_var_handles.front(); @@ -49,7 +49,7 @@ void GatherOpHandle::RunImpl() { auto in_0_handle = in_var_handles[0]; auto pre_in_var = - var_scopes.at(in_0_handle->scope_idx_)->FindVar(in_0_handle->name_); + var_scopes.at(in_0_handle->scope_idx())->FindVar(in_0_handle->name()); PADDLE_ENFORCE_NOT_NULL(pre_in_var); PADDLE_ENFORCE(pre_in_var->IsType(), @@ -65,7 +65,7 @@ void GatherOpHandle::RunImpl() { // Gather the inputs for (auto *in_handle : in_var_handles) { auto *in_var = - var_scopes.at(in_handle->scope_idx_)->FindVar(in_handle->name_); + var_scopes.at(in_handle->scope_idx())->FindVar(in_handle->name()); PADDLE_ENFORCE_NOT_NULL(in_var); VariableVisitor::EnforceShapeAndDTypeEQ(*in_var, *pre_in_var); @@ -77,7 +77,7 @@ void GatherOpHandle::RunImpl() { } // NOTE: The Places of all input tensor must be all on CPU or all on GPU. - platform::Place t_out_p = out_var_handle->place_; + platform::Place t_out_p = out_var_handle->place(); if (platform::is_gpu_place(pre_in_value.place())) { PADDLE_ENFORCE(platform::is_gpu_place(t_out_p), "Places of input and output must be all on GPU."); @@ -85,8 +85,8 @@ void GatherOpHandle::RunImpl() { t_out_p = platform::CPUPlace(); } - auto out_var = - var_scopes.at(out_var_handle->scope_idx_)->FindVar(out_var_handle->name_); + auto out_var = var_scopes.at(out_var_handle->scope_idx()) + ->FindVar(out_var_handle->name()); PADDLE_ENFORCE_NOT_NULL(out_var); auto out_value = out_var->GetMutable(); out_value->set_height(pre_in_value.height()); @@ -99,9 +99,9 @@ void GatherOpHandle::RunImpl() { Tensor *out_tensor = out_value->mutable_value(); // copy - auto dev_ctx = dev_ctxes_[out_var_handle->place_]; - RunAndRecordEvent(out_var_handle->place_, [in_tensors, out_tensor, &dev_ctx, - t_out_p] { + auto dev_ctx = dev_ctxes_.at(out_var_handle->place()); + RunAndRecordEvent(out_var_handle->place(), [in_tensors, out_tensor, &dev_ctx, + t_out_p] { int s = 0, e = 0; for (size_t j = 0; j < in_tensors.size(); ++j) { e += in_tensors[j].dims()[0]; diff --git a/paddle/fluid/framework/details/gather_op_handle_test.cc b/paddle/fluid/framework/details/gather_op_handle_test.cc index ed67e88ff6a7fe9efd93e5dfd4d7bdf4c43aac2e..e8cb7feb8bea92a7486b8a9d84ba4b9e2b93dbfb 100644 --- a/paddle/fluid/framework/details/gather_op_handle_test.cc +++ b/paddle/fluid/framework/details/gather_op_handle_test.cc @@ -31,9 +31,10 @@ struct TestGatherOpHandle { std::vector local_scopes_; std::vector param_scopes_; Scope g_scope_; - std::unique_ptr op_handle_; - std::vector> vars_; + OpHandleBase* op_handle_; + std::vector vars_; std::vector gpu_list_; + std::vector> nodes_; void WaitAll() { for (size_t j = 0; j < ctxs_.size(); ++j) { @@ -70,7 +71,7 @@ struct TestGatherOpHandle { } void InitGatherOp(size_t input_scope_idx) { - std::vector> nodes; + nodes_.clear(); for (size_t j = 0; j < gpu_list_.size(); ++j) { local_scopes_.push_back(&(g_scope_.NewScope())); Scope& local_scope = local_scopes_.back()->NewScope(); @@ -82,44 +83,45 @@ struct TestGatherOpHandle { } param_scopes_[input_scope_idx]->Var("out"); - nodes.emplace_back( + nodes_.emplace_back( ir::CreateNodeForTest("node", ir::Node::Type::kOperation).release()); - op_handle_.reset( - new GatherOpHandle(nodes.back().get(), local_scopes_, gpu_list_)); + op_handle_ = + new GatherOpHandle(nodes_.back().get(), local_scopes_, gpu_list_); // add input for (size_t j = 0; j < gpu_list_.size(); ++j) { op_handle_->SetDeviceContext(gpu_list_[j], ctxs_[j].get()); - nodes.emplace_back( + nodes_.emplace_back( ir::CreateNodeForTest("node1", ir::Node::Type::kVariable).release()); auto* in_var_handle = - new VarHandle(nodes.back().get(), 1, j, "input", gpu_list_[j]); + new VarHandle(nodes_.back().get(), 1, j, "input", gpu_list_[j]); vars_.emplace_back(in_var_handle); op_handle_->AddInput(in_var_handle); } // add dummy var - nodes.emplace_back( + nodes_.emplace_back( ir::CreateNodeForTest("node2", ir::Node::Type::kVariable).release()); - vars_.emplace_back(new DummyVarHandle(nodes.back().get())); + vars_.emplace_back(new DummyVarHandle(nodes_.back().get())); DummyVarHandle* in_dummy_var_handle = - static_cast(vars_.back().get()); + static_cast(vars_.back()); in_dummy_var_handle->ClearGeneratedOp(); op_handle_->AddInput(in_dummy_var_handle); // add output - nodes.emplace_back( + nodes_.emplace_back( ir::CreateNodeForTest("node3", ir::Node::Type::kVariable).release()); - auto* out_var_handle = new VarHandle(nodes.back().get(), 2, input_scope_idx, - "out", gpu_list_[input_scope_idx]); + auto* out_var_handle = + new VarHandle(nodes_.back().get(), 2, input_scope_idx, "out", + gpu_list_[input_scope_idx]); vars_.emplace_back(out_var_handle); op_handle_->AddOutput(out_var_handle); // add dummy var - nodes.emplace_back( + nodes_.emplace_back( ir::CreateNodeForTest("node4", ir::Node::Type::kVariable).release()); - vars_.emplace_back(new DummyVarHandle(nodes.back().get())); + vars_.emplace_back(new DummyVarHandle(nodes_.back().get())); DummyVarHandle* dummy_var_handle = - static_cast(vars_.back().get()); + static_cast(vars_.back()); op_handle_->AddOutput(dummy_var_handle); } diff --git a/paddle/fluid/framework/details/graph_test_base.h b/paddle/fluid/framework/details/graph_test_base.h new file mode 100644 index 0000000000000000000000000000000000000000..d139f8488309eecf89c924a346ab0e574edc86dc --- /dev/null +++ b/paddle/fluid/framework/details/graph_test_base.h @@ -0,0 +1,80 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include "glog/logging.h" +#include "gtest/gtest.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_helper.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/framework/program_desc.h" + +namespace paddle { +namespace framework { + +class DummyOp : public OperatorBase { + public: + DummyOp(const std::string& type, const VariableNameMap& inputs, + const VariableNameMap& outputs, const AttributeMap& attrs) + : OperatorBase(type, inputs, outputs, attrs) {} + + private: + void RunImpl(const Scope& scope, + const platform::Place& place) const override {} +}; + +class SumOpMaker : public OpProtoAndCheckerMaker { + public: + void Make() { + AddInput("X", "").AsDuplicable(); + AddOutput("Out", ""); + AddComment(""); + } +}; + +class AssignOpMaker : public OpProtoAndCheckerMaker { + public: + void Make() { + AddInput("X", "").AsDuplicable(); + AddOutput("Out", ""); + AddComment(""); + } +}; + +class SplitOpMaker : public OpProtoAndCheckerMaker { + public: + void Make() { + AddInput("X", ""); + AddOutput("Out", "").AsDuplicable(); + AddComment(""); + } +}; + +class DummyVarTypeInference : public VarTypeInference { + public: + void operator()(framework::InferVarTypeContext* ctx) const override { + auto& inputs = ctx->Input("X"); + auto type = ctx->GetType(inputs.front()); + auto out_var_name = ctx->Output("Out").front(); + ctx->SetType(out_var_name, type); + } +}; + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/inplace_op_pass.cc b/paddle/fluid/framework/details/inplace_op_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..84c9e4a379a5e07dc3a8e85409c804eebc390c73 --- /dev/null +++ b/paddle/fluid/framework/details/inplace_op_pass.cc @@ -0,0 +1,555 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/details/inplace_op_pass.h" +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "paddle/fluid/framework/details/memory_optimize_pass.h" +#include "paddle/fluid/framework/ir/graph_helper.h" +#include "paddle/fluid/framework/op_info.h" + +// NOTE(dzhwinter): inplace means one op output variable reuse the input space. +// By our design, one operator only can read its input(const Variable), +// write its output(non-const Variable). If one operator is inplaced, means +// user have chance to write the space before reading happens. +// Especially when some optimize code writing style is applied. +// +// +// /* wrong case in operator */ +// /*In this case, a larger allocation is allocated, input content is lost*/ +// const Tensor* in = ctx.Input("In") +// Tensor* out = ctx.Output("Out"); +// auto* out_ptr = out->mutable_data(ctx.GetPlace()); +// out_ptr[0] = 0; // input contect is overwrited. + +// NOTE(dzhwinter): +// Only for backward compacity and stable. if enable_inplace_whitelist is turn +// on. +// only the ops in whitelist will be use inplace strategy. +// if not, all the op will be inplaced if it registered with InplaceClass +DEFINE_bool( + enable_inplace_whitelist, false, + "If this option turns on, only these op in whitelist can be inplaced." + "If it turns off, all of the running op can be candidate of inplaced op." + "Such as scale, elementwise_add" + "By default, it's turned off"); + +DECLARE_string(memory_optimize_debug); + +// clang-format off +const std::string kInplacedOpWhiteList[] = { // NOLINT + "sigmoid", + "exp", + "relu", + "tanh", + "sqrt", + "ceil", + "floor", + "reciprocal", + "relu6", + "soft_relu", + "hard_sigmoid", + "batch_norm", + "batch_norm_grad", + "sum", + "sum_grad", + "scale", + "reshape", + "elementwise_add", + "elementwise_add_grad", +}; +// clang-format on + +namespace paddle { +namespace framework { +namespace details { + +static inline ir::Node* GetNextCascadeInplacedVar(ir::Node* var) { + // if next op is inplaced, then return the output var + // otherwise return nullptr + PADDLE_ENFORCE(var && var->IsVar() && !var->IsCtrlVar()); + ir::Node* inplaced_var = nullptr; + for (auto* next_op : var->outputs) { + for (auto* output : next_op->outputs) { + if (output->IsVar() && !output->IsCtrlVar() && + output->Name() == var->Name()) { + inplaced_var = output; + } + } + } + return inplaced_var; +} + +static inline ir::Node* GetPrevCascadeInplacedVar(ir::Node* var) { + PADDLE_ENFORCE(var && var->IsVar() && !var->IsCtrlVar()); + if (var->inputs.empty()) return nullptr; + auto* prev_op = var->inputs.at(0); + auto input_it = std::find_if(prev_op->inputs.begin(), prev_op->inputs.end(), + [&](ir::Node* node) { + if (node->IsVar() && !node->IsCtrlVar() && + node->Name() == var->Name()) { + return true; + } else { + return false; + } + }); + return input_it == prev_op->inputs.end() ? nullptr : *input_it; +} + +InplacePass::InplacePass() : Pass() { + if (FLAGS_enable_inplace_whitelist) { + for (auto& s : kInplacedOpWhiteList) { + whitelist_.emplace(s); + } + } +} + +void InplacePass::InitSSAGraphNodes() const { + std::unordered_map> all_vars; + for (auto* op : view_.AllOps()) { + for (auto* node : op->inputs) { + if (!node->IsVar() || node->IsCtrlVar()) continue; + if (all_vars[node->Name()].count(node) == 0) { + all_vars[node->Name()].emplace(node); + var_nodes_[node->Name()].emplace_back(node); + } + } + for (auto* node : op->outputs) { + if (!node->IsVar() || node->IsCtrlVar()) continue; + if (all_vars[node->Name()].count(node) == 0) { + all_vars[node->Name()].emplace(node); + var_nodes_[node->Name()].emplace_back(node); + } + } + } +} + +void InplacePass::ApplyImpl(ir::Graph* graph) const { + var_nodes_.clear(); + view_.Build(graph); + InitSSAGraphNodes(); + + auto cnt = 0; + for (auto* op : view_.AllOps()) { + VLOG(4) << "Handle op " << cnt++ << ": " << op->Name(); + if (FLAGS_enable_inplace_whitelist && !whitelist_.count(op->Name())) + continue; + TryInplaceOpInputOutput(op, graph); + } +} + +void InplacePass::InplaceModifyDesc(const std::string& var, + const std::string& cache_var, + const size_t& idx) const { + for (size_t i = idx; i < view_.AllOps().size(); ++i) { + ir::Node* op = view_.AllOps()[i]; + PADDLE_ENFORCE(op->IsOp() && op->Op()); + auto* op_desc = op->Op(); + op_desc->RenameInput(var, cache_var); + op_desc->RenameOutput(var, cache_var); + + op_desc->Flush(); + } +} + +const NodeSwapQueue InplacePass::TryInplaceModifyVar( + const std::string& var, const std::string& cache_var, const size_t& idx, + ir::Graph* graph) const { + PADDLE_ENFORCE(var_nodes_[var].size() >= 1 && + var_nodes_[var].at(0)->Var() != nullptr); + std::unique_ptr var_desc(new VarDesc(*var_nodes_[var].at(0)->Var())); + var_desc->SetName(cache_var); + + NodeSwapQueue swap_nodes; + + for (size_t i = idx; i < view_.AllOps().size(); ++i) { + auto* op = view_.AllOps()[i]; + + // redirect the input to the latest version of cache_var + for (auto* node : op->inputs) { + if (node->Name() == var) { + ir::Node* cache_node = graph->CreateVarNode(var_desc.get()); + + // swap node to cache_node + cache_node->outputs.insert(cache_node->outputs.end(), + node->outputs.begin(), node->outputs.end()); + PADDLE_ENFORCE(node->inputs.size() == 1 && node->inputs[0]->IsOp()); + auto* prev_op = node->inputs[0]; + std::replace(prev_op->outputs.begin(), prev_op->outputs.end(), node, + cache_node); + cache_node->inputs.emplace_back(prev_op); + for (auto* next_op : node->outputs) { + std::replace(next_op->inputs.begin(), next_op->inputs.end(), node, + cache_node); + } + + swap_nodes.emplace_back(std::make_pair(node, cache_node)); + } + } + + // if we need to rename the output, + // always create a newer version of cache_var + for (auto* node : op->outputs) { + if (node->Name() == var) { + ir::Node* cache_node = graph->CreateVarNode(var_desc.get()); + // swap node to cache node + cache_node->outputs.insert(cache_node->outputs.end(), + node->outputs.begin(), node->outputs.end()); + cache_node->inputs.emplace_back(op); + std::replace(op->outputs.begin(), op->outputs.end(), node, cache_node); + for (auto* next_op : node->outputs) { + std::replace(next_op->inputs.begin(), next_op->inputs.end(), node, + cache_node); + } + + swap_nodes.emplace_back(std::make_pair(node, cache_node)); + } + } + } + + return swap_nodes; +} + +void InplacePass::CommitModify(const NodeSwapQueue& swap_nodes, + ir::Graph* graph) const { + for (auto& pair : swap_nodes) { + auto *node = pair.first, *cache_node = pair.second; + const std::string var = node->Name(), cache_var = cache_node->Name(); + var_nodes_[cache_var].emplace_back(cache_node); + graph->RemoveNode(node); + auto& nodes = var_nodes_.at(var); + // release unused var in graph. Because python side memory optimize + // may reused the var in same name, so we only clear the var node + // after current inplaced index. + nodes.erase(std::remove(nodes.begin(), nodes.end(), node), nodes.end()); + } +} + +void InplacePass::WithdrawModify(const NodeSwapQueue& nodes, + ir::Graph* graph) const { + for (auto& pair : nodes) { + auto *node = pair.first, *cache_node = pair.second; + const std::string var = node->Name(), cache_var = cache_node->Name(); + auto* prev_op = node->inputs[0]; + std::replace(prev_op->outputs.begin(), prev_op->outputs.end(), cache_node, + node); + for (auto* next_op : node->outputs) { + std::replace(next_op->inputs.begin(), next_op->inputs.end(), cache_node, + node); + } + graph->RemoveNode(cache_node); + } +} + +void InplacePass::TryInplaceOpInputOutput(ir::Node* op, + ir::Graph* graph) const { + VLOG(4) << "Try to inplace op " << op->Name(); + // some pre-requirments need to meet if the op want to inplaced. + PADDLE_ENFORCE(op->Op() != nullptr, "op_desc is nullptr"); + + auto* op_desc = op->Op(); + auto& infer_inplace = + OpInfoMap::Instance().Get(op_desc->Type()).infer_inplace_; + + // 1. infer_inplace_ is registered. + if (!static_cast(infer_inplace)) return; + PADDLE_ENFORCE(static_cast(infer_inplace), + "%s's infer_inplace has not been registered", op_desc->Type()); + + auto in_to_outs = infer_inplace(*op_desc); + + auto& all_ops = view_.AllOps(); + auto cursor = std::find(all_ops.begin(), all_ops.end(), op); + size_t idx = std::distance(all_ops.begin(), cursor); + + for (auto& pair : in_to_outs) { + auto& in_para_name = pair.first; + auto& out_para_name = pair.second; + + auto input_vars = op->Op()->Input(in_para_name); + if (!input_vars.size()) { + VLOG(4) << "Parameter " << in_para_name << " is empty skip " + << in_para_name << " => " << out_para_name << " pair"; + continue; + } + auto output_vars = op->Op()->Output(out_para_name); + if (!output_vars.size()) { + VLOG(4) << "Parameter " << out_para_name << " is empty skip " + << in_para_name << " => " << out_para_name << " pair"; + continue; + } + auto in_var_name = input_vars.at(0); + auto out_var_name = output_vars.at(0); + auto* in_node = view_.GetNodeByName(in_var_name, op->inputs); + auto* out_node = view_.GetNodeByName(out_var_name, op->outputs); + + VLOG(4) << "Try to inplace " << in_var_name << " with " << out_var_name; + + if (var_nodes_[in_var_name].back() != in_node) { + VLOG(4) << "SKIP since " << in_var_name + << " is also used as output by other ops"; + continue; + } + + bool can_replace = true; + if (in_var_name == out_var_name) { + can_replace = false; + VLOG(4) << "SKIP: Input variable " << in_var_name << " & Output variable " + << out_var_name << " are the same"; + } else if (!NodeCanReused(in_node)) { + can_replace = false; + VLOG(4) << "SKIP: Input varialbe " << in_var_name << "cannot be reused"; + } else if (!NodeCanReused(out_node)) { + can_replace = false; + VLOG(4) << "SKIP: Output variable " << out_var_name + << " cannot be reused"; + } else if (details::NodeSize(*in_node->Var()) != + details::NodeSize(*out_node->Var())) { + can_replace = false; + VLOG(4) << "SKIP: Input and Output varialbe size not match"; + } + + if (!can_replace) continue; + + // 2. there is no external pending op on the input node + // if (view_.PendingOpsOnVar(in_node).size() > 1) { + if (in_node->outputs.size() > 1 && !view_.CheckDeps(in_node, op)) { + VLOG(4) << string::Sprintf( + "Skiped pair %s => %s. %s input has external dependency." + "inplace such pair will overwrite the memory.", + out_var_name, in_var_name, op->Name()); + continue; + } + + // 3. if output has been memory optimize by python(fluid.memory_optmize()). + // this candidate can not be inplaced. Will be deprecated in the future. + if (view_.InSkipSet(out_node->Name())) { + VLOG(4) << string::Sprintf( + "Skiped %s => %s reused previous memory block in python memory " + "optmize," + "it inplace may generate a circle", + out_var_name, in_var_name, op->Name()); + continue; + } + + // Debug Interface. Which would be skipped by the pass. + if (out_node->Name() == FLAGS_memory_optimize_debug) { + VLOG(3) << "Skiped var by force. FLAGS_memory_optimize_debug=" + << out_node->Name(); + continue; + } + + // NOTE(dzhwinter): + // two stage commit of inplaced process. if after inplace happens generate a + // circle, + // then withdraw the changes. Otherwise, safely add the node. + auto swap_nodes = + TryInplaceModifyVar(out_var_name, in_var_name, idx, graph); + + if (!ir::HasCircle(*graph)) { + VLOG(3) << string::Sprintf("!!! %s, %s => %s inplaced", op->Name(), + out_var_name, in_var_name); + InplaceModifyDesc(out_var_name, in_var_name, idx); + CommitModify(swap_nodes, graph); + } else { + VLOG(3) << string::Sprintf( + "Skiped pair %s => %s, inplace will generate a circle. withdraw %s", + out_var_name, in_var_name, op->Name()); + WithdrawModify(swap_nodes, graph); + } + } +} + +void GraphView::TopoSort(ir::Graph* graph) { + // + ops_.clear(); + auto deps_num = [](ir::Node* op) { + auto cnt = 0; + for (auto& var : op->inputs) + if (var->inputs.size() > 0) ++cnt; + return cnt; + }; + + std::queue> ready_ops; + + int level = 0; + auto nodes = graph->Nodes(); + std::unordered_map deps_map; + for (auto& node : nodes) { + if (node->IsOp() && node->Op() != nullptr) { + deps_map[node] = deps_num(node); + if (0 == deps_map[node]) { + ready_ops.push({node, level}); + } + } + } + + while (!ready_ops.empty()) { + auto item = ready_ops.front(); + ready_ops.pop(); + + ops_.emplace_back(item.first); + // record level when pop from queue + op_level_[item.first] = item.second; + + for (auto node : item.first->outputs) { + for (auto op : node->outputs) { + --deps_map[op]; + if (deps_map[op] == 0) ready_ops.push({op, item.second + 1}); + } + } + } + + bool all_ops_checked = true; + for (auto& node : nodes) { + if (node->IsOp() && node->Op() != nullptr && deps_map[node] > 0) { + all_ops_checked = false; + break; + } + } + + PADDLE_ENFORCE(all_ops_checked, "All ops deps should be 0 after analysis"); +} + +// return true if current op node depeneds on all other op that use the same +// variable node +bool GraphView::CheckDeps(ir::Node* var, ir::Node* current_op) const { + // get op list that rely on the same variable + auto op_list = var->outputs; + for (auto& op : op_list) { + if (op == current_op) continue; + + VLOG(4) << " GraphView::CheckDeps : " << op->Name() << " & " + << current_op->Name(); + if (!CheckOpDeps(op, current_op)) return false; + VLOG(4) << ""; + } + return true; +} + +// check if op2 depends on op1's output +bool GraphView::CheckOpDeps(ir::Node* op1, ir::Node* op2) const { + if (VLOG_IS_ON(4)) { + auto print_op = [&](ir::Node* op, const char* name) { + std::ostringstream os; + os << " " << name << " : " << op->Name() << " "; + os << "Input args : "; + for (auto& arg : op->inputs) os << arg->Name() << " "; + os << "Output args : "; + for (auto& arg : op->outputs) os << arg->Name() << " "; + os << "Level : " << op_level_.at(op); + VLOG(4) << os.str(); + }; + print_op(op1, "OP1"); + print_op(op2, "OP2"); + } + if (op1 == op2) return true; + if (op_level_.at(op1) >= op_level_.at(op2)) return false; + + for (auto& var : op2->inputs) + if (var->inputs.size() > 0 && CheckOpDeps(op1, var->inputs[0])) return true; + + return false; +} + +ir::Node* GraphView::GetNodeByName(const std::string& name, + const std::vector& nodes) const { + // nodes should be op->inputs/outputs + // node in same node do have different name. + std::unordered_set nodes_in_op; + bool has_dup_node = + std::all_of(nodes.begin(), nodes.end(), [&nodes_in_op](ir::Node* node) { + if (!node->IsVar() || node->IsCtrlVar() || node->Var() == nullptr) { + if (nodes_in_op.count(node->Name())) return true; + nodes_in_op.emplace(node->Name()); + } + return false; + }); + PADDLE_ENFORCE(has_dup_node == false, "nodes has same name!"); + ir::Node* node = nullptr; + for (auto* it : nodes) { + if (!it->IsVar() || it->IsCtrlVar() || it->Var() == nullptr) continue; + if (it->Name() == name) { + node = it; + break; + } + } + PADDLE_ENFORCE(node != nullptr, + string::Sprintf("Not found var %s in nodes!", name)); + return node; +} + +std::vector GraphView::PendingOpsOnVar(ir::Node* node) { + // get the pending ops depends on same var node. + // because node also maybe a inplaced variable, so need to backtrack all the + // previous inplaced vars. + std::vector pending_ops; + ir::Node* p = node; + while (p != nullptr) { + pending_ops.insert(pending_ops.end(), p->outputs.begin(), p->outputs.end()); + p = GetPrevCascadeInplacedVar(p); + } + return pending_ops; +} + +void GraphView::Build(ir::Graph* g) { + // track the var nodes in correct order. + // Because we insert some new created node. Which may have data race between + // nodes. + // resolve data harzards depends on the var nodes in right order. + TopoSort(g); + + // 2. track the nodes which used by parameter server. + // these node can not be inplaced, otherwise trainer + // pserver can not find each other name. + auto update_skip_set = [&](ir::Node* node) { + for (auto& in : node->inputs) { + if (in->IsVar() && in->Var() != nullptr) dup_nodes_.emplace(in->Name()); + } + for (auto& out : node->outputs) { + if (out->IsVar() && out->Var() != nullptr) + dup_nodes_.emplace(out->Name()); + } + }; + for (auto& node : g->Nodes()) { + if (!node->IsOp()) continue; + // avoid optimize the variable used in sub-blocks + if (OpHasSubBlock(node->Op())) update_skip_set(node); + + if (node->Name() == "send") update_skip_set(node); + if (node->Name() == "recv") update_skip_set(node); + if (node->Name() == "prefetch") update_skip_set(node); + } +} + +const std::vector& GraphView::AllOps() { return ops_; } + +bool GraphView::InSkipSet(const std::string& var) const { + return dup_nodes_.count(var); +} + +} // namespace details +} // namespace framework +} // namespace paddle + +REGISTER_PASS(inplace_pass, paddle::framework::details::InplacePass); diff --git a/paddle/fluid/framework/details/inplace_op_pass.h b/paddle/fluid/framework/details/inplace_op_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..fbec973ddaa7673601780810cfbbf8c1128af513 --- /dev/null +++ b/paddle/fluid/framework/details/inplace_op_pass.h @@ -0,0 +1,99 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may abtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include +#include +#include +#include +#include +#include "paddle/fluid/framework/details/memory_optimize_helper.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/pass.h" + +namespace paddle { +namespace framework { +namespace details { + +class GraphView { + public: + GraphView() = default; + + void Build(ir::Graph* g); + + const std::vector& AllOps(); + + ir::Node* GetNodeByName(const std::string& name, + const std::vector& nodes) const; + + std::vector PendingOpsOnVar(ir::Node* var); + + // Will Deperated in the future. + // NOTE(dzhwinter) : + // 1. Python memory optimize will reuse + // memory based var name, so different op output may + // have the same variable name. enable inplace on such node + // will generate a circle in ssa graph. + // 2. DistributeTranspiler will use unique name to + // map the parameter and gradient, must be skipped. + bool InSkipSet(const std::string& var) const; + + bool CheckDeps(ir::Node* var, ir::Node* current_op) const; + bool CheckOpDeps(ir::Node* op1, ir::Node* op2) const; + void TopoSort(ir::Graph* g); + + private: + std::vector ops_; + std::unordered_set dup_nodes_; // mem opt affect nodes + std::map> adj_list_; + std::unordered_map op_level_; +}; + +// swap pairs in sequence +typedef std::vector> NodeSwapQueue; +class InplacePass : public ir::Pass { + public: + InplacePass(); + + protected: + void ApplyImpl(ir::Graph* graph) const override; + + void InitSSAGraphNodes() const; + + private: + const NodeSwapQueue TryInplaceModifyVar(const std::string& var, + const std::string& cache_var, + const size_t& idx, + ir::Graph* graph) const; + + void CommitModify(const NodeSwapQueue&, ir::Graph* graph) const; + + void WithdrawModify(const NodeSwapQueue& nodes, ir::Graph* graph) const; + + void InplaceModifyDesc(const std::string& in_var, const std::string& out_var, + const size_t& idx) const; + + void TryInplaceOpInputOutput(ir::Node* op, ir::Graph* graph) const; + + mutable std::map> var_nodes_; + + mutable std::unordered_set whitelist_; + mutable GraphView view_; +}; + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/memory_optimize_helper.cc b/paddle/fluid/framework/details/memory_optimize_helper.cc new file mode 100644 index 0000000000000000000000000000000000000000..1af57dc4087d2fd734c43e9549a4bd4526af4d35 --- /dev/null +++ b/paddle/fluid/framework/details/memory_optimize_helper.cc @@ -0,0 +1,568 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/details/memory_optimize_helper.h" +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "paddle/fluid/framework/operator.h" +#include "paddle/fluid/framework/var_desc.h" +#include "paddle/fluid/platform/cpu_info.h" + +#ifdef PADDLE_WITH_CUDA +#include "paddle/fluid/platform/gpu_info.h" +#endif // PADDLE_WITH_CUDA + +namespace paddle { +namespace framework { +namespace details { +using paddle::framework::VarDesc; + +std::vector SortOpLikeDescOrder(const ir::Graph& graph) { + PADDLE_ENFORCE(graph.Has(kStaleProgramOpDescs), + "Graph has no attribute of kStaleProgramOpDescs."); + // 1. get op desc order + auto& op_descs = graph.Get>(kStaleProgramOpDescs); + + // 2. topology sort order + auto nodes = graph.Nodes(); + std::deque ops; + FilterVariables(nodes, [&](ir::Node* op) { + if (op->IsOp() && op->Op() != nullptr) { + ops.emplace_back(op); + } + }); + std::unordered_map op_deps; + std::list ready_ops; + std::unordered_map> pending_ops; + + for (auto* op : ops) { + std::unordered_set preceding_op; + for (auto* in : op->inputs) { + if (in->inputs.empty()) continue; + PADDLE_ENFORCE(in->inputs.size() == 1 && in->inputs[0]->IsOp()); + preceding_op.emplace(in->inputs[0]); + pending_ops[in->inputs[0]].emplace(op); + } + op_deps[op] = preceding_op.size(); + if (preceding_op.empty()) { + ready_ops.emplace_back(op); + } + } + + // 3. generated op list based desc order and the topology order + std::vector ret; + std::list op_descs_list(op_descs.begin(), op_descs.end()); + + auto update_by_found_node = [&](ir::Node* found_node) { + for (auto* pending_op : pending_ops[found_node]) { + if (--op_deps[pending_op] == 0) { + ready_ops.emplace_back(pending_op); + } + } + ready_ops.remove(found_node); + ret.emplace_back(found_node); + }; + + while (!ready_ops.empty()) { + bool all_of_ready_op_unmatched = true; + for (auto it = op_descs_list.begin(); it != op_descs_list.end();) { + auto op_desc = *it; + ir::Node* found_node = nullptr; + for (auto* op : ready_ops) { + if (IsSameDesc(op->Op(), op_desc)) { + found_node = op; + break; + } + } + + // 3.1 op desc deleted by other pass + if (found_node == nullptr) { + ++it; + continue; + } else { + all_of_ready_op_unmatched = false; + it = op_descs_list.erase(it); + } + update_by_found_node(found_node); + } + + // 3.2 op descs are added by other pass + // preceding op non empty means some new op descs are + // created, but not contained in return node list. + // these new op desc may depend on each other. + std::list prev_ready_ops(ready_ops); + if (all_of_ready_op_unmatched) { + for (auto op : prev_ready_ops) { + update_by_found_node(op); + } + } + } + + PADDLE_ENFORCE(std::all_of( + op_deps.begin(), op_deps.end(), + [&](const std::pair& p) { return p.second == 0; })); + + return ret; +} + +size_t NodeSize(const VarDesc& node) { + auto shape = node.GetShape(); + int size = + std::accumulate(shape.begin(), shape.end(), 1, std::multiplies()); + size_t type_size = SizeOfType(node.GetDataType()); + return type_size * std::abs(size); +} + +size_t NodeSize(ir::Node* n) { return NodeSize(*(n->Var())); } + +std::string DebugStringImpl(VarDesc* var) { + std::stringstream ss; + ss << var->Name(); + ss << "["; + try { + auto shape = var->GetShape(); + for (size_t i = 0; i < shape.size(); ++i) { + if (i != shape.size() - 1) { + ss << shape[i] << ","; + } else { + ss << shape[i]; + } + } + ss << "]"; + } catch (...) { + ss << "Var has no VarDesc !!! Name:" << var->Name(); + } + return ss.str(); +} + +std::string DebugString(ir::Node* var) { + return DebugStringImpl(GetVarDesc(var)); +} + +// NOTE(dzh): based ir node, if a large node has been reused +// by a small size node, then next time it appear in pool, it will +// have the small size. Find the original node shap from blockdesc. +VarDesc* GetVarDesc(ir::Node* n) { + PADDLE_ENFORCE(n->IsVar() && !n->IsCtrlVar() && n->inputs.size() == 1); + return n->Var(); +} + +struct NodeComparator { + bool operator()(ir::Node* lhs, ir::Node* rhs) const { + if (lhs->Var()->GetType() != rhs->Var()->GetType()) return false; + auto* lhs_desc = GetVarDesc(lhs); + auto* rhs_desc = GetVarDesc(rhs); + // match data type + if (lhs_desc->GetDataType() != rhs_desc->GetDataType()) { + return false; + } + // match shape + auto lhs_shape = lhs_desc->GetShape(); + auto rhs_shape = rhs_desc->GetShape(); + if ((lhs_shape[0] == -1 && rhs_shape[0] == -1) || + (lhs_shape[0] != -1 && rhs_shape[0] != -1)) { + return NodeSize(lhs) == NodeSize(rhs); + } else { + return false; + } + } +}; + +void OrderedSet::Insert(ir::Node* var) { + PADDLE_ENFORCE(var->IsVar() && !var->IsCtrlVar()); + if (mark_table_.count(var->Name()) != 0) { + mark_table_[var->Name()]->emplace_back(var); + return; + } + + auto* var_desc = var->Var(); + auto var_shape = var_desc->GetShape(); + int batch_size = static_cast(var_shape[0]); + + NodeComparator functor; + Iter it = nodes_.begin(); + while (it != nodes_.end()) { + auto& prev = it->front(); + auto* cache_desc = GetVarDesc(prev); + int cache_batch_size = cache_desc->GetShape()[0]; + if ((cache_batch_size == -1 && batch_size == -1) || + (cache_batch_size != -1 && batch_size != -1)) { + if (functor(prev, var)) { + ++it; + } else { + break; + } + } else if (cache_batch_size == -1 && batch_size != -1) { + ++it; + } else if (cache_batch_size != -1 && batch_size == -1) { + break; + } + } + + it = nodes_.insert(it, {var}); + mark_table_[var->Name()] = it; +} + +int OrderedSet::GetNodeIndexInPool(ir::Node* var) { + return std::distance(nodes_.begin(), mark_table_[var->Name()]); +} + +ir::Node* OrderedSet::FindBestFitNode(ir::Node* var) const { + ir::Node* found_node = nullptr; + NodeComparator functor; + + for (auto it = nodes_.begin(); it != nodes_.end(); ++it) { + auto& candidate = it->front(); + if (functor(var, candidate)) { + found_node = candidate; + break; + } + } + return found_node; +} + +ir::Node* OrderedSet::FindNextBestFitNode(ir::Node* var, ir::Node* prev) const { + ir::Node* found_node = nullptr; + NodeComparator functor; + auto it = + std::find_if(nodes_.begin(), nodes_.end(), [&](const NodeVector& v) { + if (v.front() == prev) + return true; + else + return false; + }); + PADDLE_ENFORCE(it != nodes_.end(), "Not found previous in node list!"); + for (it = std::next(it); it != nodes_.end(); ++it) { + auto& candidate = it->front(); + if (functor(var, candidate)) { + found_node = candidate; + break; + } + } + return found_node; +} + +bool OrderedSet::Has(ir::Node* var) const { + if (mark_table_.count(var->Name())) { + auto& node_in_samename = mark_table_.at(var->Name()); + auto iter = + std::find_if(node_in_samename->begin(), node_in_samename->end(), + [&](ir::Node* n) { return n->Name() == var->Name(); }); + return iter != node_in_samename->end(); + } + return false; +} + +void OrderedSet::Erase(const std::string& var) { + PADDLE_ENFORCE(mark_table_.count(var)); + nodes_.erase(mark_table_[var]); + mark_table_.erase(var); +} + +void OrderedSet::Erase(ir::Node* var) { + PADDLE_ENFORCE(var != nullptr); + Erase(var->Name()); +} + +std::string OrderedSet::ToString() const { + std::stringstream ss; + for (auto it = nodes_.begin(); it != nodes_.end(); ++it) { + for (auto& node : *it) { + ss << DebugString(node) << " "; + } + } + return ss.str(); +} + +bool NodeCanReused(ir::Node* node) { + // valid the node is a var node + // vars can be @EMPTY@, @LR_DECAY_REUSE_ID@. For example, while_grad + if (node == nullptr || !node->IsVar() || node->IsCtrlVar() || + node->Name() == kEmptyVarName) + return false; + + bool flag = true; + // op output force generated in cpu, can not be reused. + for (auto* op : node->inputs) { + if (op->Op()->HasAttr("force_cpu")) { + flag &= framework::AttrReader(op->Op()->GetAttrMap()) + .Get("force_cpu") == 0; + } + } + // var desc validation. + flag &= NodeCanReused(*node->Var()); + return flag; +} + +int MinChunkSize() { + int size{0}; +#ifdef PADDLE_WITH_CUDA + size = platform::GpuMinChunkSize(); +#else + size = platform::CpuMinChunkSize(); +#endif // PADDLE_WITH_CUDA + return size; +} + +bool NodeCanReused(const VarDesc& node) { + auto type = node.GetType(); + // only these types holds bulk of gpu memory + // FIXME(liuwei1031) did not find good ways to test SELECTED_ROWS and + // LOD_TENSOR_ARRAY re-use logic, + // disable them in version 1.4 + // if (!(type == proto::VarType::LOD_TENSOR || + // type == proto::VarType::SELECTED_ROWS || + // type == proto::VarType::LOD_TENSOR_ARRAY)) { + // return false; + // } + if (type != proto::VarType::LOD_TENSOR) return false; + + // persistable variable is parameter + if (node.Persistable()) { + return false; + } + // shape < min_chunk_size is meaningless. + // further more, fetched loss always has size = 1 + // which should not be reused. + auto shape = node.GetShape(); + int size = std::abs( + std::accumulate(shape.begin(), shape.end(), 1, std::multiplies())); + if (shape.empty() || size < MinChunkSize()) { + return false; + } + return true; +} + +bool OpHasSubBlock(OpDesc* desc) { + const AttributeMap& attrs = desc->GetAttrMap(); + for (auto& attr : attrs) { + if (attr.second.type() == typeid(BlockDesc*) || // NOLINT + attr.second.type() == typeid(std::vector)) // NOLINT + return true; + } + return false; +} + +ControlFlowGraph::ControlFlowGraph(const ir::Graph& graph) { + ops_ = SortOpLikeDescOrder(graph); + ConnectNodes(); +} + +void ControlFlowGraph::BuildCFGGraph() { + // FIXME(dzh): same effect with ConnectNodes, but use the control + // link to build dependency graph, it goes wrong in transformer. + for (ir::Node* op : ops_) { + for (auto& input_var : op->inputs) { + if (!input_var->inputs.empty()) { + PADDLE_ENFORCE( + input_var->inputs.size() == 1 && input_var->inputs[0]->IsOp(), + "Preceding Op Node of Var Node must be unique"); + auto* pred_op = input_var->inputs[0]; + if (pred_op->Op() != nullptr) { + predecessors_[op].insert(pred_op); + successors_[pred_op].insert(op); + } + } + if (input_var->IsVar() && !input_var->IsCtrlVar()) { + uses_[op].insert(input_var->Name()); + } + } + for (auto& output_var : op->outputs) { + // output var may be used by many op + for (auto* succ_op : output_var->outputs) { + if (succ_op->Op() != nullptr) { + successors_[op].insert(succ_op); + predecessors_[succ_op].insert(op); + } + } + if (output_var->IsVar() && !output_var->IsCtrlVar()) { + defs_[op].insert(output_var->Name()); + } + } + } +} + +void ControlFlowGraph::ConnectNodes() { + for (size_t i = 0; i < ops_.size(); ++i) { + auto& op = ops_[i]; + try { + auto& next_op = ops_.at(i + 1); + successors_[op].insert(next_op); + predecessors_[next_op].insert(op); + } catch (...) { + // do nothing + } + + FilterVariables(op->inputs, + [&](ir::Node* var) { uses_[op].emplace(var->Name()); }); + + FilterVariables(op->outputs, + [&](ir::Node* var) { defs_[op].emplace(var->Name()); }); + } +} + +void ControlFlowGraph::LiveVariableAnalysis() { + // NOTE(dzh): variable liveless analysis (a.k.a reversed_ops algorithm) + // compute the liveness of for each variable though reversed_ops algorithm. + // It iterates the operators from end to begin, compute the live in/live out + // variable set for each op, then the diff between in/out will be used for + // the variable reuse. For detail refer to + // http://www.cs.cornell.edu/courses/cs4120/2013fa/lectures/lec26-fa13.pdf + std::list work_list(ops_.rbegin(), ops_.rend()); + while (!work_list.empty()) { + ir::Node* op = work_list.front(); + work_list.pop_front(); + // get the live_in calculated before. Empty if first. + auto prev_live_in = std::move(live_in_[op]); + for (auto& s : successors_[op]) { + for (auto& var : live_in_[s]) { + live_out_[op].insert(var); + } + } + for (auto& var : uses_[op]) { + live_in_[op].insert(var); + } + for (auto& var : live_out_[op]) { + live_in_[op].insert(var); + } + for (auto& var : defs_[op]) { + if (uses_[op].count(var)) continue; + live_in_[op].erase(var); + } + + // If the live_in is not changed, then the liveness analysis of + // predecessors is completed. + // + // Otherwise, recalculate the predecessors liveness + if (live_in_[op] != prev_live_in) { + for (auto& pre : predecessors_[op]) { + work_list.push_back(pre); + } + } + } + + for (auto* op : ops_) { + unlived_vars_[op] = std::set(); + for (auto& var : this->LiveIn(op)) { + if (!this->LiveOut(op).count(var)) { + unlived_vars_[op].insert(var); + } + } + } +} + +void ControlFlowGraph::RenameVarInCFGGraph(const std::string& old_node, + const std::string& new_node, + int begin_idx) { + std::vector need_update(ops_.size(), false); + // update graph from begin idx to the end + for (size_t i = begin_idx; i != ops_.size(); ++i) { + auto* op = ops_[i]; + if (uses_[op].find(old_node) != uses_[op].end()) { + uses_[op].erase(old_node); + uses_[op].insert(new_node); + } + if (defs_[op].find(old_node) != defs_[op].end()) { + defs_[op].erase(old_node); + defs_[op].insert(new_node); + } + if (live_in_[op].find(old_node) != live_in_[op].end()) { + live_in_[op].erase(old_node); + live_in_[op].insert(new_node); + need_update[i] = true; + } + if (live_out_[op].find(old_node) != live_out_[op].end()) { + live_out_[op].erase(old_node); + live_out_[op].insert(new_node); + need_update[i] = true; + } + } + + for (size_t i = begin_idx; i < ops_.size(); ++i) { + if (!need_update[i]) continue; + auto* op = ops_[i]; + for (auto& var : this->LiveIn(op)) { + if (!this->LiveOut(op).count(var)) { + unlived_vars_[op].insert(var); + } + } + } +} + +const std::set& ControlFlowGraph::LiveIn(ir::Node* op) const { + auto it = live_in_.find(op); + PADDLE_ENFORCE( + it != live_in_.end(), + string::Sprintf("Expect %s in live_in, but Not Found.", op->Name())); + return it->second; +} + +const std::set& ControlFlowGraph::LiveOut(ir::Node* op) const { + auto it = live_out_.find(op); + PADDLE_ENFORCE( + it != live_out_.end(), + string::Sprintf("Expect %s in live_out, but Not Found.", op->Name())); + return it->second; +} + +const std::set& ControlFlowGraph::Use(ir::Node* op) const { + auto it = uses_.find(op); + PADDLE_ENFORCE( + it != uses_.end(), + string::Sprintf("Expect %s in use, but Not Found.", op->Name())); + return it->second; +} + +const std::set& ControlFlowGraph::Unlived(ir::Node* op) const { + auto it = unlived_vars_.find(op); + PADDLE_ENFORCE( + it != unlived_vars_.end(), + string::Sprintf("Expect %s in unlived_set, but Not Found.", op->Name())); + return it->second; + return it->second; +} + +const std::vector& ControlFlowGraph::Ops() const { return ops_; } + +std::vector& ControlFlowGraph::Ops() { return ops_; } + +ir::Node* ControlFlowGraph::GetNodeByName(const std::string& name, + ir::Node* op) const { + // in ssa-graph, different version nodes have same name, + // this function get the latest version var before target op + // It may return nullptr, such as data node. + ir::Node* found_node = nullptr; + for (auto* node : ops_) { + if (node == op) break; + for (auto& output : node->outputs) { + PADDLE_ENFORCE((output != nullptr && output->IsVar()), + "Output is empty!"); + if (output->Var() && output->Name() == name) { + found_node = output; + } + } + } + return found_node; +} + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/memory_optimize_helper.h b/paddle/fluid/framework/details/memory_optimize_helper.h new file mode 100644 index 0000000000000000000000000000000000000000..65c7017d2d462976cf8cd4d7b5f660e279e12b6a --- /dev/null +++ b/paddle/fluid/framework/details/memory_optimize_helper.h @@ -0,0 +1,181 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "paddle/fluid/framework/data_type.h" +#include "paddle/fluid/framework/ir/graph.h" + +namespace paddle { +namespace framework { +namespace details { + +std::vector SortOpLikeDescOrder(const ir::Graph& graph); + +// NOTE(dzh): A ordered set for node reuse in memory optimize. +// the orderedset sort node in ascend order(by node bytes size). +// in fluid, -1 means the batch_size, which is determined in runtime. +// So the reuse happens between nodes who's batch_size both are -1 +// simultaneously or not. +// +// sort rule: +// rule 0 : smaller node ranking in front. +// rule 1 : batch_size equal -1 ranking in the front than the node not. +// +// For example, +// node0[-1, 1] node1[-1, 1, 1], node2[1,1], node3[1,1024], .. + +class OrderedSet { + public: + // nodes with same name exists in pool. + using NodeVector = std::vector; + using Iter = typename std::list::iterator; + using ConstIter = typename std::list::const_iterator; + + void Insert(ir::Node* var); + void Erase(ir::Node* var); + void Erase(const std::string& var); + bool Has(ir::Node* var) const; + void Clear() { + mark_table_.clear(); + nodes_.clear(); + } + // find the bestfit shape node block with var. + ir::Node* FindBestFitNode(ir::Node* var) const; + ir::Node* FindNextBestFitNode(ir::Node* var, ir::Node* prev) const; + // map store non-const iterator, can not promise const + int GetNodeIndexInPool(ir::Node* var); + // pool all node to string + std::string ToString() const; + + Iter begin() { return nodes_.begin(); } + Iter end() { return nodes_.end(); } + ConstIter begin() const { return nodes_.begin(); } + ConstIter end() const { return nodes_.end(); } + + size_t size() const { return nodes_.size(); } + + private: + // for searching. + std::unordered_map mark_table_; + // node pool + std::list nodes_; +}; + +class ControlFlowGraph { + public: + ControlFlowGraph() = default; + // IR Graph + explicit ControlFlowGraph(const ir::Graph& graph); + + void LiveVariableAnalysis(); + + void RenameVarInCFGGraph(const std::string& old_node, + const std::string& new_node, int begin_idx); + + const std::set& LiveIn(ir::Node* op) const; + const std::set& LiveOut(ir::Node* op) const; + const std::set& Use(ir::Node* op) const; + const std::set& Unlived(ir::Node* op) const; + const std::vector& Ops() const; + std::vector& Ops(); + + // for ssa-graph nodes + ir::Node* GetNodeByName(const std::string& name, ir::Node* op) const; + + private: + void BuildCFGGraph(); + void ConnectNodes(); + + using NodeListMap = std::unordered_map>; + using VarSetMap = std::map>; + // successors ops use the output variables. + NodeListMap successors_; + // predecessors ops generated input variables. + NodeListMap predecessors_; + // variables lived before run current op. + VarSetMap live_in_; + // variables lived after run current op. + VarSetMap live_out_; + VarSetMap uses_; // op inputs + VarSetMap defs_; // op outputs + std::unordered_map> unlived_vars_; + + std::vector ops_; // op sequence by topology sort +}; + +// valid a tensor can be reuse or not +bool NodeCanReused(ir::Node* node); + +// valid a tensor can be reuse or not. +bool NodeCanReused(const VarDesc& node); + +// check op has subblock or not +bool OpHasSubBlock(OpDesc* desc); + +// node memory size in bytes +size_t NodeSize(ir::Node* n); + +// node memory size in bytes +size_t NodeSize(const VarDesc&); + +std::string DebugString(ir::Node* var); + +VarDesc* GetVarDesc(ir::Node* n); + +static inline bool IsSameDesc(OpDesc* op1, OpDesc* op2) { + return op1->Type() == op2->Type() && op1->Inputs() == op2->Inputs() && + op1->Outputs() == op2->Outputs(); +} + +template +class FilterVariableImpl { + public: + void operator()(const Container& nodes, Callback callback) { + for (auto* node : nodes) { + callback(node); + } + } +}; + +// filter var node for op->inputs/outputs +template +class FilterVariableImpl, Callback> { + public: + void operator()(const std::vector& nodes, Callback callback) { + for (auto* var : nodes) { + if (var->IsVar() && !var->IsCtrlVar()) { + callback(var); + } + } + } +}; + +template +void FilterVariables(const Container& nodes, Callback callback) { + FilterVariableImpl()(nodes, callback); +} + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/memory_optimize_helper_test.cc b/paddle/fluid/framework/details/memory_optimize_helper_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..3fb02f69b1bb65a74a2e5f69e9de7994b4d012db --- /dev/null +++ b/paddle/fluid/framework/details/memory_optimize_helper_test.cc @@ -0,0 +1,526 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/details/memory_optimize_helper.h" +#include +#include +#include +#include +#include +#include +#include +#include +#include "glog/logging.h" +#include "gtest/gtest.h" +#include "paddle/fluid/framework/details/graph_test_base.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_helper.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/framework/operator.h" +#include "paddle/fluid/framework/program_desc.h" + +namespace paddle { +namespace framework { +namespace details { + +TEST(OrderedSet, Normal) { + OrderedSet pool; + std::vector> nodes; + + // clang-format off + std::vector> shapes = {{-1, 10}, + {-1, 20}, + {1, 2}, + {5, 2}, + {10, 20}, + {-1, 2, 5}, + {-1, 1, 5}, + {-1, 1}}; + // clang-format on + const int COUNT = shapes.size(); + ProgramDesc prog; + BlockDesc* block_desc = prog.MutableBlock(0); + auto* op_desc = block_desc->AppendOp(); + op_desc->SetType("dummy"); + std::unique_ptr op = ir::CreateNodeForTest(op_desc); + + for (int i = 0; i < COUNT; ++i) { + auto desc = block_desc->Var(std::to_string(i)); + desc->SetShape(shapes[i]); + std::unique_ptr node = ir::CreateNodeForTest(desc); + node->inputs.emplace_back(op.get()); + nodes.emplace_back(std::move(node)); + } + + // Insert + for (auto& node : nodes) { + pool.Insert(node.get()); + } + + // Has/size + ASSERT_EQ(pool.size(), shapes.size()); + for (auto& node : nodes) { + ASSERT_TRUE(pool.Has(node.get())); + } + + // assert its order and interface. + std::cout << pool.ToString() << std::endl; + pool.Erase(nodes.front().get()); + std::cout << pool.ToString() << std::endl; + + ASSERT_EQ(pool.size(), static_cast(COUNT - 1)); + ASSERT_EQ(pool.GetNodeIndexInPool(nodes.back().get()), 0); + + { + auto v1 = block_desc->Var("11"); + v1->SetShape({-1, 256, 56, 56}); + std::unique_ptr node1 = ir::CreateNodeForTest(v1); + node1->inputs.emplace_back(op.get()); + auto* cache = pool.FindBestFitNode(node1.get()); + ASSERT_EQ(cache, nullptr); + } + { + auto v2 = block_desc->Var("12"); + v2->SetShape({-1, 2, 5}); + std::unique_ptr node1 = ir::CreateNodeForTest(v2); + node1->inputs.emplace_back(op.get()); + auto* cache = pool.FindBestFitNode(node1.get()); + ASSERT_EQ(pool.GetNodeIndexInPool(cache), 2); // match 6:[-1,2,5] + } + { + auto v3 = block_desc->Var("13"); + v3->SetShape({2, 5}); + std::unique_ptr node1 = ir::CreateNodeForTest(v3); + node1->inputs.emplace_back(op.get()); + auto* cache = pool.FindBestFitNode(node1.get()); + ASSERT_EQ(pool.GetNodeIndexInPool(cache), 5); // match 4:[5,2] + } +} + +TEST(OrderedSet, FindBestFitNode) { + OrderedSet pool; + std::vector> nodes; + ProgramDesc prog; + BlockDesc* block_desc = prog.MutableBlock(0); + auto* op_desc = block_desc->AppendOp(); + op_desc->SetType("dummy"); + std::unique_ptr op = ir::CreateNodeForTest(op_desc); + + { + auto desc = block_desc->Var("a"); + desc->SetShape({128, 128}); + std::unique_ptr node = ir::CreateNodeForTest(desc); + node->inputs.emplace_back(op.get()); + nodes.emplace_back(std::move(node)); + } + { + auto desc = block_desc->Var("b"); + desc->SetShape({128, 129}); + std::unique_ptr node = ir::CreateNodeForTest(desc); + node->inputs.emplace_back(op.get()); + nodes.emplace_back(std::move(node)); + } + { + auto desc = block_desc->Var("c"); + desc->SetShape({128, 128}); + std::unique_ptr node = ir::CreateNodeForTest(desc); + node->inputs.emplace_back(op.get()); + nodes.emplace_back(std::move(node)); + } + + for (auto& node : nodes) { + pool.Insert(node.get()); + } + + auto* n = nodes[0].get(); + auto* cache = pool.FindBestFitNode(n); + ASSERT_TRUE(cache->Name() == "a" || cache->Name() == "c"); + auto* cache_b = pool.FindNextBestFitNode(n, cache); + ASSERT_TRUE(cache_b->Name() != cache->Name()); + ASSERT_TRUE(cache_b->Name() == "a" || cache_b->Name() == "c"); + cache = pool.FindNextBestFitNode(n, cache_b); + ASSERT_TRUE(cache == nullptr); +} + +} // namespace details +} // namespace framework +} // namespace paddle + +REGISTER_OPERATOR(sum, paddle::framework::DummyOp, + paddle::framework::SumOpMaker, + paddle::framework::DummyVarTypeInference); +REGISTER_OPERATOR(assign, paddle::framework::DummyOp, + paddle::framework::AssignOpMaker, + paddle::framework::DummyVarTypeInference); +REGISTER_OPERATOR(dummy, paddle::framework::DummyOp, + paddle::framework::SumOpMaker, + paddle::framework::DummyVarTypeInference); +/* + https://en.wikipedia.org/wiki/Live_variable_analysis + Create a customed classical dependency graph, left row is the instruction + number. + 1. a = 1 + 2. b = a + 3. c = a + 4. d = b + c + 5. e = d + + a--------+ + | | + b c + | | + d--------+ + | + e + Then analysis these variable's liveness range + */ + +namespace paddle { +namespace framework { +namespace details { + +inline static ProgramDesc FillProgramDesc() { + ProgramDesc prog; + prog.MutableBlock(0)->Var("a")->SetType(proto::VarType::LOD_TENSOR); + prog.MutableBlock(0)->Var("b")->SetType(proto::VarType::LOD_TENSOR); + prog.MutableBlock(0)->Var("c")->SetType(proto::VarType::LOD_TENSOR); + prog.MutableBlock(0)->Var("d")->SetType(proto::VarType::LOD_TENSOR); + prog.MutableBlock(0)->Var("e")->SetType(proto::VarType::LOD_TENSOR); + { + auto* op = prog.MutableBlock(0)->AppendOp(); + op->SetType("assign"); + op->SetInput("X", {"a"}); + op->SetOutput("Out", {"b"}); + } + { + auto* op = prog.MutableBlock(0)->AppendOp(); + op->SetType("assign"); + op->SetInput("X", {"a"}); + op->SetOutput("Out", {"c"}); + } + { + auto* op = prog.MutableBlock(0)->AppendOp(); + op->SetType("sum"); + op->SetInput("X", {"b", "c"}); + op->SetOutput("Out", {"d"}); + } + { + auto* op = prog.MutableBlock(0)->AppendOp(); + op->SetType("assign"); + op->SetInput("X", {"d"}); + op->SetOutput("Out", {"e"}); + } + return prog; +} + +TEST(CFGGraph, IRGraph) { + // prepare ir graph + auto prog = FillProgramDesc(); + ir::Graph graph(prog); + + ControlFlowGraph cfg(graph); + cfg.LiveVariableAnalysis(); + + // test assign op + ASSERT_TRUE((std::set{"a"} == cfg.LiveIn(cfg.Ops()[0]))); + ASSERT_TRUE((std::set{"a", "b"} == cfg.LiveOut(cfg.Ops()[0]))); + + // test assign op + ASSERT_TRUE((std::set{"a", "b"} == cfg.LiveIn(cfg.Ops()[1]))); + ASSERT_TRUE((std::set{"b", "c"} == cfg.LiveOut(cfg.Ops()[1]))); + + // test sum op + ASSERT_TRUE((std::set{"b", "c"} == cfg.LiveIn(cfg.Ops()[2]))); + ASSERT_TRUE((std::set{"d"} == cfg.LiveOut(cfg.Ops()[2]))); + + // test assign op + ASSERT_TRUE((std::set{"d"} == cfg.LiveIn(cfg.Ops()[3]))); + ASSERT_TRUE((std::set{} == cfg.LiveOut(cfg.Ops()[3]))); +} + +// 1. normal test +TEST(SortOpLikeDescOrder, NormalTest) { + auto prog = FillProgramDesc(); + ir::Graph graph(prog); + + auto nodes = SortOpLikeDescOrder(graph); + auto op_descs = prog.Block(0).AllOps(); + for (size_t i = 0; i < nodes.size(); ++i) { + auto node = nodes[i]; + auto op_desc = op_descs[i]; + ASSERT_TRUE(IsSameDesc(node->Op(), op_desc)); + } +} + +// 2. remove some op_desc +TEST(SortOpLikeDescOrder, RemoveOpDesc) { + auto prog = FillProgramDesc(); + ir::Graph graph(prog); + auto nodes = graph.Nodes(); + auto op_descs = prog.Block(0).AllOps(); + ir::Node* found_node = nullptr; + for (auto node : nodes) { + if (node->IsOp() && node->outputs.back()->Name() == "e") { + found_node = node; + break; + } + } + PADDLE_ENFORCE(found_node != nullptr); + for (auto it = op_descs.begin(); it != op_descs.end();) { + if (IsSameDesc(*it, found_node->Op())) { + it = op_descs.erase(it); + } else { + ++it; + } + } + + auto find_node_in_graph = [&](std::string s) { + ir::Node* ret = nullptr; + for (auto n : graph.Nodes()) { + if (n->Name() == s) { + ret = n; + break; + } + } + PADDLE_ENFORCE(ret != nullptr); + return ret; + }; + + ir::Node* e = find_node_in_graph("e"); + ir::Node* d = find_node_in_graph("d"); + std::remove(d->outputs.begin(), d->outputs.end(), found_node); + graph.RemoveNode(found_node); + graph.RemoveNode(e); + + // other node keeps the same order + auto remain_nodes = SortOpLikeDescOrder(graph); + for (size_t i = 0; i < remain_nodes.size(); ++i) { + auto node = remain_nodes[i]; + auto op_desc = op_descs[i]; + ASSERT_TRUE(IsSameDesc(node->Op(), op_desc)); + } +} + +// 3. add some op_desc +TEST(SortOpLikeDescOrder, AddOpDesc) { + auto prog = FillProgramDesc(); + ir::Graph graph(prog); + + auto find_node_in_graph = [&](std::string s) { + ir::Node* ret = nullptr; + for (auto n : graph.Nodes()) { + if (n->Name() == s) { + ret = n; + break; + } + } + PADDLE_ENFORCE(ret != nullptr); + return ret; + }; + + // cached desc different with real one + // mimic the intermidiete pass modify the programdesc. + std::vector op_descs = graph.OriginProgram().Block(0).AllOps(); + + auto op = prog.MutableBlock(0)->AppendOp(); + prog.MutableBlock(0)->Var("d1")->SetType(proto::VarType::LOD_TENSOR); + op->SetType("sum"); + op->SetInput("X", {"b", "c"}); + op->SetOutput("Out", {"d1"}); + ir::Node* node = graph.CreateOpNode(op); + ir::Node* d1 = graph.CreateVarNode(prog.MutableBlock(0)->Var("d1")); + ir::Node* b = find_node_in_graph("b"); + ir::Node* c = find_node_in_graph("c"); + node->outputs.emplace_back(d1); + node->inputs.emplace_back(b); + node->inputs.emplace_back(c); + d1->inputs.emplace_back(node); + b->outputs.emplace_back(node); + c->outputs.emplace_back(node); + op_descs.insert(op_descs.begin() + 4, op); + + auto nodes = SortOpLikeDescOrder(graph); + + for (size_t i = 0; i < nodes.size(); ++i) { + auto node = nodes[i]; + auto op_desc = op_descs[i]; + ASSERT_TRUE(IsSameDesc(node->Op(), op_desc)); + } +} + +// 4. add and delete some op_desc +TEST(SortOpLikeDescOrder, AddAndDeleteOpDesc) { + auto prog = FillProgramDesc(); + ir::Graph graph(prog); + + auto find_node_in_graph = [&](std::string s) { + ir::Node* ret = nullptr; + for (auto n : graph.Nodes()) { + if (n->Name() == s) { + ret = n; + break; + } + } + PADDLE_ENFORCE(ret != nullptr); + return ret; + }; + + std::vector op_descs = graph.OriginProgram().Block(0).AllOps(); + + // remove sum node + ir::Node* found_node = nullptr; + auto nodes = graph.Nodes(); + for (auto node : nodes) { + if (node->Name() == "sum") { + found_node = node; + break; + } + } + PADDLE_ENFORCE(found_node != nullptr); + for (auto it = op_descs.begin(); it != op_descs.end();) { + if (IsSameDesc(*it, found_node->Op())) { + it = op_descs.erase(it); + } else { + ++it; + } + } + { + ir::Node* d = find_node_in_graph("d"); + ir::Node* c = find_node_in_graph("c"); + ir::Node* e = find_node_in_graph("e"); + std::remove(d->outputs.begin(), d->outputs.end(), found_node); + std::remove(c->outputs.begin(), c->outputs.end(), found_node); + ir::Node* pending_op = found_node->outputs[0]->outputs[0]; + graph.RemoveNode(e); + graph.RemoveNode(pending_op); + graph.RemoveNode(found_node); + } + + // add node + auto op = prog.MutableBlock(0)->AppendOp(); + prog.MutableBlock(0)->Var("d1")->SetType(proto::VarType::LOD_TENSOR); + op->SetType("sum"); + op->SetInput("X", {"b", "c"}); + op->SetOutput("Out", {"d1"}); + { + ir::Node* node = graph.CreateOpNode(op); + ir::Node* d1 = graph.CreateVarNode(prog.MutableBlock(0)->Var("d1")); + ir::Node* b = find_node_in_graph("b"); + ir::Node* c = find_node_in_graph("c"); + node->outputs.emplace_back(d1); + node->inputs.emplace_back(b); + node->inputs.emplace_back(c); + b->outputs.emplace_back(node); + c->outputs.emplace_back(node); + } + op_descs.insert(op_descs.begin() + 2, op); + + // check the order + auto mynodes = SortOpLikeDescOrder(graph); + for (size_t i = 0; i < mynodes.size(); ++i) { + auto node = mynodes[i]; + auto op_desc = op_descs[i]; + ASSERT_TRUE(IsSameDesc(node->Op(), op_desc)); + } +} + +// 5. add and replace some op_desc inplace. +TEST(SortOpLikeDescOrder, AddAndReplaceOpDescInplace) { + auto prog = FillProgramDesc(); + ir::Graph graph(prog); + std::vector op_descs = graph.OriginProgram().Block(0).AllOps(); + + auto find_node_in_graph = [&](std::string s) { + ir::Node* ret = nullptr; + for (auto n : graph.Nodes()) { + if (n->Name() == s) { + ret = n; + break; + } + } + PADDLE_ENFORCE(ret != nullptr); + return ret; + }; + + // add node + auto op = prog.MutableBlock(0)->AppendOp(); + prog.MutableBlock(0)->Var("d1")->SetType(proto::VarType::LOD_TENSOR); + op->SetType("sum"); + op->SetInput("X", {"b", "c"}); + op->SetOutput("Out", {"d1"}); + { + ir::Node* node = graph.CreateOpNode(op); + ir::Node* d1 = graph.CreateVarNode(prog.MutableBlock(0)->Var("d1")); + ir::Node* b = find_node_in_graph("b"); + ir::Node* c = find_node_in_graph("c"); + node->outputs.emplace_back(d1); + node->inputs.emplace_back(b); + node->inputs.emplace_back(c); + d1->inputs.emplace_back(node); + b->outputs.emplace_back(node); + c->outputs.emplace_back(node); + } + + op_descs.emplace_back(op); + + // replace op_desc inplace + auto nodes = graph.Nodes(); + ir::Node* found_node = nullptr; + for (auto node : nodes) { + if (node->IsOp() && node->Op() && node->Name() == "assign") { + if (node->outputs.size() == 1 && node->outputs[0]->Name() == "e") { + found_node = node; + break; + } + } + } + { + ir::Node* d = find_node_in_graph("d"); + ir::Node* e = find_node_in_graph("e"); + std::remove(d->outputs.begin(), d->outputs.end(), found_node); + std::remove(e->inputs.begin(), e->inputs.end(), found_node); + graph.RemoveNode(found_node); + } + op_descs.erase(op_descs.begin() + 3); + + auto replace_op = prog.MutableBlock(0)->AppendOp(); + replace_op->SetType("sum"); + replace_op->SetInput("X", {"d", "d1"}); + replace_op->SetOutput("Out", {"e"}); + { + ir::Node* sum2 = graph.CreateOpNode(replace_op); + ir::Node* e = find_node_in_graph("e"); + ir::Node* d = find_node_in_graph("d"); + ir::Node* d1 = find_node_in_graph("d1"); + sum2->inputs.emplace_back(d); + sum2->inputs.emplace_back(d1); + sum2->outputs.emplace_back(e); + e->inputs.emplace_back(sum2); + d->outputs.emplace_back(sum2); + d1->outputs.emplace_back(sum2); + } + + op_descs.emplace_back(replace_op); + // compare op order + auto graph_nodes = SortOpLikeDescOrder(graph); + for (size_t i = 0; i < graph_nodes.size(); ++i) { + auto node = graph_nodes[i]; + auto op_desc = op_descs[i]; + ASSERT_TRUE(IsSameDesc(node->Op(), op_desc)); + } +} + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/memory_optimize_pass.cc b/paddle/fluid/framework/details/memory_optimize_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..ddaef206028b16dd10c2beb57ce6bf30103a8d10 --- /dev/null +++ b/paddle/fluid/framework/details/memory_optimize_pass.cc @@ -0,0 +1,340 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/details/memory_optimize_pass.h" +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "gflags/gflags.h" +#include "paddle/fluid/framework/data_type.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_helper.h" + +DEFINE_bool(enable_subgraph_optimize, false, + "SubGraph also reuse global graph variables, it will reduce the " + "memory occupation" + "but a higher risk of memory reuse error. default disabled."); +DEFINE_string(memory_optimize_debug, "", + "debug the operator output variable when do the variable reuse." + "memory reuse pass." + "only for debug, default disabled."); + +namespace paddle { +namespace framework { +namespace details { + +void MemoryOptimizePass::ApplyImpl(ir::Graph* graph) const { + auto nodes = graph->Nodes(); + CollectSkipVarsSet(nodes); + + cfg_.reset(new details::ControlFlowGraph(*graph)); + cfg_->LiveVariableAnalysis(); + InitSSAGraphNodes(); + + int reuse_id = 0; + for (size_t idx = 0; idx < cfg_->Ops().size(); ++idx) { + auto& op = cfg_->Ops()[idx]; + auto* op_desc = op->Op(); + // some op in graph has no op desc + if (op_desc == nullptr) continue; + if (OpHasSubBlock(op_desc)) { + if (FLAGS_enable_subgraph_optimize) { + SubGraphOptimize(op_desc); + } else { + VLOG(3) << op->Name() + << " has subblock, but disable subgraph optimize. skipped."; + continue; + } + } + + for (auto& var : op->outputs) { + if (var->IsVar() && !var->IsCtrlVar() && skip_set_.count(var->Name())) { + VLOG(3) << "Skip set contains variable of " << var->Name() + << "disable reuse on it. skipped"; + continue; + } + if (NodeCanReused(var) && cfg_->Use(op).count(var->Name()) == 0) { + ir::Node* cache = pool_.FindBestFitNode(var); + while (cache != nullptr && var->Name() == cache->Name()) { + VLOG(3) << "The same cache variable is cascade reused. " + << cache->Name() << " is re-filled to the pool after " + << "the reused op is finished. Current op can not " + << "replace it again. Skip this candidate."; + cache = pool_.FindNextBestFitNode(var, cache); + } + if (var->Name() == FLAGS_memory_optimize_debug) { + VLOG(3) << "start match var " << DebugString(var) << " of op " + << op->Name(); + VLOG(3) << pool_.ToString(); + VLOG(3) << "matched in pool : " + << ((cache == nullptr) ? "False" : "True"); + } + + if (cache != nullptr) { + int node_idx_in_pool = pool_.GetNodeIndexInPool(cache); + VLOG(3) << string::Sprintf( + "!!! %s, %s => %s, cache idx %d, pool size %d", + std::to_string(reuse_id++), DebugString(var), DebugString(cache), + node_idx_in_pool, static_cast(pool_.size())); + // NOTE(dzhwinter): update the ProgramDesc/IR Graph + // and the CFG Graph on the fly. + // + // IR Graph define the dependence relationship between nodes. + // + // ProgramDesc defines the input/output vars. Its used in + // CreateOp, CreateVar when running happens. + // + // CFG Graph store the liveness information, when reuse happens + // we also need to update the variable liveness. + const std::string var_name = var->Name(); + const std::string cache_name = cache->Name(); + + cfg_->RenameVarInCFGGraph(var_name, cache_name, idx); + RenameVarInGraphDesc(var_name, cache_name, idx); + RenameVarInGraphNode(var_name, cache_name, idx, graph); + pool_.Erase(cache_name); + } + } + } + // fill the pool + for (auto& var : cfg_->Unlived(op)) { + ir::Node* var_node = cfg_->GetNodeByName(var, op); + if (var_node == nullptr || var_node->IsCtrlVar()) continue; + if (NodeCanReused(var_node) && !pool_.Has(var_node)) { + pool_.Insert(var_node); + } + } + } + graph->ResolveHazard(var_nodes_); +} + +void MemoryOptimizePass::SubGraphOptimize(OpDesc* op_desc) const { + // conditional block, while op and their grad op + auto* sub_block_desc = + AttrReader(op_desc->GetAttrMap()).Get("sub_block"); + + // create a mirror block to construct an IR Graph. + ProgramDesc prog; + auto* copy_block = prog.MutableBlock(0); + for (auto* op : sub_block_desc->AllOps()) { + auto* copy_op = copy_block->AppendOp(); + copy_op->CopyFrom(*op); + copy_op->Flush(); + } + + for (auto* var : sub_block_desc->AllVars()) { + auto* copy_var = copy_block->Var(var->Name()); + copy_var->SetDataType(var->GetDataType()); + // only lod tensor can be reused. So ignore the multiple dims case. + copy_var->SetType(var->GetType()); + copy_var->SetShape(var->GetShape()); + copy_var->SetPersistable(var->Persistable()); + } + + ir::Graph sub_graph(prog); + std::unordered_set sub_graph_all_ops; + FilterVariables(sub_graph.Nodes(), [&](ir::Node* var) { + // sub_graph_all_ops.emplace(var); + if (var->IsVar() && !var->IsCtrlVar()) { + sub_graph_all_ops.emplace(var); + } + }); + int sub_reuse_id = 0; + // subgraph nodes is unordered, reuse need to follow the desc order. + // find the right op node through the descs + for (auto* sub_op_desc : sub_block_desc->AllOps()) { + ir::Node* sub_op = nullptr; + for (auto* node : sub_graph_all_ops) { + if (node->Op() == sub_op_desc) { + sub_op = node; + break; + } + } + PADDLE_ENFORCE(sub_op != nullptr); + for (auto* var : sub_op->outputs) { + if (NodeCanReused(var)) { + ir::Node* cache = pool_.FindBestFitNode(var); + if (cache != nullptr) { + if (var->Var()->GetDataType() != cache->Var()->GetDataType()) { + continue; + } + int node_idx_in_pool = pool_.GetNodeIndexInPool(cache); + VLOG(3) << string::Sprintf( + "!!! %s, %s => %s, cache idx %d, pool size %d", + std::to_string(sub_reuse_id++), DebugString(var), + DebugString(cache), node_idx_in_pool, + static_cast(pool_.size())); + // NOTE(dzh): subblock is not in IR graph. Modify the block_desc + // immediately to make the subblock variable reuse strategy take + // effect. Because it is a single op in graph. No need to + // update the ir nodes. + // FIXME(liuwei1031): Graph is not aware of the existence of + // BlockDescs and ProgramDescs. + // The operations related to BlockDesc or ProgramDesc should perform + // on Graph or Node directly! + sub_op_desc->Rename(var->Name(), cache->Name()); + if (sub_op_desc->Block() != nullptr && + sub_op_desc->Block()->HasVar(var->Name())) { + sub_op_desc->Block()->RemoveVar(var->Name()); + } + } + } + } + } +} + +void MemoryOptimizePass::CollectSkipVarsSet( + const std::unordered_set& nodes) const { + auto update_skip_set = [&](OpDesc* op_desc) { + auto inputs = op_desc->InputArgumentNames(); + auto outputs = op_desc->OutputArgumentNames(); + skip_set_.insert(inputs.begin(), inputs.end()); + skip_set_.insert(outputs.begin(), outputs.end()); + }; + for (auto& op : nodes) { + if (!op->IsOp() || op->Op() == nullptr) continue; + auto* op_desc = op->Op(); + // NOTE(dzhwinter): + // current block can not reuse next level block vars. + if (OpHasSubBlock(op_desc)) update_skip_set(op_desc); + // NOTE(dzhwinter): + // distributed ops input/output name need to + // keep same bettwen trainer/pserver + if (op_desc->Type() == "send") update_skip_set(op_desc); + if (op_desc->Type() == "recv") update_skip_set(op_desc); + if (op_desc->Type() == "prefetch") update_skip_set(op_desc); + } +} + +void MemoryOptimizePass::RenameVarInGraphDesc(const std::string& var, + const std::string& cache_var, + size_t idx) const { + for (size_t i = idx; i < cfg_->Ops().size(); ++i) { + auto* op = cfg_->Ops()[i]; + PADDLE_ENFORCE(op->IsOp() && op->Op()); + auto* op_desc = op->Op(); + op_desc->RenameInput(var, cache_var); + op_desc->RenameOutput(var, cache_var); + if (op_desc->Block() != nullptr) { + op_desc->Block()->RemoveVar(var); + } else { + LOG(WARNING) << "op " << op->Name() << " not know its block." + << "Is the op_desc created without block pointer? " + << "Can not find " << var << " in Block(0)"; + } + op_desc->Flush(); + } +} + +void MemoryOptimizePass::InitSSAGraphNodes() const { + std::unordered_map> all_vars; + if (var_nodes_.empty()) { + for (auto* op : cfg_->Ops()) { + for (auto* node : op->inputs) { + if (all_vars[node->Name()].count(node) == 0) { + all_vars[node->Name()].emplace(node); + var_nodes_[node->Name()].emplace_back(node); + } + } + for (auto* node : op->outputs) { + if (all_vars[node->Name()].count(node) == 0) { + all_vars[node->Name()].emplace(node); + var_nodes_[node->Name()].emplace_back(node); + } + } + } + } +} + +void MemoryOptimizePass::RenameVarInGraphNode(const std::string& var, + const std::string& cache_var, + size_t idx, + ir::Graph* graph) const { + // if replace happens, we need to create a newer version cache_var + // but use the same dims/data_type with var. + PADDLE_ENFORCE(var_nodes_[var].size() >= 1 && + var_nodes_[var].at(0)->Var() != nullptr); + std::unique_ptr var_desc(new VarDesc(*var_nodes_[var].at(0)->Var())); + var_desc->SetName(cache_var); + + for (size_t i = idx; i < cfg_->Ops().size(); ++i) { + auto* op = cfg_->Ops()[i]; + + // redirect the input to the latest version of cache_var + for (auto* node : op->inputs) { + if (node->Name() == var) { + ir::Node* cache_node = var_nodes_[cache_var].back(); + + // swap node to cache_node + cache_node->outputs.insert(cache_node->outputs.end(), + node->outputs.begin(), node->outputs.end()); + PADDLE_ENFORCE(node->inputs.size() == 1 && node->inputs[0]->IsOp()); + auto* prev_op = node->inputs[0]; + std::replace(prev_op->outputs.begin(), prev_op->outputs.end(), node, + cache_node); + for (auto* next_op : node->outputs) { + std::replace(next_op->inputs.begin(), next_op->inputs.end(), node, + cache_node); + } + + // erase unused node + auto& nodes = var_nodes_.at(var); + nodes.erase(std::remove(nodes.begin(), nodes.end(), node), nodes.end()); + graph->RemoveNode(node); + } + } + + // if we need to rename the output, + // always create a newer version of cache_var + for (auto* node : op->outputs) { + if (node->Name() == var) { + ir::Node* cache_node = graph->CreateVarNode(var_desc.get()); + var_nodes_[cache_var].emplace_back(cache_node); + + // swap node to cache node + cache_node->outputs.insert(cache_node->outputs.end(), + node->outputs.begin(), node->outputs.end()); + cache_node->inputs.emplace_back(op); + std::replace(op->outputs.begin(), op->outputs.end(), node, cache_node); + for (auto* next_op : node->outputs) { + std::replace(next_op->inputs.begin(), next_op->inputs.end(), node, + cache_node); + } + + // erase unused node + auto& nodes = var_nodes_.at(var); + nodes.erase(std::remove(nodes.begin(), nodes.end(), node), nodes.end()); + graph->RemoveNode(node); + } + } + } +} + +} // namespace details +} // namespace framework +} // namespace paddle + +REGISTER_PASS(memory_optimize_pass, + paddle::framework::details::MemoryOptimizePass) + .RequireGraphAttr(paddle::framework::details::kStaleProgramOpDescs); diff --git a/paddle/fluid/framework/details/memory_optimize_pass.h b/paddle/fluid/framework/details/memory_optimize_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..ce94890b3856fa6bf167b8a08c814f81e422c372 --- /dev/null +++ b/paddle/fluid/framework/details/memory_optimize_pass.h @@ -0,0 +1,71 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "paddle/fluid/framework/data_type.h" +#include "paddle/fluid/framework/details/memory_optimize_helper.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/pass.h" + +namespace paddle { +namespace framework { +namespace details { + +class MemoryOptimizePass : public ir::Pass { + protected: + void ApplyImpl(ir::Graph* graph) const override; + // fill the variable map(var_nodes) by version. + void InitSSAGraphNodes() const; + + private: + // update program descs + void RenameVarInGraphDesc(const std::string& var, + const std::string& cache_var, size_t idx) const; + // update ir nodes + void RenameVarInGraphNode(const std::string& var, + const std::string& cache_var, size_t idx, + ir::Graph* graph) const; + + void SubGraphOptimize(OpDesc* op_desc) const; + // 1. scan op with subblock and collect the output/input vars. + // while, while_grad, conditional_block + // 2. scan distributed ops and collect the output/input vars + void CollectSkipVarsSet(const std::unordered_set&) const; + + private: + // Reuse Node Pool, Owned. + mutable OrderedSet pool_; + // controlflow Graph + mutable std::unique_ptr cfg_; + // skip set + mutable std::unordered_set skip_set_; + // var nodes + mutable std::map> var_nodes_; +}; + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/modify_op_lock_and_record_event_pass.cc b/paddle/fluid/framework/details/modify_op_lock_and_record_event_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..ae363f96393bddac4c88c7caf0ef6087ea848fb9 --- /dev/null +++ b/paddle/fluid/framework/details/modify_op_lock_and_record_event_pass.cc @@ -0,0 +1,58 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/details/modify_op_lock_and_record_event_pass.h" +#include "paddle/fluid/framework/details/computation_op_handle.h" +#include "paddle/fluid/framework/details/multi_devices_helper.h" +#include "paddle/fluid/framework/details/op_graph_view.h" +#include "paddle/fluid/framework/ir/graph_helper.h" + +namespace paddle { +namespace framework { +namespace details { + +static bool IsLockAndRecordEventFreeComputationOpHandle( + ComputationOpHandle *op, const OpGraphView &graph_view) { + if (!platform::is_gpu_place(op->GetPlace())) return false; + for (auto &pending_op : graph_view.PendingOps(op)) { + auto *tmp = dynamic_cast(pending_op); + if (tmp == nullptr || !(tmp->GetPlace() == op->GetPlace())) { + return false; + } + } + return true; +} + +void ModifyOpLockAndRecordEventPass::ApplyImpl(ir::Graph *ir_graph) const { + auto all_ops = ir::FilterByNodeWrapper(*ir_graph); + OpGraphView graph_view(all_ops); + for (auto &op : all_ops) { + auto *compute_op = dynamic_cast(op); + if (compute_op == nullptr) continue; + bool is_lock_and_record_event_free = + IsLockAndRecordEventFreeComputationOpHandle(compute_op, graph_view); + compute_op->SetLockAndRecordEventFree(is_lock_and_record_event_free); + if (is_lock_and_record_event_free) { + VLOG(10) << "Set is_lock_and_record_event_free be true in op " + << compute_op->DebugString(); + } + } +} + +} // namespace details +} // namespace framework +} // namespace paddle + +REGISTER_PASS(modify_op_lock_and_record_event_pass, + paddle::framework::details::ModifyOpLockAndRecordEventPass); diff --git a/paddle/fluid/framework/details/modify_op_lock_and_record_event_pass.h b/paddle/fluid/framework/details/modify_op_lock_and_record_event_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..54d52d6240a830dfc66f13c26fb79a896897f980 --- /dev/null +++ b/paddle/fluid/framework/details/modify_op_lock_and_record_event_pass.h @@ -0,0 +1,31 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/pass.h" + +namespace paddle { +namespace framework { +namespace details { + +class ModifyOpLockAndRecordEventPass : public ir::Pass { + protected: + void ApplyImpl(ir::Graph* graph) const override; +}; + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/multi_devices_graph_check_pass.cc b/paddle/fluid/framework/details/multi_devices_graph_check_pass.cc index c9c255864a2477ed29873f8521acce37fa928c06..9859b04dec4193812769cc63d4489a9150b973f2 100644 --- a/paddle/fluid/framework/details/multi_devices_graph_check_pass.cc +++ b/paddle/fluid/framework/details/multi_devices_graph_check_pass.cc @@ -12,76 +12,85 @@ // See the License for the specific language governing permissions and // limitations under the License. -#include "paddle/fluid/framework/details/multi_devices_graph_check_pass.h" #include +#include "paddle/fluid/framework/details/multi_devices_helper.h" #include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_helper.h" namespace paddle { namespace framework { namespace details { -bool SSAGraghBuilderWithChecker::IsValidGraph(const ir::Graph *graph) const { - std::unordered_map pending_ops; - std::unordered_set pending_vars; - std::unordered_set ready_vars; - std::unordered_set ready_ops; +class SSAGraghBuilderWithChecker : public ir::Pass { + protected: + void ApplyImpl(ir::Graph *graph) const override { + PADDLE_ENFORCE(IsValidGraph(graph)); + } - auto insert_pending_var = [&](VarHandleBase *var) { - pending_vars.insert(var); - if (var->GeneratedOp() == nullptr) { - ready_vars.emplace(var); - } - }; + bool IsValidGraph(const ir::Graph *graph) const { + std::unordered_map pending_ops; + std::unordered_set pending_vars; + std::unordered_set ready_vars; + std::unordered_set ready_ops; - for (auto &var_map : graph->Get(kGraphVars)) { - for (auto &name_pair : var_map) { - for (auto &version_pair : name_pair.second) { - insert_pending_var(version_pair.get()); + auto insert_pending_var = [&](VarHandleBase *var) { + pending_vars.insert(var); + if (var->GeneratedOp() == nullptr) { + ready_vars.emplace(var); } - } - } + }; - for (auto &var : graph->Get(kGraphDepVars)) { - insert_pending_var(var.get()); - } + for (auto &var_map : graph->Get(kGraphVars)) { + for (auto &name_pair : var_map) { + for (auto &version_pair : name_pair.second) { + insert_pending_var(version_pair); + } + } + } - for (auto &op : graph->Get(kGraphOps)) { - if (op->Inputs().empty()) { - ready_ops.insert(op.get()); - } else { - pending_ops.insert({op.get(), op.get()->NoDupInputSize()}); + for (auto &var : graph->Get(kGraphDepVars)) { + insert_pending_var(var); } - } - auto run_all_ops = [&](std::unordered_set &set) { - for (auto *op : set) { - for (auto out : op->Outputs()) { - ready_vars.emplace(out); + for (OpHandleBase *op : ir::FilterByNodeWrapper(*graph)) { + if (op->Inputs().empty()) { + ready_ops.insert(op); + } else { + pending_ops.insert({op, op->NoDupInputSize()}); } } - set.clear(); - }; - while (!pending_vars.empty()) { - run_all_ops(ready_ops); + auto run_all_ops = [&](std::unordered_set &set) { + for (auto *op : set) { + for (auto out : op->Outputs()) { + ready_vars.emplace(out); + } + } + set.clear(); + }; - if (ready_vars.empty()) { - return false; - } + while (!pending_vars.empty()) { + run_all_ops(ready_ops); - for (auto ready_var : ready_vars) { - pending_vars.erase(ready_var); - for (auto *op : ready_var->PendingOps()) { - auto &deps = --pending_ops[op]; - if (deps == 0) { - ready_ops.insert(op); + if (ready_vars.empty()) { + return false; + } + + for (auto ready_var : ready_vars) { + pending_vars.erase(ready_var); + for (auto *op : ready_var->PendingOps()) { + auto &deps = --pending_ops[op]; + if (deps == 0) { + ready_ops.insert(op); + } } } + ready_vars.clear(); } - ready_vars.clear(); + return true; } - return true; -} +}; + } // namespace details } // namespace framework } // namespace paddle @@ -89,6 +98,4 @@ bool SSAGraghBuilderWithChecker::IsValidGraph(const ir::Graph *graph) const { REGISTER_PASS(multi_devices_check_pass, paddle::framework::details::SSAGraghBuilderWithChecker) .RequireGraphAttr(paddle::framework::details::kGraphVars) - .RequireGraphAttr(paddle::framework::details::kGraphDepVars) - .RequireGraphAttr(paddle::framework::details::kGraphOps) - .RequireGraphAttr(paddle::framework::details::kShardedVarDevice); + .RequireGraphAttr(paddle::framework::details::kGraphDepVars); diff --git a/paddle/fluid/framework/details/multi_devices_graph_check_pass.h b/paddle/fluid/framework/details/multi_devices_graph_check_pass.h deleted file mode 100644 index 1e2b1867c376956d7d2dac465c13e2f3f64ba7eb..0000000000000000000000000000000000000000 --- a/paddle/fluid/framework/details/multi_devices_graph_check_pass.h +++ /dev/null @@ -1,38 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#pragma once - -#include "paddle/fluid/framework/details/multi_devices_helper.h" - -#include - -namespace paddle { -namespace framework { -namespace details { - -class SSAGraghBuilderWithChecker : public ir::Pass { - protected: - std::unique_ptr ApplyImpl( - std::unique_ptr graph) const override { - PADDLE_ENFORCE(IsValidGraph(graph.get())); - return graph; - } - - bool IsValidGraph(const ir::Graph* graph) const; -}; - -} // namespace details -} // namespace framework -} // namespace paddle diff --git a/paddle/fluid/framework/details/multi_devices_graph_pass.cc b/paddle/fluid/framework/details/multi_devices_graph_pass.cc index 134fcee826715672a6e021e9bf694bb771ebb830..f213e07b555ca9fc4b73a2f91412063f4e7f47d4 100644 --- a/paddle/fluid/framework/details/multi_devices_graph_pass.cc +++ b/paddle/fluid/framework/details/multi_devices_graph_pass.cc @@ -11,17 +11,20 @@ // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License. +#include "paddle/fluid/framework/details/multi_devices_graph_pass.h" #include #include +#include #include +#include +#include #include #include - #include "paddle/fluid/framework/details/all_reduce_op_handle.h" #include "paddle/fluid/framework/details/broadcast_op_handle.h" #include "paddle/fluid/framework/details/computation_op_handle.h" -#include "paddle/fluid/framework/details/data_balance_op_handle.h" -#include "paddle/fluid/framework/details/multi_devices_graph_pass.h" +#include "paddle/fluid/framework/details/fetch_barrier_op_handle.h" +#include "paddle/fluid/framework/details/fused_broadcast_op_handle.h" #include "paddle/fluid/framework/details/reduce_op_handle.h" #include "paddle/fluid/framework/details/rpc_op_handle.h" #include "paddle/fluid/framework/details/scale_loss_grad_op_handle.h" @@ -29,11 +32,25 @@ #include "paddle/fluid/framework/ir/node.h" #include "paddle/fluid/framework/op_info.h" #include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/operators/math/math_function.h" namespace paddle { namespace framework { namespace details { + namespace { +// TODO(panyx0718): Clean this up as well. +// all operators. NOTE that even we use a vector here, the operators is +// unordered. +typedef std::vector GraphOps; +const char kGraphOps[] = "ops"; + +bool OpHaveRole(const ir::Node &node, const framework::OpRole &role) { + return boost::get( + node.Op()->GetAttr(OpProtoAndCheckerMaker::OpRoleAttrName())) == + static_cast(role); +} + void PolishGraphToSupportDataHazards(ir::Graph *graph) { for (auto &var_map : graph->Get(kGraphVars)) { for (auto &name_pair : var_map) { @@ -91,7 +108,7 @@ VarHandle *CreateOrGetLatestVarHandle(ir::Graph *graph, ir::Node *node, } var_holder.emplace_back(var); } else { - var = var_holder.rbegin()->get(); + var = *var_holder.rbegin(); } return var; } @@ -120,174 +137,27 @@ void AddOutputToLeafOps(ir::Graph *graph) { } } // namespace -static const char kLossVarName[] = "loss_var_name"; -static const char kPlaces[] = "places"; -static const char kParams[] = "params"; -static const char kLocalScopes[] = "local_scopes"; -static const char kStrategy[] = "strategy"; +void MultiDevSSAGraphBuilderBase::CheckGraph(const ir::Graph &graph) const {} -void MultiDevSSAGraphBuilder::Init() const { +void MultiDevSSAGraphBuilderBase::Init() const { all_vars_.clear(); - balance_vars_.clear(); loss_var_name_ = Get(kLossVarName); + VLOG(10) << "Init MultiDevSSAGraphBuilder, loss name: " << loss_var_name_; places_ = Get>(kPlaces); local_scopes_ = Get>(kLocalScopes); strategy_ = Get(kStrategy); -#ifdef PADDLE_WITH_CUDA - nccl_ctxs_ = &Get("nccl_ctxs"); +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + nccl_ctxs_ = &Get(kNCCLCtxs); #endif - - for (auto &p : Get>(kParams)) { - grad_names_.insert(GradVarName(p)); - } - balance_vars_.resize(places_.size(), 0); - if (strategy_.enable_data_balance_ && places_.size() == 1) { - LOG(WARNING) << "It is no need to enable data balance when there is only " - "one place. enable_data_balance is set to False."; - strategy_.enable_data_balance_ = false; - } -} - -void MultiDevSSAGraphBuilder::CreateOpHandleIOs(ir::Graph *result, - ir::Node *node, - size_t place_id) const { - auto p = places_[place_id]; - auto *op_handle = result->Get(kGraphOps).back().get(); - op_handle->SetDeviceContext(p, - platform::DeviceContextPool::Instance().Get(p)); - - for (ir::Node *input : node->inputs) { - VarHandle *var = CreateOrGetLatestVarHandle(result, input, p, place_id); - op_handle->AddInput(var); - } - - for (ir::Node *output : node->outputs) { - ir::Node *new_node = nullptr; - if (output->Var()) { - new_node = result->CreateVarNode(output->Var()); - } else { - new_node = - result->CreateEmptyNode(output->Name(), ir::Node::Type::kVariable); - } - CreateOpOutput(result, op_handle, new_node, p, place_id); - } + PADDLE_ENFORCE_EQ(places_.size(), local_scopes_.size()); } -std::vector MultiDevSSAGraphBuilder::FindDistTrainSendVars( - const std::vector &nodes) const { - std::vector send_vars; - // since parameters are all in block 0, - // it's enough to only scan send ops in block 0 - for (auto &node : nodes) { - OpDesc *op = node->Op(); - // TODO(Yancey1989): use a graceful method to find send op, - // instead of the the hard code string - if (op->Type() == "send") { - auto op_vars = op->InputArgumentNames(); - send_vars.reserve(send_vars.size() + - std::distance(op_vars.begin(), op_vars.end())); - send_vars.insert(send_vars.end(), op_vars.begin(), op_vars.end()); - } - } - return send_vars; -} - -std::vector MultiDevSSAGraphBuilder::FindDistTrainRecvVars( - const std::vector &nodes) const { - std::vector recv_vars; - for (auto &node : nodes) { - OpDesc *op = node->Op(); - // TODO(Yancey1989): use a graceful method to find recv op, - // instead of the hard code string - if (op->Type() == "recv") { - auto op_vars = op->OutputArgumentNames(); - recv_vars.reserve(recv_vars.size() + - std::distance(op_vars.begin(), op_vars.end())); - recv_vars.insert(recv_vars.end(), op_vars.begin(), op_vars.end()); - } - } - return recv_vars; -} - -size_t MultiDevSSAGraphBuilder::GetAppropriateDeviceID( - const std::vector &var_names) const { - int64_t numel_sum = 0; - for (auto var_name : var_names) { - if (all_vars_.find(var_name) == all_vars_.end()) continue; - auto var_desc = all_vars_.at(var_name); - PADDLE_ENFORCE_NOT_NULL(var_desc); - auto dim = framework::make_ddim(var_desc->GetShape()); - int64_t numel = framework::product(dim); - PADDLE_ENFORCE_GT(numel, 0); - numel_sum += numel; - } - - auto smallest = - std::min_element(std::begin(balance_vars_), std::end(balance_vars_)); - size_t dev_id = - static_cast(std::distance(std::begin(balance_vars_), smallest)); - balance_vars_[dev_id] += numel_sum; - return dev_id; -} - -// Topology sort the graph nodes from inputs to outputs. -// Since SSAGraphBuilder depends on forward/backward nodes to assign devices -// to parameter/gradients before optimizer ops, topo sort is insufficient. ( -// some optimizer ops might not depend on any nodes), we manually move all -// optimizer nodes after last backward nodes. -// However, the assumption by SSAGraphBuilder should be relaxed in the future. -std::vector SortOpsAndDelayOptimizeOp(const ir::Graph &graph) { - std::vector ret = ir::TopologySortOperations(graph); - size_t last_backward = 0; - for (size_t i = 0; i < ret.size(); ++i) { - if (boost::get( - ret[i]->Op()->GetAttr(OpProtoAndCheckerMaker::OpRoleAttrName())) == - static_cast(OpRole::kBackward)) { - last_backward = i; - } - } - - std::vector optimize_ops; - std::vector sorted_ret; - for (size_t i = 0; i < ret.size(); ++i) { - if (i < last_backward) { - if (boost::get(ret[i]->Op()->GetAttr( - OpProtoAndCheckerMaker::OpRoleAttrName())) == - static_cast(OpRole::kOptimize)) { - optimize_ops.push_back(ret[i]); - } else { - sorted_ret.push_back(ret[i]); - } - } else if (i == last_backward) { - sorted_ret.push_back(ret[i]); - // Verify that no operations before optimize ops depends on optimize ops. - std::unordered_set optimize_set(optimize_ops.begin(), - optimize_ops.end()); - for (ir::Node *n : sorted_ret) { - for (ir::Node *in : n->inputs) { - for (ir::Node *pre_n : in->inputs) { - PADDLE_ENFORCE(optimize_set.find(pre_n) == optimize_set.end(), - "optimize operations cannot be depended by forward " - "or backward node %s -> %s", - pre_n->Name(), n->Name()); - } - } - } - sorted_ret.insert(sorted_ret.end(), optimize_ops.begin(), - optimize_ops.end()); - } else { - sorted_ret.push_back(ret[i]); - } - } - return sorted_ret; -} - -std::unique_ptr MultiDevSSAGraphBuilder::ApplyImpl( - std::unique_ptr graph) const { +void MultiDevSSAGraphBuilderBase::ApplyImpl(ir::Graph *graph) const { Init(); - // Give the topology sort order and rebuild the graph structure. - std::vector sorted_ops = SortOpsAndDelayOptimizeOp(*graph); + CheckGraph(*graph); + std::vector sorted_ops = SortOperations(*graph); + auto nodes = graph->ReleaseNodes(); ir::Graph &result = *graph; @@ -296,178 +166,183 @@ std::unique_ptr MultiDevSSAGraphBuilder::ApplyImpl( all_vars_.emplace(node->Name(), node->Var()); } } - std::unordered_set og_has_been_broadcast; // We cannot invoke resize. It is a bug of GCC 4.8 result.Set(kGraphVars, new GraphVars(places_.size())); result.Set(kGraphDepVars, new GraphDepVars); result.Set(kGraphOps, new GraphOps); - result.Set(kShardedVarDevice, new ShardedVarDevice); - // find send/recv vars so that we can place the distributed training - // related op in the place 0 - auto send_vars = FindDistTrainSendVars(sorted_ops); - auto recv_vars = FindDistTrainRecvVars(sorted_ops); - - std::vector> bcast_var_name_set; - bcast_var_name_set.resize(places_.size()); - - size_t cur_device_id = 0; bool is_forwarding = true; - bool is_dist_train = false; for (ir::Node *node : sorted_ops) { - if (boost::get( - node->Op()->GetAttr(OpProtoAndCheckerMaker::OpRoleAttrName())) == - static_cast(OpRole::kRPC)) { - int op_dev_id = CreateRPCOp(&result, node); - PADDLE_ENFORCE(op_dev_id != -1, - "Can not schedule the RPC operator to the right place."); - if (node->Op()->Type() == "recv") { - auto recv_vars_attr = - boost::get>(node->Op()->GetNullableAttr( - OpProtoAndCheckerMaker::OpRoleVarAttrName())); - PADDLE_ENFORCE(recv_vars_attr.size() == 2UL); // [parameter, gradient] - if (recv_vars_attr[0].find(".block") == std::string::npos) { - bcast_var_name_set[op_dev_id].emplace(recv_vars_attr[0]); - } - } - is_dist_train = true; - } else if (boost::get(node->Op()->GetAttr( - OpProtoAndCheckerMaker::OpRoleAttrName())) == - static_cast(OpRole::kDist)) { - int op_dev_id = CreateDistTrainOp(&result, node); - if (node->Op()->Type() == "concat") { - auto origin_param_name = node->Op()->OutputArgumentNames()[0]; - bcast_var_name_set[op_dev_id].emplace(origin_param_name); - } - } else if (IsScaleLossOp(node)) { - // user can customize loss@grad if not use_default_grad_scale_ - if (strategy_.gradient_scale_ != - BuildStrategy::GradientScaleStrategy::kCustomized) { - // TODO(paddle-dev): Why is there no input for this op_handle? - auto loss_grad_name = node->Op()->OutputArgumentNames()[0]; - CreateScaleLossGradOp(&result, loss_grad_name); - } - // This assumes the backward generating code will ensure IsScaleLossOp - // is true only for the op that scale the final scalar loss. - // It also assumes backward op will always follow the forward op in - // the block. - is_forwarding = false; + if (DealWithSpecialOp(&result, node)) { + continue; } else { - int op_dev_id = GetOpDeviceID(result, node); - if (op_dev_id != -1) { // This op only runs on one specific device. - CreateComputationalOp(&result, node, op_dev_id); - for (ir::Node *n : node->outputs) { - graph->Get(kShardedVarDevice) - .emplace(n->Name(), op_dev_id); - } + // This op runs on all devices + if (IsScaleLossOp(node)) { + // user can customize loss@grad if not use_default_grad_scale_ + InsertScaleLossGradOp(&result, node); + // This assumes the backward generating code will ensure IsScaleLossOp + // is true only for the op that scale the final scalar loss. + // It also assumes backward op will always follow the forward op in + // the block. + is_forwarding = false; } else { - // This op runs on all devices, and its output may have parameter's - // gradients. - // TODO(paddle-dev): Why is so special about "read" op? - if (node->Op()->Type() == "read" && strategy_.enable_data_balance_) { - node->Op()->SetAttr("throw_eof_exp", false); - CreateComputationalOps(&result, node, places_.size()); - const auto &data_var_names = node->Op()->Output("Out"); - InsertDataBalanceOp(&result, data_var_names); - } else { - CreateComputationalOps(&result, node, places_.size()); - } + CreateComputationalOps(&result, node, places_.size()); + } + + // Insert collective ops if nranks > 1 + if (!is_forwarding && Get(kNRanks) > 1) { + try { + bool is_bk_op = + static_cast(boost::get(node->Op()->GetAttr( + OpProtoAndCheckerMaker::OpRoleAttrName())) & + static_cast(OpRole::kBackward)); + // optimize op is already processed in DealWithSpecialOp, + // here we only consider backward op + if (!is_bk_op) continue; + + /* + * the op that will generate the gradient of on parameter will have + one attr op_role_var + * to record the parameter and gradient, like: + attrs { + name: "op_role_var" + type: STRINGS + strings: "fc_1.b_0" + strings: "fc_1.b_0@GRAD" + } + */ - if (!is_forwarding && places_.size() > 1) { // Currently, we assume that once gradient is generated, it can be // broadcast, and each gradient is only broadcast once. - if (static_cast(boost::get(node->Op()->GetAttr( - OpProtoAndCheckerMaker::OpRoleAttrName())) & - static_cast(OpRole::kBackward))) { - try { - auto backward_vars = boost::get>( - node->Op()->GetNullableAttr( - OpProtoAndCheckerMaker::OpRoleVarAttrName())); - - PADDLE_ENFORCE_EQ(backward_vars.size() % 2, 0); - - for (size_t i = 0; i < backward_vars.size(); i += 2) { - auto &p_name = backward_vars[i]; - auto &g_name = backward_vars[i + 1]; - VLOG(10) << "Bcast " << g_name << " for parameter " << p_name; - - switch (strategy_.reduce_) { - case BuildStrategy::ReduceStrategy::kReduce: - cur_device_id = GetAppropriateDeviceID({g_name}); - CreateReduceOp(&result, g_name, cur_device_id); - graph->Get(kShardedVarDevice) - .emplace(g_name, cur_device_id); - if (!is_dist_train) { - bcast_var_name_set[cur_device_id].emplace(p_name); - } - break; - case BuildStrategy::ReduceStrategy::kAllReduce: - if (IsSparseGradient(g_name)) { - CreateReduceOp(&result, g_name, 0); - CreateBroadcastOp(&result, g_name, 0); - } else { - InsertAllReduceOp(&result, g_name); - } - break; - default: - LOG(FATAL) << "Unknown reduce strategy "; - break; - } - } - } catch (boost::bad_get e) { + auto backward_vars = + boost::get>(node->Op()->GetNullableAttr( + OpProtoAndCheckerMaker::OpRoleVarAttrName())); + PADDLE_ENFORCE_EQ(backward_vars.size() % 2, 0); + for (size_t i = 0; i < backward_vars.size(); i += 2) { + auto &p_name = backward_vars[i]; + auto &g_name = backward_vars[i + 1]; + VLOG(10) << "Bcast " << g_name << " for parameter " << p_name + << " op_type " << node->Op()->Type(); + if (NeedCollectiveForGrad(g_name, sorted_ops)) { + InsertCollectiveOp(&result, p_name, g_name); } } + } catch (boost::bad_get e) { } } } } - bool use_gpu = false; -#ifdef PADDLE_WITH_CUDA - use_gpu = nccl_ctxs_ != nullptr; -#endif - // Insert broadcast operators principle: - // 1. Broadcast optimized parameters in Reduce strategy; - // 2. No need broadcast optimized parameters in AllReduce strategy because of - // the optimization sub-graph would be run on every GPU; - // 3. Allways broadcast received parameters in Distribute Training. - if ((use_gpu && - strategy_.reduce_ == BuildStrategy::ReduceStrategy::kReduce) || - is_dist_train) { - for (size_t dev_id = 0; dev_id < bcast_var_name_set.size(); ++dev_id) { - auto &to_bcast_set = bcast_var_name_set[dev_id]; - for (auto &bcast_name : to_bcast_set) { - CreateBroadcastOp(&result, bcast_name, dev_id); - } - } - } + InsertPostprocessOps(&result); + /* Dependency graph has been constructed. However, there are still data hazards need to be handled. - */ + */ PolishGraphToSupportDataHazards(&result); /* * Only variables should be the leaves of graph. */ AddOutputToLeafOps(&result); - PADDLE_ENFORCE(!ir::HasCircle(result)); - return graph; + + result.Erase(kGraphOps); } -bool MultiDevSSAGraphBuilder::IsSparseGradient(const std::string &og) const { - PADDLE_ENFORCE(all_vars_.count(og) != 0); - if (all_vars_.at(og)->GetType() == proto::VarType::SELECTED_ROWS) { - return true; +void MultiDevSSAGraphBuilderBase::InsertScaleLossGradOp( + ir::Graph *result, const ir::Node *node) const { + // user can customize loss@grad if not use_default_grad_scale_ + size_t loss_scale = 0; + switch (this->strategy_.gradient_scale_) { + case BuildStrategy::GradientScaleStrategy::kOne: + loss_scale = 1; + break; + case BuildStrategy::GradientScaleStrategy::kCoeffNumDevice: + loss_scale = Get(kNRanks); + break; + case BuildStrategy::GradientScaleStrategy::kCustomized: + loss_scale = 0; + break; + default: + LOG(FATAL) << "Unknown gradient scale strategy."; + break; + } + + VLOG(3) << "loss_scale: " << loss_scale; + + if (loss_scale) { + // TODO(paddle-dev): Why is there no input for this op_handle? + auto loss_grad_name = node->Op()->OutputArgumentNames()[0]; + auto out_dtype = this->all_vars_.at(loss_grad_name)->GetDataType(); + this->CreateScaleLossGradOp(result, loss_grad_name, node->outputs[0], + loss_scale, out_dtype); } +} + +bool MultiDevSSAGraphBuilderBase::DealWithSpecialOp(ir::Graph *result, + ir::Node *node) const { return false; } -void MultiDevSSAGraphBuilder::SetCommunicationContext( +std::vector MultiDevSSAGraphBuilderBase::SortOperations( + const ir::Graph &graph) const { + return ir::TopologySortOperations(graph); +} + +bool MultiDevSSAGraphBuilderBase::UseGPU() const { + bool use_gpu = false; +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + use_gpu = nccl_ctxs_ != nullptr; +#endif + return use_gpu; +} + +bool MultiDevSSAGraphBuilderBase::NeedCollectiveForGrad( + const std::string &grad_name, std::vector ops) const { + // if we have allreduce_op for current gradient variable in the graph, + // then we don't need to add allreduce_op_handle for this gradient + // NOTE: This is for the case that all gradients should add collective ops + for (auto *node : ops) { + if (node->Op()->Type() != "allreduce") continue; + for (auto in_name : node->Op()->InputArgumentNames()) { + if (in_name == grad_name) { + return false; + } + } + } + return true; +} + +void MultiDevSSAGraphBuilderBase::CreateOpHandleIOs(ir::Graph *result, + ir::Node *node, + size_t place_id) const { + auto p = places_[place_id]; + auto *op_handle = result->Get(kGraphOps).back(); + op_handle->SetDeviceContext(p, + platform::DeviceContextPool::Instance().Get(p)); + + for (ir::Node *input : node->inputs) { + VarHandle *var = CreateOrGetLatestVarHandle(result, input, p, place_id); + op_handle->AddInput(var); + } + + for (ir::Node *output : node->outputs) { + ir::Node *new_node = nullptr; + if (output->Var()) { + new_node = result->CreateVarNode(output->Var()); + } else { + new_node = + result->CreateEmptyNode(output->Name(), ir::Node::Type::kVariable); + } + CreateOpOutput(result, op_handle, new_node, p, place_id); + } +} + +void MultiDevSSAGraphBuilderBase::SetCommunicationContext( OpHandleBase *op_handle, const platform::Place &p) const { -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) if (nccl_ctxs_ == nullptr) { op_handle->SetDeviceContext(p, platform::DeviceContextPool::Instance().Get(p)); @@ -478,10 +353,10 @@ void MultiDevSSAGraphBuilder::SetCommunicationContext( #endif } -void MultiDevSSAGraphBuilder::CreateBroadcastOp(ir::Graph *result, - const std::string &p_name, - size_t src_dev_id) const { -#ifdef PADDLE_WITH_CUDA +void MultiDevSSAGraphBuilderBase::CreateBroadcastOp(ir::Graph *result, + const std::string &p_name, + size_t src_dev_id) const { +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) auto *op_handle = new BroadcastOpHandle( result->CreateEmptyNode("broadcast", ir::Node::Type::kOperation), local_scopes_, places_, nccl_ctxs_); @@ -493,7 +368,7 @@ void MultiDevSSAGraphBuilder::CreateBroadcastOp(ir::Graph *result, result->Get(kGraphOps).emplace_back(op_handle); auto *in = - result->Get(kGraphVars).at(src_dev_id).at(p_name).back().get(); + result->Get(kGraphVars).at(src_dev_id).at(p_name).back(); op_handle->AddInput(in); for (size_t i = 0; i < places_.size(); ++i) { @@ -508,107 +383,109 @@ void MultiDevSSAGraphBuilder::CreateBroadcastOp(ir::Graph *result, } } -void MultiDevSSAGraphBuilder::CreateComputationalOp(ir::Graph *result, - ir::Node *node, - int dev_id) const { +void MultiDevSSAGraphBuilderBase::CreateFusedBroadcastOp( + ir::Graph *result, + const std::vector> &bcast_varnames) const { +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + auto *op_handle = new FusedBroadcastOpHandle( + result->CreateEmptyNode("fused_broadcast", ir::Node::Type::kOperation), + local_scopes_, places_, nccl_ctxs_); +#else + auto *op_handle = new FusedBroadcastOpHandle( + result->CreateEmptyNode("fused_broadcast", ir::Node::Type::kOperation), + local_scopes_, places_); +#endif + result->Get(kGraphOps).emplace_back(op_handle); + + for (size_t i = 0; i < places_.size(); ++i) { + auto &p = places_[i]; + SetCommunicationContext(op_handle, p); + } + + for (size_t dev_id = 0; dev_id < bcast_varnames.size(); ++dev_id) { + for (auto &p_name : bcast_varnames[dev_id]) { + auto *in = + result->Get(kGraphVars).at(dev_id).at(p_name).back(); + op_handle->AddInput(in); + for (size_t out_dev_id = 0; out_dev_id < places_.size(); ++out_dev_id) { + auto &p = places_[out_dev_id]; + auto &vars = + result->Get(kGraphVars).at(out_dev_id).at(p_name); + auto *out_var = new VarHandle( + result->CreateEmptyNode(p_name, ir::Node::Type::kVariable), + vars.size(), out_dev_id, p_name, p); + vars.emplace_back(out_var); + op_handle->AddOutput(out_var); + } + } + } +} + +void MultiDevSSAGraphBuilderBase::CreateComputationalOp(ir::Graph *result, + ir::Node *node, + size_t dev_id) const { result->Get(kGraphOps).emplace_back( new ComputationOpHandle(result->CreateOpNode(node->Op()), - local_scopes_[dev_id], places_[dev_id])); + local_scopes_[dev_id], places_[dev_id], dev_id)); CreateOpHandleIOs(result, node, dev_id); } -void MultiDevSSAGraphBuilder::InsertAllReduceOp(ir::Graph *result, - const std::string &og) const { -#ifdef PADDLE_WITH_CUDA - result->Get(kGraphOps).emplace_back(new AllReduceOpHandle( - result->CreateEmptyNode("allreduce", ir::Node::Type::kOperation), - local_scopes_, places_, nccl_ctxs_)); +void MultiDevSSAGraphBuilderBase::CreateAllReduceOp(ir::Graph *result, + const std::string &og, + bool is_encoded) const { + OpHandleBase *op_handle = nullptr; + + auto append_allreduce_op = [&]( + const std::vector &scopes, + const std::vector &places) -> OpHandleBase * { +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + result->Get(kGraphOps).emplace_back(new AllReduceOpHandle( + result->CreateEmptyNode("allreduce", ir::Node::Type::kOperation), + scopes, places, nccl_ctxs_, is_encoded, + static_cast(strategy_.trainers_endpoints_.size()) * + places_.size())); #else - result->Get(kGraphOps).emplace_back(new AllReduceOpHandle( - result->CreateEmptyNode("allreduce", ir::Node::Type::kOperation), - local_scopes_, places_)); + result->Get(kGraphOps).emplace_back(new AllReduceOpHandle( + result->CreateEmptyNode("allreduce", ir::Node::Type::kOperation), + scopes, places)); #endif - auto *op_handle = result->Get(kGraphOps).back().get(); + return result->Get(kGraphOps).back(); + }; + + if (!strategy_.enable_parallel_graph_) + op_handle = append_allreduce_op(local_scopes_, places_); for (size_t i = 0; i < places_.size(); ++i) { - auto &p = places_[i]; - SetCommunicationContext(op_handle, p); + if (strategy_.enable_parallel_graph_) { + op_handle = append_allreduce_op({local_scopes_[i]}, {places_[i]}); + } + + SetCommunicationContext(op_handle, places_[i]); auto &vars = result->Get(kGraphVars)[i][og]; PADDLE_ENFORCE(!vars.empty()); auto &prev_grad = vars.back(); - op_handle->AddInput(prev_grad.get()); + op_handle->AddInput(prev_grad); + VLOG(10) << "all_reduce_op_handle add input " << prev_grad->DebugString(); auto var = new VarHandle(result->CreateEmptyNode(og, ir::Node::Type::kVariable), - vars.size(), i, og, p); + vars.size(), i, og, places_[i]); vars.emplace_back(var); op_handle->AddOutput(var); + VLOG(10) << "all_reduce_op_handle add output " << og + << ", handle:" << var->DebugString(); } } -void MultiDevSSAGraphBuilder::InsertDataBalanceOp( - ir::Graph *result, const std::vector &datas) const { -#ifdef PADDLE_WITH_CUDA - result->Get(kGraphOps).emplace_back(new DataBalanceOpHandle( - result->CreateEmptyNode("data_balance", ir::Node::Type::kOperation), - local_scopes_, places_, nccl_ctxs_)); -#else - result->Get(kGraphOps).emplace_back(new DataBalanceOpHandle( - result->CreateEmptyNode("data_balance", ir::Node::Type::kOperation), - local_scopes_, places_)); -#endif - auto *op_handle = result->Get(kGraphOps).back().get(); - for (size_t i = 0; i < places_.size(); ++i) { - auto &p = places_[i]; - SetCommunicationContext(op_handle, p); - for (const std::string &d_name : datas) { - auto &vars = result->Get(kGraphVars)[i][d_name]; - PADDLE_ENFORCE(!vars.empty()); - op_handle->AddInput(vars.back().get()); - auto var = new VarHandle( - result->CreateEmptyNode(d_name, ir::Node::Type::kVariable), - vars.size(), i, d_name, p); - vars.emplace_back(var); - op_handle->AddOutput(var); - } - } -} - -int MultiDevSSAGraphBuilder::GetOpDeviceID(const ir::Graph &graph, - ir::Node *node) const { - if (strategy_.reduce_ != BuildStrategy::ReduceStrategy::kReduce) { - return -1; - } - int op_role = boost::get( - node->Op()->GetAttr(framework::OpProtoAndCheckerMaker::OpRoleAttrName())); - if (op_role != static_cast(framework::OpRole::kOptimize)) { - return -1; - } - auto param_grad = boost::get>( - node->Op()->GetAttr(OpProtoAndCheckerMaker::OpRoleVarAttrName())); - - PADDLE_ENFORCE_EQ(param_grad.size(), 2U); - int dev_id = GetVarDeviceID(graph, param_grad[1]); - PADDLE_ENFORCE_NE(dev_id, -1, "dev_id should not be -1.[%s, %s, %s]", - node->Op()->Type(), param_grad[0], param_grad[1]); - return dev_id; -} - -int MultiDevSSAGraphBuilder::GetVarDeviceID(const ir::Graph &graph, - const std::string &varname) const { - auto &sharded_var_device = graph.Get(kShardedVarDevice); - auto got = sharded_var_device.find(varname); - return got == sharded_var_device.end() ? -1 : got->second; -} - -void MultiDevSSAGraphBuilder::CreateScaleLossGradOp( - ir::Graph *result, const std::string &loss_grad_name) const { +void MultiDevSSAGraphBuilderBase::CreateScaleLossGradOp( + ir::Graph *result, const std::string &loss_grad_name, + ir::Node *out_var_node, size_t loss_scale, + proto::VarType::Type dtype) const { for (size_t i = 0; i < places_.size(); ++i) { - // Insert ScaleCost OpHandle auto *dev_ctx = platform::DeviceContextPool::Instance().Get(places_[i]); auto *op_handle = new ScaleLossGradOpHandle( result->CreateEmptyNode("scale_loss_grad", ir::Node::Type::kOperation), - local_scopes_.size(), local_scopes_[i], places_[i], dev_ctx); + loss_scale, local_scopes_[i], places_[i], dev_ctx, dtype); result->Get(kGraphOps).emplace_back(op_handle); // FIXME: Currently ScaleLossGradOp only use device_count as scale @@ -617,29 +494,25 @@ void MultiDevSSAGraphBuilder::CreateScaleLossGradOp( // loss->pending_ops_.emplace_back(op_handle); // op_handle->inputs_.emplace_back(loss); - CreateOpOutput( - result, op_handle, - result->CreateEmptyNode(loss_grad_name, ir::Node::Type::kVariable), - places_[i], i); + CreateOpOutput(result, op_handle, + result->CreateVarNode(out_var_node->Var()), places_[i], i); } } -void MultiDevSSAGraphBuilder::CreateComputationalOps(ir::Graph *result, - ir::Node *node, - size_t num_places) const { +void MultiDevSSAGraphBuilderBase::CreateComputationalOps( + ir::Graph *result, ir::Node *node, size_t num_places) const { for (size_t scope_idx = 0; scope_idx < num_places; ++scope_idx) { auto p = places_[scope_idx]; auto s = local_scopes_[scope_idx]; - result->Get(kGraphOps).emplace_back( - new ComputationOpHandle(result->CreateOpNode(node->Op()), s, p)); + result->Get(kGraphOps).emplace_back(new ComputationOpHandle( + result->CreateOpNode(node->Op()), s, p, scope_idx)); CreateOpHandleIOs(result, node, scope_idx); } } -VarHandle *MultiDevSSAGraphBuilder::CreateReduceOp(ir::Graph *result, - const std::string &og, - int dst_dev_id) const { -#ifdef PADDLE_WITH_CUDA +VarHandle *MultiDevSSAGraphBuilderBase::CreateReduceOp( + ir::Graph *result, const std::string &og, size_t dst_dev_id) const { +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) result->Get(kGraphOps).emplace_back(new ReduceOpHandle( result->CreateEmptyNode("reduce", ir::Node::Type::kOperation), local_scopes_, places_, nccl_ctxs_)); @@ -648,7 +521,7 @@ VarHandle *MultiDevSSAGraphBuilder::CreateReduceOp(ir::Graph *result, result->CreateEmptyNode("reduce", ir::Node::Type::kOperation), local_scopes_, places_)); #endif - auto *op_handle = result->Get(kGraphOps).back().get(); + auto *op_handle = result->Get(kGraphOps).back(); for (size_t i = 0; i < places_.size(); ++i) { auto &p = places_[i]; @@ -656,7 +529,7 @@ VarHandle *MultiDevSSAGraphBuilder::CreateReduceOp(ir::Graph *result, auto &vars = result->Get(kGraphVars)[i][og]; PADDLE_ENFORCE(!vars.empty()); auto &prev_grad = vars.back(); - op_handle->AddInput(prev_grad.get()); + op_handle->AddInput(prev_grad); } auto &vars = result->Get(kGraphVars)[dst_dev_id][og]; auto var = @@ -667,63 +540,284 @@ VarHandle *MultiDevSSAGraphBuilder::CreateReduceOp(ir::Graph *result, return var; } -int MultiDevSSAGraphBuilder::CreateDistTrainOp(ir::Graph *result, - ir::Node *node) const { - int op_dev_id = -1; - std::vector input_var_names; - std::vector output_var_names; - for (ir::Node *input : node->inputs) { - input_var_names.push_back(input->Name()); +bool MultiDevSSAGraphBuilderBase::IsScaleLossOp(ir::Node *node) const { + return !loss_var_name_.empty() && node->Op() && + boost::get( + node->Op()->GetAttr(OpProtoAndCheckerMaker::OpRoleAttrName())) == + (static_cast(OpRole::kBackward) | + static_cast(OpRole::kLoss)); +} + +bool MultiDevSSAGraphBuilderBase::IsSparseGradient( + const std::string &og) const { + PADDLE_ENFORCE(all_vars_.count(og) != 0); + return all_vars_.at(og)->GetType() == proto::VarType::SELECTED_ROWS; +} + +void AllReduceSSAGraphBuilder::InsertCollectiveOp( + ir::Graph *result, const std::string &p_name, + const std::string &g_name) const { + if (IsSparseGradient(g_name)) { + CreateReduceOp(result, g_name, 0); + CreateBroadcastOp(result, g_name, 0); + } else { + CreateAllReduceOp(result, g_name); } - for (ir::Node *output : node->outputs) { - output_var_names.push_back(output->Name()); +} + +int BalanceVarSSAGraphBuilder::GetVarDeviceID( + const std::string &varname) const { + auto got = sharded_var_device_.find(varname); + if (got == sharded_var_device_.end()) { + auto pos = varname.find(framework::kNewGradSuffix); + if (pos != std::string::npos) { + got = sharded_var_device_.find(varname.substr(0, pos)); + } } + return got == sharded_var_device_.end() ? -1 : got->second; +} - if (node->Op()->Type() == "split_byref" || - node->Op()->Type() == "split_selected_rows") { - // TODO(paddle-dev): getting the first var is not safe. - op_dev_id = GetVarDeviceID(*result, input_var_names[0]); - if (strategy_.reduce_ == BuildStrategy::ReduceStrategy::kAllReduce) { - op_dev_id = GetAppropriateDeviceID(input_var_names); - for (auto &varname : input_var_names) { - result->Get(kShardedVarDevice) - .emplace(varname, op_dev_id); +int BalanceVarSSAGraphBuilder::GetOpDeviceID(ir::Node *node) const { + if (strategy_.reduce_ != BuildStrategy::ReduceStrategy::kReduce) { + return -1; + } + if (!OpHaveRole(*node, framework::OpRole::kOptimize)) { + return -1; + } + auto param_grad = boost::get>( + node->Op()->GetAttr(OpProtoAndCheckerMaker::OpRoleVarAttrName())); + + PADDLE_ENFORCE_EQ(param_grad.size(), 2U); + int dev_id = GetVarDeviceID(param_grad[1]); + PADDLE_ENFORCE_NE(dev_id, -1, "dev_id should not be -1.[%s, %s, %s]", + node->Op()->Type(), param_grad[0], param_grad[1]); + return dev_id; +} + +size_t BalanceVarSSAGraphBuilder::GetAppropriateDeviceID( + const std::vector &var_names) const { + int64_t numel_sum = 0; + for (auto var_name : var_names) { + if (all_vars_.find(var_name) == all_vars_.end()) continue; + auto var_desc = all_vars_.at(var_name); + PADDLE_ENFORCE_NOT_NULL(var_desc); + auto dim = framework::make_ddim(var_desc->GetShape()); + int64_t numel = framework::product(dim); + PADDLE_ENFORCE_GT(numel, 0); + numel_sum += numel; + } + + auto smallest = + std::min_element(std::begin(balance_vars_), std::end(balance_vars_)); + size_t dev_id = + static_cast(std::distance(std::begin(balance_vars_), smallest)); + balance_vars_[dev_id] += numel_sum; + return dev_id; +} + +void BalanceVarSSAGraphBuilder::ResetState() const { + balance_vars_.clear(); + sharded_var_device_.clear(); + + balance_vars_.resize(places_.size(), 0); +} + +void ReduceSSAGraphBuilder::Init() const { + MultiDevSSAGraphBuilderBase::Init(); + ResetState(); +} + +void ReduceSSAGraphBuilder::ResetState() const { + BalanceVarSSAGraphBuilder::ResetState(); + bcast_var_name_set_.clear(); + bcast_var_name_set_.resize(places_.size()); +} + +void ReduceSSAGraphBuilder::InsertCollectiveOp( + ir::Graph *result, const std::string &p_name, + const std::string &g_name) const { + size_t cur_device_id = GetAppropriateDeviceID({g_name}); + CreateReduceOp(result, g_name, cur_device_id); + sharded_var_device_.emplace(g_name, cur_device_id); + bcast_var_name_set_[cur_device_id].emplace(p_name); +} + +bool ReduceSSAGraphBuilder::DealWithSpecialOp(ir::Graph *result, + ir::Node *node) const { + int op_dev_id = BalanceVarSSAGraphBuilder::GetOpDeviceID(node); + if (op_dev_id != -1) { + // This op only runs on one specific device. + CreateComputationalOp(result, node, op_dev_id); + for (ir::Node *n : node->outputs) { + sharded_var_device_.emplace(n->Name(), op_dev_id); + } + return true; + } + return false; +} + +void ReduceSSAGraphBuilder::InsertPostprocessOps(ir::Graph *result) const { + if (UseGPU()) { + if (strategy_.fuse_broadcast_ops_) { + CreateFusedBroadcastOp(result, bcast_var_name_set_); + } else { + for (size_t dev_id = 0; dev_id < bcast_var_name_set_.size(); ++dev_id) { + auto &to_bcast_set = bcast_var_name_set_[dev_id]; + for (auto &bcast_name : to_bcast_set) { + CreateBroadcastOp(result, bcast_name, dev_id); + } } } - for (auto &varname : output_var_names) { - result->Get(kShardedVarDevice) - .emplace(varname, op_dev_id); + } +} + +int ReduceSSAGraphBuilder::GetOpDeviceID( + ir::Node *node, + std::unordered_map> *delay_ops) const { + if (!OpHaveRole(*node, framework::OpRole::kOptimize)) { + return -1; + } + + auto param_grad = boost::get>( + node->Op()->GetAttr(OpProtoAndCheckerMaker::OpRoleVarAttrName())); + + PADDLE_ENFORCE_EQ(param_grad.size(), 2U); + int dev_id = GetVarDeviceID(param_grad[1]); + + if (dev_id == -1) { + (*delay_ops)[param_grad[1]].push_back(node); + return -2; + } + return dev_id; +} + +std::vector ReduceSSAGraphBuilder::SortOperations( + const ir::Graph &graph) const { + std::vector sorted_ops = ir::TopologySortOperations(graph); + return SortForReduceMode(sorted_ops); +} + +std::vector ReduceSSAGraphBuilder::SortForReduceMode( + const std::vector &topo_ops) const { + std::vector sorted_ops; + std::unordered_map> delayed_op; + sorted_ops.reserve(topo_ops.size()); + ResetState(); + + auto insert_delayed_op = [&](const std::string &var_name, int dev_id) { + sharded_var_device_.emplace(var_name, dev_id); + if (delayed_op.count(var_name)) { + auto &ops = delayed_op.at(var_name); + sorted_ops.insert(sorted_ops.end(), ops.begin(), ops.end()); + delayed_op.at(var_name).clear(); } - } else if (node->Op()->Type() == "concat") { - op_dev_id = GetVarDeviceID(*result, input_var_names[0]); - for (auto &varname : output_var_names) { - result->Get(kShardedVarDevice) - .emplace(varname, op_dev_id); + }; + + for (ir::Node *node : topo_ops) { + int op_dev_id = GetOpDeviceID(node, &delayed_op); + if (op_dev_id > -1) { + // This op only runs on one specific device. + sorted_ops.emplace_back(node); + for (ir::Node *n : node->outputs) { + insert_delayed_op(n->Name(), op_dev_id); + } + } else if (op_dev_id == -1) { + // This op runs on all devices, and its output may have parameter's + // gradients. + sorted_ops.emplace_back(node); + bool is_bk_op = + static_cast(boost::get(node->Op()->GetAttr( + OpProtoAndCheckerMaker::OpRoleAttrName())) & + static_cast(OpRole::kBackward)); + if (!is_bk_op) continue; + // Currently, we assume that once gradient is generated, it can be + // broadcast, and each gradient is only broadcast once. + std::vector backward_vars; + try { + backward_vars = + boost::get>(node->Op()->GetNullableAttr( + OpProtoAndCheckerMaker::OpRoleVarAttrName())); + } catch (boost::bad_get e) { + } + PADDLE_ENFORCE_EQ(backward_vars.size() % 2, 0); + + for (size_t i = 0; i < backward_vars.size(); i += 2) { + auto &g_name = backward_vars[i + 1]; + size_t cur_device_id = GetAppropriateDeviceID({g_name}); + insert_delayed_op(g_name, static_cast(cur_device_id)); + } + } else if (op_dev_id == -2) { + // The Op on which the Op depends has not yet been generated. } - } else { - LOG(ERROR) << "got unexpected dist op: " << node->Op()->Type(); - PADDLE_THROW( - "the distribute training related op should be in [split_byref, " - "concat]."); } - PADDLE_ENFORCE(op_dev_id != -1, - "can not find right place for distributed op: %s", - node->Op()->Type()); + PADDLE_ENFORCE_EQ(sorted_ops.size(), topo_ops.size()); - CreateComputationalOp(result, node, op_dev_id); - return op_dev_id; + ResetState(); + return sorted_ops; +} + +void DistSSAGraphBuilder::Init() const { + MultiDevSSAGraphBuilderBase::Init(); + ResetState(); +} + +void DistSSAGraphBuilder::ResetState() const { + BalanceVarSSAGraphBuilder::ResetState(); + bcast_var_name_set_.clear(); + bcast_var_name_set_.resize(places_.size()); +} + +bool DistSSAGraphBuilder::DealWithSpecialOp(ir::Graph *result, + ir::Node *node) const { + bool insert_op = false; + if (OpHaveRole(*node, OpRole::kRPC)) { + int op_dev_id = CreateRPCOp(result, node); + PADDLE_ENFORCE(op_dev_id != -1, + "Can not schedule the RPC operator to the right place."); + if (node->Op()->Type() == "recv") { + auto recv_vars_attr = + boost::get>(node->Op()->GetNullableAttr( + OpProtoAndCheckerMaker::OpRoleVarAttrName())); + PADDLE_ENFORCE(recv_vars_attr.size() == 2UL); // [parameter, gradient] + if (recv_vars_attr[0].find(".block") == std::string::npos) { + bcast_var_name_set_[op_dev_id].emplace(recv_vars_attr[0]); + } + } + insert_op = true; + need_broadcast_var_ = true; + } else if (OpHaveRole(*node, OpRole::kDist)) { + int op_dev_id = CreateDistTrainOp(result, node); + if (node->Op()->Type() == "concat") { + // the input(block of parameter) of concat is on different device, + // the output(parameter) will on one device. + auto origin_param_name = node->Op()->OutputArgumentNames()[0]; + bcast_var_name_set_[op_dev_id].emplace(origin_param_name); + } + insert_op = true; + } else { + int op_dev_id = GetOpDeviceID(node); + if (op_dev_id != -1) { // This op only runs on one specific device. + // optimize op will be processed here. + CreateComputationalOp(result, node, op_dev_id); + for (ir::Node *n : node->outputs) { + sharded_var_device_.emplace(n->Name(), op_dev_id); + } + insert_op = true; + } + } + return insert_op; } void SetOpInputsAllPlaces(ir::Graph *result, ir::Node *node, int num_places) { - auto *op_handle = result->Get(kGraphOps).back().get(); + auto *op_handle = result->Get(kGraphOps).back(); for (ir::Node *input : node->inputs) { VarHandle *var = nullptr; for (int place_offset = 0; place_offset < num_places; ++place_offset) { auto &var_holders = result->Get(kGraphVars)[place_offset]; auto &var_holder = var_holders[input->Name()]; if (!var_holder.empty()) { - var = var_holder.rbegin()->get(); + var = *var_holder.rbegin(); op_handle->AddInput(var); } } @@ -731,12 +825,11 @@ void SetOpInputsAllPlaces(ir::Graph *result, ir::Node *node, int num_places) { } // Create RPC related op handles that connects its in ops and out ops. -int MultiDevSSAGraphBuilder::CreateRPCOp(ir::Graph *result, - ir::Node *node) const { +int DistSSAGraphBuilder::CreateRPCOp(ir::Graph *result, ir::Node *node) const { int op_dev_id = -1; if (node->Op()->Type() == "send") { // TODO(paddle-dev): getting the first var is not safe. - op_dev_id = GetVarDeviceID(*result, node->inputs[0]->Name()); + op_dev_id = GetVarDeviceID(node->inputs[0]->Name()); PADDLE_ENFORCE(!ir::IsControlDepVar(*node->inputs[0]), "This hack no longer holds, please fix."); // the variable name which contains .block means it was splited by @@ -754,11 +847,9 @@ int MultiDevSSAGraphBuilder::CreateRPCOp(ir::Graph *result, VLOG(10) << "send grad " << input_var_names[0] << " origin " << send_param_grad[1] << " place: " << op_dev_id; for (auto &varname : input_var_names) { - result->Get(kShardedVarDevice) - .emplace(varname, op_dev_id); + sharded_var_device_.emplace(varname, op_dev_id); } - result->Get(kShardedVarDevice) - .emplace(send_param_grad[1], op_dev_id); + sharded_var_device_.emplace(send_param_grad[1], op_dev_id); } } else if (node->Op()->Type() == "recv") { std::vector output_var_names; @@ -768,7 +859,7 @@ int MultiDevSSAGraphBuilder::CreateRPCOp(ir::Graph *result, auto recv_param_grad = boost::get>( node->Op()->GetAttr(OpProtoAndCheckerMaker::OpRoleVarAttrName())); if (recv_param_grad.size() == 2U) { - op_dev_id = GetVarDeviceID(*result, recv_param_grad[1]); + op_dev_id = GetVarDeviceID(recv_param_grad[1]); VLOG(10) << "recv param " << recv_param_grad[0] << " get grad place: " << recv_param_grad[1] << " place: " << op_dev_id; @@ -776,8 +867,7 @@ int MultiDevSSAGraphBuilder::CreateRPCOp(ir::Graph *result, op_dev_id = GetAppropriateDeviceID(output_var_names); } for (auto &varname : output_var_names) { - result->Get(kShardedVarDevice) - .emplace(varname, op_dev_id); + sharded_var_device_.emplace(varname, op_dev_id); } } else { // send_barrier, fetch_barrier will run on place 0; @@ -786,9 +876,17 @@ int MultiDevSSAGraphBuilder::CreateRPCOp(ir::Graph *result, PADDLE_ENFORCE(op_dev_id != -1, "can not find the right place for rpc op: %s", node->Op()->Type()); - result->Get(kGraphOps).emplace_back(new RPCOpHandle( - result->CreateOpNode(node->Op()), *node->Op(), local_scopes_[op_dev_id], - node->Op()->Type(), places_[op_dev_id])); + + // Create fetch_barrier op handle to enable output on all devices. + // **NOTE** fetch_barrier should output variables list same as recv op does. + if (node->Op()->Type() == "fetch_barrier") { + result->Get(kGraphOps).emplace_back(new FetchBarrierOpHandle( + result->CreateOpNode(node->Op()), local_scopes_, places_)); + } else { + result->Get(kGraphOps).emplace_back(new RPCOpHandle( + result->CreateOpNode(node->Op()), *node->Op(), local_scopes_[op_dev_id], + node->Op()->Type(), places_[op_dev_id])); + } if (node->Op()->Type() == "send") { CreateOpHandleIOs(result, node, op_dev_id); @@ -796,7 +894,7 @@ int MultiDevSSAGraphBuilder::CreateRPCOp(ir::Graph *result, // send_barrier, recv, fetch_barrier's inputs are deps var, get them from // all places auto p = places_[op_dev_id]; - auto *op_handle = result->Get(kGraphOps).back().get(); + auto *op_handle = result->Get(kGraphOps).back(); op_handle->SetDeviceContext(p, platform::DeviceContextPool::Instance().Get(p)); @@ -804,8 +902,8 @@ int MultiDevSSAGraphBuilder::CreateRPCOp(ir::Graph *result, for (ir::Node *output : node->outputs) { int outvar_dev_id = op_dev_id; if (node->Op()->Type() == "fetch_barrier") { - outvar_dev_id = GetVarDeviceID(*result, output->Name()); - PADDLE_ENFORCE_NE(outvar_dev_id, -1); + outvar_dev_id = GetVarDeviceID(output->Name()); + PADDLE_ENFORCE_NE(outvar_dev_id, -1, "output name %s", output->Name()); } p = places_[outvar_dev_id]; ir::Node *new_node = nullptr; @@ -821,21 +919,154 @@ int MultiDevSSAGraphBuilder::CreateRPCOp(ir::Graph *result, return op_dev_id; } -bool MultiDevSSAGraphBuilder::IsScaleLossOp(ir::Node *node) const { - return boost::get( - node->Op()->GetAttr(OpProtoAndCheckerMaker::OpRoleAttrName())) == - (static_cast(OpRole::kBackward) | - static_cast(OpRole::kLoss)) && - !loss_var_name_.empty(); // If loss_var is empty. This is test mode +int DistSSAGraphBuilder::CreateDistTrainOp(ir::Graph *result, + ir::Node *node) const { + int op_dev_id = -1; + std::vector input_var_names; + std::vector output_var_names; + for (ir::Node *input : node->inputs) { + input_var_names.push_back(input->Name()); + } + for (ir::Node *output : node->outputs) { + output_var_names.push_back(output->Name()); + } + + if (node->Op()->Type() == "split_byref" || + node->Op()->Type() == "split_selected_rows" || + node->Op()->Type() == "split_ids") { + // TODO(paddle-dev): getting the first var is not safe. + op_dev_id = GetVarDeviceID(input_var_names[0]); + if (strategy_.reduce_ == BuildStrategy::ReduceStrategy::kAllReduce) { + op_dev_id = GetAppropriateDeviceID(input_var_names); + for (auto &varname : input_var_names) { + sharded_var_device_.emplace(varname, op_dev_id); + } + } + for (auto &varname : output_var_names) { + sharded_var_device_.emplace(varname, op_dev_id); + } + } else if (node->Op()->Type() == "concat") { + op_dev_id = GetVarDeviceID(input_var_names[0]); + for (auto &varname : output_var_names) { + sharded_var_device_.emplace(varname, op_dev_id); + } + } else { + LOG(ERROR) << "got unexpected dist op: " << node->Op()->Type(); + PADDLE_THROW( + "the distribute training related op should be in [split_byref, " + "concat]."); + } + + PADDLE_ENFORCE(op_dev_id != -1, + "can not find right place for distributed op: %s", + node->Op()->Type()); + + CreateComputationalOp(result, node, op_dev_id); + return op_dev_id; } + +bool DistSSAGraphBuilder::IsEncoded(const std::string &p_name) const { + auto u_name = p_name + "__dgc_u__"; + auto it = all_vars_.find(u_name); + if (it == all_vars_.end()) { + VLOG(10) << "can't find u_name, so it's not encoded:" << u_name; + return false; + } + + return true; +} + +void DistSSAGraphBuilder::InsertCollectiveOp(ir::Graph *result, + const std::string &p_name, + const std::string &g_name) const { + // collective gradient to each device + size_t cur_device_id = 0; + switch (strategy_.reduce_) { + case BuildStrategy::ReduceStrategy::kReduce: + cur_device_id = GetAppropriateDeviceID({g_name}); + CreateReduceOp(result, g_name, cur_device_id); + sharded_var_device_.emplace(g_name, cur_device_id); + break; + case BuildStrategy::ReduceStrategy::kAllReduce: + if (IsSparseGradient(g_name)) { + CreateReduceOp(result, g_name, 0); + CreateBroadcastOp(result, g_name, 0); + } else { +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + CreateAllReduceOp(result, g_name, IsEncoded(p_name)); +#else + PADDLE_ENFORCE(false, "Compiled withoud cuda!"); +#endif + } + break; + default: + LOG(FATAL) << "Unknown reduce strategy."; + break; + } +} + +void DistSSAGraphBuilder::InsertPostprocessOps(ir::Graph *result) const { + // broad cast received parameters when training in parameter server mode. + if (need_broadcast_var_) { + // There are 4 conditions: + // 1. GPU && Reduce: Reduce gradient then broadcast gradient to other GPUS. + // Need to broadcast received parameters to other GPU. + // 2. GPU && AllReduce: AllReduce all graident to each GPU. Need to + // broadcast received parameters to other GPU. + // 3. CPU && AllReduce: AllReduce all gradient to each thread. Need to + // broadcast received parameters to other scope. + // 4. CPU && Reduce: because all parameters share the same memory, did not + // broadcast received parameters. + if (!UseGPU() && + strategy_.reduce_ == BuildStrategy::ReduceStrategy::kReduce) { + return; + } + if (strategy_.fuse_broadcast_ops_) { + CreateFusedBroadcastOp(result, bcast_var_name_set_); + } else { + for (size_t dev_id = 0; dev_id < bcast_var_name_set_.size(); ++dev_id) { + auto &to_bcast_set = bcast_var_name_set_[dev_id]; + for (auto &bcast_name : to_bcast_set) { + CreateBroadcastOp(result, bcast_name, dev_id); + } + } + } + } +} + +std::unordered_set &MultiDevSSAGraphBuilder() { + static std::unordered_set regs; + return regs; +} + +static int MultiDevSSAGraphBuilderRegister(const std::string &builder_mode) { + MultiDevSSAGraphBuilder().insert(builder_mode); + return 0; +} + } // namespace details } // namespace framework } // namespace paddle -REGISTER_PASS(multi_devices_pass, - paddle::framework::details::MultiDevSSAGraphBuilder) - .RequirePassAttr(paddle::framework::details::kLossVarName) - .RequirePassAttr(paddle::framework::details::kPlaces) - .RequirePassAttr(paddle::framework::details::kParams) - .RequirePassAttr(paddle::framework::details::kLocalScopes) - .RequirePassAttr(paddle::framework::details::kStrategy); +#define REGISTER_MULTI_DEVICES_PASS(pass_name, pass_class) \ + STATIC_ASSERT_GLOBAL_NAMESPACE( \ + _reg_ssa_graph_builder_##pass_name, \ + "REGISTER_MULTI_DEVICES_PASS must be called in global namespace."); \ + int _reg_ssa_graph_builder_entry_##pass_name = \ + paddle::framework::details::MultiDevSSAGraphBuilderRegister(#pass_name); \ + REGISTER_PASS(pass_name, pass_class) \ + .RequirePassAttr(paddle::framework::details::kLossVarName) \ + .RequirePassAttr(paddle::framework::details::kPlaces) \ + .RequirePassAttr(paddle::framework::details::kLocalScopes) \ + .RequirePassAttr(paddle::framework::details::kStrategy) \ + .RequirePassAttr(paddle::framework::details::kNRanks) + +REGISTER_MULTI_DEVICES_PASS(reduce_mode_multi_devices_pass, + paddle::framework::details::ReduceSSAGraphBuilder); +REGISTER_MULTI_DEVICES_PASS( + all_reduce_mode_multi_devices_pass, + paddle::framework::details::AllReduceSSAGraphBuilder); +REGISTER_MULTI_DEVICES_PASS(dist_multi_devices_pass, + paddle::framework::details::DistSSAGraphBuilder); +REGISTER_MULTI_DEVICES_PASS(async_multi_devices_pass, + paddle::framework::details::AsyncSSAGraphBuilder); diff --git a/paddle/fluid/framework/details/multi_devices_graph_pass.h b/paddle/fluid/framework/details/multi_devices_graph_pass.h index cdf9f13cde608b546d17a1e53e0f6acea9e12566..7cc68dd2d5a422cfa1ac3a4bfdd48545a6e5691d 100644 --- a/paddle/fluid/framework/details/multi_devices_graph_pass.h +++ b/paddle/fluid/framework/details/multi_devices_graph_pass.h @@ -13,10 +13,13 @@ // limitations under the License. #pragma once + +#include #include +#include +#include #include #include - #include "paddle/fluid/framework/details/build_strategy.h" #include "paddle/fluid/framework/details/multi_devices_helper.h" #include "paddle/fluid/framework/ir/graph.h" @@ -30,71 +33,182 @@ namespace framework { class Scope; namespace details { -class MultiDevSSAGraphBuilder : public ir::Pass { +constexpr char kLossVarName[] = "loss_var_name"; +constexpr char kStrategy[] = "strategy"; +constexpr char kNRanks[] = "nranks"; + +class MultiDevSSAGraphBuilderBase : public ir::Pass { protected: - std::unique_ptr ApplyImpl( - std::unique_ptr graph) const override; + void ApplyImpl(ir::Graph *graph) const override; - private: - void CreateOpHandleIOs(ir::Graph *result, ir::Node *node, - size_t device_id) const; - void Init() const; + virtual void Init() const; -#ifdef PADDLE_WITH_CUDA - mutable platform::NCCLContextMap *nccl_ctxs_; -#endif + virtual void CheckGraph(const ir::Graph &graph) const; - int GetVarDeviceID(const ir::Graph &graph, const std::string &varname) const; + virtual std::vector SortOperations(const ir::Graph &graph) const; - bool IsScaleLossOp(ir::Node *node) const; + virtual void InsertCollectiveOp(ir::Graph *result, const std::string &p_name, + const std::string &g_name) const = 0; - int CreateRPCOp(ir::Graph *result, ir::Node *node) const; - int CreateDistTrainOp(ir::Graph *result, ir::Node *node) const; + virtual bool DealWithSpecialOp(ir::Graph *result, ir::Node *node) const; + + virtual void InsertPostprocessOps(ir::Graph *result) const = 0; - std::vector FindDistTrainSendVars( - const std::vector &nodes) const; + bool UseGPU() const; - std::vector FindDistTrainRecvVars( - const std::vector &nodes) const; + virtual bool NeedCollectiveForGrad(const std::string &grad_name, + std::vector ops) const; + + bool IsScaleLossOp(ir::Node *node) const; void CreateComputationalOps(ir::Graph *result, ir::Node *node, size_t num_places) const; void CreateScaleLossGradOp(ir::Graph *result, - const std::string &loss_grad_name) const; + const std::string &loss_grad_name, + ir::Node *out_var_node, size_t loss_scale, + proto::VarType::Type dtype) const; VarHandle *CreateReduceOp(ir::Graph *result, const std::string &og, - int dst_dev_id) const; - void CreateComputationalOp(ir::Graph *result, ir::Node *node, - int dev_id) const; + size_t dst_dev_id) const; - int GetOpDeviceID(const ir::Graph &graph, ir::Node *node) const; + void CreateComputationalOp(ir::Graph *result, ir::Node *node, + size_t dev_id) const; - void InsertAllReduceOp(ir::Graph *result, const std::string &og) const; + bool IsSparseGradient(const std::string &og) const; - void InsertDataBalanceOp(ir::Graph *result, - const std::vector &datas) const; + void CreateAllReduceOp(ir::Graph *result, const std::string &og, + bool is_encoded = false) const; void CreateBroadcastOp(ir::Graph *result, const std::string &p_name, size_t src_dev_id) const; - bool IsSparseGradient(const std::string &og) const; + void InsertScaleLossGradOp(ir::Graph *result, const ir::Node *node) const; - size_t GetAppropriateDeviceID( - const std::vector &var_names) const; + void CreateFusedBroadcastOp( + ir::Graph *result, + const std::vector> &bcast_varnames) const; void SetCommunicationContext(OpHandleBase *op_handle, const platform::Place &p) const; + void CreateOpHandleIOs(ir::Graph *result, ir::Node *node, + size_t device_id) const; + +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + mutable platform::NCCLContextMap *nccl_ctxs_; +#endif + mutable std::string loss_var_name_; mutable std::vector places_; mutable std::vector local_scopes_; - mutable std::unordered_set grad_names_; mutable BuildStrategy strategy_; mutable std::unordered_map all_vars_; +}; + +class AllReduceSSAGraphBuilder : public MultiDevSSAGraphBuilderBase { + protected: + virtual void InsertCollectiveOp(ir::Graph *result, const std::string &p_name, + const std::string &g_name) const; + + virtual void InsertPostprocessOps(ir::Graph *result) const {} +}; + +class AsyncSSAGraphBuilder : public MultiDevSSAGraphBuilderBase { + protected: + void InsertCollectiveOp(ir::Graph *result, const std::string &p_name, + const std::string &g_name) const override {} + + bool NeedCollectiveForGrad(const std::string &grad_name, + std::vector ops) const { + return false; + } + + bool DealWithSpecialOp(ir::Graph *result, ir::Node *node) const override { + if (node->Op()->Type() == "recv") { + VLOG(1) << "set recv op do_not_run to true"; + node->Op()->SetAttr("do_not_run", true); + node->Op()->Flush(); + } else if (node->Name() == "lookup_table" || node->Name() == "nce" || + node->Name() == "hierarchical_sigmoid") { + // in async_mode, we do not need remote prefetch, because communicator + // will do async parameter recv. + VLOG(1) << "set " << node->Name() << " op remote_prefetch to false"; + node->Op()->SetAttr("remote_prefetch", false); + node->Op()->Flush(); + } + return false; + } + + void InsertPostprocessOps(ir::Graph *result) const override {} +}; + +class BalanceVarSSAGraphBuilder : public MultiDevSSAGraphBuilderBase { + protected: + int GetVarDeviceID(const std::string &varname) const; + + int GetOpDeviceID(ir::Node *node) const; + + size_t GetAppropriateDeviceID( + const std::vector &var_names) const; + + virtual void ResetState() const; + + mutable std::unordered_map sharded_var_device_; mutable std::vector balance_vars_; }; + +class ReduceSSAGraphBuilder : public BalanceVarSSAGraphBuilder { + protected: + virtual void Init() const; + + virtual void InsertCollectiveOp(ir::Graph *result, const std::string &p_name, + const std::string &g_name) const; + + virtual bool DealWithSpecialOp(ir::Graph *result, ir::Node *node) const; + + virtual void InsertPostprocessOps(ir::Graph *result) const; + + virtual std::vector SortOperations(const ir::Graph &graph) const; + + virtual void ResetState() const; + + int GetOpDeviceID(ir::Node *node, + std::unordered_map> + *delay_ops) const; + + std::vector SortForReduceMode( + const std::vector &topo_ops) const; + + mutable std::vector> bcast_var_name_set_; +}; + +class DistSSAGraphBuilder : public BalanceVarSSAGraphBuilder { + protected: + virtual void Init() const; + + virtual bool DealWithSpecialOp(ir::Graph *result, ir::Node *node) const; + + virtual void InsertPostprocessOps(ir::Graph *result) const; + + virtual void InsertCollectiveOp(ir::Graph *result, const std::string &p_name, + const std::string &g_name) const; + + virtual void ResetState() const; + + int CreateRPCOp(ir::Graph *result, ir::Node *node) const; + + int CreateDistTrainOp(ir::Graph *result, ir::Node *node) const; + + mutable std::vector> bcast_var_name_set_; + mutable bool need_broadcast_var_{false}; + + bool IsEncoded(const std::string &p_name) const; +}; + +std::unordered_set &MultiDevSSAGraphBuilder(); + } // namespace details } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/details/multi_devices_graph_print_pass.cc b/paddle/fluid/framework/details/multi_devices_graph_print_pass.cc index 361c91dc78c08a2cbf84ee88211d389c1e2312e5..34c38ea81a9e4832f7e1b63e1e6db4ea27704c34 100644 --- a/paddle/fluid/framework/details/multi_devices_graph_print_pass.cc +++ b/paddle/fluid/framework/details/multi_devices_graph_print_pass.cc @@ -13,8 +13,11 @@ // limitations under the License. #include "paddle/fluid/framework/details/multi_devices_graph_print_pass.h" +#include #include +#include #include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_helper.h" namespace paddle { namespace framework { @@ -51,18 +54,18 @@ void GraphvizSSAGraphPrinter::Print(const ir::Graph &graph, vars[var_ptr] = cur_var_id; if (var_handle_ptr) { - sout << "var_" << cur_var_id << " [label=\"" << var_handle_ptr->name_ + sout << "var_" << cur_var_id << " [label=\"" << var_handle_ptr->name() << "\\n" - << var_handle_ptr->place_ << "\\n" - << "scope: " << var_handle_ptr->scope_idx_ << "\\n" - << "v" << var_handle_ptr->version_ << "\"]" << std::endl; + << var_handle_ptr->place() << "\\n" + << "scope: " << var_handle_ptr->scope_idx() << "\\n" + << "v" << var_handle_ptr->version() << "\"]" << std::endl; } else if (dummy_ptr) { sout << "var_" << cur_var_id << " [label=\"dummy\"]" << std::endl; } }); size_t op_id = 0; - for (auto &op : graph.Get(kGraphOps)) { + for (auto &op : ir::FilterByNodeWrapper(graph)) { std::string op_name = "op_" + std::to_string(op_id++); sout << op_name << " [label=\"" << op->Name() << "\", shape=rect]" << std::endl; @@ -84,4 +87,5 @@ void GraphvizSSAGraphPrinter::Print(const ir::Graph &graph, } // namespace paddle REGISTER_PASS(multi_devices_print_pass, - paddle::framework::details::SSAGraghBuilderWithPrinter); + paddle::framework::details::SSAGraghBuilderWithPrinter) + .RequirePassAttr(paddle::framework::details::kGraphvizPath); diff --git a/paddle/fluid/framework/details/multi_devices_graph_print_pass.h b/paddle/fluid/framework/details/multi_devices_graph_print_pass.h index c00685fa1629c0722c315c726053c2cba8bf17e7..6d57d75e8a5541ac39e6dbe231c3f47daaa4206a 100644 --- a/paddle/fluid/framework/details/multi_devices_graph_print_pass.h +++ b/paddle/fluid/framework/details/multi_devices_graph_print_pass.h @@ -14,8 +14,10 @@ #pragma once +#include #include #include +#include #include #include #include "paddle/fluid/framework/details/multi_devices_helper.h" @@ -24,6 +26,8 @@ namespace paddle { namespace framework { namespace details { +constexpr char kGraphvizPath[] = "debug_graphviz_path"; + class SSAGraphPrinter { public: virtual ~SSAGraphPrinter() {} @@ -37,13 +41,11 @@ class GraphvizSSAGraphPrinter : public SSAGraphPrinter { class SSAGraghBuilderWithPrinter : public ir::Pass { protected: - std::unique_ptr ApplyImpl( - std::unique_ptr graph) const override { + void ApplyImpl(ir::Graph* graph) const override { std::unique_ptr fout( - new std::ofstream(Get("debug_graphviz_path"))); + new std::ofstream(Get(kGraphvizPath))); PADDLE_ENFORCE(fout->good()); Get("graph_printer").Print(*graph, *fout); - return graph; } }; diff --git a/paddle/fluid/framework/details/multi_devices_helper.h b/paddle/fluid/framework/details/multi_devices_helper.h index 175c5a9950be69d7bf6ae9e386af762007a18a51..6e6ef074db3450ebbb5567743b908e0aee382c27 100644 --- a/paddle/fluid/framework/details/multi_devices_helper.h +++ b/paddle/fluid/framework/details/multi_devices_helper.h @@ -16,8 +16,10 @@ #include #include +#include +#include +#include #include - #include "paddle/fluid/framework/details/op_handle_base.h" #include "paddle/fluid/framework/details/var_handle.h" @@ -35,23 +37,36 @@ namespace details { // The outside vector is the device vector. Each element of this vector is a // map from variable name to variables. The variables, who have the same name, // will have a differsent version. The offset in the -// `std::vector>` is the version of varaibles. -typedef std::vector< - std::unordered_map>>> +// `std::vector` is the version of varaibles. +typedef std::vector>> GraphVars; -const char kGraphVars[] = "vars"; +constexpr char kGraphVars[] = "vars"; + +constexpr char kPlaces[] = "places"; +constexpr char kLocalScopes[] = "local_scopes"; +constexpr char kNCCLCtxs[] = "nccl_ctxs"; // aux variables to represent dependency. Useful to resolve data hazard. -typedef std::unordered_set> GraphDepVars; -const char kGraphDepVars[] = "dep_vars"; +typedef std::unordered_set GraphDepVars; +constexpr char kGraphDepVars[] = "dep_vars"; + +typedef std::unordered_set FusedVars; +constexpr char kFusedVars[] = "fused_vars"; +constexpr char kFusedVarNamePrefix[] = "@FUSEDVAR@"; + +typedef std::string FusedOptType; +constexpr char kFusedOptType[] = "fused_opt_type"; + +typedef std::string FusedGrads; +constexpr char kFusedGrads[] = "fused_gradients"; + +typedef std::vector> ParamsAndGrads; +constexpr char kParamsAndGrads[] = "params_grads"; -// all operators. NOTE that even we use a vector here, the operators is -// unordered. -typedef std::vector> GraphOps; -const char kGraphOps[] = "ops"; +typedef std::vector>> + GroupGradsAndParams; +constexpr char kGroupGradsAndParams[] = "group_grads_params"; -typedef std::unordered_map ShardedVarDevice; -const char kShardedVarDevice[] = "sharded_var_device"; } // namespace details } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/details/op_graph_view.cc b/paddle/fluid/framework/details/op_graph_view.cc new file mode 100644 index 0000000000000000000000000000000000000000..d3865c2c2919c2d43521e4f51013e5fa1b10416d --- /dev/null +++ b/paddle/fluid/framework/details/op_graph_view.cc @@ -0,0 +1,69 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/details/op_graph_view.h" +#include +#include + +namespace paddle { +namespace framework { +namespace details { + +OpGraphView::OpGraphView(const std::vector &ops) { Build(ops); } + +void OpGraphView::Build(const std::vector &ops) { + preceding_ops_.clear(); + pending_ops_.clear(); + for (auto &op : ops) { + preceding_ops_[op]; + pending_ops_[op]; + for (auto &var : op->Outputs()) { + for (auto &pending_op : var->PendingOps()) { + preceding_ops_[pending_op].insert(op); + pending_ops_[op].insert(pending_op); + } + } + } + PADDLE_ENFORCE( + preceding_ops_.size() == ops.size() && pending_ops_.size() == ops.size(), + "There are duplicate ops in graph."); +} + +std::unordered_set OpGraphView::AllOps() const { + std::unordered_set ret; + ret.reserve(preceding_ops_.size()); + for (auto &pair : preceding_ops_) { + ret.insert(pair.first); + } + return ret; +} + +bool OpGraphView::HasOp(OpHandleBase *op) const { + return preceding_ops_.count(op) != 0; +} + +void OpGraphView::EnforceHasOp(OpHandleBase *op) const { + PADDLE_ENFORCE(HasOp(op), "Cannot find op %s in OpGraphView", + op == nullptr ? "nullptr" : op->DebugString()); +} + +const std::unordered_set &OpGraphView::PendingOps( + OpHandleBase *op) const { + EnforceHasOp(op); + return pending_ops_.at(op); +} + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/op_graph_view.h b/paddle/fluid/framework/details/op_graph_view.h new file mode 100644 index 0000000000000000000000000000000000000000..77aa02eba56acb3bb20a5c5a55c75af78a3c1c81 --- /dev/null +++ b/paddle/fluid/framework/details/op_graph_view.h @@ -0,0 +1,76 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include "paddle/fluid/framework/details/op_handle_base.h" + +namespace paddle { +namespace framework { +namespace details { + +class OpGraphView { + public: + explicit OpGraphView(const std::vector &ops); + + std::unordered_set AllOps() const; + + const std::unordered_set &PendingOps(OpHandleBase *op) const; + + bool HasOp(OpHandleBase *op) const; + + // Use a visitor to visit all pending ops of op + // Stop when callback returns false + template + bool VisitAllPendingOps(OpHandleBase *op, Callback &&callback) const; + + private: + void Build(const std::vector &ops); + void EnforceHasOp(OpHandleBase *op) const; + + std::unordered_map> + preceding_ops_; + std::unordered_map> + pending_ops_; +}; + +template +bool OpGraphView::VisitAllPendingOps(OpHandleBase *op, + Callback &&callback) const { + EnforceHasOp(op); + std::unordered_set visited; + std::queue q; + q.push(op); + do { + op = q.front(); + q.pop(); + for (auto &pending_op : pending_ops_.at(op)) { + if (visited.count(pending_op) == 0) { + visited.insert(pending_op); + if (!callback(pending_op)) { + return false; + } + } + } + } while (!q.empty()); + return true; +} + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/op_handle_base.cc b/paddle/fluid/framework/details/op_handle_base.cc index 3812f0abf1b7069525c4420054c61c01c908acfe..69cd84ebf2d678c089141f09a92c46e3a03fe4d9 100644 --- a/paddle/fluid/framework/details/op_handle_base.cc +++ b/paddle/fluid/framework/details/op_handle_base.cc @@ -13,6 +13,7 @@ // limitations under the License. #include "paddle/fluid/framework/details/op_handle_base.h" #include +#include namespace paddle { namespace framework { @@ -41,15 +42,42 @@ OpHandleBase::~OpHandleBase() { void OpHandleBase::Run(bool use_cuda) { #ifdef PADDLE_WITH_CUDA - if (events_.empty() && use_cuda) { + if (events_.empty() && use_cuda && dev_ctxes_.size() > 0) { for (auto &p : dev_ctxes_) { int dev_id = boost::get(p.first).device; PADDLE_ENFORCE(cudaSetDevice(dev_id)); PADDLE_ENFORCE( cudaEventCreateWithFlags(&events_[dev_id], cudaEventDisableTiming)); } + if (IsMultiDeviceTransfer() && dev_ctxes_.size() > 0) { + for (auto &out_var : outputs_) { + auto *out_var_handle = dynamic_cast(out_var); + if (out_var_handle) { + int dev_id = + boost::get(out_var_handle->place()).device; + out_var_handle->SetGenerateEvent(events_.at(dev_id)); + } + } + } else { + PADDLE_ENFORCE_EQ(dev_ctxes_.size(), 1UL, + "%s should have only one dev_ctx.", Name()); + auto &place = dev_ctxes_.begin()->first; + int dev_id = boost::get(place).device; + for (auto &out_var : outputs_) { + auto *out_var_handle = dynamic_cast(out_var); + if (out_var_handle) { + PADDLE_ENFORCE( + platform::is_same_place(place, out_var_handle->place()), + "The place of output(%s) is not consistent with the " + "place of current op(%s).", + out_var_handle->Name(), Name()); + out_var_handle->SetGenerateEvent(events_.at(dev_id)); + } + } + } } #else + PADDLE_ENFORCE(!use_cuda); #endif @@ -93,17 +121,48 @@ void OpHandleBase::AddOutput(VarHandleBase *out) { void OpHandleBase::WaitInputVarGenerated() { for (auto in_var : inputs_) { if (NeedWait(in_var)) { - for (auto &pair : dev_ctxes_) { - in_var->GeneratedOp()->RecordWaitEventOnCtx(pair.second); + // Dummy Variable is used to represent dependencies between operators, so + // there doesn't add event for it. + auto *in_var_handle = dynamic_cast(in_var); + if (in_var_handle) { + auto &place = in_var_handle->place(); + if (platform::is_gpu_place(place)) { +#ifdef PADDLE_WITH_CUDA + auto stream = + static_cast(dev_ctxes_.at(place)) + ->stream(); + PADDLE_ENFORCE( + cudaStreamWaitEvent(stream, in_var_handle->GetEvent(), 0)); +#else + PADDLE_THROW("Doesn't compile the GPU."); +#endif + } + // There are nothing to do when the place is CPUPlace. } } } } void OpHandleBase::WaitInputVarGenerated(const platform::Place &place) { - for (auto *in : inputs_) { - if (NeedWait(in)) { - in->GeneratedOp()->RecordWaitEventOnCtx(dev_ctxes_[place]); + for (auto in_var : inputs_) { + if (NeedWait(in_var)) { + // Dummy Variable is used to represent dependencies between operators, so + // there doesn't add event for it. + auto *in_var_handle = dynamic_cast(in_var); + if (in_var_handle) { + if (platform::is_gpu_place(in_var_handle->place())) { +#ifdef PADDLE_WITH_CUDA + auto stream = static_cast( + dev_ctxes_.at(in_var_handle->place())) + ->stream(); + PADDLE_ENFORCE( + cudaStreamWaitEvent(stream, in_var_handle->GetEvent(), 0)); +#else + PADDLE_THROW("Doesn't compile the GPU."); +#endif + } + // There are nothing to do when the place is CPUPlace. + } } } } diff --git a/paddle/fluid/framework/details/op_handle_base.h b/paddle/fluid/framework/details/op_handle_base.h index d09b94a3fd32952985a37cf4246c7640d2db4f56..e0aa352e95bc3685a1f4879bffa6e86eecd7e7f9 100644 --- a/paddle/fluid/framework/details/op_handle_base.h +++ b/paddle/fluid/framework/details/op_handle_base.h @@ -25,13 +25,16 @@ namespace paddle { namespace framework { namespace details { -constexpr char kLocalExecScopeName[] = "@LCOAL_SCOPE@"; +constexpr char kLocalExecScopeName[] = "@LOCAL_SCOPE@"; // Wraps ir::Node and provide helper utilities. // It's responsible for populating necessary fields of ir::Node. class OpHandleBase { public: - explicit OpHandleBase(ir::Node *node) : node_(node) {} + // Owned by `node`. No need to be deleted explicitly. + explicit OpHandleBase(ir::Node *node) : node_(node) { + node_->WrappedBy(this); + } virtual ~OpHandleBase(); @@ -67,6 +70,9 @@ class OpHandleBase { auto it = dev_ctxes_.find(place); return it != dev_ctxes_.end() ? it->second : nullptr; } + const std::map &DeviceContext() { + return dev_ctxes_; + } void SetDeviceContext(platform::Place place, platform::DeviceContext *ctx_) { dev_ctxes_[place] = ctx_; diff --git a/paddle/fluid/framework/details/op_registry.h b/paddle/fluid/framework/details/op_registry.h index eea7e712f8f6e187cdceedce77cc76d1d4ca2101..18de595983f52e56dba4f5069257f354132db51b 100644 --- a/paddle/fluid/framework/details/op_registry.h +++ b/paddle/fluid/framework/details/op_registry.h @@ -16,8 +16,13 @@ limitations under the License. */ #include #include +#include +#include +#include #include #include "paddle/fluid/framework/grad_op_desc_maker.h" +#include "paddle/fluid/framework/inplace_op_inference.h" +#include "paddle/fluid/framework/no_need_buffer_vars_inference.h" #include "paddle/fluid/framework/op_info.h" #include "paddle/fluid/framework/op_proto_maker.h" #include "paddle/fluid/framework/operator.h" @@ -32,24 +37,87 @@ enum OpInfoFillType { kOpProtoAndCheckerMaker = 1, kGradOpDescMaker = 2, kVarTypeInference = 3, - kShapeInference = 4 + kShapeInference = 4, + kInplaceOpInference = 5, + kNoNeedBufferVarsInference = 6, + kUnknown = -1 +}; + +namespace internal { +template +struct TypePair { + using Type = T; + static constexpr OpInfoFillType kFillType = kType; +}; + +using OpRegistryClasses = std::tuple< // NOLINT + TypePair, // NOLINT + TypePair, // NOLINT + TypePair, // NOLINT + TypePair, // NOLINT + TypePair, // NOLINT + TypePair, // NOLINT + TypePair // NOLINT + >; + +static constexpr int kOpRegistryClassNumber = + std::tuple_size::value; + +template +struct IsMatchedBaseTypeImpl { + using PairType = typename std::tuple_element::type; + static constexpr bool kValue = + std::is_base_of::value; +}; + +template +struct IsMatchedBaseTypeImpl { + static constexpr bool kValue = false; +}; + +template +static inline constexpr bool IsMatchedBaseType() { + return IsMatchedBaseTypeImpl< + T, kPos, (kPos >= 0 && kPos < kOpRegistryClassNumber)>::kValue; +} + +template +struct OpInfoFillTypeGetterImpl {}; + +// This case should not happen +template +struct OpInfoFillTypeGetterImpl {}; + +template +struct OpInfoFillTypeGetterImpl { + static constexpr OpInfoFillType kType = kUnknown; +}; + +template +struct OpInfoFillTypeGetterImpl { + static constexpr OpInfoFillType kType = + OpInfoFillTypeGetterImpl()>::kType; +}; + +template +struct OpInfoFillTypeGetterImpl { + using PairType = typename std::tuple_element::type; + static constexpr OpInfoFillType kType = PairType::kFillType; }; +template +using OpInfoFillTypeGetter = + OpInfoFillTypeGetterImpl()>; + +} // namespace internal + template struct OpInfoFillTypeID { static constexpr OpInfoFillType ID() { - return std::is_base_of::value - ? kOperator - : (std::is_base_of::value - ? kOpProtoAndCheckerMaker - : (std::is_base_of::value - ? kGradOpDescMaker - : (std::is_base_of::value - ? kVarTypeInference - : (std::is_base_of::value - ? kShapeInference - : static_cast( - -1))))); + return internal::OpInfoFillTypeGetter::kType; } }; @@ -116,15 +184,19 @@ struct OpInfoFiller { T maker(fwd_op, no_grad_set, grad_to_var, grad_block); return maker(); }; + + info->use_default_grad_op_desc_maker_ = + std::is_base_of, T>::value || + std::is_base_of, T>::value; } }; template struct OpInfoFiller { void operator()(const char* op_type, OpInfo* info) const { - info->infer_var_type_ = [](const OpDesc& fwd_op, BlockDesc* block) { + info->infer_var_type_ = [](InferVarTypeContext* context) { T inference; - inference(fwd_op, block); + inference(context); }; } }; @@ -139,6 +211,34 @@ struct OpInfoFiller { } }; +template +struct OpInfoFiller { + void operator()(const char* op_type, OpInfo* info) const { + info->infer_inplace_ = [](const OpDesc& op_desc) { + T infer; + return infer(op_desc); + }; + } +}; + +template +struct OpInfoFiller { + void operator()(const char* op_type, OpInfo* info) const { + info->infer_no_need_buffer_vars_ = [](const VariableNameMap& inputs, + const VariableNameMap& outputs, + const AttributeMap& attrs) { + T infer(inputs, outputs, attrs); + return infer(); + }; + } +}; + +// A fake OpInfoFiller of void +template <> +struct OpInfoFiller { + void operator()(const char* op_type, OpInfo* info) const {} +}; + } // namespace details } // namespace framework diff --git a/paddle/fluid/framework/details/parallel_ssa_graph_executor.cc b/paddle/fluid/framework/details/parallel_ssa_graph_executor.cc new file mode 100644 index 0000000000000000000000000000000000000000..137e0dd7708dcc77c3a927979cfb357249f1fdc9 --- /dev/null +++ b/paddle/fluid/framework/details/parallel_ssa_graph_executor.cc @@ -0,0 +1,169 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/details/parallel_ssa_graph_executor.h" +#include +#include +#include "paddle/fluid/framework/ir/graph_helper.h" + +namespace paddle { +namespace framework { +namespace details { + +std::vector> +ParallelSSAGraphExecutor::SeparateMultiDevicesGraph(ir::Graph *graph) { + std::vector> graphs; + graphs.reserve(places_.size()); + for (size_t i = 0; i < places_.size(); ++i) { + ProgramDesc empty; + graphs.emplace_back(std::unique_ptr(new ir::Graph(empty))); + auto &g = graphs.back(); + g->Set(kGraphVars, new GraphVars(1UL)); + g->Set(kGraphDepVars, new GraphDepVars); + auto &stale_ops = + graph->Get>(details::kStaleProgramOpDescs); + g->Erase(details::kStaleProgramOpDescs); + g->Set>(details::kStaleProgramOpDescs, + new std::vector(stale_ops)); + } + auto op_handles = ir::FilterByNodeWrapper(*graph); + + for (auto &op : op_handles) { + auto &dev_ctx = op->DeviceContext(); + auto &p = dev_ctx.begin()->first; + int dev_id = boost::get(p).device; + auto &dev_dummys = graphs[dev_id]->Get(kGraphDepVars); + graphs[dev_id]->AddNode(graph->RemoveNode(op->Node()).release()); + + for (auto &var : op->Inputs()) { + auto dummy_ptr = dynamic_cast(var); + if (dummy_ptr) { + dev_dummys.insert(var); + if (graph->Nodes().count(var->Node())) + graphs[dev_id]->AddNode(graph->RemoveNode(var->Node()).release()); + } + } + for (auto &var : op->Outputs()) { + auto dummy_ptr = dynamic_cast(var); + if (dummy_ptr) { + dev_dummys.insert(var); + if (graph->Nodes().count(var->Node())) + graphs[dev_id]->AddNode(graph->RemoveNode(var->Node()).release()); + } + } + } + + for (size_t dev_id = 0; dev_id < places_.size(); ++dev_id) { + auto &dev_vars = graphs[dev_id]->Get(kGraphVars)[0]; + auto &origin_vars = graph->Get(kGraphVars)[dev_id]; + for (auto &name_pair : origin_vars) { + dev_vars.emplace(name_pair.first, name_pair.second); + for (auto &version_pair : name_pair.second) { + if (graph->Nodes().count(version_pair->Node())) { + graphs[dev_id]->AddNode( + graph->RemoveNode(version_pair->Node()).release()); + } + } + } + } + + return graphs; +} + +ParallelSSAGraphExecutor::ParallelSSAGraphExecutor( + const ExecutionStrategy &strategy, const std::vector &local_scopes, + const std::vector &places, ir::Graph *graph) + : strategy_(std::move(strategy)), + local_scopes_(std::move(local_scopes)), + pool_(places.size() >= 2 ? new ::ThreadPool(places.size()) : nullptr), + places_(std::move(places)), + // TODO(Yancey1989): Copying graphs is not safely since it deleted the + // attrs. + graphs_(SeparateMultiDevicesGraph(graph)) { + PADDLE_ENFORCE_EQ(places_.size(), local_scopes_.size()); + + auto seq_allreduce_pass = + ir::PassRegistry::Instance().Get("all_reduce_deps_pass"); + for (size_t i = 0; i < graphs_.size(); ++i) { + graphs_[i].reset(seq_allreduce_pass->Apply(graphs_[i].release())); + } + + // set the correct size of thread pool to each device. + strategy_.num_threads_ = strategy_.num_threads_ < places_.size() + ? 1UL + : strategy_.num_threads_ / places_.size(); + VLOG(1) << "set num_threads: " << strategy_.num_threads_ + << " to run the operators of the graph on each device."; + for (size_t i = 0; i < places.size(); ++i) { + executors_.emplace_back(new details::ThreadedSSAGraphExecutor( + strategy_, local_scopes_, {places_[i]}, graphs_.at(i).get())); + } +} + +FeedFetchList ParallelSSAGraphExecutor::Run( + const std::vector &fetch_tensors) { + std::vector> run_futures; + + std::vector fetch_data; + FeedFetchList ret; + + fetch_data.reserve(places_.size()); + ret.reserve(fetch_tensors.size()); + exception_holder_.Clear(); + + for (size_t i = 0; i < places_.size(); ++i) { + auto call = [this, i, &fetch_tensors]() -> FeedFetchList { + try { + return executors_[i]->Run(fetch_tensors); + } catch (...) { + exception_holder_.Catch(std::current_exception()); + } + return FeedFetchList(); + }; + + if (pool_) { + run_futures.emplace_back(pool_->enqueue(std::move(call))); + } else { + fetch_data.emplace_back(call()); + } + } + + if (pool_) { + for (auto &f : run_futures) { + if (exception_holder_.IsCaught()) { + f.wait(); + } else { + fetch_data.emplace_back(f.get()); + } + } + } + if (exception_holder_.IsCaught()) { + exception_holder_.ReThrow(); + } + + for (size_t fetch_idx = 0; fetch_idx < fetch_tensors.size(); ++fetch_idx) { + std::vector lodtensor_ptrs; + lodtensor_ptrs.reserve(local_scopes_.size()); + for (size_t scope_idx = 0; scope_idx < local_scopes_.size(); ++scope_idx) { + lodtensor_ptrs.push_back(&fetch_data.at(scope_idx).at(fetch_idx)); + } + ret.emplace_back(); + ret.back().MergeLoDTensor(lodtensor_ptrs, platform::CPUPlace()); + } + return ret; +} + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/parallel_ssa_graph_executor.h b/paddle/fluid/framework/details/parallel_ssa_graph_executor.h new file mode 100644 index 0000000000000000000000000000000000000000..1e421f2a3a51363fe368859f7a34593c8c894077 --- /dev/null +++ b/paddle/fluid/framework/details/parallel_ssa_graph_executor.h @@ -0,0 +1,57 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include + +#include "ThreadPool.h" +#include "paddle/fluid/framework/details/multi_devices_helper.h" +#include "paddle/fluid/framework/details/threaded_ssa_graph_executor.h" +#include "paddle/fluid/framework/ir/graph.h" + +namespace paddle { +namespace framework { +namespace details { + +class ParallelSSAGraphExecutor : public SSAGraphExecutor { + public: + ParallelSSAGraphExecutor(const ExecutionStrategy &strategy, + const std::vector &local_scopes, + const std::vector &places, + ir::Graph *graph); + ~ParallelSSAGraphExecutor() final = default; + + const ir::Graph &Graph() const override { return *graphs_[0]; } + + FeedFetchList Run(const std::vector &fetch_tensors) override; + + private: + std::vector> SeparateMultiDevicesGraph( + ir::Graph *graph); + + ExecutionStrategy strategy_; + std::vector local_scopes_; + std::unique_ptr<::ThreadPool> pool_{nullptr}; + std::vector places_; + std::vector> graphs_; + + std::vector> executors_; + ExceptionHolder exception_holder_; +}; + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/reduce_and_gather.h b/paddle/fluid/framework/details/reduce_and_gather.h index bd6153c0c736f6e32378eebcbf6c4d7e402c9b42..0de8e436518ea353a185087b0e4668b5d200c966 100644 --- a/paddle/fluid/framework/details/reduce_and_gather.h +++ b/paddle/fluid/framework/details/reduce_and_gather.h @@ -53,7 +53,32 @@ struct ReduceLoDTensor { } }; -inline void GatherSelectedRows( +struct ReduceBufferData { + const std::vector &src_data_; + void *dst_data_; + int64_t numel_; + + ReduceBufferData(const std::vector &src, void *dst, + int64_t numel) + : src_data_(src), dst_data_(dst), numel_(numel) {} + + template + void apply() const { + T *dst_data = reinterpret_cast(dst_data_); + for (size_t i = 0; i < src_data_.size(); ++i) { + auto srd_data = reinterpret_cast(src_data_[i]); + VLOG(10) << "dst: " << dst_data_ << ", " << srd_data; + if (srd_data == dst_data_) { + continue; + } + + std::transform(srd_data, srd_data + numel_, dst_data, dst_data, + [](T a, T b) -> T { return a + b; }); + } + } +}; + +inline void GatherLocalSelectedRows( const std::vector &src_selecte_rows_, const std::vector &in_places, const std::map &dev_ctxes, diff --git a/paddle/fluid/framework/details/reduce_op_handle.cc b/paddle/fluid/framework/details/reduce_op_handle.cc index 7fc06f234d42a992328c0b6164f17945d8075c28..4e2477c205db5966aa0b2d0c7a608be94a69eb82 100644 --- a/paddle/fluid/framework/details/reduce_op_handle.cc +++ b/paddle/fluid/framework/details/reduce_op_handle.cc @@ -16,6 +16,12 @@ #include "paddle/fluid/framework/details/container_cast.h" #include "paddle/fluid/framework/details/reduce_and_gather.h" #include "paddle/fluid/framework/details/variable_visitor.h" +#if defined PADDLE_WITH_CUDA && defined PADDLE_WITH_DISTRIBUTE +#include "paddle/fluid/operators/distributed/collective_client.h" +#include "paddle/fluid/operators/distributed/collective_server.h" +#include "paddle/fluid/operators/distributed/request_handler.h" +#endif +#include "paddle/fluid/operators/math/selected_rows_functor.h" #include "paddle/fluid/platform/profiler.h" DEFINE_bool( @@ -26,8 +32,114 @@ namespace paddle { namespace framework { namespace details { +std::once_flag CollectiveContext::init_flag_; +std::unique_ptr CollectiveContext::context_; + +static inline std::string GetRemoteVarName(const std::string &var_name, + int trainer_id) { + return string::Sprintf("%s_merged_tmp@trainer_%d", var_name, trainer_id); +} + +void ReduceOpHandle::Wait( + const std::map &dev_ctxes) { + // TODO(gongwb): use event wait? + for (auto &dev_ctx : dev_ctxes) { + dev_ctx.second->Wait(); + } +} + +#if defined PADDLE_WITH_CUDA && defined PADDLE_WITH_DISTRIBUTE +template +void ReduceOpHandle::GatherSelectedRows( + const std::vector &src_selected_rows, + const std::vector &in_places, + const std::map &dev_ctxes, + VarHandle *out_var_handle, const platform::Place &out_place, + SelectedRows *dst_selected_rows) { + const CollectiveContext &collective_context = + *CollectiveContext::GetInstance(); + + // 1. gather local selected rows, merge them + std::string gathered_var_name = out_var_handle->name() + "_gathered_tmp"; + auto scope = local_scopes_.at(out_var_handle->scope_idx()); + auto gathered_var_mid = scope->Var(gathered_var_name); + auto gathered_select_rows = + gathered_var_mid->GetMutable(); + GatherLocalSelectedRows(src_selected_rows, in_places, dev_ctxes, out_place, + gathered_select_rows); + // FIXME(gongwb): remove this Wait. + Wait(dev_ctxes); + + // merge them + auto merged_dev_ctx = dynamic_cast(dev_ctxes.at(out_place)); + std::string merged_var_name = + GetRemoteVarName(out_var_handle->name(), collective_context.trainer_id_); + auto merged_select_rows = + scope->Var(merged_var_name)->GetMutable(); + operators::math::scatter::MergeAdd merge_func; + merge_func(*merged_dev_ctx, *gathered_select_rows, merged_select_rows); + + // 2. start collective server if it doesn't exist + operators::distributed::CollectiveServer *server = + operators::distributed::CollectiveServer::GetInstance( + collective_context.endpoints_[collective_context.trainer_id_], + collective_context.endpoints_.size() - 1); + + auto rpc_server = server->GetRPCServer(); + rpc_server->RegisterVar(merged_var_name, + operators::distributed::kRequestGetMonomerVariable, + scope, merged_dev_ctx); + + // 3. gather them from all remote nodes. + std::vector remote; + operators::distributed::CollectiveClient *client = + operators::distributed::CollectiveClient::GetInstance(); + + std::vector vars; + for (unsigned int i = 0; i < collective_context.endpoints_.size(); i++) { + if (i == (unsigned)collective_context.trainer_id_) continue; + + operators::distributed::RemoteVar var; + var.trainer_id_ = i; + var.var_name_ = GetRemoteVarName(out_var_handle->name(), i); + var.ep_ = collective_context.endpoints_[i]; + + vars.push_back(var); + VLOG(4) << "gather from:" << var.String(); + } + + // erase gathered vars + merged_dev_ctx->Wait(); + scope->EraseVars(std::vector{gathered_var_name}); + + PADDLE_ENFORCE(client->Gather(vars, &remote, *merged_dev_ctx, scope)); + PADDLE_ENFORCE(remote.size() == vars.size()); + + // 4. merged local selected rows. + std::vector all; + all.resize(collective_context.endpoints_.size()); + for (auto v : vars) { + all[v.trainer_id_] = + scope->FindVar(v.var_name_)->GetMutable(); + } + all[collective_context.trainer_id_] = merged_select_rows; + + merge_func(*merged_dev_ctx, all, dst_selected_rows); + + rpc_server->WaitVarBarrier(merged_var_name); + rpc_server->ClearVar(merged_var_name); + + // 5. clear mid vars + std::vector tmp_vars{merged_var_name}; + for (auto r : vars) { + tmp_vars.push_back(r.var_name_); + } + scope->EraseVars(tmp_vars); +} +#endif + void ReduceOpHandle::RunImpl() { - platform::RecordEvent record_event(Name(), dev_ctxes_.begin()->second); + platform::RecordEvent record_event(Name()); if (places_.size() == 1) return; // the input and output may have dummy var. @@ -41,7 +153,7 @@ void ReduceOpHandle::RunImpl() { { auto out_var_handles = DynamicCast(outputs_); - PADDLE_ENFORCE_EQ(out_var_handles.size(), 1, + PADDLE_ENFORCE_EQ(out_var_handles.size(), 1UL, "The number of output should be one."); out_var_handle = out_var_handles.front(); } @@ -54,7 +166,7 @@ void ReduceOpHandle::RunImpl() { } auto pre_in_var = - var_scopes.at(in_0_handle->scope_idx_)->FindVar(in_0_handle->name_); + var_scopes.at(in_0_handle->scope_idx())->FindVar(in_0_handle->name()); PADDLE_ENFORCE_NOT_NULL(pre_in_var); // Wait input done, this Wait is asynchronous operation @@ -63,15 +175,15 @@ void ReduceOpHandle::RunImpl() { // NOTE: The Places of all input tensor must be all on CPU or all on GPU. std::vector in_places; // used to get dev_ctx for (auto *in_handle : in_var_handles) { - in_places.emplace_back(in_handle->place_); + in_places.emplace_back(in_handle->place()); auto in_var = - var_scopes.at(in_handle->scope_idx_)->FindVar(in_handle->name_); + var_scopes.at(in_handle->scope_idx())->FindVar(in_handle->name()); PADDLE_ENFORCE_NOT_NULL(in_var); VariableVisitor::EnforceShapeAndDTypeEQ(*pre_in_var, *in_var); } - auto out_var = - var_scopes.at(out_var_handle->scope_idx_)->FindVar(out_var_handle->name_); + auto out_var = var_scopes.at(out_var_handle->scope_idx()) + ->FindVar(out_var_handle->name()); PADDLE_ENFORCE_NOT_NULL(out_var); // NOTE: The tensors' Place of input and output must be all on GPU or all on @@ -79,9 +191,9 @@ void ReduceOpHandle::RunImpl() { auto in_p = VariableVisitor::GetMutableTensor(pre_in_var).place(); platform::Place t_out_p; if (platform::is_gpu_place(in_p)) { - PADDLE_ENFORCE(platform::is_gpu_place(out_var_handle->place_), + PADDLE_ENFORCE(platform::is_gpu_place(out_var_handle->place()), "Places of input and output must be all on GPU."); - t_out_p = out_var_handle->place_; + t_out_p = out_var_handle->place(); } else { t_out_p = platform::CPUPlace(); } @@ -90,8 +202,36 @@ void ReduceOpHandle::RunImpl() { this->RunAndRecordEvent([&] { std::vector in_selected_rows = GetInputValues(in_var_handles, var_scopes); - GatherSelectedRows(in_selected_rows, in_places, dev_ctxes_, t_out_p, - out_var->GetMutable()); + + const CollectiveContext &collective_context = + *CollectiveContext::GetInstance(); + VLOG(10) << "GatherSelectedRows CollectiveContext:" + << collective_context.String(); + + // TODO(gongwb): add cpu support + if (collective_context.endpoints_.size() <= 1 || + is_cpu_place(in_places[0]) || is_cpu_place(t_out_p)) { + GatherLocalSelectedRows(in_selected_rows, in_places, dev_ctxes_, + t_out_p, + out_var->GetMutable()); + return; + } + +#if defined PADDLE_WITH_CUDA && defined PADDLE_WITH_DISTRIBUTE + if (in_selected_rows[0]->value().type() == + framework::proto::VarType::FP32) { + GatherSelectedRows( + in_selected_rows, in_places, dev_ctxes_, out_var_handle, t_out_p, + out_var->GetMutable()); + } else if (in_selected_rows[0]->value().type() == + framework::proto::VarType::FP64) { + GatherSelectedRows( + in_selected_rows, in_places, dev_ctxes_, out_var_handle, t_out_p, + out_var->GetMutable()); + } else { + PADDLE_THROW("only support double or float when gather SelectedRows"); + } +#endif }); } else { std::vector lod_tensors = @@ -106,17 +246,17 @@ void ReduceOpHandle::RunImpl() { if (!FLAGS_cpu_deterministic) { ReduceLoDTensor func(lod_tensors, out_var->GetMutable()); - VisitDataType(ToDataType(lod_tensors[0]->type()), func); + VisitDataType(lod_tensors[0]->type(), func); } else { // We sum lod_tensors to reduce_sum_trg which is in local_scopes_0 // here, but it doesn't mean reduce_sum_trg must be in local_scopes_0. auto &reduce_sum_trg = *this->local_scopes_[0] ->FindVar(kLocalExecScopeName) ->Get() - ->FindVar(out_var_handle->name_) + ->FindVar(out_var_handle->name()) ->GetMutable(); ReduceLoDTensor func(lod_tensors, &reduce_sum_trg); - VisitDataType(ToDataType(lod_tensors[0]->type()), func); + VisitDataType(lod_tensors[0]->type(), func); auto trg = out_var->GetMutable(); if (reduce_sum_trg.data() != trg->data()) { @@ -125,13 +265,13 @@ void ReduceOpHandle::RunImpl() { } }); } else if (paddle::platform::is_gpu_place(lod_tensors[0]->place())) { -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) auto pre_in = pre_in_var->Get(); VariableVisitor::ShareDimsAndLoD(*pre_in_var, out_var); VariableVisitor::GetMutableTensor(out_var).mutable_data( - out_var_handle->place_, pre_in.type()); + out_var_handle->place(), pre_in.type()); - auto out_p = out_var_handle->place_; + auto out_p = out_var_handle->place(); int root_id = boost::get(out_p).device; std::vector> all_reduce_calls; for (size_t i = 0; i < var_scopes.size(); ++i) { @@ -146,7 +286,7 @@ void ReduceOpHandle::RunImpl() { if (root_id == dev_id) { recvbuffer = out_var->GetMutable()->mutable_data( - out_var_handle->place_); + out_var_handle->place()); } int type = platform::ToNCCLDataType(lod_tensor.type()); @@ -180,8 +320,8 @@ std::vector ReduceOpHandle::GetInputValues( const std::vector &var_scopes) const { std::vector in_selected_rows; for (auto *in_handle : in_var_handles) { - auto &in_sr = var_scopes.at(in_handle->scope_idx_) - ->FindVar(in_handle->name_) + auto &in_sr = var_scopes.at(in_handle->scope_idx()) + ->FindVar(in_handle->name()) ->Get(); in_selected_rows.emplace_back(&in_sr); } diff --git a/paddle/fluid/framework/details/reduce_op_handle.h b/paddle/fluid/framework/details/reduce_op_handle.h index a6289b055f97b7b0e57928358d84117b33cf2df8..5491f00f45e9d48c5eb7455396ac51801f2c40ab 100644 --- a/paddle/fluid/framework/details/reduce_op_handle.h +++ b/paddle/fluid/framework/details/reduce_op_handle.h @@ -23,19 +23,45 @@ #include "paddle/fluid/framework/scope.h" #include "paddle/fluid/framework/selected_rows.h" #include "paddle/fluid/platform/device_context.h" -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) #include "paddle/fluid/platform/nccl_helper.h" #endif namespace paddle { namespace framework { namespace details { +struct CollectiveContext { + std::vector endpoints_; + int trainer_id_{0}; + + std::string String() const { + std::stringstream ss; + ss << "endpoints_:"; + for (auto e : endpoints_) { + ss << e << ","; + } + + ss << "trainer_id_:" << trainer_id_; + + return ss.str(); + } + + static CollectiveContext *GetInstance() { + std::call_once(init_flag_, + [&]() { context_.reset(new CollectiveContext()); }); + return context_.get(); + } + + private: + static std::once_flag init_flag_; + static std::unique_ptr context_; +}; struct ReduceOpHandle : public OpHandleBase { std::vector local_scopes_; std::vector places_; -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) const platform::NCCLContextMap *nccl_ctxs_; ReduceOpHandle(ir::Node *node, const std::vector &local_scopes, const std::vector &places, @@ -46,7 +72,8 @@ struct ReduceOpHandle : public OpHandleBase { nccl_ctxs_(nccl_ctxs) { if (nccl_ctxs_) { for (auto &p_ctx : nccl_ctxs_->contexts_) { - dev_ctxes_[platform::CUDAPlace(p_ctx.first)] = p_ctx.second.ctx_.get(); + this->SetDeviceContext(platform::CUDAPlace(p_ctx.first), + p_ctx.second.ctx_.get()); } } } @@ -63,6 +90,19 @@ struct ReduceOpHandle : public OpHandleBase { protected: void RunImpl() override; +#if defined PADDLE_WITH_CUDA && defined PADDLE_WITH_DISTRIBUTE + template + void GatherSelectedRows( + const std::vector &src_selecte_rows_, + const std::vector &in_places, + const std::map &dev_ctxes, + VarHandle *out_var_handle, const platform::Place &out_place, + SelectedRows *dst_selecte_rows); +#endif + + void Wait( + const std::map &dev_ctxes); + template std::vector GetInputValues( const std::vector &in_var_handles, diff --git a/paddle/fluid/framework/details/reduce_op_handle_test.cc b/paddle/fluid/framework/details/reduce_op_handle_test.cc index 3a9a58412391b188c5e804b41fa47b3607a36bd1..6cee4770e64354cf8a719b0e11b1816b345dd8bd 100644 --- a/paddle/fluid/framework/details/reduce_op_handle_test.cc +++ b/paddle/fluid/framework/details/reduce_op_handle_test.cc @@ -30,12 +30,12 @@ struct TestReduceOpHandle { Scope g_scope_; std::vector local_scopes_; std::vector param_scopes_; - std::unique_ptr op_handle_; - std::vector> vars_; + OpHandleBase *op_handle_; + std::vector vars_; std::vector gpu_list_; std::vector> ctxs_; -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) std::unique_ptr nccl_ctxs_; #endif @@ -43,7 +43,7 @@ struct TestReduceOpHandle { for (size_t j = 0; j < ctxs_.size(); ++j) { ctxs_[j]->Wait(); } -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) if (nccl_ctxs_) { nccl_ctxs_->WaitAll(); } @@ -53,7 +53,7 @@ struct TestReduceOpHandle { void InitCtxOnGpu(bool use_gpu) { use_gpu_ = use_gpu; if (use_gpu) { -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) int count = p::GetCUDADeviceCount(); if (count <= 1) { LOG(WARNING) << "Cannot test multi-gpu Broadcast, because the CUDA " @@ -77,7 +77,7 @@ struct TestReduceOpHandle { gpu_list_.push_back(p); ctxs_.emplace_back(new p::CPUDeviceContext(p)); } -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) nccl_ctxs_.reset(nullptr); #endif } @@ -99,14 +99,14 @@ struct TestReduceOpHandle { nodes.emplace_back(new ir::Node("node")); if (use_gpu_) { -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) op_handle_.reset(new ReduceOpHandle(nodes.back().get(), local_scopes_, gpu_list_, nccl_ctxs_.get())); #else PADDLE_THROW("CUDA is not support."); #endif } else { -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) op_handle_.reset(new ReduceOpHandle(nodes.back().get(), local_scopes_, gpu_list_, nccl_ctxs_.get())); #else diff --git a/paddle/fluid/framework/details/reference_count_op_handle.h b/paddle/fluid/framework/details/reference_count_op_handle.h deleted file mode 100644 index fc479a4c4a1e7d5c824d3c202e0cccf743dd52c9..0000000000000000000000000000000000000000 --- a/paddle/fluid/framework/details/reference_count_op_handle.h +++ /dev/null @@ -1,138 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#pragma once - -#include -#include -#include -#include - -#include "paddle/fluid/framework/details/op_handle_base.h" -#include "paddle/fluid/framework/garbage_collector.h" -#include "paddle/fluid/framework/scope.h" -#include "paddle/fluid/framework/selected_rows.h" -#include "paddle/fluid/framework/tensor.h" - -namespace paddle { -namespace framework { -namespace details { - -using ReferenceCountMap = std::unordered_map; -using AtomicReferenceCountMap = - std::unordered_map>; -using DeviceReferenceCountMap = - std::unordered_map>; -using AtomicDeviceReferenceCountMap = - std::unordered_map>; -using DeviceGarbageCollectorMap = - std::unordered_map>>; - -class ReferenceCountOpHandle : public OpHandleBase { - public: - ReferenceCountOpHandle(ir::Node *node, const Scope *scope, - const platform::CUDAPlace &place, - const std::vector &var_names, - GarbageCollector *gc, - AtomicReferenceCountMap *ref_cnts) - : OpHandleBase(node), scope_(scope), gc_(gc), ref_cnts_(ref_cnts) { - dev_ctx_ = static_cast( - platform::DeviceContextPool::Instance().Get(place)); - if (IsStreamGarabageCollector()) { - PADDLE_ENFORCE(cudaSetDevice(place.device)); - PADDLE_ENFORCE(cudaEventCreateWithFlags(&event_, cudaEventDisableTiming)); - } - - for (auto &name : var_names) AddVar(name); - } - - ~ReferenceCountOpHandle() { - if (IsStreamGarabageCollector()) { - auto gpu_place = boost::get(dev_ctx_->GetPlace()); - PADDLE_ENFORCE(cudaSetDevice(gpu_place.device)); - PADDLE_ENFORCE(cudaEventDestroy(event_)); - } - } - - std::string Name() const override { return "reference_count"; } - - void AddVar(const std::string &name) { - auto it = var_names_.find(name); - if (it != var_names_.end()) - ++(it->second); - else - var_names_[name] = 1; - } - - protected: - void RunImpl() override { - auto *exec_scope = scope_->FindVar(kLocalExecScopeName)->Get(); - std::vector tensors; - for (auto &pair : var_names_) { - auto &name = pair.first; - auto it = ref_cnts_->find(name); - if (it == ref_cnts_->end()) continue; - - auto *var = exec_scope->FindVar(name); - if (var == nullptr) continue; - - if (var->IsType()) { - if (it->second.fetch_sub(pair.second) <= pair.second) { - tensors.emplace_back(var->GetMutable()); - } - } else if (var->IsType()) { - if (it->second.fetch_sub(pair.second) <= pair.second) { - tensors.emplace_back( - var->GetMutable()->mutable_value()); - } - } - } - - if (!tensors.empty()) { - ClearTensors(tensors); - } - } - - private: - void ClearTensors(const std::vector &tensors) { - auto *gc = dynamic_cast *>(gc_); - if (gc != nullptr) { - auto compute_stream = dev_ctx_->stream(); - auto callback_stream = gc->stream(); - auto callback_func = [=]() { - PADDLE_ENFORCE(cudaEventRecord(event_, compute_stream)); - PADDLE_ENFORCE(cudaStreamWaitEvent(callback_stream, event_, 0)); - }; - gc_->Add(tensors, callback_func); - } else { - gc_->Add(tensors); - } - } - - bool IsStreamGarabageCollector() const { - return dynamic_cast *>(gc_) != nullptr; - } - - const Scope *scope_; - platform::CUDADeviceContext *dev_ctx_; - std::unordered_map var_names_; - GarbageCollector *gc_; // not own - AtomicReferenceCountMap *ref_cnts_; // not own - cudaEvent_t event_; -}; - -} // namespace details -} // namespace framework -} // namespace paddle diff --git a/paddle/fluid/framework/details/reference_count_pass.cc b/paddle/fluid/framework/details/reference_count_pass.cc index 2d1f688d64ece3322e253b0c070264b9eb73d678..25337872c10f932b6e9ecf4f0a6fb9bed332b11c 100644 --- a/paddle/fluid/framework/details/reference_count_pass.cc +++ b/paddle/fluid/framework/details/reference_count_pass.cc @@ -12,184 +12,335 @@ // See the License for the specific language governing permissions and // limitations under the License. +#include #include #include +#include +#include +#include +#include #include #include "paddle/fluid/framework/details/computation_op_handle.h" +#include "paddle/fluid/framework/details/eager_deletion_op_handle.h" #include "paddle/fluid/framework/details/multi_devices_helper.h" +#include "paddle/fluid/framework/details/op_graph_view.h" #include "paddle/fluid/framework/details/reference_count_pass.h" +#include "paddle/fluid/framework/details/reference_count_pass_helper.h" +#include "paddle/fluid/framework/ir/graph_helper.h" namespace paddle { namespace framework { namespace details { -static ComputationOpHandle *FindNextComputationOpHandle(VarHandle *var_in) { - std::queue queue; - queue.push(var_in); - do { - auto *var = queue.front(); - queue.pop(); - for (auto *op : var->PendingOps()) { - auto *compute_op = dynamic_cast(op); - if (compute_op != nullptr && compute_op->GetPlace() == var_in->place_) { - return compute_op; +// A functor to shrink/remove operators who depend on other operators in a set +class ShrinkDepsOpFunctor { + private: + enum RelationShip { kSame = 0, kNoDeps = 1, kBefore = 2, kAfter = 3 }; + + public: + explicit ShrinkDepsOpFunctor(const std::vector &all_ops) + : graph_(all_ops) {} + + template + OpSet operator()(const OpSet &op_set) const { + using KeyType = typename OpSet::key_type; + static_assert( + std::is_base_of::type>::value, + "Key type of OpSet must be OpHandleBase, or derived of OpHandleBase"); + + if (op_set.size() <= 1) return op_set; + std::vector ops(op_set.begin(), op_set.end()); + OpSet ret; + auto rels = GetRelations(ops); + auto not_before = [](RelationShip r) { return r != kBefore; }; + for (size_t i = 0; i < rels.size(); ++i) { + if (std::all_of(rels[i].begin(), rels[i].end(), not_before)) { + ret.emplace(static_cast(ops[i])); + } + } + return ret; + } + + private: + std::vector> GetRelations( + const std::vector &ops) const { + std::unordered_map op_to_idx; + for (size_t i = 0; i < ops.size(); ++i) { + PADDLE_ENFORCE(graph_.HasOp(ops[i]), "Op does not exist in graph"); + op_to_idx[ops[i]] = i; + } + + PADDLE_ENFORCE(op_to_idx.size() == ops.size(), "Duplicate ops"); + + std::vector> ret(ops.size()); + for (auto &e : ret) { + e.assign(ops.size(), kSame); + } + + size_t found_num = ops.size(); + size_t total_num = ops.size() * ops.size(); + auto visitor = [&](OpHandleBase *op, size_t i) { + auto it = op_to_idx.find(op); + if (it != op_to_idx.end()) { + size_t j = it->second; + if (i != j && ret[i][j] == kSame) { + ret[i][j] = kBefore; + ret[j][i] = kAfter; + found_num += 2; + if (found_num == total_num) { + return false; + } + } + } + return true; + }; + + for (size_t i = 0; i < ops.size(); ++i) { + auto sub_visitor = [&, i](OpHandleBase *op) { return visitor(op, i); }; + if (!graph_.VisitAllPendingOps(ops[i], sub_visitor)) { + break; + } + } + + for (size_t i = 0; i < ops.size(); ++i) { + for (size_t j = i + 1; j < ops.size(); ++j) { + if (ret[i][j] != kSame) continue; + ret[i][j] = kNoDeps; + ret[j][i] = kNoDeps; } - for (auto *out_var : op->Outputs()) { - queue.push(out_var); + } + + return ret; + } + + const OpGraphView graph_; +}; + +/** + * Find the nearest downstream computation op handle. If the op is a + * computation op, just return itself. + */ +static ComputationOpHandle *FindNextComputationOpHandleOrReturnItself( + OpHandleBase *op, size_t scope_idx) { + std::queue q; + std::unordered_set visited; + q.push(op); + do { + auto *op = q.front(); + q.pop(); + auto *compute_op = dynamic_cast(op); + if (compute_op != nullptr && compute_op->GetScopeIdx() == scope_idx) { + return compute_op; + } + for (auto *out_var : op->Outputs()) { + for (auto *pending_op : out_var->PendingOps()) { + if (visited.count(pending_op)) continue; + visited.insert(pending_op); } } - } while (!queue.empty()); + } while (!q.empty()); return nullptr; } -std::unique_ptr ReferenceCountPass::ApplyImpl( - std::unique_ptr graph) const { - auto &ref_cnts = Get(kGlobalReferenceCount); - auto &cur_ref_cnts = Get(kCurReferenceCount); - auto &gcs = Get(kGarbageCollector); - - // It is not easy to find the right reference counts of varaibles in graph - // Step 1: Find all variables in computation ops - // Step 2: Find all variables in non-computation ops which refers to variables - // in computation ops - std::unordered_set names; - std::unordered_map> - compute_ref_cnt_map; - - auto get_ref_cnts_from_compute_op = [&]( - const std::unique_ptr &op, - const std::vector &vars) { - std::vector var_names_in_op; - auto *compute_op = dynamic_cast(op.get()); - if (compute_op == nullptr || - !platform::is_gpu_place(compute_op->GetPlace())) - return var_names_in_op; - auto place = boost::get(compute_op->GetPlace()); - for (VarHandleBase *var_handle_base : vars) { - auto *var_handle = dynamic_cast(var_handle_base); - if (var_handle == nullptr || !var_handle->Node()->IsVar()) continue; - - if (!platform::is_gpu_place(var_handle->place_) || - boost::get(var_handle->place_) != place) - continue; +static std::unordered_set +ExtractComputationOpFromLastLivedVar(VarHandle *var, size_t scope_idx, + const ShrinkDepsOpFunctor &shrink_func, + bool *ok) { + // stage one. Get last op for variable. + std::unordered_set candidates; + { + if (var->PendingOps().empty() && var->GeneratedOp()) { + // No operator depends on this variable. So the last operator is the op + // who generates this variable. + candidates.emplace(var->GeneratedOp()); + } else { + candidates = var->PendingOps(); + } - VarDesc *var_desc = var_handle->Node()->Var(); - auto var_name = var_handle->Node()->Name(); + // No pending ops or generated op is nullptr + if (candidates.empty()) { + *ok = false; + return {}; + } + } - // This is weird but there is really some variables without var_desc - // in computation_op - if (var_desc == nullptr) { - var_desc = compute_op->Node()->Op()->Block()->FindVar(var_name); - if (var_desc == nullptr) continue; + // stage two. Try to cast them to computation op. + // return (*ok=false) when failed. + // + // The reason why we cannot make any types of op handle to be the last lived + // op is: + // some op handle may operate on many DeviceContext, however, our garbage + // collector can only wait one DeviceContext for now. So currently, we wait + // the nearest compute op. + std::unordered_set computation_op; + { + for (auto *op : candidates) { + auto *compute_op = + FindNextComputationOpHandleOrReturnItself(op, scope_idx); + if (compute_op == nullptr) { + *ok = false; + return {}; } + computation_op.emplace(compute_op); + } + } - if (var_desc->Persistable()) continue; - auto var_type = var_desc->Proto()->type().type(); - if (var_type != proto::VarType::LOD_TENSOR && - var_type != proto::VarType::SELECTED_ROWS) { + // stage three. Try to shrink computation op if they depend on each other. + // Get the smallest set of the most ops. + *ok = true; + return shrink_func(computation_op); +} + +/** + * Shrink op dependencies according to no need buffer vars. + * + * If some ops do not need Tensor buffer of any input, + * just remove the dependency of this op, i.e, decrease reference count. + * + * For example, input Y of elementwise_add_grad op is only used to infer shape + * and lod of Y@GRAD, we do not need the buffer of input Y. Data buffer of + * input Y can be collected before elementwise_add_grad op runs. + * + * This method returns whether the dependency count decreases to 0, and + * shrinks op dependency if possible. + */ +static bool ShrinkNoNeedBufferVarOpDependency( + const std::string &var_name, + std::unordered_set *op_handles) { + std::vector skip_ops; + for (auto *op_handle : *op_handles) { + auto *op_base = op_handle->GetOp(); + auto &inferer = op_base->Info().NoNeedBufferVarsInferer(); + if (!inferer) { + continue; + } + + std::unordered_set no_need_buffer_vars = + inferer(op_base->Inputs(), op_base->Outputs(), op_base->Attrs()); + + // Check whether var_name occurs in other inputs or outputs of the op + // If it occurs, we cannot decrease the dependency number. + bool occurred_in_other_vars = false; + for (auto &in_pair : op_base->Inputs()) { + if (no_need_buffer_vars.count(in_pair.first) > 0) { continue; } - // compute op only runs in one device - if (ref_cnts[place.device]->count(var_name)) - ++(*ref_cnts[place.device])[var_name]; - else - (*ref_cnts[place.device])[var_name] = 1; - - names.insert(var_name); - var_names_in_op.push_back(var_name); - } - return var_names_in_op; - }; - - auto update_ref_cnts_from_non_compute_op = [&]( - const std::unique_ptr &op, - const std::vector &vars) { - if (dynamic_cast(op.get()) != nullptr) return; - for (VarHandleBase *var_handle_base : vars) { - auto *var_handle = dynamic_cast(var_handle_base); - if (var_handle == nullptr || !var_handle->Node()->IsVar()) continue; - - auto var_name = var_handle->Node()->Name(); - auto var_place = var_handle->place_; - if (!platform::is_gpu_place(var_place)) continue; - auto place = boost::get(var_place); - if (names.count(var_name) == 0) continue; - if (ref_cnts.count(place.device) && - ref_cnts[place.device]->count(var_name)) { - ++(*ref_cnts[place.device])[var_name]; - - auto *next_compute_op = FindNextComputationOpHandle(var_handle); - if (next_compute_op != nullptr) { - if (compute_ref_cnt_map.count(next_compute_op)) { - compute_ref_cnt_map[next_compute_op]->AddVar(var_name); - VLOG(5) << "Add reference count of " << var_name << " to Operator " - << next_compute_op->Name(); - } else { - // Create new reference_count_op_handle - ir::Node *ref_cnt_node = graph->CreateEmptyNode( - "reference_count", ir::Node::Type::kOperation); - auto *ref_cnt_handle = new ReferenceCountOpHandle( - ref_cnt_node, next_compute_op->GetScope(), place, {var_name}, - gcs[place.device].get(), cur_ref_cnts[place.device].get()); - if (next_compute_op->Outputs().empty()) { - auto *dep_var = new DummyVarHandle(graph->CreateControlDepVar()); - next_compute_op->AddOutput(dep_var); - graph->Get(kGraphDepVars).emplace(dep_var); - } - ref_cnt_handle->AddInput(next_compute_op->Outputs().front()); - compute_ref_cnt_map[next_compute_op].reset(ref_cnt_handle); - } - } + auto &args = in_pair.second; + auto iter = std::find(args.begin(), args.end(), var_name); + if (iter != args.end()) { + occurred_in_other_vars = true; + break; + } + } + + if (occurred_in_other_vars) { + continue; + } + + for (auto &out_pair : op_base->Outputs()) { + auto &args = out_pair.second; + auto iter = std::find(args.begin(), args.end(), var_name); + if (iter != args.end()) { + occurred_in_other_vars = true; + break; } } - }; - - auto &all_ops = graph->Get(kGraphOps); - for (auto &op : all_ops) { - auto in_var_names = get_ref_cnts_from_compute_op(op, op->Inputs()); - auto out_var_names = get_ref_cnts_from_compute_op(op, op->Outputs()); - if (in_var_names.empty() && out_var_names.empty()) continue; - in_var_names.insert(in_var_names.end(), out_var_names.begin(), - out_var_names.end()); - auto *compute_op = dynamic_cast(op.get()); - auto place = boost::get(compute_op->GetPlace()); - ir::Node *ref_cnt_node = - graph->CreateEmptyNode("reference_count", ir::Node::Type::kOperation); - auto *ref_cnt_handle = new ReferenceCountOpHandle( - ref_cnt_node, compute_op->GetScope(), place, in_var_names, - gcs[place.device].get(), cur_ref_cnts[place.device].get()); - if (compute_op->Outputs().empty()) { - auto *dep_var = new DummyVarHandle(graph->CreateControlDepVar()); - compute_op->AddOutput(dep_var); - graph->Get(kGraphDepVars).emplace(dep_var); - } - ref_cnt_handle->AddInput(compute_op->Outputs().front()); - compute_ref_cnt_map[compute_op].reset(ref_cnt_handle); - } - for (auto &op : all_ops) { - update_ref_cnts_from_non_compute_op(op, op->Inputs()); - update_ref_cnts_from_non_compute_op(op, op->Outputs()); + if (!occurred_in_other_vars) { + VLOG(2) << "Shrink var " << var_name << " in op " << op_handle->Name(); + skip_ops.emplace_back(op_handle); + } } - std::vector> new_all_ops; - new_all_ops.reserve(compute_ref_cnt_map.size() + all_ops.size()); - for (auto &op : all_ops) { - new_all_ops.emplace_back(std::move(op)); - auto it = compute_ref_cnt_map.find(new_all_ops.back().get()); - if (it != compute_ref_cnt_map.end()) { - // Add LeafNode to ReferenceCountOpHandle - auto *dummy_leaf = new DummyVarHandle(graph->CreateControlDepVar()); - graph->Get(kGraphDepVars).emplace(dummy_leaf); - it->second->AddOutput(dummy_leaf); - new_all_ops.emplace_back(std::move(it->second)); + if (skip_ops.size() == op_handles->size()) { + op_handles->clear(); + return true; + } else { + for (auto *skip_op : skip_ops) { + op_handles->erase(skip_op); } + return false; } +} + +void ReferenceCountPass::ApplyImpl(ir::Graph *graph) const { + auto &ref_cnts = Get>(kGlobalReferenceCount); + auto &last_live_ops_of_vars = + Get>(kLastLiveOpsOfVars); + + PADDLE_ENFORCE(last_live_ops_of_vars.empty() && ref_cnts.empty(), + "Last Live Ops and Reference Counts of vars should be " + "initialized at here."); + + const auto &vars = graph->Get(kGraphVars); - all_ops.swap(new_all_ops); - return graph; + last_live_ops_of_vars.resize(vars.size()); + ref_cnts.resize(vars.size()); + + ShrinkDepsOpFunctor shrink_func( + ir::FilterByNodeWrapper(*graph)); + + for (size_t i = 0; i < vars.size(); ++i) { + for (auto &name_var_pair : vars[i]) { + // Whether this variable can be reused or deleted? If not, we do not + // compute reference counts and dependencies. + VarDesc *var_desc = TryGetLatestVarDesc(name_var_pair.second); + + if (var_desc == nullptr || var_desc->Persistable()) { + continue; + } + + auto var_type = var_desc->Proto()->type().type(); + if (var_type != proto::VarType::LOD_TENSOR && + var_type != proto::VarType::SELECTED_ROWS && + var_type != proto::VarType::LOD_TENSOR_ARRAY) { + // Var type cannot be deleted + continue; + } + + auto &var_name = name_var_pair.first; + auto &var_handles = name_var_pair.second; + + for (auto iter = var_handles.rbegin(); iter != var_handles.rend(); + ++iter) { + bool ok; + auto result = + ExtractComputationOpFromLastLivedVar(*iter, i, shrink_func, &ok); + + // Seldomly, some vars may have no pending or preceding computation ops + // Just break; + if (!ok) break; + VLOG(10) << "Extract " << result.size() << " ops of var " << var_name; + + size_t original_op_deps = result.size(); + // If all ops do not need buffer of var_name, calculate reference count + // of the previous version of var_name. + if (ShrinkNoNeedBufferVarOpDependency(var_name, &result)) { + VLOG(10) << "Try to precede reference count computing at var " + << var_name; + continue; + } + + size_t final_op_deps = result.size(); + if (final_op_deps < original_op_deps) { + VLOG(5) << "Shrink op deps from " << original_op_deps << " to " + << final_op_deps; + } + + PADDLE_ENFORCE(!result.empty(), "Last living ops of %s cannot be empty", + var_name); + ref_cnts[i].emplace(var_name, result.size()); + last_live_ops_of_vars[i].emplace(var_name, std::move(result)); + break; + } + + // Seldomly, all preceding trying failed. + // Just skip this corner case + } + } } } // namespace details @@ -199,5 +350,4 @@ std::unique_ptr ReferenceCountPass::ApplyImpl( REGISTER_PASS(reference_count_pass, paddle::framework::details::ReferenceCountPass) .RequirePassAttr(paddle::framework::details::kGlobalReferenceCount) - .RequirePassAttr(paddle::framework::details::kCurReferenceCount) - .RequirePassAttr(paddle::framework::details::kGarbageCollector); + .RequirePassAttr(paddle::framework::details::kLastLiveOpsOfVars); diff --git a/paddle/fluid/framework/details/reference_count_pass.h b/paddle/fluid/framework/details/reference_count_pass.h index 7081280b0600b9c1985987d02d679c298ad4b8bd..7bb01ee6161eda944006d8d3d0fe6e9f91befcee 100644 --- a/paddle/fluid/framework/details/reference_count_pass.h +++ b/paddle/fluid/framework/details/reference_count_pass.h @@ -14,7 +14,6 @@ #pragma once -#include "paddle/fluid/framework/details/reference_count_op_handle.h" #include "paddle/fluid/framework/ir/graph.h" #include "paddle/fluid/framework/ir/pass.h" @@ -22,14 +21,9 @@ namespace paddle { namespace framework { namespace details { -constexpr char kGlobalReferenceCount[] = "reference_count"; -constexpr char kCurReferenceCount[] = "current_reference_count"; -constexpr char kGarbageCollector[] = "garbage_collector"; - class ReferenceCountPass : public ir::Pass { protected: - std::unique_ptr ApplyImpl( - std::unique_ptr graph) const override; + void ApplyImpl(ir::Graph* graph) const override; }; } // namespace details diff --git a/paddle/fluid/framework/details/reference_count_pass_helper.cc b/paddle/fluid/framework/details/reference_count_pass_helper.cc new file mode 100644 index 0000000000000000000000000000000000000000..94de0e6ab0a91d90a7f2c4c4fc14eb78663c95fe --- /dev/null +++ b/paddle/fluid/framework/details/reference_count_pass_helper.cc @@ -0,0 +1,34 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/details/reference_count_pass_helper.h" +#include "paddle/fluid/framework/details/var_handle.h" +#include "paddle/fluid/framework/var_desc.h" + +namespace paddle { +namespace framework { +namespace details { + +VarDesc *TryGetLatestVarDesc(const std::vector &vars) { + VarDesc *var_desc = nullptr; + std::find_if(vars.rbegin(), vars.rend(), [&](VarHandle *var_handle) -> bool { + var_desc = var_handle->Node()->Var(); + return var_desc != nullptr; + }); + return var_desc; +} + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/reference_count_pass_helper.h b/paddle/fluid/framework/details/reference_count_pass_helper.h new file mode 100644 index 0000000000000000000000000000000000000000..ce700119c54ddd711315dfa45d61b9241cfda651 --- /dev/null +++ b/paddle/fluid/framework/details/reference_count_pass_helper.h @@ -0,0 +1,58 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include +#include +#include + +#include "paddle/fluid/framework/garbage_collector.h" + +namespace paddle { +namespace framework { + +class VarDesc; +class VarHandle; + +namespace details { + +class ComputationOpHandle; + +using ReferenceCountMap = std::unordered_map; + +using AtomicReferenceCountMap = + std::unordered_map>; + +using GarbageCollectorMap = + std::map>; + +const char kGlobalReferenceCount[] = "global_reference_count"; +const char kRuntimeReferenceCount[] = "runtime_reference_count"; +const char kGarbageCollector[] = "garbage_collector"; +const char kAllPlaces[] = "all_places"; + +using LastLiveOpsOfVars = + std::unordered_map>; +const char kLastLiveOpsOfVars[] = "last_live_ops_of_var"; + +VarDesc *TryGetLatestVarDesc(const std::vector &vars); + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/rpc_op_handle.cc b/paddle/fluid/framework/details/rpc_op_handle.cc index f44b374edb29228dff5a8bf003d945291f166d49..3e082f247adf7fe22db2b62802f0a87c9c93447a 100644 --- a/paddle/fluid/framework/details/rpc_op_handle.cc +++ b/paddle/fluid/framework/details/rpc_op_handle.cc @@ -29,22 +29,19 @@ RPCOpHandle::RPCOpHandle(ir::Node *node, const framework::OpDesc &op_desc, place_(place) {} void RPCOpHandle::RunImpl() { - // TODO(wuyi): need further analysis whether wait VarDummyHandle. - // Wait input done for (auto *in : inputs_) { - auto &p = static_cast(in)->place_; - // FIXME(Yancey1989): need a better solution instead of use DebugString() - if (ir::IsControlDepVar(*in->Node())) { // HACK + auto &p = static_cast(in)->place(); + if (ir::IsControlDepVar(*in->Node())) { continue; } if (in->GeneratedOp()) { - in->GeneratedOp()->RecordWaitEventOnCtx(dev_ctxes_[p]); + in->GeneratedOp()->RecordWaitEventOnCtx(dev_ctxes_.at(p)); } } - auto &tmp_scope = local_scope_->FindVar(kLocalExecScopeName)->Get(); - // FIXME(wuyi): can not use RunAndRecordEvent here, for it will cause dead - // lock. - op_->Run(*tmp_scope, place_); + this->RunAndRecordEvent([this] { + op_->Run(*local_scope_->FindVar(kLocalExecScopeName)->Get(), + place_); + }); } std::string RPCOpHandle::Name() const { return name_; } diff --git a/paddle/fluid/framework/details/scale_loss_grad_op_handle.cc b/paddle/fluid/framework/details/scale_loss_grad_op_handle.cc index ba243979b34aa1f683de707525403becaf0a1c00..6924549f36d6365534ab288257899a78107675cc 100644 --- a/paddle/fluid/framework/details/scale_loss_grad_op_handle.cc +++ b/paddle/fluid/framework/details/scale_loss_grad_op_handle.cc @@ -22,39 +22,66 @@ namespace details { ScaleLossGradOpHandle::ScaleLossGradOpHandle(ir::Node *node, size_t num_dev, Scope *scope, platform::Place place, - platform::DeviceContext *dev_ctx) + platform::DeviceContext *dev_ctx, + proto::VarType::Type dtype) : OpHandleBase(node), coeff_(static_cast(1.0 / num_dev)), scope_(scope), - place_(place) { - dev_ctxes_[place_] = dev_ctx; + place_(place), + out_dtype_(dtype) { + this->SetDeviceContext(place_, dev_ctx); } ScaleLossGradOpHandle::~ScaleLossGradOpHandle() {} +struct ScaleLossGradFunctor { + float coeff_; + Tensor *out_; + platform::Place place_; + OpHandleBase *op_handle_; + proto::VarType::Type out_dtype_; + platform::DeviceContext *ctx_; + + ScaleLossGradFunctor(float coeff, Tensor *out, platform::Place place, + OpHandleBase *op_handle, proto::VarType::Type dtype, + platform::DeviceContext *ctx) + : coeff_(coeff), out_(out), place_(place), out_dtype_(dtype), ctx_(ctx) {} + + template + void apply() const { + auto *out_data = out_->mutable_data(place_); + if (platform::is_cpu_place(place_)) { + *out_data = static_cast(coeff_); + } else { +#ifdef PADDLE_WITH_CUDA + OutT cast_coeff = static_cast(coeff_); + auto stream = static_cast(ctx_)->stream(); + memory::Copy(boost::get(place_), out_data, + platform::CPUPlace(), &cast_coeff, SizeOfType(out_dtype_), + stream); + VLOG(10) << place_ << "RUN Scale loss grad op"; + +#endif + } + } +}; + void ScaleLossGradOpHandle::RunImpl() { // Doesn't wait any event - std::string var_name = static_cast(this->outputs_[0])->name_; + std::string var_name = static_cast(this->outputs_[0])->name(); auto &local_scope = *scope_->FindVar(kLocalExecScopeName)->Get(); - float *tmp = local_scope.FindVar(var_name) - ->GetMutable() - ->mutable_data(make_ddim({1}), place_); + auto *tensor = local_scope.FindVar(var_name)->GetMutable(); + tensor->Resize(make_ddim({1})); - if (platform::is_cpu_place(place_)) { - *tmp = coeff_; - } else { #ifdef PADDLE_WITH_CUDA - this->RunAndRecordEvent([&] { - auto stream = - static_cast(this->dev_ctxes_[place_]) - ->stream(); - memory::Copy(boost::get(place_), tmp, - platform::CPUPlace(), &coeff_, sizeof(float), stream); - VLOG(10) << place_ << "RUN Scale loss grad op"; - }); + ScaleLossGradFunctor func(coeff_, tensor, place_, this, out_dtype_, + this->dev_ctxes_.at(place_)); + this->RunAndRecordEvent([&] { framework::VisitDataType(out_dtype_, func); }); +#else + ScaleLossGradFunctor func(coeff_, tensor, place_, this, out_dtype_, nullptr); + framework::VisitDataType(out_dtype_, func); #endif - } } std::string ScaleLossGradOpHandle::Name() const { return "Scale LossGrad"; } diff --git a/paddle/fluid/framework/details/scale_loss_grad_op_handle.h b/paddle/fluid/framework/details/scale_loss_grad_op_handle.h index 523b55724c82d4e2bef0520c10e5708c952a3ecc..8bedd1643eb9c5e591fa3c40995fcba08980b9fa 100644 --- a/paddle/fluid/framework/details/scale_loss_grad_op_handle.h +++ b/paddle/fluid/framework/details/scale_loss_grad_op_handle.h @@ -26,8 +26,8 @@ namespace details { struct ScaleLossGradOpHandle : public OpHandleBase { ScaleLossGradOpHandle(ir::Node *node, size_t num_dev, Scope *scope, - platform::Place place, - platform::DeviceContext *context); + platform::Place place, platform::DeviceContext *context, + proto::VarType::Type dtype); ~ScaleLossGradOpHandle() final; @@ -40,6 +40,7 @@ struct ScaleLossGradOpHandle : public OpHandleBase { float coeff_; Scope *scope_; platform::Place place_; + proto::VarType::Type out_dtype_; }; } // namespace details diff --git a/paddle/fluid/framework/details/scope_buffered_ssa_graph_executor.cc b/paddle/fluid/framework/details/scope_buffered_ssa_graph_executor.cc index e5b1eaa7318aecde1dbf89de8fe242a3008db97c..7b13112986f9ad85056a3e8a5a6ed99bd0be95d5 100644 --- a/paddle/fluid/framework/details/scope_buffered_ssa_graph_executor.cc +++ b/paddle/fluid/framework/details/scope_buffered_ssa_graph_executor.cc @@ -16,11 +16,8 @@ #include #include #include -#include "paddle/fluid/framework/executor.h" +#include "paddle/fluid/framework/variable_helper.h" #include "paddle/fluid/platform/profiler.h" -#ifdef PADDLE_WITH_CUDA -#include "paddle/fluid/framework/details/reference_count_op_handle.h" -#endif namespace paddle { namespace framework { @@ -59,43 +56,34 @@ FeedFetchList ScopeBufferedSSAGraphExecutor::Run( } } std::vector fetch_data; - std::exception_ptr eptr; + std::exception_ptr eptr = nullptr; try { fetch_data = underlying_executor_->Run(fetch_tensors); } catch (...) { eptr = std::current_exception(); } - platform::RecordEvent e("ScopeBufferedSSAGraphExecutorAfterRun", nullptr); - drop_scope_counter_ += 1; + platform::RecordEvent e("ScopeBufferedSSAGraphExecutorAfterRun"); + ++drop_scope_counter_; -#ifdef PADDLE_WITH_CUDA - const std::string gc_name = "garbage_collector"; - DeviceGarbageCollectorMap *gc = - Graph().Has(gc_name) ? &(Graph().Get(gc_name)) - : nullptr; -#endif + bool stream_end = false; + if (!fetch_tensors.empty()) { + WaitComputationalStreams(); + stream_end = true; + } - if (!fetch_tensors.empty() || - drop_scope_counter_ == strategy_.num_iteration_per_drop_scope_) { - drop_scope_counter_ = 0; - // Wait All computational streams - for (auto p : places_) { - platform::DeviceContextPool::Instance().Get(p)->Wait(); -#ifdef PADDLE_WITH_CUDA - if (gc != nullptr && platform::is_gpu_place(p)) { - auto gpu_place = boost::get(p); - auto &gc_at_place = gc->at(gpu_place.device); - gc_at_place->Wait(); - gc_at_place->Reset(); - } -#endif + if (drop_scope_counter_ == strategy_.num_iteration_per_drop_scope_) { + if (!stream_end) { + WaitComputationalStreams(); } + for (auto &scope : local_scopes_) { auto &local_scope = *scope->Var(details::kLocalExecScopeName)->GetMutable(); scope->DeleteScope(local_scope); } + + drop_scope_counter_ = 0; } if (eptr) { std::rethrow_exception(eptr); diff --git a/paddle/fluid/framework/details/scope_buffered_ssa_graph_executor.h b/paddle/fluid/framework/details/scope_buffered_ssa_graph_executor.h index 5e87e0bf50b51d2b630aba06a5907dd721754d1f..0f6340213daee98a75401f9db0e628f7b4fd79fc 100644 --- a/paddle/fluid/framework/details/scope_buffered_ssa_graph_executor.h +++ b/paddle/fluid/framework/details/scope_buffered_ssa_graph_executor.h @@ -47,6 +47,14 @@ class ScopeBufferedSSAGraphExecutor : public SSAGraphExecutor { FeedFetchList Run(const std::vector& fetch_tensors) override; + private: + inline void WaitComputationalStreams() { + // Wait All computational streams + for (auto p : places_) { + platform::DeviceContextPool::Instance().Get(p)->Wait(); + } + } + private: size_t drop_scope_counter_{0}; diff --git a/paddle/fluid/framework/details/sequential_execution_pass.cc b/paddle/fluid/framework/details/sequential_execution_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..839f8dc43ed8c6f13380732b221520b3bb59b099 --- /dev/null +++ b/paddle/fluid/framework/details/sequential_execution_pass.cc @@ -0,0 +1,108 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/details/sequential_execution_pass.h" +#include +#include +#include +#include +#include "paddle/fluid/framework/details/memory_optimize_helper.h" +#include "paddle/fluid/framework/op_proto_maker.h" + +namespace paddle { +namespace framework { +namespace details { + +static bool IsSameOpDesc(OpDesc *op1, OpDesc *op2) { + return op1->Type() == op2->Type() && op1->Inputs() == op2->Inputs() && + op1->Outputs() == op2->Outputs(); +} + +void SequentialExecutionPass::ApplyImpl(ir::Graph *graph) const { + // FIXME(zjl): Insert dependencies between some distributed ops may cause + // the multi_devices_graph_pass fails. So we skip these ops here. + // Indeed, maybe we should not insert dependencies between these ops + // casually, which may cause deadlock easily. + // We should add more skipped distributed ops when found errors in + // multi_devices_graph_pass + static std::unordered_set skip_dist_ops{ + "send", "recv", "send_barrier", "fetch_barrier"}; + + auto &ops = graph->Get>(kStaleProgramOpDescs); + std::vector op_node_list; + op_node_list.reserve(ops.size()); + + std::unordered_map op_deps; + std::unordered_map> pending_ops; + std::unordered_set ready_ops; + + for (ir::Node *node : graph->Nodes()) { + if (!node->IsOp()) continue; + std::unordered_set preceding_ops; + for (auto *in : node->inputs) { + PADDLE_ENFORCE(in->IsVar(), + "Preceding Node of Op Nodes must be Var Node"); + if (in->inputs.empty()) continue; + PADDLE_ENFORCE(in->inputs.size() == 1 && in->inputs[0]->IsOp(), + "Preceding Op Node of Var Node must be unique"); + preceding_ops.insert(in->inputs[0]); + pending_ops[in->inputs[0]].insert(node); + } + op_deps[node] = preceding_ops.size(); + if (preceding_ops.empty()) { + ready_ops.insert(node); + } + } + + for (auto *op_desc : ops) { + ir::Node *found_node = nullptr; + for (auto *node : ready_ops) { + if (IsSameOpDesc(op_desc, node->Op())) { + PADDLE_ENFORCE(found_node == nullptr, + "Found multiple op_desc in graph: %s", op_desc->Type()); + found_node = node; + } + } + + PADDLE_ENFORCE_NOT_NULL(found_node, "Cannot find op_desc in graph: %s", + op_desc->Type()); + for (auto *pending_op : pending_ops[found_node]) { + if (--op_deps.at(pending_op) == 0) { + ready_ops.insert(pending_op); + } + } + ready_ops.erase(found_node); + if (skip_dist_ops.count(op_desc->Type()) == 0) { + op_node_list.push_back(found_node); + } + } + + for (size_t i = 1; i < op_node_list.size(); ++i) { + auto *dep_var = graph->CreateControlDepVar(); + op_node_list[i]->inputs.push_back(dep_var); + op_node_list[i - 1]->outputs.push_back(dep_var); + dep_var->outputs.push_back(op_node_list[i]); + dep_var->inputs.push_back(op_node_list[i - 1]); + VLOG(10) << "Add dependencies between " << op_node_list[i - 1]->Name() + << " and " << op_node_list[i]->Name(); + } +} + +} // namespace details +} // namespace framework +} // namespace paddle + +REGISTER_PASS(sequential_execution_pass, + paddle::framework::details::SequentialExecutionPass) + .RequireGraphAttr(paddle::framework::details::kStaleProgramOpDescs); diff --git a/paddle/fluid/framework/details/sequential_execution_pass.h b/paddle/fluid/framework/details/sequential_execution_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..7d6a4f4cc55698d80a60333d2e8d528b4a3b1641 --- /dev/null +++ b/paddle/fluid/framework/details/sequential_execution_pass.h @@ -0,0 +1,31 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/pass.h" + +namespace paddle { +namespace framework { +namespace details { + +class SequentialExecutionPass : public ir::Pass { + protected: + void ApplyImpl(ir::Graph* graph) const override; +}; + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/ssa_graph_executor.cc b/paddle/fluid/framework/details/ssa_graph_executor.cc index 780da5478ff34ecd7096d0ef62b72bf1088dd221..af2cbd5c876fdd7c27cd679f7e9412d1b0604ecc 100644 --- a/paddle/fluid/framework/details/ssa_graph_executor.cc +++ b/paddle/fluid/framework/details/ssa_graph_executor.cc @@ -19,14 +19,16 @@ namespace framework { namespace details { SSAGraphExecutor::~SSAGraphExecutor() {} -void ClearFetchOp(ir::Graph* graph, - std::vector>* fetch_ops) { +void ClearFetchOp(ir::Graph* graph, std::vector* fetch_ops) { if (fetch_ops->empty()) return; for (auto& op : *fetch_ops) { for (auto& out_var : op->Node()->outputs) { graph->RemoveNode(out_var); } + for (auto& in_var : op->Inputs()) { + in_var->RemoveOutput(op, op->Node()); + } graph->RemoveNode(op->Node()); } fetch_ops->clear(); diff --git a/paddle/fluid/framework/details/ssa_graph_executor.h b/paddle/fluid/framework/details/ssa_graph_executor.h index d5cf7737d565c523995e6685b73c57e5a6f0197b..860eaa25b58e4579ad792ff18618de3b90707e8d 100644 --- a/paddle/fluid/framework/details/ssa_graph_executor.h +++ b/paddle/fluid/framework/details/ssa_graph_executor.h @@ -38,8 +38,7 @@ class SSAGraphExecutor { virtual FeedFetchList Run(const std::vector& fetch_tensors) = 0; }; -void ClearFetchOp(ir::Graph* graph, - std::vector>* fetch_ops); +void ClearFetchOp(ir::Graph* graph, std::vector* fetch_ops); } // namespace details } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/details/threaded_ssa_graph_executor.cc b/paddle/fluid/framework/details/threaded_ssa_graph_executor.cc index 31beef3ae829d72570ee7c879dac71ed600cd216..67246a4dd448b0ce2f115d6438c5fdd6cc39ca6d 100644 --- a/paddle/fluid/framework/details/threaded_ssa_graph_executor.cc +++ b/paddle/fluid/framework/details/threaded_ssa_graph_executor.cc @@ -14,7 +14,7 @@ #include "paddle/fluid/framework/details/threaded_ssa_graph_executor.h" -#include "paddle/fluid/framework/details/multi_devices_helper.h" +#include "paddle/fluid/framework/ir/graph_helper.h" #include "paddle/fluid/platform/profiler.h" namespace paddle { @@ -22,99 +22,91 @@ namespace framework { namespace details { ThreadedSSAGraphExecutor::ThreadedSSAGraphExecutor( const ExecutionStrategy &strategy, const std::vector &local_scopes, - const std::vector &places, - std::unique_ptr &&graph) - : graph_(std::move(graph)), - pool_(strategy.num_threads_ >= 2 ? new ::ThreadPool(strategy.num_threads_) - : nullptr), + const std::vector &places, ir::Graph *graph) + : graph_(graph), local_scopes_(local_scopes), places_(places), fetch_ctxs_(places), - running_ops_(0), - strategy_(strategy) {} + strategy_(strategy), + prepare_pool_(1), + pool_(strategy.num_threads_ >= 2 ? new ::ThreadPool(strategy.num_threads_) + : nullptr) { + if (strategy_.num_iteration_per_run_ > 1) { + int read_op_num = 0; + for (auto *node : graph_->Nodes()) { + if (node->IsOp() && node->Name() == "read") { + read_op_num++; + } + } + if (read_op_num == 0) { + LOG(WARNING) << "when num_iteration_per_run_ is larger then 1, the model " + "should use pyreader to feed data!"; + } + } + PrepareOpDeps(); + CopyOpDeps(); +} -FeedFetchList ThreadedSSAGraphExecutor::Run( +inline FeedFetchList ThreadedSSAGraphExecutor::RunImpl( const std::vector &fetch_tensors) { std::unique_ptr event( - new platform::RecordEvent("ThreadedSSAGraphExecutorPrepare", nullptr)); - std::unordered_map pending_ops; - std::unordered_set pending_vars; - BlockingQueue ready_vars; - std::unordered_set ready_ops; + new platform::RecordEvent("ThreadedSSAGraphExecutorPrepare")); + std::unique_ptr op_deps = op_deps_futures_.get(); + CopyOpDeps(); + VLOG(10) << "ThreadedSSAGraphExecutor::Run"; + std::shared_ptr> ready_vars( + new BlockingQueue); + auto &pending_ops = op_deps->pending_ops_; + auto &pending_vars = op_deps->pending_vars_; + auto &ready_ops = op_deps->ready_ops_; + // For ops (e.g. nccl_all_reduce) that need to coordinate multiple // streams from multiple GPUs, it's faster to buffer them and schedule // together since we currently cannot overlap computation and memcpy streams. // Should revisit it if overlapping is available. std::unordered_set delayed_ops; - // Transform SSAGraph to pending_ops & pending_vars - for (auto &var_map : graph_->Get(details::kGraphVars)) { - for (auto &name_pair : var_map) { - for (auto &version_pair : name_pair.second) { - InsertPendingVar(&pending_vars, &ready_vars, version_pair.get()); - } - } - } - for (auto &var : graph_->Get(details::kGraphDepVars)) { - InsertPendingVar(&pending_vars, &ready_vars, var.get()); - } - - for (auto &op : graph_->Get(details::kGraphOps)) { - if (op->Inputs().empty()) { // Special case, Op has no input. - ready_ops.insert(op.get()); - } else { - InsertPendingOp(&pending_ops, op.get()); - } - } - // Step 2. Insert FetchOps - std::vector> fetch_ops; - std::unordered_set> fetch_dependencies; + std::vector fetch_ops; + std::unordered_set fetch_dependencies; FeedFetchList fetch_data(fetch_tensors.size()); - InsertFetchOps(fetch_tensors, &fetch_ops, &fetch_dependencies, &pending_ops, - &pending_vars, &ready_vars, &fetch_data); + InsertFetchOps(fetch_tensors, &fetch_ops, &fetch_dependencies, &ready_ops, + &pending_ops, &pending_vars, &fetch_data); auto run_all_ops = [&](std::unordered_set &set) { for (auto *op : set) { - running_ops_++; - RunOp(&ready_vars, op); + RunOp(ready_vars, op); } set.clear(); }; - // Clean run context run_op_futures_.clear(); exception_holder_.Clear(); event.reset(nullptr); - // Step 3. Execution while (!pending_vars.empty()) { // 1. Run All Ready ops // Keep loop until all vars are ready. - // - // NOTE: DelayedOps have a lower priority. It will be scheduled after all - // ready_ops have been performed. - if (ready_ops.empty() && strategy_.allow_op_delay_ && running_ops_ == 0) { - run_all_ops(delayed_ops); - } else { - run_all_ops(ready_ops); - } + run_all_ops(ready_ops); // 2. Find ready variable bool timeout; - auto cur_ready_vars = ready_vars.PopAll(1, &timeout); - + auto cur_ready_vars = ready_vars->PopAll(1, &timeout); if (timeout) { if (exception_holder_.IsCaught()) { + VLOG(3) << "caught exception " << exception_holder_.Type() + << ", rethrow it"; for (auto &run_op_future : run_op_futures_) { run_op_future.wait(); } + ClearFetchOp(graph_, &fetch_ops); exception_holder_.ReThrow(); } else { continue; } } + // 3. Remove the dependency of ready_var. // Find the ready_ops after the ready_var. for (auto ready_var : cur_ready_vars) { @@ -123,37 +115,41 @@ FeedFetchList ThreadedSSAGraphExecutor::Run( auto &deps = pending_ops[op]; --deps; if (deps == 0) { - if (op->IsMultiDeviceTransfer() && strategy_.allow_op_delay_) { - delayed_ops.insert(op); - } else { - ready_ops.insert(op); - } + ready_ops.insert(op); } } } } PADDLE_ENFORCE(ready_ops.empty()); - // Wait FetchOps. - ClearFetchOp(graph_.get(), &fetch_ops); + ClearFetchOp(graph_, &fetch_ops); return fetch_data; } +FeedFetchList ThreadedSSAGraphExecutor::Run( + const std::vector &fetch_tensors) { + for (size_t j = 0; j < strategy_.num_iteration_per_run_ - 1; ++j) { + RunImpl({}); + } + return RunImpl(fetch_tensors); +} + void ThreadedSSAGraphExecutor::InsertFetchOps( const std::vector &fetch_tensors, - std::vector> *fetch_ops, - std::unordered_set> *fetch_dependencies, + std::vector *fetch_ops, + std::unordered_set *fetch_dependencies, + std::unordered_set *ready_ops, std::unordered_map *pending_ops, std::unordered_set *pending_vars, - BlockingQueue *ready_vars, FeedFetchList *fetch_data) { + FeedFetchList *fetch_data) { std::unordered_map> fetched_vars; - + std::unordered_set local_ready_vars; for (auto &fetch_var_name : fetch_tensors) { for (auto &var_map : graph_->Get(details::kGraphVars)) { auto it = var_map.find(fetch_var_name); if (it != var_map.end()) { - fetched_vars[fetch_var_name].push_back(it->second.rbegin()->get()); + fetched_vars[fetch_var_name].emplace_back(*it->second.rbegin()); } } } @@ -162,8 +158,9 @@ void ThreadedSSAGraphExecutor::InsertFetchOps( auto &var_name = fetch_tensors[i]; auto fetched_var_it = fetched_vars.find(var_name); PADDLE_ENFORCE(fetched_var_it != fetched_vars.end(), - "Cannot find fetched variable.(Perhaps the main_program " - "is not set to ParallelExecutor)"); + "Cannot find fetched variable(%s).(Perhaps the main_program " + "is not set to ParallelExecutor)", + var_name); auto &vars = fetched_var_it->second; @@ -185,9 +182,23 @@ void ThreadedSSAGraphExecutor::InsertFetchOps( auto *fetch_dummy = new DummyVarHandle(fetch_var); op->AddOutput(fetch_dummy); fetch_dependencies->emplace(fetch_dummy); - this->InsertPendingVar(pending_vars, ready_vars, fetch_dummy); - this->InsertPendingOp(pending_ops, op); + + this->InsertPendingVar(pending_vars, &local_ready_vars, fetch_dummy); + + size_t wait_input_num = 0; + std::unordered_set input_set(vars.begin(), vars.end()); + for (auto *var : input_set) { + if (pending_vars->count(var)) { + ++wait_input_num; + } + } + if (wait_input_num) { + pending_ops->insert({op, wait_input_num}); + } else { + ready_ops->insert(static_cast(op)); + } } + PADDLE_ENFORCE_EQ(local_ready_vars.size(), 0); } void ThreadedSSAGraphExecutor::InsertPendingOp( @@ -198,25 +209,79 @@ void ThreadedSSAGraphExecutor::InsertPendingOp( void ThreadedSSAGraphExecutor::InsertPendingVar( std::unordered_set *pending_vars, - BlockingQueue *ready_vars, VarHandleBase *var) const { + std::unordered_set *ready_vars, VarHandleBase *var) const { pending_vars->insert(var); if (var->GeneratedOp() == nullptr) { - ready_vars->Push(var); + ready_vars->insert(var); } } +void ThreadedSSAGraphExecutor::PrepareOpDeps() { + op_deps_.reset(new OpDependentData()); + std::unordered_map &pending_ops = + op_deps_->pending_ops_; + std::unordered_set &pending_vars = op_deps_->pending_vars_; + std::unordered_set &ready_ops = op_deps_->ready_ops_; + std::unordered_set ready_vars; + + // Transform SSAGraph to pending_ops & pending_vars + for (auto &var_map : graph_->Get(details::kGraphVars)) { + for (auto &name_pair : var_map) { + for (auto &version_pair : name_pair.second) { + InsertPendingVar(&pending_vars, &ready_vars, version_pair); + } + } + } + for (auto &var : graph_->Get(details::kGraphDepVars)) { + InsertPendingVar(&pending_vars, &ready_vars, var); + } + + for (auto &op : ir::FilterByNodeWrapper(*graph_)) { + if (op->Inputs().empty()) { // Special case, Op has no input. + ready_ops.insert(op); + } else { + InsertPendingOp(&pending_ops, op); + } + } + for (auto ready_var : ready_vars) { + pending_vars.erase(ready_var); + for (auto *op : ready_var->PendingOps()) { + auto &deps = pending_ops[op]; + --deps; + if (deps == 0) { + ready_ops.insert(op); + } + } + } +} + +void ThreadedSSAGraphExecutor::CopyOpDeps() { + op_deps_futures_ = prepare_pool_.enqueue([&] { + auto *op_deps = new OpDependentData(); + op_deps->pending_ops_.insert(op_deps_->pending_ops_.begin(), + op_deps_->pending_ops_.end()); + op_deps->pending_vars_.insert(op_deps_->pending_vars_.begin(), + op_deps_->pending_vars_.end()); + op_deps->ready_ops_.insert(op_deps_->ready_ops_.begin(), + op_deps_->ready_ops_.end()); + return std::unique_ptr(op_deps); + }); +} + void ThreadedSSAGraphExecutor::RunOp( - BlockingQueue *ready_var_q, details::OpHandleBase *op) { + const std::shared_ptr> &ready_var_q, + details::OpHandleBase *op) { auto op_run = [ready_var_q, op, this] { try { if (VLOG_IS_ON(10)) { VLOG(10) << op << " " << op->Name() << " : " << op->DebugString(); } - op->Run(strategy_.use_cuda_); + if (LIKELY(!strategy_.dry_run_)) { + op->Run(strategy_.use_cuda_); + } VLOG(10) << op << " " << op->Name() << " Done "; - running_ops_--; ready_var_q->Extend(op->Outputs()); - VLOG(10) << op << " " << op->Name() << "Signal posted"; + VLOG(10) << op << " " << op->Name() << " Signal posted"; } catch (...) { exception_holder_.Catch(std::current_exception()); } diff --git a/paddle/fluid/framework/details/threaded_ssa_graph_executor.h b/paddle/fluid/framework/details/threaded_ssa_graph_executor.h index 512f8a4ca5a9b82a395dde11722b8db44ea5ec27..8c026057b480fbc40b7b8f12d8e6b8e54195a141 100644 --- a/paddle/fluid/framework/details/threaded_ssa_graph_executor.h +++ b/paddle/fluid/framework/details/threaded_ssa_graph_executor.h @@ -15,18 +15,22 @@ #pragma once #include +#include #include +#include #include +#include #include #include #include -#include -#include "ThreadPool.h" // ThreadPool in thrird party +#include // ThreadPool in thrird party + #include "paddle/fluid/framework/blocking_queue.h" #include "paddle/fluid/framework/details/exception_holder.h" #include "paddle/fluid/framework/details/execution_strategy.h" #include "paddle/fluid/framework/details/fetch_op_handle.h" +#include "paddle/fluid/framework/details/multi_devices_helper.h" #include "paddle/fluid/framework/details/ssa_graph_executor.h" #include "paddle/fluid/framework/ir/graph.h" @@ -36,52 +40,64 @@ class Scope; namespace details { +struct OpDependentData { + std::unordered_map pending_ops_; + std::unordered_set pending_vars_; + std::unordered_set ready_ops_; +}; + class ThreadedSSAGraphExecutor : public SSAGraphExecutor { public: ThreadedSSAGraphExecutor(const ExecutionStrategy &strategy, const std::vector &local_scopes, const std::vector &places, - std::unique_ptr &&graph); + ir::Graph *graph); const ir::Graph &Graph() const override { return *graph_; } // Run a SSAGraph by a thread pool // Use topological sort algorithm FeedFetchList Run(const std::vector &fetch_tensors) override; - ~ThreadedSSAGraphExecutor() {} + ~ThreadedSSAGraphExecutor() final = default; private: - void RunOp(BlockingQueue *ready_var_q, + inline FeedFetchList RunImpl(const std::vector &fetch_tensors); + void RunOp(const std::shared_ptr> &ready_var_q, details::OpHandleBase *op); private: - std::unique_ptr graph_; - std::unique_ptr<::ThreadPool> pool_; + // Note(zcd): the ThreadPool should be placed last so that ThreadPool should + // be destroyed first. + ir::Graph *graph_; std::vector local_scopes_; std::vector places_; platform::DeviceContextPool fetch_ctxs_; ExceptionHolder exception_holder_; - std::atomic running_ops_; + std::unique_ptr op_deps_; + std::future> op_deps_futures_; + ExecutionStrategy strategy_; + // use std::list because clear(), push_back, and for_each are O(1) + std::list> run_op_futures_; + ::ThreadPool prepare_pool_; + std::unique_ptr<::ThreadPool> pool_; void InsertPendingOp(std::unordered_map *pending_ops, OpHandleBase *op_instance) const; void InsertPendingVar(std::unordered_set *pending_vars, - BlockingQueue *ready_vars, + std::unordered_set *ready_vars, VarHandleBase *var) const; - void InsertFetchOps( - const std::vector &fetch_tensors, - std::vector> *fetch_ops, - std::unordered_set> *fetch_dependencies, - std::unordered_map *pending_ops, - std::unordered_set *pending_vars, - BlockingQueue *ready_vars, FeedFetchList *fetch_data); + void InsertFetchOps(const std::vector &fetch_tensors, + std::vector *fetch_ops, + std::unordered_set *fetch_dependencies, + std::unordered_set *ready_ops, + std::unordered_map *pending_ops, + std::unordered_set *pending_vars, + FeedFetchList *fetch_data); - private: - ExecutionStrategy strategy_; - // use std::list because clear(), push_back, and for_each are O(1) - std::list> run_op_futures_; + void PrepareOpDeps(); + void CopyOpDeps(); }; } // namespace details diff --git a/paddle/fluid/framework/details/var_handle.cc b/paddle/fluid/framework/details/var_handle.cc index 5457870e9ff5d7cf67c9c7076b9aae94eeada779..95d62e66415e7879144d35f858ef04a8a936cd66 100644 --- a/paddle/fluid/framework/details/var_handle.cc +++ b/paddle/fluid/framework/details/var_handle.cc @@ -20,13 +20,20 @@ namespace details { VarHandleBase::~VarHandleBase() {} +VarHandle::~VarHandle() { VLOG(4) << "deleting var handle " << DebugString(); } + std::string VarHandle::DebugString() const { std::stringstream ss; - ss << name_ << ":" << place_; + ss << "name:" << name_ << ", place:" << place_ << ", version:" << version_ + << ", scope_idx:" << scope_idx_; return ss.str(); } std::string DummyVarHandle::DebugString() const { return node_->Name(); } + +DummyVarHandle::~DummyVarHandle() { + VLOG(4) << "deleting dummy var handle " << DebugString(); +} } // namespace details } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/details/var_handle.h b/paddle/fluid/framework/details/var_handle.h index d8c2bc40b9458a1d5a7dd8a32277d04f69295f09..93060ef2593cbc032a382b617f9690e392a15b63 100644 --- a/paddle/fluid/framework/details/var_handle.h +++ b/paddle/fluid/framework/details/var_handle.h @@ -35,11 +35,15 @@ class OpHandleBase; // A variable can only be generated by a single operator. i.e. // This is a single assignment graph. struct VarHandleBase { - explicit VarHandleBase(ir::Node* node) : node_(node) {} + // Owned by `node`. No need to be deleted explicitly. + explicit VarHandleBase(ir::Node* node) : node_(node) { + node_->WrappedBy(this); + } virtual ~VarHandleBase(); virtual std::string DebugString() const = 0; + virtual const std::string& Name() const = 0; void AddInput(OpHandleBase* in, ir::Node* node) { node_->inputs.clear(); @@ -49,6 +53,8 @@ struct VarHandleBase { void AddOutput(OpHandleBase* out, ir::Node* node) { if (pending_ops_.find(out) == pending_ops_.end()) { + PADDLE_ENFORCE(out != nullptr, "The output of %s should not be nullptr", + this->Node()->Name()); pending_ops_.insert(out); node_->outputs.push_back(node); } @@ -90,7 +96,7 @@ struct VarHandleBase { // // NOTE: runtime variables have place. struct VarHandle : public VarHandleBase { - explicit VarHandle(ir::Node* node) : VarHandleBase(node) {} + virtual ~VarHandle(); std::string DebugString() const override; @@ -102,24 +108,57 @@ struct VarHandle : public VarHandleBase { name_(std::move(name)), place_(std::move(place)) {} +#ifdef PADDLE_WITH_CUDA + bool HasEvent() { return has_event_; } + + const cudaEvent_t& GetEvent() { + PADDLE_ENFORCE(HasEvent(), "The event is not set."); + return event_; + } + + void SetGenerateEvent(const cudaEvent_t& event) { + has_event_ = true; + event_ = event; + } +#endif + // version field currently is not used, however, just store the version to // debug easily. + private: size_t version_; size_t scope_idx_; std::string name_; platform::Place place_; +#ifdef PADDLE_WITH_CUDA + // Only when this event is triggered, var is generated. + cudaEvent_t event_; + bool has_event_{false}; +#endif + public: bool IsTheSameVar(const VarHandle& o) const { return o.generated_op_ == generated_op_ && o.name_ == name_ && o.scope_idx_ == scope_idx_; } + + size_t version() const { return version_; } + size_t scope_idx() const { return scope_idx_; } + const std::string& Name() const override { return name_; } + const std::string& name() const { return name_; } + const platform::Place& place() const { return place_; } }; // Dummy Variable. It is used to represent dependencies between operators struct DummyVarHandle : public VarHandleBase { explicit DummyVarHandle(ir::Node* node) : VarHandleBase(node) {} + virtual ~DummyVarHandle(); + std::string DebugString() const override; + + public: + const std::string& Name() const override { return name_; } + std::string name_{"DummyVar"}; }; } // namespace details diff --git a/paddle/fluid/framework/details/variable_visitor.cc b/paddle/fluid/framework/details/variable_visitor.cc index 3dfd14419d94379a0bf79f55d7a139acd77cbd7e..134f759081a0778194c20785e215420d6e2bb622 100644 --- a/paddle/fluid/framework/details/variable_visitor.cc +++ b/paddle/fluid/framework/details/variable_visitor.cc @@ -24,7 +24,7 @@ static void VisitVariable(Variable* var, Func* func) { } else if (var->IsType()) { (*func)(var->GetMutable()); } else { - PADDLE_THROW("Not supported type %s", var->Type().name()); + PADDLE_THROW("Not supported type %s", ToTypeName(var->Type())); } } @@ -35,7 +35,7 @@ static void VisitVariable(const Variable& var, Func* func) { } else if (var.IsType()) { (*func)(var.Get()); } else { - PADDLE_THROW("Not supported type %s", var.Type().name()); + PADDLE_THROW("Not supported type %s", ToTypeName(var.Type())); } } diff --git a/paddle/fluid/framework/details/while_op_eager_deletion_pass.cc b/paddle/fluid/framework/details/while_op_eager_deletion_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..8f7c99f12a6338ad99d988d3eda3759e323f64bb --- /dev/null +++ b/paddle/fluid/framework/details/while_op_eager_deletion_pass.cc @@ -0,0 +1,60 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/details/computation_op_handle.h" +#include "paddle/fluid/framework/details/multi_devices_helper.h" +#include "paddle/fluid/framework/ir/graph_helper.h" +#include "paddle/fluid/operators/controlflow/while_op_helper.h" + +namespace paddle { +namespace framework { +namespace details { + +class WhileOpEagerDeletionPass : public ir::Pass { + protected: + void ApplyImpl(ir::Graph *graph) const override { + auto all_ops = ir::FilterByNodeWrapper(*graph); + + // Find all while_op and while_grad_op + std::unordered_map, + std::vector>> + target_ops; + for (auto *op : all_ops) { + auto compute_op = dynamic_cast(op); + if (compute_op == nullptr) continue; + + if (compute_op->Name() == "while") { + target_ops[compute_op->GetScopeIdx()].first.emplace_back( + compute_op->GetOp()); + } else if (compute_op->Name() == "while_grad") { + target_ops[compute_op->GetScopeIdx()].second.emplace_back( + compute_op->GetOp()); + } + } + + for (auto &ops_pair : target_ops) { + auto &while_ops = ops_pair.second.first; + auto &while_grad_ops = ops_pair.second.second; + operators::PrepareSafeEagerDeletionOnWhileOpAndWhileGradOp( + while_ops, while_grad_ops); + } + } +}; + +} // namespace details +} // namespace framework +} // namespace paddle + +REGISTER_PASS(while_op_eager_deletion_pass, + paddle::framework::details::WhileOpEagerDeletionPass); diff --git a/paddle/fluid/framework/device_worker.cc b/paddle/fluid/framework/device_worker.cc new file mode 100644 index 0000000000000000000000000000000000000000..443acf0a16303ef47d24b3013ed92929d0d7839e --- /dev/null +++ b/paddle/fluid/framework/device_worker.cc @@ -0,0 +1,27 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/device_worker.h" + +namespace paddle { +namespace framework { + +void DeviceWorker::SetRootScope(Scope* root_scope) { root_scope_ = root_scope; } + +void DeviceWorker::SetDataFeed(const std::shared_ptr& data_feed) { + device_reader_ = data_feed; +} + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/device_worker.h b/paddle/fluid/framework/device_worker.h new file mode 100644 index 0000000000000000000000000000000000000000..a7a8663ec3b1c436104f53b6db833bd26f6722f0 --- /dev/null +++ b/paddle/fluid/framework/device_worker.h @@ -0,0 +1,198 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include +#include // NOLINT +#include +#include // NOLINT +#include + +#include "paddle/fluid/framework/data_feed.h" +#include "paddle/fluid/framework/fleet/fleet_wrapper.h" +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/framework/program_desc.h" +#include "paddle/fluid/framework/reader.h" +#include "paddle/fluid/framework/trainer_desc.pb.h" +#include "paddle/fluid/framework/variable_helper.h" +#include "paddle/fluid/operators/reader/blocking_queue.h" +#include "paddle/fluid/platform/place.h" +#include "paddle/fluid/platform/port.h" +#include "paddle/fluid/platform/timer.h" + +namespace paddle { +namespace framework { + +class PullDenseWorker { + public: + virtual ~PullDenseWorker() {} + virtual void Initialize(const TrainerDesc& param); + int Start(); + void Stop(); + void SetRootScope(Scope* scope) { root_scope_ = scope; } + void IncreaseThreadVersion(int thread_id, uint64_t table_id); + void ResetThreadVersion(uint64_t table_id); + void Wait(std::vector<::std::future>* status_vec); + static std::shared_ptr GetInstance() { + if (NULL == s_instance_) { + s_instance_.reset(new paddle::framework::PullDenseWorker()); + } + return s_instance_; + } + + private: + PullDenseWorker() : root_scope_(NULL) {} + void Run(); + bool CheckUpdateParam(uint64_t table_id); + + private: + static std::shared_ptr s_instance_; + std::shared_ptr fleet_ptr_; + PullDenseWorkerParameter param_; + DownpourWorkerParameter dwp_param_; + Scope* root_scope_; + bool running_; + + static std::map last_versions_; + static std::map current_version_; + static std::mutex mutex_for_version_; + static std::map> training_versions_; + static std::map> dense_value_names_; + + std::thread t_; + int thread_num_; + int sleep_time_ms_; + int threshold_; + + std::vector<::std::future> pull_dense_status_; + uint32_t pull_dense_fail_times_ = 0; + std::vector base_norm_param_; + std::vector mean_; + std::vector scale_; + float squared_sum_epsilon_ = 1e-4; + std::mutex mutex_for_mean_scale_; + float total_batch_num_ = 0; +}; + +// should incorporate different type of device +class DeviceWorker { + public: + DeviceWorker() {} + virtual ~DeviceWorker() {} + virtual void Initialize(const TrainerDesc& desc) = 0; + virtual void SetDeviceIndex(int tid) = 0; + virtual void TrainFiles() = 0; + virtual void PrintFetchVars() = 0; + virtual void TrainFilesWithProfiler() = 0; + virtual void CreateDeviceResource(const ProgramDesc& main_prog) = 0; + // will make this zero copy in the future + virtual void BindingDataFeedMemory() = 0; + virtual void SetRootScope(Scope* root_scope); + virtual void SetDataFeed(const std::shared_ptr& data_feed); + virtual void SetPlace(const paddle::platform::Place& place) { + place_ = place; + } + + protected: + Scope* root_scope_; + paddle::platform::Place place_; + std::shared_ptr device_reader_; + int64_t batch_num_; + FetchConfig fetch_config_; +}; + +class CPUWorkerBase : public DeviceWorker { + public: + CPUWorkerBase() {} + virtual ~CPUWorkerBase() {} + virtual void SetDeviceIndex(int tid) { thread_id_ = tid; } + virtual void TrainFiles() = 0; + virtual void TrainFilesWithProfiler() {} + virtual void PrintFetchVars() {} + virtual void CreateDeviceResource(const ProgramDesc& main_prog) {} + + protected: + int thread_id_; +}; + +class HogwildWorker : public CPUWorkerBase { + public: + HogwildWorker() {} + virtual ~HogwildWorker() {} + virtual void Initialize(const TrainerDesc& desc); + virtual void TrainFiles(); + virtual void TrainFilesWithProfiler(); + virtual void PrintFetchVars(); + virtual void CreateDeviceResource(const ProgramDesc& main_prog); + virtual void BindingDataFeedMemory(); + + protected: + void CreateThreadOperators(const ProgramDesc& program); + void CreateThreadScope(const ProgramDesc& program); + std::vector op_names_; + std::vector ops_; + Scope* thread_scope_; + HogwildWorkerParameter param_; + std::vector skip_ops_; +}; + +class DownpourWorker : public HogwildWorker { + public: + DownpourWorker() {} + virtual ~DownpourWorker() {} + virtual void Initialize(const TrainerDesc& desc); + virtual void TrainFiles(); + virtual void TrainFilesWithProfiler(); + + protected: + std::shared_ptr fleet_ptr_; + std::shared_ptr pull_dense_worker_; + void FillSparseValue(size_t table_id); + void PushGradients(); + void CollectLabelInfo(size_t table_id); + + private: + bool need_to_push_dense_; + bool need_to_push_sparse_; + DownpourWorkerParameter param_; + // just save the value in param_ for easy access + std::map label_var_name_; + std::map> sparse_key_names_; + std::map> sparse_value_names_; + std::map> sparse_grad_names_; + std::map> dense_value_names_; + std::map> dense_grad_names_; + + // feasign + std::map> features_; + // feasign stats + std::map> feature_labels_; + // feasign embedding + std::map>> feature_values_; + // feasign embedding gradient + std::map>> feature_grads_; + // skipped ops + std::vector skip_ops_; + + std::shared_ptr _pull_dense_worker; + std::vector<::std::future> push_sparse_status_; + std::vector<::std::future> push_dense_status_; +}; + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/device_worker_factory.cc b/paddle/fluid/framework/device_worker_factory.cc new file mode 100644 index 0000000000000000000000000000000000000000..2a7b368145c3b16873fc90a34fe5bb439d9806dd --- /dev/null +++ b/paddle/fluid/framework/device_worker_factory.cc @@ -0,0 +1,65 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/device_worker_factory.h" +#include +#include +#include + +namespace paddle { +namespace framework { + +typedef std::shared_ptr (*Createdevice_workerFunction)(); +typedef std::unordered_map + device_workerMap; +device_workerMap g_device_worker_map; +#define REGISTER_DEVICE_WORKER_CLASS(device_worker_class) \ + namespace { \ + std::shared_ptr Creator_##device_worker_class() { \ + return std::shared_ptr(new device_worker_class); \ + } \ + class __Registerer_##device_worker_class { \ + public: \ + __Registerer_##device_worker_class() { \ + g_device_worker_map[#device_worker_class] = \ + &Creator_##device_worker_class; \ + } \ + }; \ + __Registerer_##device_worker_class g_registerer_##device_worker_class; \ + } // namespace + +std::string DeviceWorkerFactory::DeviceWorkerTypeList() { + std::string device_worker_types; + for (auto iter = g_device_worker_map.begin(); + iter != g_device_worker_map.end(); ++iter) { + if (iter != g_device_worker_map.begin()) { + device_worker_types += ", "; + } + device_worker_types += iter->first; + } + return device_worker_types; +} + +std::shared_ptr DeviceWorkerFactory::CreateDeviceWorker( + std::string device_worker_class) { + if (g_device_worker_map.count(device_worker_class) < 1) { + exit(-1); + } + return g_device_worker_map[device_worker_class](); +} + +REGISTER_DEVICE_WORKER_CLASS(HogwildWorker); +REGISTER_DEVICE_WORKER_CLASS(DownpourWorker); +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/device_worker_factory.h b/paddle/fluid/framework/device_worker_factory.h new file mode 100644 index 0000000000000000000000000000000000000000..9d0613385e78c9f482840677c71f621e53ed85b5 --- /dev/null +++ b/paddle/fluid/framework/device_worker_factory.h @@ -0,0 +1,31 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include "paddle/fluid/framework/device_worker.h" + +namespace paddle { +namespace framework { + +class DeviceWorkerFactory { + public: + static std::string DeviceWorkerTypeList(); + static std::shared_ptr CreateDeviceWorker( + std::string device_worker_class); +}; +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/device_worker_test.cc b/paddle/fluid/framework/device_worker_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..faa648ab35d2b4d7a553344c2261d2aa07d0829a --- /dev/null +++ b/paddle/fluid/framework/device_worker_test.cc @@ -0,0 +1,24 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include "paddle/fluid/framework/trainer.h" + +namespace paddle { +namespace framework { +TEST() { + // create hogwild device worker +} +} +} diff --git a/paddle/fluid/framework/dim.h b/paddle/fluid/framework/dim.h index 73f92fa389fa3a66a14ae60b8dbfbcae80485658..88aee8379d835ce88b6b348aca99eb4a35bbeb5c 100644 --- a/paddle/fluid/framework/dim.h +++ b/paddle/fluid/framework/dim.h @@ -16,332 +16,184 @@ #include #include #include +#include #include +#include "paddle/fluid/framework/array.h" #include "paddle/fluid/platform/assert.h" +#include "paddle/fluid/platform/enforce.h" #include "paddle/fluid/platform/hostdevice.h" namespace paddle { namespace framework { // Statically sized, statically indexed dimension -template -struct Dim { - static constexpr int dimensions = i; +template +class Dim : public Array { + public: + static_assert(D >= 0, "D must be not less than 0"); - template - HOSTDEVICE Dim(int64_t _head, Args... _tail) : head(_head), tail(_tail...) { - static_assert(sizeof...(_tail) == i - 1, - "Dim initialized with the wrong number of parameters"); - } + static constexpr int kRank = D; + using BaseClass = Array; - HOSTDEVICE - Dim(int64_t _head, const Dim& _tail) : head(_head), tail(_tail) {} + inline Dim(int64_t head, const Dim& tail) { + (*this)[0] = head; + new (this->GetMutable() + 1) Dim(tail); + } - HOSTDEVICE - Dim() : head(0), tail() {} + template + HOSTDEVICE explicit Dim(int64_t head, Args... args) + : BaseClass(head, args...) {} /** Construct a Dim from a linear index and size. Uses Fortran order * indexing. */ - HOSTDEVICE - Dim(int64_t idx, const Dim& size) - : head(idx % size.head), tail(idx / size.head, size.tail) {} + HOSTDEVICE Dim(int64_t idx, const Dim& size); /** Construct a Dim with each dimension set to the given index */ - HOSTDEVICE - Dim(int64_t idx) : head(idx), tail(idx) {} + HOSTDEVICE explicit Dim(int64_t idx) { this->Fill(idx); } - HOSTDEVICE - bool operator==(const Dim& o) const { - return (head == o.head) && (tail == o.tail); - } - - HOSTDEVICE - bool operator!=(const Dim& o) const { return !(*this == o); } - - HOSTDEVICE - int64_t& operator[](int idx); - HOSTDEVICE - int64_t operator[](int idx) const; + HOSTDEVICE Dim() = default; HOST std::string to_string() const; - - int64_t head; - Dim tail; }; -// Base case specialization -template <> -struct Dim<0> { - static constexpr int dimensions = 0; - - HOSTDEVICE - Dim(int64_t _head) {} - - HOSTDEVICE - Dim() {} - - HOSTDEVICE - Dim(int idx, const Dim<0>& size) { -#ifndef __CUDA_ARCH__ - if (idx > 0) { - throw std::invalid_argument("Index out of range."); - } -#else - PADDLE_ASSERT(idx == 0); -#endif - } - - HOSTDEVICE - bool operator==(const Dim<0>& o) const { return true; } - - HOSTDEVICE - bool operator!=(const Dim<0>& o) const { return false; } - - HOSTDEVICE - int64_t& operator[](int idx); - HOSTDEVICE - int64_t operator[](int idx) const; -}; - -namespace { - -// Helper for accessing Dim classes -template -struct DimGetter { - // Return a copy if Dim is const - template - HOSTDEVICE static int64_t impl(const D& d) { - return DimGetter::impl(d.tail); - } - // Return a reference if Dim is mutable - template - HOSTDEVICE static int64_t& impl(D& d) { - return DimGetter::impl(d.tail); +namespace detail { +template +struct FortranOrderIndexingConstructorFunctor { + HOSTDEVICE inline static void Run(const int64_t* in, int64_t* idx, + int64_t* out) { + out[kStart] = (*idx) % in[kStart]; + (*idx) /= in[kStart]; + FortranOrderIndexingConstructorFunctor::Run(in, idx, + out); } }; -// Eureka! We found the element! -template <> -struct DimGetter<0> { - // Return a copy if Dim is const - template - HOSTDEVICE static int64_t impl(const D& d) { - return d.head; - } - // Return a reference if Dim is mutable - template - HOSTDEVICE static int64_t& impl(D& d) { - return d.head; - } +template +struct FortranOrderIndexingConstructorFunctor { + HOSTDEVICE inline static void Run(const int64_t* in, int64_t* idx, + int64_t* out) {} }; +} // namespace detail template -HOSTDEVICE int64_t& indexer(Dim& dim, int idx) { -#ifndef __CUDA_ARCH__ - if (idx < 0) { - throw std::invalid_argument("Tried to access a negative dimension"); - } -#else - PADDLE_ASSERT(idx >= 0); -#endif - if (idx == 0) { - return dim.head; - } - return indexer(dim.tail, idx - 1); -} - -template <> -HOSTDEVICE int64_t& indexer<0>(Dim<0>& dim, int idx) { -#ifndef __CUDA_ARCH__ - throw std::invalid_argument("Invalid index"); -#else - PADDLE_ASSERT(false); -#if CUDA_VERSION < 8000 - // On CUDA versions previous to 8.0, only __shared__ variables - // could be declared as static in the device code. - int64_t head = 0; -#else - static int64_t head = 0; -#endif - return head; -#endif -} - -template -HOSTDEVICE int64_t indexer(const Dim& dim, int idx) { -#ifndef __CUDA_ARCH__ - if (idx < 0) { - throw std::invalid_argument("Tried to access a negative dimension"); - } -#else - PADDLE_ASSERT(idx >= 0); -#endif - if (idx == 0) { - return dim.head; - } - return indexer(dim.tail, idx - 1); -} - -template <> -HOSTDEVICE int64_t indexer<0>(const Dim<0>& dim, int idx) { -#ifndef __CUDA_ARCH__ - throw std::invalid_argument("Invalid index"); -#else - PADDLE_ASSERT(false); -#if CUDA_VERSION < 8000 - // On CUDA versions previous to 8.0, only __shared__ variables - // could be declared as static in the device code. - int64_t head = 0; -#else - static int64_t head = 0; -#endif - return head; -#endif -} - -} // namespace -// Static access to constant Dim -template -HOSTDEVICE int64_t get(const Dim& d) { - return DimGetter::impl(d); +HOSTDEVICE Dim::Dim(int64_t idx, const Dim& size) { + detail::FortranOrderIndexingConstructorFunctor<0, D, D == 0>::Run( + size.Get(), &idx, this->GetMutable()); } -// Static access to mutable Dim -template -HOSTDEVICE int64_t& get(Dim& d) { - return DimGetter::impl(d); +template +HOSTDEVICE inline int64_t get(const Dim& dim) { + return dim[idx]; } -// Dynamic access to constant Dim -template -HOSTDEVICE int64_t Dim::operator[](int i) const { - return indexer(*this, i); +template +HOSTDEVICE inline int64_t& get(Dim& dim) { // NOLINT + return dim[idx]; } -// Dynamic access to mutable Dim -template -HOSTDEVICE int64_t& Dim::operator[](int i) { - return indexer(*this, i); -} - -// Dynamic access to constant Dim -inline HOSTDEVICE int64_t Dim<0>::operator[](int i) const { - return indexer(*this, i); -} - -// Dynamic access to mutable Dim -inline HOSTDEVICE int64_t& Dim<0>::operator[](int i) { - return indexer(*this, i); -} - -// Dynamic access to constant Dim -// without std::enable_if will try to instantiate this on get<0>(d) -template -HOSTDEVICE typename std::enable_if<(l > 0), int64_t>::type get(const Dim& d, - int i) { - return d[i]; +template +HOSTDEVICE inline int64_t get(const Dim& dim, int idx) { + return dim[idx]; } -// Dynamic access to mutable Dim -template -HOSTDEVICE typename std::enable_if<(l > 0), int64_t&>::type get(Dim& d, - int i) { - return d[i]; +template +HOSTDEVICE inline int64_t& get(Dim& dim, int idx) { // NOLINT + return dim[idx]; } // Dot product of two dims -template -HOSTDEVICE int64_t linearize(const Dim& a, const Dim& b) { - return a.head * b.head + linearize(a.tail, b.tail); -} - -// Base case dot product of two Dims -// Notice it is inline because it is no longer a template -template <> -HOSTDEVICE inline int64_t linearize(const Dim<0>& a, const Dim<0>& b) { - return 0; +template +HOSTDEVICE inline int64_t linearize(const Dim& a, const Dim& b) { + return UnrollProduct::Run(a.Get(), b.Get()); } // Product of a Dim -template -HOSTDEVICE int64_t product(const Dim& a, int prod = 1) { - return prod * a.head * product(a.tail); -} - -// Base case product of a Dim -// Notice it is inline because it is no longer a template -template <> -HOSTDEVICE inline int64_t product(const Dim<0>& a, int prod) { - return prod; +template +HOSTDEVICE inline int64_t product(const Dim& a) { + return UnrollProduct::Run(a.Get()); } // Is 0 <= idx_i < size_i for all i? -template -HOSTDEVICE bool contained(const Dim& idx, const Dim& size) { - return ((0 <= idx.head) && (idx.head < size.head) && - contained(idx.tail, size.tail)); -} +namespace detail { +template +struct ContainedFunctor { + HOSTDEVICE static inline bool Run(const int64_t* idx, const int64_t* size) { + return (idx[kStart] >= 0 && idx[kStart] < size[kStart]) && + ContainedFunctor::Run(idx, + size); + } +}; -// Base case of is 0 <= idx_i < size_i ? -// Notice it is inline because it is no longer a template -template <> -HOSTDEVICE inline bool contained(const Dim<0>& idx, const Dim<0>& size) { - return true; +template +struct ContainedFunctor { + HOSTDEVICE static constexpr inline bool Run(const int64_t* idx, + const int64_t* size) { + return true; + } +}; +} // namespace detail + +template +HOSTDEVICE inline bool contained(const Dim& idx, const Dim& size) { + return detail::ContainedFunctor<0, D, D == 0>::Run(idx.Get(), size.Get()); } /** * \brief Compute exclusive prefix-multiply of a Dim. */ -template -HOSTDEVICE Dim ex_prefix_mul(const Dim& src, int mul = 1) { - return Dim(mul, ex_prefix_mul(src.tail, mul * src.head)); -} +namespace detail { +template +struct ExPrefixMulFunctor { + HOSTDEVICE static inline void Run(const int64_t* in, int64_t* out) { + kStart == 0 ? out[kStart] = 1 : out[kStart] = + out[kStart - 1] * in[kStart - 1]; + detail::ExPrefixMulFunctor::Run(in, + out); + } +}; + +template +struct ExPrefixMulFunctor { + HOSTDEVICE static inline void Run(const int64_t* in, int64_t* out) {} +}; +} // namespace detail -///\cond HIDDEN -// Base case of ex_prefix_mul -// Notice it is inline because it is no longer a template -template <> -HOSTDEVICE inline Dim<0> ex_prefix_mul(const Dim<0>& src, int mul) { - return Dim<0>(); +template +HOSTDEVICE inline Dim ex_prefix_mul(const Dim& src) { + Dim ret; + detail::ExPrefixMulFunctor<0, D, D == 0>::Run(src.Get(), ret.GetMutable()); + return ret; } -///\endcond /** * Add two dimensions together */ -template -HOSTDEVICE Dim dim_plus(const Dim& a, const Dim& b) { - return Dim(a.head + b.head, dim_plus(a.tail, b.tail)); -} - -// Base case -template <> -HOSTDEVICE inline Dim<0> dim_plus(const Dim<0>& a, const Dim<0>& b) { - return Dim<0>(); +template +HOSTDEVICE inline Dim dim_plus(const Dim& a, const Dim& b) { + Dim ret; + UnrollAdd::Run(a.Get(), b.Get(), ret.GetMutable()); + return ret; } -template -HOSTDEVICE Dim operator+(const Dim& lhs, const Dim& rhs) { +template +HOSTDEVICE inline Dim operator+(const Dim& lhs, const Dim& rhs) { return dim_plus(lhs, rhs); } /** * Multiply two dimensions together */ -template -HOSTDEVICE Dim dim_mult(const Dim& a, const Dim& b) { - return Dim(a.head * b.head, dim_mult(a.tail, b.tail)); -} - -// Base case -template <> -HOSTDEVICE inline Dim<0> dim_mult(const Dim<0>& a, const Dim<0>& b) { - return Dim<0>(); +template +HOSTDEVICE inline Dim dim_mult(const Dim& a, const Dim& b) { + Dim ret; + UnrollMul::Run(a.Get(), b.Get(), ret.GetMutable()); + return ret; } -template -HOSTDEVICE Dim operator*(const Dim& lhs, const Dim& rhs) { +template +HOSTDEVICE Dim operator*(const Dim& lhs, const Dim& rhs) { return dim_mult(lhs, rhs); } @@ -354,23 +206,32 @@ HOSTDEVICE Dim operator*(const Dim& lhs, const Dim& rhs) { * \return Dim object the same size as \p size with normalized strides * */ +namespace detail { +template +struct NormalizeStridesFunctor { + HOSTDEVICE static void Run(const int64_t* size, const int64_t* stride, + int64_t* ret) { + ret[kStart] = (size[kStart] == 1 ? 0 : stride[kStart]); + NormalizeStridesFunctor::Run( + size, stride, ret); + } +}; -template -HOSTDEVICE Dim normalize_strides(const Dim& size, const Dim& stride) { - int norm_stride = size.head == 1 ? 0 : stride.head; - return Dim(norm_stride, normalize_strides(size.tail, stride.tail)); -} - -///\cond HIDDEN +template +struct NormalizeStridesFunctor { + HOSTDEVICE static void Run(const int64_t* size, const int64_t* stride, + int64_t* ret) {} +}; +} // namespace detail -template <> -HOSTDEVICE inline Dim<0> normalize_strides(const Dim<0>& size, - const Dim<0>& stride) { - return Dim<0>(); +template +HOSTDEVICE Dim normalize_strides(const Dim& size, const Dim& stride) { + Dim ret; + detail::NormalizeStridesFunctor<0, D, D == 0>::Run(size.Get(), stride.Get(), + ret.GetMutable()); + return ret; } -///\endcond - /** * Helper function to create a Dim * @@ -379,25 +240,17 @@ HOSTDEVICE inline Dim<0> normalize_strides(const Dim<0>& size, */ template -HOSTDEVICE Dim make_dim(Args... idxes) { +HOSTDEVICE inline Dim make_dim(Args... idxes) { return Dim(idxes...); } // Allows us to output a Dim -// XXX For some reason, overloading fails to resolve this correctly -template -typename std::enable_if<(i > 1), std::ostream&>::type operator<<( - std::ostream& os, const Dim& d) { - os << d.head << ", " << d.tail; - return os; -} - -// Base case that allows us to output a Dim -// XXX I wish this could be an overload instead of a template -template -typename std::enable_if<(i == 1), std::ostream&>::type operator<<( - std::ostream& os, const Dim& d) { - os << d.head; +template +inline std::ostream& operator<<(std::ostream& os, const Dim& d) { + os << d[0]; + for (int i = 1; i < D; ++i) { + os << ", " << d[i]; + } return os; } @@ -405,17 +258,15 @@ inline std::ostream& operator<<(std::ostream& os, const Dim<0>& d) { return os; } -template -HOST std::string Dim::to_string() const { +template +HOST std::string Dim::to_string() const { std::stringstream stream; - stream << *this; - return stream.str(); } template -HOSTDEVICE Dim linear_to_dimension(int linear_index, Dim extents) { +HOSTDEVICE Dim linear_to_dimension(int linear_index, const Dim& extents) { Dim result; for (int i = 0; i < D - 1; ++i) { @@ -428,5 +279,10 @@ HOSTDEVICE Dim linear_to_dimension(int linear_index, Dim extents) { return result; } +template +inline void static_dim_assign(const T1* in, T2* out) { + UnrollAssign::Run(in, out); +} + } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/dist_multi_trainer.cc b/paddle/fluid/framework/dist_multi_trainer.cc new file mode 100644 index 0000000000000000000000000000000000000000..481e12fcd63e77b6d42143f93df69c0f6abe7f25 --- /dev/null +++ b/paddle/fluid/framework/dist_multi_trainer.cc @@ -0,0 +1,80 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include "paddle/fluid/framework/data_feed_factory.h" +#include "paddle/fluid/framework/data_set.h" +#include "paddle/fluid/framework/device_worker_factory.h" +#include "paddle/fluid/framework/trainer.h" + +namespace paddle { +namespace framework { + +void DistMultiTrainer::Initialize(const TrainerDesc& trainer_desc, + Dataset* dataset) { + thread_num_ = trainer_desc.thread_num(); + SetDataset(dataset); + + dataset->CreateReaders(); + const std::vector> readers = + dataset->GetReaders(); + + thread_num_ = readers.size(); + workers_.resize(thread_num_); + + for (int i = 0; i < thread_num_; ++i) { + workers_[i] = DeviceWorkerFactory::CreateDeviceWorker( + trainer_desc.device_worker_name()); + workers_[i]->SetDeviceIndex(i); + workers_[i]->SetDataFeed(readers[i]); + workers_[i]->Initialize(trainer_desc); + } + + VLOG(3) << "going to initialize pull dense worker"; + pull_dense_worker_ = PullDenseWorker::GetInstance(); + pull_dense_worker_->Initialize(trainer_desc); + VLOG(3) << "initialize pull dense worker"; + SetDebug(trainer_desc.debug()); +} + +void DistMultiTrainer::InitOtherEnv(const ProgramDesc& main_program) { + pull_dense_worker_->SetRootScope(root_scope_); + pull_dense_worker_->Start(); + VLOG(3) << "init other env done."; +} + +void DistMultiTrainer::Run() { + for (int thidx = 0; thidx < thread_num_; ++thidx) { + if (!debug_) { + threads_.push_back( + std::thread(&DeviceWorker::TrainFiles, workers_[thidx].get())); + } else { + threads_.push_back(std::thread(&DeviceWorker::TrainFilesWithProfiler, + workers_[thidx].get())); + } + } +} + +void DistMultiTrainer::Finalize() { + for (auto& th : threads_) { + th.join(); + } + pull_dense_worker_->Stop(); + dataset_ptr_->DestroyReaders(); + root_scope_->DropKids(); +} + +} // end namespace framework +} // end namespace paddle diff --git a/paddle/fluid/framework/dlpack_tensor.cc b/paddle/fluid/framework/dlpack_tensor.cc new file mode 100644 index 0000000000000000000000000000000000000000..39652706c43fb51da99170b361b3e1a6e04c6fc9 --- /dev/null +++ b/paddle/fluid/framework/dlpack_tensor.cc @@ -0,0 +1,124 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/dlpack_tensor.h" +#include "paddle/fluid/framework/data_type.h" +namespace paddle { +namespace framework { + +namespace internal { +template +static ::DLDataType GetDLDataTypeCode() { + ::DLDataType dtype; + if (std::is_same::value || + std::is_floating_point::value) { + dtype.code = kDLFloat; + } else if (std::is_unsigned::value) { + dtype.code = kDLUInt; + } else if (std::is_integral::value) { + dtype.code = kDLInt; + } else { + PADDLE_THROW("Unsupported data type %s", typeid(T).name()); + } + dtype.bits = 8 * sizeof(T); + dtype.lanes = 1; + return dtype; +} + +static std::unordered_map CreateDLDataTypeMap() { + static std::unordered_map result; + +#define REG_DL_DATA_TYPE(cpp_type, proto_type) \ + result[static_cast(proto_type)] = GetDLDataTypeCode() + + _ForEachDataType_(REG_DL_DATA_TYPE); +#undef REG_DL_DATA_TYPE + return result; +} + +static DLDataType GetDLDataTypeFromTypeIndex(proto::VarType::Type type) { + static auto type_to_dtype_map = CreateDLDataTypeMap(); + static auto type_to_dtype_map_end_it = type_to_dtype_map.end(); + auto it = type_to_dtype_map.find(static_cast(type)); + PADDLE_ENFORCE(it != type_to_dtype_map_end_it, "Unsupported data type %d", + type); + return it->second; +#undef REG_DL_DATA_TYPE +} + +struct DLContextVisitor : public boost::static_visitor<::DLContext> { + inline ::DLContext operator()(const platform::CPUPlace &place) const { + ::DLContext ctx; + ctx.device_type = kDLCPU; + ctx.device_id = 0; + return ctx; + } + + inline ::DLContext operator()(const platform::CUDAPlace &place) const { +#ifdef PADDLE_WITH_CUDA + ::DLContext ctx; + ctx.device_type = kDLGPU; + ctx.device_id = place.device; + return ctx; +#else + PADDLE_THROW("platform::CUDAPlace is not supported in CPU only version"); +#endif + } + + inline ::DLContext operator()(const platform::CUDAPinnedPlace &place) const { +#ifdef PADDLE_WITH_CUDA + ::DLContext ctx; + ctx.device_type = kDLCPUPinned; + ctx.device_id = 0; + return ctx; +#else + PADDLE_THROW( + "platform::CUDAPinnedPlace is not supported in CPU only version"); +#endif + } +}; +} // namespace internal + +DLPackTensor::DLPackTensor(const Tensor &tensor, LaneType lanes) { + // init data, data buffer + t_.data = const_cast(tensor.data()); + + // init ctx, DLContext type with device_type and device_id + auto place = tensor.place(); + t_.ctx = boost::apply_visitor(internal::DLContextVisitor(), place); + + // init dtype + t_.dtype = internal::GetDLDataTypeFromTypeIndex(tensor.type()); + t_.dtype.lanes = lanes; + + // init ndim, tensor rank + auto &dims = tensor.dims(); + using DimType = decltype(t_.ndim); // int + t_.ndim = static_cast(dims.size()); + + // init shape, tensor dims + t_.shape = shape_; + for (DimType i = 0; i < t_.ndim; ++i) { + t_.shape[i] = dims[i]; + } + + // init strides, nullptr means the tensor is compact + t_.strides = nullptr; + + // init byte_offset + t_.byte_offset = 0; +} + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/dlpack_tensor.h b/paddle/fluid/framework/dlpack_tensor.h new file mode 100644 index 0000000000000000000000000000000000000000..e48b0d5c88fecf797a61283b004735fdcbabb329 --- /dev/null +++ b/paddle/fluid/framework/dlpack_tensor.h @@ -0,0 +1,45 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include "paddle/fluid/framework/tensor.h" + +namespace paddle { +namespace framework { + +class DLPackTensor { + public: + using LaneType = decltype(::DLTensor::dtype.lanes); // uint16_t + using ShapeType = + std::remove_reference::type; // int64_t + + // lanes is only used in CPU to enable vectorization + explicit DLPackTensor(const Tensor& tensor, LaneType lanes = 1); + + inline operator const ::DLTensor&() const { return t_; } + + inline operator ::DLTensor&() { return t_; } + + private: + ::DLTensor t_; + + // The shape in DLTensor is defined as int64_t* + // Add this member to make TVMTensor init without heap allocation + ShapeType shape_[DDim::kMaxRank]; +}; + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/dlpack_tensor_test.cc b/paddle/fluid/framework/dlpack_tensor_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..c0a8e1bcdfa3a54aea061f1a0815fc1405c76d9c --- /dev/null +++ b/paddle/fluid/framework/dlpack_tensor_test.cc @@ -0,0 +1,101 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/dlpack_tensor.h" +#include +#include +#include + +namespace paddle { +namespace framework { + +namespace { // NOLINT +template +constexpr uint8_t GetDLDataTypeCode() { + return std::is_same::value || + std::is_floating_point::value + ? static_cast(kDLFloat) + : (std::is_unsigned::value + ? static_cast(kDLUInt) + : (std::is_integral::value ? static_cast(kDLInt) + : static_cast(-1))); +} +} // NOLINT + +template +void TestMain(const platform::Place &place, uint16_t lanes) { + DDim dims{4, 5, 6, 7}; + Tensor tensor; + tensor.Resize(dims); + void *p = tensor.mutable_data(place); + + DLPackTensor dlpack_tensor(tensor, lanes); + ::DLTensor &dl_tensor = dlpack_tensor; + + CHECK_EQ(p, dl_tensor.data); + if (platform::is_cpu_place(place)) { + CHECK_EQ(kDLCPU, dl_tensor.ctx.device_type); + CHECK_EQ(0, dl_tensor.ctx.device_id); + } else if (platform::is_gpu_place(place)) { + CHECK_EQ(kDLGPU, dl_tensor.ctx.device_type); + CHECK_EQ(boost::get(place).device, + dl_tensor.ctx.device_id); + } else if (platform::is_cuda_pinned_place(place)) { + CHECK_EQ(kDLCPUPinned, dl_tensor.ctx.device_type); + CHECK_EQ(0, dl_tensor.ctx.device_id); + } else { + CHECK_EQ(false, true); + } + + CHECK_EQ(dims.size(), dl_tensor.ndim); + for (auto i = 0; i < dims.size(); ++i) { + CHECK_EQ(dims[i], dl_tensor.shape[i]); + } + + CHECK_EQ(dl_tensor.strides == nullptr, true); + CHECK_EQ(static_cast(0), dl_tensor.byte_offset); + + CHECK_EQ(lanes, dl_tensor.dtype.lanes); + CHECK_EQ(sizeof(T) * 8, dl_tensor.dtype.bits); + + CHECK_EQ(GetDLDataTypeCode(), dl_tensor.dtype.code); +} + +template +void TestMainLoop() { +#ifdef PADDLE_WITH_CUDA + std::vector places{platform::CPUPlace(), + platform::CUDAPlace(0), + platform::CUDAPinnedPlace()}; + if (platform::GetCUDADeviceCount() > 1) { + places.emplace_back(platform::CUDAPlace(1)); + } +#else + std::vector places{platform::CPUPlace()}; +#endif + std::vector lanes{1, 2}; + for (auto &p : places) { + for (auto &l : lanes) { + TestMain(p, l); + } + } +} +TEST(dlpack, test_all) { +#define TestCallback(cpp_type, proto_type) TestMainLoop() + + _ForEachDataType_(TestCallback); +} + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/downpour_worker.cc b/paddle/fluid/framework/downpour_worker.cc new file mode 100644 index 0000000000000000000000000000000000000000..386ffd84c57063e950cd8b0d57304c66190be4c4 --- /dev/null +++ b/paddle/fluid/framework/downpour_worker.cc @@ -0,0 +1,479 @@ +/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/device_worker.h" +#include "paddle/fluid/framework/device_worker_factory.h" +#include "paddle/fluid/platform/cpu_helper.h" + +namespace paddle { +namespace framework { + +void DownpourWorker::Initialize(const TrainerDesc& desc) { + param_ = desc.downpour_param(); + for (int i = 0; i < param_.sparse_table_size(); ++i) { + uint64_t table_id = + static_cast(param_.sparse_table(i).table_id()); + TableParameter table = param_.sparse_table(i); + sparse_key_names_[table_id].resize(table.sparse_key_name_size()); + for (int j = 0; j < table.sparse_key_name_size(); ++j) { + sparse_key_names_[table_id][j] = table.sparse_key_name(j); + } + sparse_value_names_[table_id].resize(table.sparse_value_name_size()); + for (int j = 0; j < table.sparse_value_name_size(); ++j) { + sparse_value_names_[table_id][j] = table.sparse_value_name(j); + } + sparse_grad_names_[table_id].resize(table.sparse_grad_name_size()); + for (int j = 0; j < table.sparse_grad_name_size(); ++j) { + sparse_grad_names_[table_id][j] = table.sparse_grad_name(j); + } + label_var_name_[table_id] = table.label_var_name(); + } + + for (int i = 0; i < param_.dense_table_size(); ++i) { + uint64_t table_id = static_cast(param_.dense_table(i).table_id()); + auto table = param_.dense_table(i); + dense_value_names_[table_id].resize(table.dense_value_name_size()); + for (int j = 0; j < table.dense_value_name_size(); ++j) { + dense_value_names_[table_id][j] = table.dense_value_name(j); + } + dense_grad_names_[table_id].resize(table.dense_grad_name_size()); + for (int j = 0; j < table.dense_grad_name_size(); ++j) { + dense_grad_names_[table_id][j] = table.dense_grad_name(j); + } + } + + skip_ops_.resize(param_.skip_ops_size()); + for (int i = 0; i < param_.skip_ops_size(); ++i) { + skip_ops_[i] = param_.skip_ops(i); + } + + need_to_push_sparse_ = param_.push_sparse(); + need_to_push_dense_ = param_.push_dense(); + + fleet_ptr_ = FleetWrapper::GetInstance(); + fetch_config_ = desc.fetch_config(); +} + +void DownpourWorker::CollectLabelInfo(size_t table_idx) { + uint64_t table_id = static_cast( + param_.program_config(0).pull_sparse_table_id(table_idx)); + + TableParameter table; + for (auto i : param_.sparse_table()) { + if (i.table_id() == table_id) { + table = i; + break; + } + } + auto& feature = features_[table_id]; + auto& feature_label = feature_labels_[table_id]; + feature_label.resize(feature.size()); + Variable* var = thread_scope_->FindVar(label_var_name_[table_id]); + LoDTensor* tensor = var->GetMutable(); + int64_t* label_ptr = tensor->data(); + + size_t global_index = 0; + for (size_t i = 0; i < sparse_key_names_[table_id].size(); ++i) { + VLOG(3) << "sparse_key_names_[" << i + << "]: " << sparse_key_names_[table_id][i]; + Variable* fea_var = thread_scope_->FindVar(sparse_key_names_[table_id][i]); + LoDTensor* tensor = fea_var->GetMutable(); + int64_t* ids = tensor->data(); + size_t fea_idx = 0; + // tensor->lod()[0].size() == batch_size + 1 + for (auto lod_idx = 1u; lod_idx < tensor->lod()[0].size(); ++lod_idx) { + for (; fea_idx < tensor->lod()[0][lod_idx]; ++fea_idx) { + // should be skipped feasign defined in protobuf + if (ids[fea_idx] == 0u) { + continue; + } + feature_label[global_index++] = + static_cast(label_ptr[lod_idx - 1]); + } + } + } + CHECK(global_index == feature.size()) + << "expect fea info size:" << feature.size() << " real:" << global_index; +} + +void DownpourWorker::FillSparseValue(size_t table_idx) { + uint64_t table_id = static_cast( + param_.program_config(0).pull_sparse_table_id(table_idx)); + + TableParameter table; + for (auto i : param_.sparse_table()) { + if (i.table_id() == table_id) { + table = i; + break; + } + } + + auto& fea_value = feature_values_[table_id]; + auto fea_idx = 0u; + + std::vector init_value(table.fea_dim()); + for (size_t i = 0; i < sparse_key_names_[table_id].size(); ++i) { + std::string slot_name = sparse_key_names_[table_id][i]; + std::string emb_slot_name = sparse_value_names_[table_id][i]; + Variable* var = thread_scope_->FindVar(slot_name); + LoDTensor* tensor = var->GetMutable(); + int64_t* ids = tensor->data(); + int len = tensor->numel(); + Variable* var_emb = thread_scope_->FindVar(emb_slot_name); + LoDTensor* tensor_emb = var_emb->GetMutable(); + float* ptr = tensor_emb->mutable_data({len, table.emb_dim()}, + platform::CPUPlace()); + memset(ptr, 0, sizeof(float) * len * table.emb_dim()); + auto& tensor_lod = tensor->lod()[0]; + LoD data_lod{tensor_lod}; + tensor_emb->set_lod(data_lod); + for (int index = 0; index < len; ++index) { + if (ids[index] == 0u) { + memcpy(ptr + table.emb_dim() * index, init_value.data() + 2, + sizeof(float) * table.emb_dim()); + continue; + } + memcpy(ptr + table.emb_dim() * index, fea_value[fea_idx].data() + 2, + sizeof(float) * table.emb_dim()); + fea_idx++; + } + } +} + +void DownpourWorker::TrainFilesWithProfiler() { + VLOG(3) << "Begin to train files with profiler"; + platform::SetNumThreads(1); + device_reader_->Start(); + std::vector op_total_time; + std::vector op_name; + for (auto& op : ops_) { + bool need_skip = false; + for (auto t = 0u; t < skip_ops_.size(); ++t) { + if (op->Type().find(skip_ops_[t]) != std::string::npos) { + need_skip = true; + break; + } + } + if (!need_skip) { + op_name.push_back(op->Type()); + } + } + + VLOG(3) << "op name size: " << op_name.size(); + op_total_time.resize(op_name.size()); + for (size_t i = 0; i < op_total_time.size(); ++i) { + op_total_time[i] = 0.0; + } + platform::Timer timeline; + double total_time = 0.0; + double read_time = 0.0; + double pull_sparse_time = 0.0; + double collect_label_time = 0.0; + double fill_sparse_time = 0.0; + double push_sparse_time = 0.0; + double push_dense_time = 0.0; + int cur_batch; + int batch_cnt = 0; + uint64_t total_inst = 0; + timeline.Start(); + while ((cur_batch = device_reader_->Next()) > 0) { + timeline.Pause(); + read_time += timeline.ElapsedSec(); + total_time += timeline.ElapsedSec(); + VLOG(3) << "program config size: " << param_.program_config_size(); + for (int i = 0; i < param_.program_config(0).pull_sparse_table_id_size(); + ++i) { + uint64_t tid = static_cast( + param_.program_config(0).pull_sparse_table_id(i)); + TableParameter table; + for (auto i : param_.sparse_table()) { + if (i.table_id() == tid) { + table = i; + break; + } + } + timeline.Start(); + fleet_ptr_->PullSparseVarsSync(*thread_scope_, tid, + sparse_key_names_[tid], &features_[tid], + &feature_values_[tid], table.fea_dim()); + timeline.Pause(); + pull_sparse_time += timeline.ElapsedSec(); + total_time += timeline.ElapsedSec(); + timeline.Start(); + CollectLabelInfo(i); + timeline.Pause(); + collect_label_time += timeline.ElapsedSec(); + total_time += timeline.ElapsedSec(); + timeline.Start(); + FillSparseValue(i); + timeline.Pause(); + fill_sparse_time += timeline.ElapsedSec(); + total_time += timeline.ElapsedSec(); + } + VLOG(3) << "Fill sparse value for all sparse table done."; + + int run_op_idx = 0; + for (auto& op : ops_) { + bool need_skip = false; + for (auto t = 0u; t < skip_ops_.size(); ++t) { + if (op->Type().find(skip_ops_[t]) != std::string::npos) { + need_skip = true; + break; + } + } + if (!need_skip) { + timeline.Start(); + VLOG(3) << "Going to run op " << op_name[run_op_idx]; + op->Run(*thread_scope_, place_); + VLOG(3) << "Op " << op_name[run_op_idx] << " Finished"; + timeline.Pause(); + op_total_time[run_op_idx++] += timeline.ElapsedSec(); + total_time += timeline.ElapsedSec(); + } + } + + if (need_to_push_sparse_) { + for (int i = 0; i < param_.program_config(0).push_sparse_table_id_size(); + ++i) { + uint64_t tid = static_cast( + param_.program_config(0).push_sparse_table_id(i)); + TableParameter table; + for (auto i : param_.sparse_table()) { + if (i.table_id() == tid) { + table = i; + break; + } + } + timeline.Start(); + fleet_ptr_->PushSparseVarsWithLabelAsync( + *thread_scope_, tid, features_[tid], feature_labels_[tid], + sparse_key_names_[tid], sparse_grad_names_[tid], table.emb_dim(), + &feature_grads_[tid], &push_sparse_status_); + timeline.Pause(); + push_sparse_time += timeline.ElapsedSec(); + total_time += timeline.ElapsedSec(); + } + } + + if (need_to_push_dense_) { + timeline.Start(); + for (int i = 0; i < param_.program_config(0).push_dense_table_id_size(); + ++i) { + uint64_t tid = static_cast( + param_.program_config(0).push_dense_table_id(i)); + fleet_ptr_->PushDenseVarsAsync( + *thread_scope_, tid, dense_grad_names_[tid], &push_sparse_status_); + } + timeline.Pause(); + push_dense_time += timeline.ElapsedSec(); + total_time += timeline.ElapsedSec(); + VLOG(3) << "push sparse and dense gradient done."; + int32_t tmp_push_dense_wait_times = -1; + static uint32_t push_dense_wait_times = + static_cast(tmp_push_dense_wait_times); + if (push_dense_status_.size() >= push_dense_wait_times) { + for (auto& t : push_dense_status_) { + t.wait(); + } + push_dense_status_.resize(0); + } + + if (tmp_push_dense_wait_times == -1) { + push_dense_status_.resize(0); + } + } + + if (need_to_push_sparse_) { + int32_t tmp_push_sparse_wait_times = -1; + static uint32_t push_sparse_wait_times = + static_cast(tmp_push_sparse_wait_times); + if (push_sparse_status_.size() >= push_sparse_wait_times) { + for (auto& t : push_sparse_status_) { + t.wait(); + } + push_sparse_status_.resize(0); + } + + if (tmp_push_sparse_wait_times == -1) { + push_sparse_status_.resize(0); + } + + VLOG(3) << "going to increase thread version"; + VLOG(3) << "push dense table id size: " + << param_.program_config(0).push_dense_table_id_size(); + } + + if (need_to_push_dense_) { + for (int i = 0; i < param_.program_config(0).push_dense_table_id_size(); + ++i) { + uint64_t tid = static_cast( + param_.program_config(0).push_dense_table_id(i)); + pull_dense_worker_->IncreaseThreadVersion(thread_id_, tid); + } + } + + PrintFetchVars(); + thread_scope_->DropKids(); + total_inst += cur_batch; + ++batch_cnt; + + if (thread_id_ == 0) { + // should be configured here + if (batch_cnt > 0 && batch_cnt % 100 == 0) { + for (size_t i = 0; i < op_total_time.size(); ++i) { + fprintf(stderr, "op_name:[%zu][%s], op_mean_time:[%fs]\n", i, + op_name[i].c_str(), op_total_time[i] / batch_cnt); + } + fprintf(stderr, "mean read time: %fs\n", read_time / batch_cnt); + fprintf(stderr, "IO percent: %f\n", read_time / total_time * 100); + fprintf(stderr, "pull sparse time percent: %f\n", + pull_sparse_time / total_time * 100); + fprintf(stderr, "collect label time percent: %f\n", + collect_label_time / total_time * 100); + fprintf(stderr, "fill sparse time percent: %f\n", + fill_sparse_time / total_time * 100); + fprintf(stderr, "push sparse time percent: %f\n", + push_sparse_time / total_time * 100); + fprintf(stderr, "push dense time percent: %f\n", + push_dense_time / total_time * 100); + fprintf(stderr, "%6.2f instances/s\n", total_inst / total_time); + } + } + timeline.Start(); + } +} + +void DownpourWorker::TrainFiles() { + VLOG(3) << "Begin to train files"; + platform::SetNumThreads(1); + device_reader_->Start(); + int batch_cnt = 0; + int cur_batch; + while ((cur_batch = device_reader_->Next()) > 0) { + // pull sparse here + for (int i = 0; i < param_.program_config(0).pull_sparse_table_id_size(); + ++i) { + uint64_t tid = static_cast( + param_.program_config(0).pull_sparse_table_id(i)); + TableParameter table; + for (auto i : param_.sparse_table()) { + if (i.table_id() == tid) { + table = i; + break; + } + } + fleet_ptr_->PullSparseVarsSync(*thread_scope_, tid, + sparse_key_names_[tid], &features_[tid], + &feature_values_[tid], table.fea_dim()); + CollectLabelInfo(i); + FillSparseValue(i); + } + VLOG(3) << "fill sparse value for all sparse table done."; + + // do computation here + for (auto& op : ops_) { + bool need_skip = false; + for (auto t = 0u; t < skip_ops_.size(); ++t) { + if (op->Type().find(skip_ops_[t]) != std::string::npos) { + need_skip = true; + break; + } + } + if (!need_skip) { + op->Run(*thread_scope_, place_); + } + } + + if (need_to_push_sparse_) { + // push gradients here + for (int i = 0; i < param_.program_config(0).push_sparse_table_id_size(); + ++i) { + uint64_t tid = static_cast( + param_.program_config(0).push_sparse_table_id(i)); + TableParameter table; + for (auto i : param_.sparse_table()) { + if (i.table_id() == tid) { + table = i; + break; + } + } + fleet_ptr_->PushSparseVarsWithLabelAsync( + *thread_scope_, tid, features_[tid], feature_labels_[tid], + sparse_key_names_[tid], sparse_grad_names_[tid], table.emb_dim(), + &feature_grads_[tid], &push_sparse_status_); + } + } + + if (need_to_push_dense_) { + for (int i = 0; i < param_.program_config(0).push_dense_table_id_size(); + ++i) { + uint64_t tid = static_cast( + param_.program_config(0).push_dense_table_id(i)); + fleet_ptr_->PushDenseVarsAsync( + *thread_scope_, tid, dense_grad_names_[tid], &push_sparse_status_); + } + + VLOG(3) << "push dense gradient done."; + // the following code should be more precise and clean + // TODO(guru4elephant) + int32_t tmp_push_dense_wait_times = -1; + static uint32_t push_dense_wait_times = + static_cast(tmp_push_dense_wait_times); + + if (push_dense_status_.size() >= push_dense_wait_times) { + for (auto& t : push_dense_status_) { + t.wait(); + } + push_dense_status_.resize(0); + } + + if (tmp_push_dense_wait_times == -1) { + push_dense_status_.resize(0); + } + } + + if (need_to_push_sparse_) { + VLOG(3) << "push sparse gradient done."; + int32_t tmp_push_sparse_wait_times = -1; + static uint32_t push_sparse_wait_times = + static_cast(tmp_push_sparse_wait_times); + if (push_sparse_status_.size() >= push_sparse_wait_times) { + for (auto& t : push_sparse_status_) { + t.wait(); + } + push_sparse_status_.resize(0); + } + + if (tmp_push_sparse_wait_times == -1) { + push_sparse_status_.resize(0); + } + } + + if (need_to_push_dense_) { + for (int i = 0; i < param_.program_config(0).push_dense_table_id_size(); + ++i) { + uint64_t tid = static_cast( + param_.program_config(0).push_dense_table_id(i)); + pull_dense_worker_->IncreaseThreadVersion(thread_id_, tid); + } + } + + PrintFetchVars(); + thread_scope_->DropKids(); + ++batch_cnt; + } +} + +} // end namespace framework +} // end namespace paddle diff --git a/paddle/fluid/framework/eigen.h b/paddle/fluid/framework/eigen.h index 2b265a773fe967f5b2ab38ce795b0f599d859c2a..5bafa4345f42a1f6209b5ee31ae6ba2ded6a899c 100644 --- a/paddle/fluid/framework/eigen.h +++ b/paddle/fluid/framework/eigen.h @@ -13,11 +13,6 @@ See the License for the specific language governing permissions and limitations under the License. */ #pragma once -// logging.h and windows.h conflict -#define GLOG_NO_ABBREVIATED_SEVERITIES -// solve static linking error in windows -// https://github.com/google/glog/issues/301 -#define GOOGLE_GLOG_DLL_DECL #include "paddle/fluid/framework/tensor.h" #include "unsupported/Eigen/CXX11/Tensor" diff --git a/paddle/fluid/framework/executor.cc b/paddle/fluid/framework/executor.cc index b212666637a5289c9c6cd3585655deaeed8afd4b..239a3ce0a84e9d0f4b3395bdbbd3fdae58e8b36a 100644 --- a/paddle/fluid/framework/executor.cc +++ b/paddle/fluid/framework/executor.cc @@ -13,16 +13,33 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/framework/executor.h" - +#include +#include +#include +#include +#include +#include "google/protobuf/io/zero_copy_stream_impl.h" +#include "google/protobuf/message.h" +#include "google/protobuf/text_format.h" #include "paddle/fluid/framework/feed_fetch_method.h" #include "paddle/fluid/framework/lod_rank_table.h" #include "paddle/fluid/framework/lod_tensor_array.h" #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/framework/reader.h" -#include "paddle/fluid/operators/detail/macros.h" +#include "paddle/fluid/framework/trainer_desc.pb.h" +#include "paddle/fluid/framework/trainer_factory.h" +#include "paddle/fluid/framework/transfer_scope_cache.h" +#include "paddle/fluid/framework/variable_helper.h" +#include "paddle/fluid/operators/controlflow/while_op_helper.h" +#include "paddle/fluid/operators/distributed/distributed.h" #include "paddle/fluid/platform/place.h" #include "paddle/fluid/platform/profiler.h" +#ifdef PADDLE_WITH_NGRAPH +#include "paddle/fluid/operators/ngraph/ngraph_engine.h" +DEFINE_bool(use_ngraph, false, "Use NGRAPH to run"); +#endif + DECLARE_bool(benchmark); DEFINE_bool(use_mkldnn, false, "Use MKLDNN to run"); @@ -36,91 +53,33 @@ int kProgramId = -1; ExecutorPrepareContext::ExecutorPrepareContext( const framework::ProgramDesc& prog, size_t block_id) - : prog_(prog), block_id_(block_id) { - if (GetEagerDeletionThreshold() >= 0) { - ref_cnts_ = GetNonPersistableReferenceCount(prog_, block_id_); + : prog_(prog), block_id_(block_id) {} + +void ExecutorPrepareContext::PrepareUnusedVars( + const std::vector& keep_vars, bool force_disable_gc) { + force_disable_gc_ = force_disable_gc; + if (GetEagerDeletionThreshold() < 0 || force_disable_gc_) { + return; } + unused_vars_ = GetUnusedVars(prog_.Block(block_id_), ops_, keep_vars); } ExecutorPrepareContext::~ExecutorPrepareContext() { VLOG(5) << "destroy ExecutorPrepareContext"; } -template -static void DeleteUnusedTensors(const Scope& scope, const OperatorBase* op, - GarbageCollector* gc, - RefCntMap* ref_cnts) { - std::unordered_set erase_tensors; - - auto handler = [&](const VariableNameMap& name_map) { - for (auto& name_pair : name_map) { - for (auto& name : name_pair.second) { - auto it = ref_cnts->find(name); - if (it == ref_cnts->end()) continue; - if ((it->second)-- == 1) { - auto* var = scope.FindVar(name); - if (var != nullptr) { - VLOG(10) << "Erase tensor \'" << name << "\'"; - if (var->IsType()) { - erase_tensors.insert(var->GetMutable()); - } else if (var->IsType()) { - erase_tensors.insert( - var->GetMutable()->mutable_value()); - } - } - } - } - } - }; - - handler(op->Inputs()); - handler(op->Outputs()); - - if (!erase_tensors.empty()) { - gc->Add(erase_tensors); - } -} - Executor::Executor(const platform::Place& place) : place_(place) {} void Executor::Close() { #ifdef PADDLE_WITH_DISTRIBUTE - ::paddle::operators::distributed::RPCClient::GetInstance< - ::paddle::operators::distributed::GRPCClient>() - ->SendComplete(); + // TODO(typhoonzero): complete message will need to use real trainer_id, + // except 0. + auto client = + paddle::operators::distributed::RPCClient::GetInstance(0); + client->SendComplete(); #endif } -void InitializeVariable(Variable* var, proto::VarType::Type var_type) { - if (var_type == proto::VarType::LOD_TENSOR) { - var->GetMutable(); - } else if (var_type == proto::VarType::SELECTED_ROWS) { - var->GetMutable(); - } else if (var_type == proto::VarType::FEED_MINIBATCH) { - var->GetMutable(); - } else if (var_type == proto::VarType::FETCH_LIST) { - var->GetMutable(); - } else if (var_type == proto::VarType::STEP_SCOPES) { - var->GetMutable>(); - } else if (var_type == proto::VarType::LOD_RANK_TABLE) { - var->GetMutable(); - } else if (var_type == proto::VarType::LOD_TENSOR_ARRAY) { - var->GetMutable(); - } else if (var_type == proto::VarType::PLACE_LIST) { - var->GetMutable(); - } else if (var_type == proto::VarType::READER) { - var->GetMutable(); - } else if (var_type == proto::VarType::RAW) { - // GetMutable will be called in operator - } else { - PADDLE_THROW( - "Variable type %d is not in " - "[LOD_TENSOR, SELECTED_ROWS, FEED_MINIBATCH, FETCH_LIST, " - "LOD_RANK_TABLE, PLACE_LIST, READER, RAW]", - var_type); - } -} - void Executor::CreateVariables(const ProgramDesc& pdesc, Scope* scope, int block_id) { auto& global_block = pdesc.Block(block_id); @@ -158,11 +117,42 @@ void Executor::CreateVariables(const ProgramDesc& pdesc, Scope* scope, } } +void Executor::RunFromDataset(const ProgramDesc& main_program, Scope* scope, + Dataset* dataset, + const std::string& trainer_desc_str) { + VLOG(3) << "Start to RunFromDataset in executor"; + TrainerDesc trainer_desc; + google::protobuf::TextFormat::ParseFromString(trainer_desc_str, + &trainer_desc); + VLOG(3) << "Going to create trainer, trainer class is " + << trainer_desc.class_name(); + std::shared_ptr trainer; + trainer = TrainerFactory::CreateTrainer(trainer_desc.class_name()); + // initialize trainer + VLOG(3) << "Going to initialize trainer"; + trainer->Initialize(trainer_desc, dataset); + VLOG(3) << "Set root scope here"; + trainer->SetScope(scope); + // prepare training environment and helper environment + VLOG(3) << "Try to init train environment"; + trainer->InitTrainerEnv(main_program, place_); + VLOG(3) << "Try to init other environment"; + trainer->InitOtherEnv(main_program); + // training and finalize training + VLOG(3) << "Trainer starts to run"; + trainer->Run(); + VLOG(3) << "Trainer going to finalize"; + trainer->Finalize(); + return; +} + void Executor::Run(const ProgramDesc& pdesc, Scope* scope, int block_id, - bool create_local_scope, bool create_vars) { + bool create_local_scope, bool create_vars, + const std::vector& skip_ref_cnt_vars, + bool force_disable_gc) { platform::RecordBlock b(block_id); if (FLAGS_use_mkldnn) EnableMKLDNN(pdesc); - auto ctx = Prepare(pdesc, block_id); + auto ctx = Prepare(pdesc, block_id, skip_ref_cnt_vars, force_disable_gc); RunPreparedContext(ctx.get(), scope, create_local_scope, create_vars); } @@ -328,7 +318,8 @@ void Executor::Run(const ProgramDesc& program, Scope* scope, } std::unique_ptr Executor::Prepare( - const ProgramDesc& program, int block_id) { + const ProgramDesc& program, int block_id, + const std::vector& skip_ref_cnt_vars, bool force_disable_gc) { std::unique_ptr ctx( new ExecutorPrepareContext(program, block_id)); PADDLE_ENFORCE_LT(static_cast(block_id), program.Size()); @@ -336,20 +327,40 @@ std::unique_ptr Executor::Prepare( for (auto& op_desc : block.AllOps()) { ctx->ops_.push_back(OpRegistry::CreateOp(*op_desc)); } +#ifdef PADDLE_WITH_NGRAPH + if (FLAGS_use_ngraph) { + paddle::operators::NgraphEngine::FuseNgraphOps( + ctx->prog_.Block(ctx->block_id_), &ctx->ops_); + } +#endif + ctx->PrepareUnusedVars(skip_ref_cnt_vars, force_disable_gc); return ctx; } std::vector> Executor::Prepare( - const ProgramDesc& program, const std::vector& block_ids) { + const ProgramDesc& program, const std::vector& block_ids, + const std::vector>& skip_ref_cnt_vars, + bool force_disable_gc) { + PADDLE_ENFORCE( + skip_ref_cnt_vars.empty() || skip_ref_cnt_vars.size() == block_ids.size(), + "skip_ref_cnt_vars should be either empty or equals to block number %d", + block_ids.size()); std::vector> result; + size_t idx = 0; for (auto& bid : block_ids) { - auto* ctx = new ExecutorPrepareContext(program, bid); PADDLE_ENFORCE_LT(static_cast(bid), program.Size()); + auto* ctx = new ExecutorPrepareContext(program, bid); auto& block = program.Block(bid); for (auto& op_desc : block.AllOps()) { ctx->ops_.push_back(OpRegistry::CreateOp(*op_desc)); } + if (skip_ref_cnt_vars.empty()) { + ctx->PrepareUnusedVars(std::vector(), force_disable_gc); + } else { + ctx->PrepareUnusedVars(skip_ref_cnt_vars[idx], force_disable_gc); + } result.push_back(std::shared_ptr(ctx)); + ++idx; } return result; } @@ -357,6 +368,7 @@ std::vector> Executor::Prepare( void Executor::RunPreparedContext(ExecutorPrepareContext* ctx, Scope* scope, bool create_local_scope, bool create_vars, bool keep_kids) { + PADDLE_ENFORCE_NOT_NULL(scope); Scope* local_scope = scope; if (create_vars) { if (create_local_scope) { @@ -366,46 +378,42 @@ void Executor::RunPreparedContext(ExecutorPrepareContext* ctx, Scope* scope, } int64_t max_memory_size = GetEagerDeletionThreshold(); - std::unique_ptr> gc; - // WhileOp would set keep_kids to false - // WhileGradOp would need the scopes created in WhileOp - // Perhaps, we should not perform eager deletion in WhileOp - // The scopes and variables created by WhileOp would be deleted - // in WhileGradOp. - if (max_memory_size >= 0 && !keep_kids) { - ctx->ResetReferenceCount(); + std::unique_ptr gc; + // FIXME(zjl): recurrent_op is rather complex, we would + // disable gc forcely in recurrent_op + if (!ctx->force_disable_gc_ && max_memory_size >= 0) { #ifdef PADDLE_WITH_CUDA if (platform::is_gpu_place(place_)) { - gc.reset(new DefaultStreamGarbageCollector( - boost::get(place_), max_memory_size)); - } else { + if (IsFastEagerDeletionModeEnabled()) { + gc.reset(new UnsafeFastGPUGarbageCollector( + boost::get(place_), max_memory_size)); + } else { + gc.reset(new DefaultStreamGarbageCollector( + boost::get(place_), max_memory_size)); + } + } else if (platform::is_cpu_place(place_)) { #endif - gc.reset(new CPUGarbageCollector( - boost::get(place_), max_memory_size)); + gc.reset(new CPUGarbageCollector(boost::get(place_), + max_memory_size)); #ifdef PADDLE_WITH_CUDA } #endif + // If gc is enabled and block size > 1 + if (gc && ctx->prog_.Size() > 1) { + operators::PrepareSafeEagerDeletionOnWhileOpAndWhileGradOp(ctx->block_id_, + ctx->ops_); + } } for (auto& op : ctx->ops_) { op->Run(*local_scope, place_); - if (gc != nullptr) { - DeleteUnusedTensors(*local_scope, op.get(), gc.get(), - &(ctx->cur_ref_cnts_)); - } - - if (FLAGS_benchmark) { - VLOG(2) << "Memory used after operator " + op->Type() + " running: " - << memory::memory_usage(place_); + if (gc) { + DeleteUnusedTensors(*local_scope, op.get(), ctx->unused_vars_, gc.get()); } } - if (gc != nullptr) { - gc->Wait(); - } else { - platform::DeviceContextPool::Instance().Get(place_)->Wait(); - } + platform::DeviceContextPool::Instance().Get(place_)->Wait(); if (local_scope != scope) { scope->DeleteScope(local_scope); @@ -420,13 +428,6 @@ void Executor::RunPreparedContext(ExecutorPrepareContext* ctx, Scope* scope, scope->DropKids(); } } - - if (FLAGS_benchmark) { - VLOG(2) << "-------------------------------------------------------"; - VLOG(2) << "Memory used after deleting local scope: " - << memory::memory_usage(place_); - VLOG(2) << "-------------------------------------------------------"; - } } void Executor::RunPreparedContext( @@ -483,6 +484,5 @@ void Executor::EnableMKLDNN(const ProgramDesc& program) { << "'MKLDNN' is not supported, Please re-compile with WITH_MKLDNN option"; #endif } - } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/executor.h b/paddle/fluid/framework/executor.h index 36b36d49c2728dbef93042158dffa26d8f56d529..6eeeb1efc6117f341026097359199cc26554649d 100644 --- a/paddle/fluid/framework/executor.h +++ b/paddle/fluid/framework/executor.h @@ -15,8 +15,12 @@ limitations under the License. */ #pragma once #include +#include #include +#include #include +#include "paddle/fluid/framework/data_set.h" +#include "paddle/fluid/framework/executor_gc_helper.h" #include "paddle/fluid/framework/garbage_collector.h" #include "paddle/fluid/framework/op_info.h" #include "paddle/fluid/framework/program_desc.h" @@ -26,54 +30,22 @@ limitations under the License. */ namespace paddle { namespace framework { -extern void InitializeVariable(Variable* var, proto::VarType::Type var_type); - -template -std::unordered_map GetNonPersistableReferenceCount( - const ProgramDesc& prog, size_t block_id) { - auto& block = prog.Block(block_id); - std::unordered_map ref_cnts; - - auto update_ref_cnts = [&](OpDesc* op_desc, const VariableNameMap& name_map) { - for (auto& name_pair : name_map) { - for (auto& name : name_pair.second) { - auto* var_desc = block.FindVar(name); - if (var_desc == nullptr || var_desc->Persistable()) continue; - auto type = var_desc->Proto()->type().type(); - if (type != proto::VarType::LOD_TENSOR && - type != proto::VarType::SELECTED_ROWS) { - continue; - } - - auto it = ref_cnts.find(name); - if (it != ref_cnts.end()) { - ++it->second; - } else { - ref_cnts[name] = 1; - } - } - } - }; - - for (auto op_desc : block.AllOps()) { - update_ref_cnts(op_desc, op_desc->Inputs()); - update_ref_cnts(op_desc, op_desc->Outputs()); - } - return ref_cnts; -} struct ExecutorPrepareContext { ExecutorPrepareContext(const framework::ProgramDesc& prog, size_t block_id); + ~ExecutorPrepareContext(); - void ResetReferenceCount() { cur_ref_cnts_ = ref_cnts_; } + void PrepareUnusedVars(const std::vector& keep_vars, + bool force_disable_gc = false); const framework::ProgramDesc& prog_; - size_t block_id_; + const size_t block_id_; + std::vector> ops_; - std::unordered_map ref_cnts_; - std::unordered_map cur_ref_cnts_; + std::unordered_map> unused_vars_; + bool force_disable_gc_{false}; }; class Executor { @@ -98,7 +70,10 @@ class Executor { * Scope */ void Run(const ProgramDesc& prog, Scope* scope, int block_id, - bool create_local_scope = true, bool create_vars = true); + bool create_local_scope = true, bool create_vars = true, + const std::vector& skip_ref_cnt_vars = + std::vector(), + bool force_disable_gc = false); // This API is very slow. void Run(const ProgramDesc& program, Scope* scope, @@ -109,10 +84,16 @@ class Executor { const std::string& fetch_holder_name = "fetch"); static std::unique_ptr Prepare( - const ProgramDesc& program, int block_id); + const ProgramDesc& program, int block_id, + const std::vector& skip_ref_cnt_vars = + std::vector(), + bool force_disable_gc = false); static std::vector> Prepare( - const ProgramDesc& program, const std::vector& block_ids); + const ProgramDesc& program, const std::vector& block_ids, + const std::vector>& skip_ref_cnt_vars = + std::vector>(), + bool force_disable_gc = false); void CreateVariables(const ProgramDesc& pdesc, Scope* scope, int block_id); @@ -131,6 +112,9 @@ class Executor { void EnableMKLDNN(const ProgramDesc& program); + void RunFromDataset(const ProgramDesc& main_program, Scope* scope, + Dataset* dataset, const std::string& trainer_desc_str); + private: const platform::Place place_; }; diff --git a/paddle/fluid/framework/executor_gc_helper.cc b/paddle/fluid/framework/executor_gc_helper.cc new file mode 100644 index 0000000000000000000000000000000000000000..77b0977b5a47fdf4413e75c4e89cf638949e937f --- /dev/null +++ b/paddle/fluid/framework/executor_gc_helper.cc @@ -0,0 +1,189 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/executor_gc_helper.h" +#include +#include +#include +#include +#include +#include +#include "glog/logging.h" +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/lod_tensor_array.h" +#include "paddle/fluid/framework/selected_rows.h" +#include "paddle/fluid/platform/enforce.h" + +namespace paddle { +namespace framework { + +struct OpInOutInfo { + public: + void Build(const OperatorBase *op) { + is_built_ = true; + auto &inferer = op->Info().NoNeedBufferVarsInferer(); + if (inferer) { + no_need_buffer_ins_ = inferer(op->Inputs(), op->Outputs(), op->Attrs()); + + if (no_need_buffer_ins_.empty()) return; + + for (auto &in_name_pair : op->Inputs()) { + if (no_need_buffer_ins_.count(in_name_pair.first) != 0) { + continue; + } + + for (auto &in_arg_name : in_name_pair.second) { + other_args_set_.insert(in_arg_name); + } + } + + for (auto &out_name_pair : op->Outputs()) { + for (auto &out_arg_name : out_name_pair.second) { + other_args_set_.insert(out_arg_name); + } + } + } + } + + bool IsBuilt() const { return is_built_; } + + bool IsInArgBufferNeeded(const std::string &in_arg_name) const { + return no_need_buffer_ins_.empty() || + other_args_set_.count(in_arg_name) != 0; + } + + private: + // A set to record unused buffer input vars of op + std::unordered_set no_need_buffer_ins_; + // A set to record other args of op (including in, out) + std::unordered_set other_args_set_; + bool is_built_{false}; +}; + +static bool VarCanBeDeleted(const std::string &name, const BlockDesc &block, + const std::unordered_set &skip_vars) { + if (skip_vars.count(name) != 0) { + return false; + } + + auto *var_desc = block.FindVar(name); + if (var_desc == nullptr || var_desc->Persistable()) { + return false; + } + + auto type = var_desc->Proto()->type().type(); + + return type == proto::VarType::LOD_TENSOR || + type == proto::VarType::SELECTED_ROWS || + type == proto::VarType::LOD_TENSOR_ARRAY; +} + +std::unordered_map> GetUnusedVars( + const BlockDesc &block, + const std::vector> &ops, + const std::vector &skip_var_list) { + std::unordered_set skip_vars(skip_var_list.begin(), + skip_var_list.end()); + + std::unordered_map var_op_idx_map; + + for (size_t i = 0; i < ops.size(); ++i) { + auto *op = ops[i].get(); + + OpInOutInfo info; + for (auto &name_pair : op->Inputs()) { + for (auto &name : name_pair.second) { + if (!VarCanBeDeleted(name, block, skip_vars)) { + continue; + } + + // var can be gc-ed + if (!info.IsBuilt()) { + info.Build(op); + } + + if (info.IsInArgBufferNeeded(name)) { + // Update the last living op of variable to current op + var_op_idx_map[name] = i; + } else { + VLOG(10) << "Skip reference count computing of variable " + << name_pair.first << "(" << name << ") in Operator " + << op->Type(); + } + } + } + + for (auto &name_pair : op->Outputs()) { + for (auto &name : name_pair.second) { + if (VarCanBeDeleted(name, block, skip_vars)) { + // Update the last living op of variable to current op + var_op_idx_map[name] = i; + } + } + } + } + + std::unordered_map> result; + for (auto &name_op_idx_pair : var_op_idx_map) { + auto &name = name_op_idx_pair.first; + size_t op_idx = name_op_idx_pair.second; + result[ops[op_idx].get()].emplace_back(name); + } + return result; +} + +void DeleteUnusedTensors( + const Scope &scope, OperatorBase *op, + const std::unordered_map> + &delete_vars_map, + GarbageCollector *gc) { + auto iter = delete_vars_map.find(op); + if (iter == delete_vars_map.end()) { + return; + } + + auto &delete_vars = iter->second; + + std::deque> garbages; + + for (auto &var_name : delete_vars) { + auto *var = scope.FindVar(var_name); + if (var == nullptr) { + continue; + } + + VLOG(2) << "Erase variable " << var_name; + if (var->IsType()) { + garbages.emplace_back(var->GetMutable()->MoveMemoryHolder()); + } else if (var->IsType()) { + garbages.emplace_back( + var->GetMutable()->mutable_value()->MoveMemoryHolder()); + } else if (var->IsType()) { + auto *lod_tensor_arr = var->GetMutable(); + for (auto &t : *lod_tensor_arr) { + garbages.emplace_back(t.MoveMemoryHolder()); + } + } else { + PADDLE_THROW("Type %s of %s is not supported eager deletion", + framework::ToTypeName(var->Type()), var_name); + } + } + + if (!garbages.empty()) { + gc->Add(std::move(garbages)); + } +} + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/executor_gc_helper.h b/paddle/fluid/framework/executor_gc_helper.h new file mode 100644 index 0000000000000000000000000000000000000000..8553273f8242844d0203d7bcd90ea2090b65826c --- /dev/null +++ b/paddle/fluid/framework/executor_gc_helper.h @@ -0,0 +1,42 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include "paddle/fluid/framework/garbage_collector.h" +#include "paddle/fluid/framework/operator.h" +#include "paddle/fluid/framework/scope.h" + +namespace paddle { +namespace framework { + +// Result map: op -> variable names that can be deleted after op runs +std::unordered_map> GetUnusedVars( + const BlockDesc &block, + const std::vector> &ops, + const std::vector &skip_vars); + +// Collect unused tensors after op runs +void DeleteUnusedTensors( + const Scope &scope, OperatorBase *op, + const std::unordered_map> + &delete_vars_map, + GarbageCollector *gc); + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/executor_thread_worker.cc b/paddle/fluid/framework/executor_thread_worker.cc new file mode 100644 index 0000000000000000000000000000000000000000..005d98c6e8fda92ff6c6b3412f89c75760bf0498 --- /dev/null +++ b/paddle/fluid/framework/executor_thread_worker.cc @@ -0,0 +1,698 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/executor_thread_worker.h" +#include +#include +#include "google/protobuf/io/zero_copy_stream_impl.h" +#include "google/protobuf/message.h" +#include "google/protobuf/text_format.h" + +#include "gflags/gflags.h" +#include "paddle/fluid/framework/feed_fetch_method.h" +#include "paddle/fluid/framework/feed_fetch_type.h" +#include "paddle/fluid/framework/lod_rank_table.h" +#include "paddle/fluid/framework/lod_tensor_array.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/framework/reader.h" +#include "paddle/fluid/framework/variable_helper.h" +#include "paddle/fluid/inference/io.h" +#include "paddle/fluid/platform/cpu_helper.h" +#include "paddle/fluid/platform/place.h" +#include "paddle/fluid/platform/timer.h" +#include "paddle/fluid/pybind/pybind.h" +namespace paddle { +namespace framework { + +#ifdef PADDLE_WITH_PSLIB +int DensePullThread::start() { + _running = true; + _t = std::thread(&DensePullThread::run, this); + return 0; +} + +void DensePullThread::run() { + while (_running) { + _pull_dense_status.resize(0); + for (auto& t : _dense_variable_name) { + if (check_update_param(t.first)) { + auto status = pull_dense(t.first); + _pull_dense_status.emplace_back(std::move(status)); + reset_thread_version(t.first); + } + } + if (_pull_dense_status.size() != 0) { + wait_all(); + } + + usleep(_sleep_time_ms * 1000); + } +} +bool DensePullThread::check_update_param(uint64_t table_id) { + { + std::lock_guard lock(_mutex_for_version); + auto& version = _training_versions[table_id]; + _current_version[table_id] = + *(std::min_element(version.begin(), version.end())); + } + if (_current_version[table_id] - _last_versions[table_id] < _threshold) { + return false; + } + return true; +} + +void DensePullThread::reset_thread_version(uint64_t table_id) { + std::lock_guard lock(_mutex_for_version); + _last_versions[table_id] = _current_version[table_id]; +} +std::future DensePullThread::pull_dense(uint64_t table_id) { + auto& regions = _regions[table_id]; + regions.clear(); + auto& variables = _dense_variable_name[table_id]; + regions.resize(variables.size()); + + for (auto i = 0u; i < variables.size(); ++i) { + auto& t = variables[i]; + Variable* var = _root_scope->FindVar(t); + LoDTensor* tensor = var->GetMutable(); + + float* w = tensor->data(); + paddle::ps::Region reg(w, tensor->numel()); + regions[i] = std::move(reg); + } + return _ps_client->pull_dense(regions.data(), regions.size(), table_id); +} + +void DensePullThread::wait_all() { + for (auto& t : _pull_dense_status) { + t.wait(); + auto status = t.get(); + if (status != 0) { + LOG(WARNING) << "pull dense failed times:" << ++_pull_dense_fail_times; + } + } + + if (_pull_dense_fail_times > 20) { + LOG(FATAL) << "pull dense failed times more than 20 times"; + exit(-1); + } + + _pull_dense_status.resize(0); +} + +void DensePullThread::increase_thread_version(int thread_id, + uint64_t table_id) { + std::lock_guard lock(_mutex_for_version); + _training_versions[table_id][thread_id]++; +} +#endif + +void ExecutorThreadWorker::CreateThreadOperators(const ProgramDesc& program) { + auto& block = program.Block(0); + op_names_.clear(); + for (auto& op_desc : block.AllOps()) { + std::unique_ptr local_op = OpRegistry::CreateOp(*op_desc); + op_names_.push_back(op_desc->Type()); + OperatorBase* local_op_ptr = local_op.release(); + ops_.push_back(local_op_ptr); + continue; + } +} + +void ExecutorThreadWorker::CreateThreadResource( + const framework::ProgramDesc& program, + const paddle::platform::Place& place) { + CreateThreadScope(program); + CreateThreadOperators(program); + SetMainProgram(program); + SetPlace(place); +} + +void ExecutorThreadWorker::CreateThreadScope(const ProgramDesc& program) { + auto& block = program.Block(0); + + PADDLE_ENFORCE_NOT_NULL( + root_scope_, "root_scope should be set before creating thread scope"); + + thread_scope_ = &root_scope_->NewScope(); + for (auto& var : block.AllVars()) { + if (var->Persistable()) { + auto* ptr = root_scope_->Var(var->Name()); + InitializeVariable(ptr, var->GetType()); + } else { + auto* ptr = thread_scope_->Var(var->Name()); + InitializeVariable(ptr, var->GetType()); + } + } +} + +void ExecutorThreadWorker::SetDataFeed( + const std::shared_ptr& datafeed) { + thread_reader_ = datafeed; +} + +void ExecutorThreadWorker::BindingDataFeedMemory() { + const std::vector& input_feed = + thread_reader_->GetUseSlotAlias(); + for (auto name : input_feed) { + thread_reader_->AddFeedVar(thread_scope_->Var(name), name); + } +} + +void ExecutorThreadWorker::SetFetchVarNames( + const std::vector& fetch_var_names) { + fetch_var_names_.clear(); + fetch_var_names_.insert(fetch_var_names_.end(), fetch_var_names.begin(), + fetch_var_names.end()); +} + +void ExecutorThreadWorker::SetDevice() { +#if defined _WIN32 || defined __APPLE__ + return; +#else + static unsigned concurrency_cap = std::thread::hardware_concurrency(); + LOG(WARNING) << "concurrency capacity " << concurrency_cap; + int thread_id = this->thread_id_; + + if (static_cast(thread_id) < concurrency_cap) { + unsigned proc = thread_id; + + cpu_set_t mask; + CPU_ZERO(&mask); + CPU_SET(proc, &mask); + + if (-1 == sched_setaffinity(0, sizeof(mask), &mask)) { + VLOG(1) << "WARNING: Failed to set thread affinity for thread " + << thread_id; + } else { + CPU_ZERO(&mask); + if ((0 != sched_getaffinity(0, sizeof(mask), &mask)) || + (CPU_ISSET(proc, &mask) == 0)) { + VLOG(3) << "WARNING: Failed to set thread affinity for thread " + << thread_id; + } + } + } else { + VLOG(1) << "WARNING: Failed to set thread affinity for thread " + << thread_id; + } +#endif +} + +template +void print_lod_tensor(std::string var_name, const LoDTensor& lod_tensor) { + auto inspect = lod_tensor.data(); + auto element_num = lod_tensor.numel(); + + std::ostringstream sstream; + sstream << var_name << " (element num " << element_num << "): ["; + sstream << inspect[0]; + for (int j = 1; j < element_num; ++j) { + sstream << " " << inspect[j]; + } + sstream << "]"; + + std::cout << sstream.str() << std::endl; +} + +static void print_fetch_var(Scope* scope, const std::string& var_name) { + auto& tensor = scope->FindVar(var_name)->Get(); + +#define PrintLoDTensorCallback(cpp_type, proto_type) \ + do { \ + if (tensor.type() == proto_type) { \ + print_lod_tensor(var_name, tensor); \ + return; \ + } \ + } while (0) + + _ForEachDataType_(PrintLoDTensorCallback); + VLOG(1) << "print_fetch_var: unrecognized data type:" << tensor.type(); +} + +void ExecutorThreadWorker::TrainFilesWithTimer() { + platform::SetNumThreads(1); + SetDevice(); + thread_reader_->Start(); + + std::vector op_total_time; + std::vector op_name; + for (auto& op : ops_) { + op_name.push_back(op->Type()); + } + op_total_time.resize(ops_.size()); + for (size_t i = 0; i < op_total_time.size(); ++i) { + op_total_time[i] = 0.0; + } + platform::Timer timeline; + double total_time = 0.0; + double read_time = 0.0; + int cur_batch; + int batch_cnt = 0; + timeline.Start(); + while ((cur_batch = thread_reader_->Next()) > 0) { + timeline.Pause(); + read_time += timeline.ElapsedSec(); + total_time += timeline.ElapsedSec(); + for (size_t i = 0; i < ops_.size(); ++i) { + timeline.Start(); + ops_[i]->Run(*thread_scope_, place_); + timeline.Pause(); + op_total_time[i] += timeline.ElapsedSec(); + total_time += timeline.ElapsedSec(); + } + ++batch_cnt; + thread_scope_->DropKids(); + if (thread_id_ == 0) { + if (batch_cnt > 0 && batch_cnt % 100 == 0) { + for (size_t i = 0; i < ops_.size(); ++i) { + fprintf(stderr, "op_name:[%zu][%s], op_mean_time:[%fs]\n", i, + op_name[i].c_str(), op_total_time[i] / batch_cnt); + } + fprintf(stderr, "mean read time: %fs\n", read_time / batch_cnt); + int fetch_var_num = fetch_var_names_.size(); + for (int i = 0; i < fetch_var_num; ++i) { + print_fetch_var(thread_scope_, fetch_var_names_[i]); + } + fprintf(stderr, "IO percent: %f\n", read_time / total_time); + } + } + timeline.Start(); + } +} + +void ExecutorThreadWorker::TrainFiles() { + platform::SetNumThreads(1); + + // todo: configurable + // SetDevice(); + + int fetch_var_num = fetch_var_names_.size(); + fetch_values_.clear(); + fetch_values_.resize(fetch_var_num); + + thread_reader_->Start(); + + int cur_batch; + int batch_cnt = 0; + while ((cur_batch = thread_reader_->Next()) > 0) { + // executor run here + for (auto& op : ops_) { + op->Run(*thread_scope_, place_); + } + + ++batch_cnt; + thread_scope_->DropKids(); + + if (debug_ == false || thread_id_ != 0) { + continue; + } + + for (int i = 0; i < fetch_var_num; ++i) { + print_fetch_var(thread_scope_, fetch_var_names_[i]); + } // end for (int i = 0...) + } // end while () +} + +void ExecutorThreadWorker::SetThreadId(int tid) { thread_id_ = tid; } + +void ExecutorThreadWorker::SetPlace(const platform::Place& place) { + place_ = place; +} + +void ExecutorThreadWorker::SetMainProgram( + const ProgramDesc& main_program_desc) { + main_program_.reset(new ProgramDesc(main_program_desc)); +} + +void ExecutorThreadWorker::SetRootScope(Scope* g_scope) { + root_scope_ = g_scope; +} + +#ifdef PADDLE_WITH_PSLIB +// AsyncExecutor +void AsyncExecutorThreadWorker::TrainFiles() { + SetDevice(); + + int fetch_var_num = fetch_var_names_.size(); + fetch_values_.clear(); + fetch_values_.resize(fetch_var_num); + + thread_reader_->Start(); + + int cur_batch; + int batch_cnt = 0; + while ((cur_batch = thread_reader_->Next()) > 0) { + // executor run here + TrainOneNetwork(); + + ++batch_cnt; + thread_scope_->DropKids(); + + if (debug_ == false || thread_id_ != 0) { + continue; + } + + for (int i = 0; i < fetch_var_num; ++i) { + print_fetch_var(thread_scope_, fetch_var_names_[i]); + } // end for (int i = 0...) + } // end while () +} + +void AsyncExecutorThreadWorker::SetPSlibPtr( + std::shared_ptr pslib_ptr) { + _pslib_ptr = pslib_ptr; +} + +void AsyncExecutorThreadWorker::SetPullDenseThread( + std::shared_ptr dpt) { + _pull_dense_thread = dpt; +} + +void AsyncExecutorThreadWorker::TrainOneNetwork() { + PrepareParams(); + + for (auto& op : ops_) { + if (op->Type().find("sgd") != std::string::npos) { + continue; + } + bool need_skip = false; + for (auto t = 0u; t < _param_config->skip_op.size(); ++t) { + if (op->Type().find(_param_config->skip_op[t]) != std::string::npos) { + need_skip = true; + break; + } + } + if (!need_skip) { + op->Run(*thread_scope_, place_); + } + } + UpdateParams(); +} + +void AsyncExecutorThreadWorker::SetParamConfig( + AsyncWorkerParamConfig* param_config) { + _param_config = param_config; +} + +void AsyncExecutorThreadWorker::PrepareParams() { + for (auto table_id : _param_config->sparse_table_id) { + PullSparse(table_id); + for (auto& t : _pull_sparse_status) { + t.wait(); + auto status = t.get(); + if (status != 0) { + LOG(ERROR) << "pull sparse failed, status[" << status << "]"; + exit(-1); + } + } + } + _pull_sparse_status.resize(0); + + for (auto table_id : _param_config->sparse_table_id) { + FillSparse(table_id); + } +} + +void AsyncExecutorThreadWorker::UpdateParams() { + for (auto i : _param_config->sparse_table_id) { + PushSparse(i); + } + for (auto i : _param_config->dense_table_id) { + PushDense(i); + } + int32_t tmp_push_dense_wait_times = -1; + int32_t tmp_push_sparse_wait_times = -1; + static uint32_t push_dense_wait_times = + static_cast(tmp_push_dense_wait_times); + static uint32_t push_sparse_wait_times = + static_cast(tmp_push_sparse_wait_times); + + if (_push_dense_status.size() >= push_dense_wait_times) { + for (auto& t : _push_dense_status) { + t.wait(); + } + _push_dense_status.resize(0); + } + if (tmp_push_dense_wait_times == -1) { + _push_dense_status.resize(0); + } + if (_push_sparse_status.size() >= push_sparse_wait_times) { + for (auto& t : _push_sparse_status) { + t.wait(); + } + _push_sparse_status.resize(0); + } + if (tmp_push_sparse_wait_times == -1) { + _push_sparse_status.resize(0); + } + for (auto dense_table_id : _param_config->dense_table_id) { + _pull_dense_thread->increase_thread_version(thread_id_, dense_table_id); + } +} + +void AsyncExecutorThreadWorker::PushDense(int table_id) { + std::vector regions; + for (auto& t : _param_config->dense_gradient_variable_name[table_id]) { + Variable* var = thread_scope_->FindVar(t); + CHECK(var != nullptr) << "var[" << t << "] not found"; + LoDTensor* tensor = var->GetMutable(); + int count = tensor->numel(); + float* g = tensor->data(); + paddle::ps::Region reg(g, count); + regions.emplace_back(std::move(reg)); + } + + auto status = _pslib_ptr->_worker_ptr->push_dense(regions.data(), + regions.size(), table_id); + _push_dense_status.push_back(std::move(status)); +} + +void AsyncExecutorThreadWorker::PullSparse(int table_id) { + auto& features = _features[table_id]; + auto& feature_value = _feature_value[table_id]; + auto fea_dim = _param_config->fea_dim; + // slot id starts from 1 + features.clear(); + features.resize(0); + features.reserve(MAX_FEASIGN_NUM); + const std::vector& feed_vec = thread_reader_->GetUseSlotAlias(); + // slot_idx = 0 is label TODO + for (auto slot_idx = 1u; slot_idx < feed_vec.size(); ++slot_idx) { + Variable* var = thread_scope_->FindVar(feed_vec[slot_idx]); + LoDTensor* tensor = var->GetMutable(); + int64_t* ids = tensor->data(); + int len = tensor->numel(); + for (auto i = 0u; i < len; ++i) { + // todo(colourful-tree): current trick - filter feasign=use_slot_mod( + // bug: datafeed fill use_slot_mod for empty slot) + if (ids[i] == 0u) { + continue; + } + features.push_back(static_cast(ids[i])); + } + } + check_pull_push_memory(features, &feature_value, fea_dim); + + std::vector pull_feature_value; + for (auto i = 0u; i < features.size(); ++i) { + pull_feature_value.push_back(feature_value[i].data()); + } + + auto status = _pslib_ptr->_worker_ptr->pull_sparse( + pull_feature_value.data(), table_id, features.data(), features.size()); + _pull_sparse_status.push_back(std::move(status)); + + auto& push_g = _feature_push_value[table_id]; + check_pull_push_memory(features, &push_g, fea_dim); + collect_feasign_info(table_id); +} + +void AsyncExecutorThreadWorker::FillSparse(int table_id) { + auto slot_dim = _param_config->slot_dim; + auto fea_dim = _param_config->fea_dim; + auto& features = _features[table_id]; + auto& fea_value = _feature_value[table_id]; + + CHECK(features.size() > 0) << "feature size check failed"; + + auto fea_idx = 0u; + + std::vector init_value(fea_dim); + + const std::vector& feed_vec = thread_reader_->GetUseSlotAlias(); + // slot_idx = 0 is label TODO + for (auto slot_idx = 1u; slot_idx < feed_vec.size(); ++slot_idx) { + Variable* var = thread_scope_->FindVar(feed_vec[slot_idx]); + LoDTensor* tensor = var->GetMutable(); + int64_t* ids = tensor->data(); + int len = tensor->numel(); + Variable* var_emb = thread_scope_->FindVar( + _param_config->slot_input_vec[table_id][slot_idx - 1]); + LoDTensor* tensor_emb = var_emb->GetMutable(); + float* ptr = + tensor_emb->mutable_data({len, slot_dim}, platform::CPUPlace()); + memset(ptr, 0, sizeof(float) * len * slot_dim); + auto& tensor_lod = tensor->lod()[0]; + + LoD data_lod{tensor_lod}; + tensor_emb->set_lod(data_lod); + + for (auto index = 0u; index < len; ++index) { + if (ids[index] == 0u) { + memcpy(ptr + slot_dim * index, init_value.data() + 2, + sizeof(float) * slot_dim); + continue; + } + memcpy(ptr + slot_dim * index, fea_value[fea_idx].data() + 2, + sizeof(float) * slot_dim); + fea_idx++; + } + } +} + +void AsyncExecutorThreadWorker::PushSparse(int table_id) { + auto slot_dim = _param_config->slot_dim; + auto fea_dim = _param_config->fea_dim; + auto& features = _features[table_id]; + auto& push_g = _feature_push_value[table_id]; + check_pull_push_memory(features, &push_g, fea_dim); + CHECK(push_g.size() == features.size() + 1) + << "push_g size:" << push_g.size() + << " features size:" << features.size(); + uint64_t fea_idx = 0u; + auto& fea_info = _fea_info[table_id]; + int offset = 2; + const std::vector& feed_vec = thread_reader_->GetUseSlotAlias(); + // slot_idx = 0 is label + for (auto slot_idx = 1u; slot_idx < feed_vec.size(); ++slot_idx) { + if (_param_config->slot_alias_to_table.find(feed_vec[slot_idx]) == + _param_config->slot_alias_to_table.end()) { + LOG(ERROR) << "ERROR slot_idx:" << slot_idx + << " name:" << feed_vec[slot_idx]; + } else if (_param_config->slot_alias_to_table[feed_vec[slot_idx]] != + table_id) { + continue; + } + Variable* g_var = thread_scope_->FindVar( + _param_config->gradient_var[table_id][slot_idx - 1]); + CHECK(g_var != nullptr) + << "var[" << _param_config->gradient_var[table_id][slot_idx - 1] + << "] not found"; + LoDTensor* g_tensor = g_var->GetMutable(); + if (g_tensor == NULL) { + LOG(ERROR) << "var[" + << _param_config->gradient_var[table_id][slot_idx - 1] + << "] not found"; + exit(-1); + } + float* g = g_tensor->data(); + + Variable* var = thread_scope_->FindVar(feed_vec[slot_idx]); + CHECK(var != nullptr) << "var[" << feed_vec[slot_idx] << "] not found"; + LoDTensor* tensor = var->GetMutable(); + if (tensor == NULL) { + LOG(ERROR) << "var[" << feed_vec[slot_idx] << "] not found"; + exit(-1); + } + int len = tensor->numel(); + CHECK(slot_dim * len == g_tensor->numel()) + << "len:" << len << " g_numel:" << g_tensor->numel(); + CHECK(len == tensor->numel()) << "len:" << len + << "t_numel:" << tensor->numel(); + int64_t* ids = tensor->data(); + for (auto id_idx = 0u; id_idx < len; ++id_idx) { + if (ids[id_idx] == 0) { + g += slot_dim; + continue; + } + memcpy(push_g[fea_idx].data() + offset, g, sizeof(float) * slot_dim); + push_g[fea_idx][0] = 1.0f; + CHECK(fea_idx < fea_info.size()) << "fea_idx:" << fea_idx + << " size:" << fea_info.size(); + push_g[fea_idx][1] = static_cast(fea_info[fea_idx].label); + g += slot_dim; + fea_idx++; + } + } + CHECK(fea_idx == features.size()) << "fea_idx:" << fea_idx + << " features size:" << features.size(); + CHECK_GT(features.size(), 0); + + std::vector push_g_vec; + for (auto i = 0u; i < features.size(); ++i) { + push_g_vec.push_back(push_g[i].data()); + } + auto status = _pslib_ptr->_worker_ptr->push_sparse( + table_id, features.data(), (const float**)push_g_vec.data(), + features.size()); + _push_sparse_status.push_back(std::move(status)); +} + +void AsyncExecutorThreadWorker::collect_feasign_info(int table_id) { + auto& fea_info = _fea_info[table_id]; + auto& feature = _features[table_id]; + fea_info.resize(feature.size()); + const std::vector& feed_vec = thread_reader_->GetUseSlotAlias(); + Variable* var = thread_scope_->FindVar(feed_vec[0]); + LoDTensor* tensor = var->GetMutable(); + int64_t* label = tensor->data(); + + int global_index = 0; + for (auto slot_idx = 1u; slot_idx < feed_vec.size(); ++slot_idx) { + Variable* var = thread_scope_->FindVar(feed_vec[slot_idx]); + LoDTensor* tensor = var->GetMutable(); + int64_t* ids = tensor->data(); + + int fea_idx = 0; + for (auto ins_idx = 1u; ins_idx < tensor->lod()[0].size(); ++ins_idx) { + for (; fea_idx < tensor->lod()[0][ins_idx]; ++fea_idx) { + if (ids[fea_idx] == 0u) { + continue; + } + FeasignInfo info{slot_idx, ins_idx, label[ins_idx - 1]}; + + fea_info[global_index++] = std::move(info); + } + } + } + CHECK(global_index == feature.size()) + << "expect fea info size:" << feature.size() << " real:" << global_index; +} + +void AsyncExecutorThreadWorker::check_pull_push_memory( + const std::vector& features, + std::vector>* push_g, int dim) { + push_g->resize(features.size() + 1); + for (auto& t : *push_g) { + t.resize(dim); + } +} + +void AsyncExecutorThreadWorker::check_pull_push_memory( + const std::vector& features, std::vector* push_g, + int dim) { + if (features.size() > push_g->size()) { + push_g->reserve(features.size() + 1); + auto size = features.size() - push_g->size() + 1; + for (auto i = 0u; i < size; ++i) { + float* ptr = new float[dim]; + push_g->push_back(ptr); + } + } +} +#endif + +} // einit_modelnd namespace framework +} // end namespace paddle diff --git a/paddle/fluid/framework/executor_thread_worker.h b/paddle/fluid/framework/executor_thread_worker.h new file mode 100644 index 0000000000000000000000000000000000000000..524922b0322e538d46f93011fbca3223b02d8849 --- /dev/null +++ b/paddle/fluid/framework/executor_thread_worker.h @@ -0,0 +1,245 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include // NOLINT +#include +#include +#include // NOLINT +#include +#include "paddle/fluid/framework/data_feed.h" +#include "paddle/fluid/framework/executor.h" +#include "paddle/fluid/framework/program_desc.h" +#include "paddle/fluid/framework/scope.h" +#ifdef PADDLE_WITH_PSLIB +#include +#endif + +namespace paddle { +namespace framework { + +void CreateTensor(Variable* var, proto::VarType::Type var_type); +#ifdef PADDLE_WITH_PSLIB +static const uint32_t MAX_FEASIGN_NUM = 1000 * 100 * 100; + +struct AsyncWorkerParamConfig { + int slot_dim; + int fea_dim; + int32_t tmp_push_dense_wait_times; + int32_t tmp_push_sparse_wait_times; + + std::vector skip_op; + + std::map> dense_variable_name; + std::map> dense_gradient_variable_name; + std::vector dense_table_id; + // fea_dim for each dense table + std::vector dense_table_size; + std::vector sparse_table_id; + std::map> slot_input_vec; + std::map> gradient_var; + std::map slot_alias_to_table; +}; + +struct DensePullThreadParam { + std::shared_ptr ps_client; + int threshold; + int training_thread_num; + Scope* root_scope; + std::map>* dense_params; + int sleep_time_ms = 2; +}; + +class DensePullThread { + public: + explicit DensePullThread(const DensePullThreadParam& param) + : _running(false) { + _ps_client = param.ps_client; + _threshold = param.threshold; + _thread_num = param.training_thread_num; + _root_scope = param.root_scope; + _sleep_time_ms = param.sleep_time_ms; + + for (auto& t : *param.dense_params) { + _dense_variable_name[t.first].insert(_dense_variable_name[t.first].end(), + t.second.begin(), t.second.end()); + _training_versions[t.first].resize(_thread_num, 0); + _last_versions[t.first] = 0; + _current_version[t.first] = 0; + } + } + + int start(); + + void stop() { + if (_running) { + _running = false; + _t.join(); + } + } + + void increase_thread_version(int thread_id, uint64_t table_id); + void reset_thread_version(uint64_t table_id); + std::future pull_dense(uint64_t table_id); + void pull_dense2(uint64_t table_id); + void wait_all(); + + private: + void run(); + bool check_update_param(uint64_t table_id); + + private: + std::shared_ptr _ps_client; + int _thread_num; + int _threshold; + int _sleep_time_ms; + Scope* _root_scope; + bool _running; + + std::map _last_versions; + std::map _current_version; + std::mutex _mutex_for_version; + std::map> _training_versions; + std::map> _dense_variable_name; + + std::thread _t; + + std::vector<::std::future> _pull_dense_status; + + std::map> _regions; + uint32_t _pull_dense_fail_times = 0; + + std::vector _base_norm_param; + std::vector _mean; + std::vector _scale; + float _squared_sum_epsilon = 1e-4; + std::mutex _mutex_for_mean_scale; + + float _total_batch_num = 0; +}; +#endif + +class ExecutorThreadWorker { + public: + ExecutorThreadWorker() + : thread_id_(-1), root_scope_(NULL), thread_scope_(NULL), debug_(false) {} + virtual ~ExecutorThreadWorker() {} + + void CreateThreadResource(const framework::ProgramDesc& program, + const paddle::platform::Place& place); + void SetThreadId(int tid); + void SetDebug(const bool debug) { debug_ = debug; } + void SetRootScope(Scope* g_scope); + // set cpu device in this function + // cpu binding is used by default + void SetDevice(); + // since we read data into memory that can not be accessed by program + // we need to bind memory of data with corresponding variables in program + // this function should be called after data feed is set + void BindingDataFeedMemory(); + // set data feed declared in executor + void SetDataFeed(const std::shared_ptr& datafeed); + // A multi-thread training function + virtual void TrainFiles(); + // with timer log + virtual void TrainFilesWithTimer(); + // set fetch variable names from python interface assigned by users + void SetFetchVarNames(const std::vector& fetch_var_names); +#ifdef PADDLE_WITH_PSLIB + virtual void SetPSlibPtr( + std::shared_ptr pslib_ptr) {} + virtual void SetPullDenseThread(std::shared_ptr dpt) {} + virtual void SetParamConfig(AsyncWorkerParamConfig* param_config) {} +#endif + + private: + void CreateThreadScope(const framework::ProgramDesc& program); + void CreateThreadOperators(const framework::ProgramDesc& program); + void SetMainProgram(const ProgramDesc& main_program_desc); + void SetPlace(const paddle::platform::Place& place); + + protected: + // thread index + std::shared_ptr thread_reader_; // shared queue, thread buffer + int thread_id_; + // operator name + std::vector op_names_; + // thread level, local operators for forward and backward + std::vector ops_; + // main program for training + std::unique_ptr main_program_; + // execution place + platform::Place place_; + // root scope for model parameters + Scope* root_scope_; + // a thread scope, father scope is global score which is shared + Scope* thread_scope_; + std::vector fetch_var_names_; + std::vector> fetch_values_; + bool debug_; +}; + +#ifdef PADDLE_WITH_PSLIB +class AsyncExecutorThreadWorker : public ExecutorThreadWorker { + public: + AsyncExecutorThreadWorker() {} + virtual ~AsyncExecutorThreadWorker() {} + void SetPSlibPtr(std::shared_ptr pslib_ptr); + void SetPullDenseThread(std::shared_ptr dpt); + void SetParamConfig(AsyncWorkerParamConfig* param_config); + void TrainFiles(); + void TrainOneNetwork(); + void PrepareParams(); + void UpdateParams(); + void PullSparse(int table_id); + void FillSparse(int table_id); + void PushSparse(int table_id); + void PushDense(int table_id); + + void check_pull_push_memory(const std::vector& features, + std::vector* push_g, int dim); + void check_pull_push_memory(const std::vector& features, + std::vector>* push_g, int dim); + void collect_feasign_info(int table_id); + + private: + struct FeasignInfo { + uint32_t slot; + uint32_t ins; + int64_t label; + }; + + std::map> _features; + std::map> _fea_info; + std::map>> _feature_value; + std::map>> _feature_push_value; + + std::shared_ptr _pslib_ptr; + + std::shared_ptr _pull_dense_thread; + + std::vector<::std::future> _pull_sparse_status; + std::vector<::std::future> _pull_dense_status; + std::vector<::std::future> _push_sparse_status; + std::vector<::std::future> _push_dense_status; + + AsyncWorkerParamConfig* _param_config; +}; +#endif + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/feed_fetch_method.cc b/paddle/fluid/framework/feed_fetch_method.cc index 3e9353f5cf67d8de62c5551f12ea786e49190549..96530b2a3f9cfd9462627a42b2bb0fea98758f92 100644 --- a/paddle/fluid/framework/feed_fetch_method.cc +++ b/paddle/fluid/framework/feed_fetch_method.cc @@ -16,7 +16,9 @@ limitations under the License. */ #include #include #include "glog/logging.h" +#include "paddle/fluid/framework/var_type.h" #include "paddle/fluid/framework/variable.h" +#include "paddle/fluid/platform/place.h" namespace paddle { namespace framework { @@ -42,6 +44,7 @@ LoDTensor& GetFetchVariable(const Scope& scope, const std::string& var_name, // Since we want to fetch LodTensor from a variable, the variable must // be created alreadly. Variable* g_fetch_value = scope.FindVar(var_name); + PADDLE_ENFORCE_NOT_NULL(g_fetch_value, "%s is not found.", var_name); PADDLE_ENFORCE(g_fetch_value->IsType(), "Only %s can be invoked by GetFetchVariable", typeid(FeedFetchList).name()); @@ -53,5 +56,12 @@ LoDTensor& GetFetchVariable(const Scope& scope, const std::string& var_name, return tensor; } +LoDTensor& GetVariableTensor(const Scope& scope, const std::string& var_name) { + Variable* var = scope.FindVar(var_name); + PADDLE_ENFORCE(var, "%s no in scope", var_name); + PADDLE_ENFORCE(var->IsType(), "Only support lod tensor now."); + return *var->GetMutable(); +} + } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/feed_fetch_method.h b/paddle/fluid/framework/feed_fetch_method.h index 7f504bfd232862c014cb59b6e8301eec74e0351f..031f8e01aa6128b803dcbfb990778e87d4fafc13 100644 --- a/paddle/fluid/framework/feed_fetch_method.h +++ b/paddle/fluid/framework/feed_fetch_method.h @@ -27,5 +27,7 @@ void SetFeedVariable(Scope* scope, const LoDTensor& input, LoDTensor& GetFetchVariable(const Scope& scope, const std::string& var_name, size_t index); +LoDTensor& GetVariableTensor(const Scope& scope, const std::string& var_name); + } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/fleet/CMakeLists.txt b/paddle/fluid/framework/fleet/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..7d363d1afdc8ac72741e6e4fea02fb96fe9347fa --- /dev/null +++ b/paddle/fluid/framework/fleet/CMakeLists.txt @@ -0,0 +1,5 @@ +if(WITH_PSLIB) + cc_library(fleet_wrapper SRCS fleet_wrapper.cc DEPS framework_proto variable_helper scope pslib_brpc pslib) +else() + cc_library(fleet_wrapper SRCS fleet_wrapper.cc DEPS framework_proto variable_helper scope) +endif(WITH_PSLIB) diff --git a/paddle/fluid/framework/fleet/fleet_wrapper.cc b/paddle/fluid/framework/fleet/fleet_wrapper.cc new file mode 100644 index 0000000000000000000000000000000000000000..394ff24c466622956b18b3012c146f6f9ddd838e --- /dev/null +++ b/paddle/fluid/framework/fleet/fleet_wrapper.cc @@ -0,0 +1,407 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/fleet/fleet_wrapper.h" +#include +#include "paddle/fluid/framework/data_feed.h" +#include "paddle/fluid/framework/scope.h" + +namespace paddle { +namespace framework { + +const uint32_t MAX_FEASIGN_NUM = 1024 * 100 * 100; +std::shared_ptr FleetWrapper::s_instance_ = NULL; +bool FleetWrapper::is_initialized_ = false; + +#ifdef PADDLE_WITH_PSLIB +template +paddle::ps::Archive& operator<<(paddle::ps::Archive& ar, + const MultiSlotType& ins) { + ar << ins.GetType(); + ar << ins.GetOffset(); + ar << ins.GetFloatData(); + ar << ins.GetUint64Data(); + return ar; +} + +template +paddle::ps::Archive& operator>>(paddle::ps::Archive& ar, + MultiSlotType& ins) { + ar >> ins.MutableType(); + ar >> ins.MutableOffset(); + ar >> ins.MutableFloatData(); + ar >> ins.MutableUint64Data(); + return ar; +} +#endif + +#ifdef PADDLE_WITH_PSLIB +std::shared_ptr FleetWrapper::pslib_ptr_ = NULL; +#endif + +void FleetWrapper::InitServer(const std::string& dist_desc, int index) { +#ifdef PADDLE_WITH_PSLIB + if (!is_initialized_) { + VLOG(3) << "Going to init server"; + pslib_ptr_ = std::shared_ptr( + new paddle::distributed::PSlib()); + pslib_ptr_->init_server(dist_desc, index); + is_initialized_ = true; + } else { + VLOG(3) << "Server can be initialized only once"; + } +#endif +} + +void FleetWrapper::InitWorker(const std::string& dist_desc, + const std::vector& host_sign_list, + int node_num, int index) { +#ifdef PADDLE_WITH_PSLIB + if (!is_initialized_) { + VLOG(3) << "Going to init worker"; + pslib_ptr_ = std::shared_ptr( + new paddle::distributed::PSlib()); + pslib_ptr_->init_worker(dist_desc, + const_cast(host_sign_list.data()), + node_num, index); + is_initialized_ = true; + } else { + VLOG(3) << "Worker can be initialized only once"; + } +#endif +} + +void FleetWrapper::StopServer() { +#ifdef PADDLE_WITH_PSLIB + VLOG(3) << "Going to stop server"; + pslib_ptr_->stop_server(); +#endif +} + +uint64_t FleetWrapper::RunServer() { +#ifdef PADDLE_WITH_PSLIB + VLOG(3) << "Going to run server"; + return pslib_ptr_->run_server(); +#else + return 0; +#endif +} + +void FleetWrapper::GatherServers(const std::vector& host_sign_list, + int node_num) { +#ifdef PADDLE_WITH_PSLIB + VLOG(3) << "Going to gather server ips"; + pslib_ptr_->gather_servers(const_cast(host_sign_list.data()), + node_num); +#endif +} + +void FleetWrapper::GatherClients(const std::vector& host_sign_list) { +#ifdef PADDLE_WITH_PSLIB + VLOG(3) << "Going to gather client ips"; + size_t len = host_sign_list.size(); + pslib_ptr_->gather_clients(const_cast(host_sign_list.data()), len); +#endif +} + +std::vector FleetWrapper::GetClientsInfo() { +#ifdef PADDLE_WITH_PSLIB + VLOG(3) << "Going to get client info"; + return pslib_ptr_->get_client_info(); +#endif + return std::vector(); +} + +void FleetWrapper::CreateClient2ClientConnection() { +#ifdef PADDLE_WITH_PSLIB + VLOG(3) << "Going to create client2client connection"; + pslib_ptr_->create_client2client_connection(); +#endif +} + +void FleetWrapper::PullSparseVarsSync( + const Scope& scope, const uint64_t table_id, + const std::vector& var_names, std::vector* fea_keys, + std::vector>* fea_values, int fea_value_dim) { +#ifdef PADDLE_WITH_PSLIB + std::vector<::std::future> pull_sparse_status; + pull_sparse_status.resize(0); + fea_keys->clear(); + fea_keys->resize(0); + fea_keys->reserve(MAX_FEASIGN_NUM); + for (auto name : var_names) { + Variable* var = scope.FindVar(name); + LoDTensor* tensor = var->GetMutable(); + int64_t* ids = tensor->data(); + int len = tensor->numel(); + for (auto i = 0u; i < len; ++i) { + if (ids[i] == 0u) { + continue; + } + fea_keys->push_back(static_cast(ids[i])); + } + } + fea_values->resize(fea_keys->size() + 1); + for (auto& t : *fea_values) { + t.resize(fea_value_dim); + } + std::vector pull_result_ptr; + for (auto& t : *fea_values) { + pull_result_ptr.push_back(t.data()); + } + auto status = pslib_ptr_->_worker_ptr->pull_sparse( + pull_result_ptr.data(), table_id, fea_keys->data(), fea_keys->size()); + pull_sparse_status.push_back(std::move(status)); + for (auto& t : pull_sparse_status) { + t.wait(); + auto status = t.get(); + if (status != 0) { + LOG(ERROR) << "fleet pull sparse failed, status[" << status << "]"; + exit(-1); + } + } +#endif +} + +void FleetWrapper::PullDenseVarsAsync( + const Scope& scope, const uint64_t tid, + const std::vector& var_names, + std::vector<::std::future>* pull_dense_status) { +#ifdef PADDLE_WITH_PSLIB + auto& regions = _regions[tid]; + regions.clear(); + regions.resize(var_names.size()); + for (auto i = 0u; i < var_names.size(); ++i) { + Variable* var = scope.FindVar(var_names[i]); + LoDTensor* tensor = var->GetMutable(); + float* w = tensor->data(); + paddle::ps::Region reg(w, tensor->numel()); + regions[i] = std::move(reg); + } + auto status = + pslib_ptr_->_worker_ptr->pull_dense(regions.data(), regions.size(), tid); + pull_dense_status->push_back(std::move(status)); +#endif +} + +void FleetWrapper::PullDenseVarsSync( + const Scope& scope, const uint64_t tid, + const std::vector& var_names) { +#ifdef PADDLE_WITH_PSLIB + auto& regions = _regions[tid]; + regions.clear(); + regions.reserve(var_names.size()); + for (auto& t : var_names) { + Variable* var = scope.FindVar(t); + LoDTensor* tensor = var->GetMutable(); + float* w = tensor->data(); + paddle::ps::Region reg(w, tensor->numel()); + regions.emplace_back(std::move(reg)); + } + auto status = + pslib_ptr_->_worker_ptr->pull_dense(regions.data(), regions.size(), tid); + status.wait(); +#endif +} + +void FleetWrapper::PushDenseParamSync( + const Scope& scope, const uint64_t table_id, + const std::vector& var_names) { +#ifdef PADDLE_WITH_PSLIB + auto place = platform::CPUPlace(); + std::vector regions; + for (auto& t : var_names) { + Variable* var = scope.FindVar(t); + CHECK(var != nullptr) << "var[" << t << "] not found"; + LoDTensor* tensor = var->GetMutable(); + float* g = tensor->mutable_data(place); + paddle::ps::Region reg(g, tensor->numel()); + regions.emplace_back(std::move(reg)); + } + auto push_status = pslib_ptr_->_worker_ptr->push_dense_param( + regions.data(), regions.size(), table_id); + push_status.wait(); + auto status = push_status.get(); + CHECK(status == 0) << "push dense param failed, status[" << status << "]"; +#endif +} + +void FleetWrapper::PushDenseVarsSync( + Scope* scope, const uint64_t table_id, + const std::vector& var_names) {} + +void FleetWrapper::PushDenseVarsAsync( + const Scope& scope, const uint64_t table_id, + const std::vector& var_names, + std::vector<::std::future>* push_sparse_status) { +#ifdef PADDLE_WITH_PSLIB + std::vector regions; + for (auto& t : var_names) { + Variable* var = scope.FindVar(t); + LoDTensor* tensor = var->GetMutable(); + int count = tensor->numel(); + float* g = tensor->data(); + paddle::ps::Region reg(g, count); + regions.emplace_back(std::move(reg)); + } + auto status = pslib_ptr_->_worker_ptr->push_dense(regions.data(), + regions.size(), table_id); + push_sparse_status->push_back(std::move(status)); +#endif +} + +void FleetWrapper::PushSparseVarsWithLabelAsync( + const Scope& scope, const uint64_t table_id, + const std::vector& fea_keys, const std::vector& fea_labels, + const std::vector& sparse_key_names, + const std::vector& sparse_grad_names, const int emb_dim, + std::vector>* push_values, + std::vector<::std::future>* push_sparse_status) { +#ifdef PADDLE_WITH_PSLIB + int offset = 2; + uint64_t fea_idx = 0u; + for (size_t i = 0; i < sparse_key_names.size(); ++i) { + Variable* g_var = scope.FindVar(sparse_grad_names[i]); + CHECK(g_var != nullptr) << "var[" << sparse_grad_names[i] << "] not found"; + LoDTensor* g_tensor = g_var->GetMutable(); + if (g_tensor == NULL) { + LOG(ERROR) << "var[" << sparse_key_names[i] << "] not found"; + exit(-1); + } + float* g = g_tensor->data(); + Variable* var = scope.FindVar(sparse_key_names[i]); + CHECK(var != nullptr) << "var[" << sparse_key_names[i] << "] not found"; + LoDTensor* tensor = var->GetMutable(); + if (tensor == NULL) { + LOG(ERROR) << "var[" << sparse_key_names[i] << "] not found"; + exit(-1); + } + int len = tensor->numel(); + int64_t* ids = tensor->data(); + push_values->resize(fea_keys.size() + 1); + for (auto& t : *push_values) { + t.resize(emb_dim + offset); + } + + for (auto id_idx = 0u; id_idx < len; ++id_idx) { + if (ids[id_idx] == 0) { + g += emb_dim; + continue; + } + CHECK(fea_idx < (*push_values).size()); + CHECK(fea_idx < fea_labels.size()); + memcpy((*push_values)[fea_idx].data() + offset, g, + sizeof(float) * emb_dim); + (*push_values)[fea_idx][0] = 1.0f; + (*push_values)[fea_idx][1] = static_cast(fea_labels[fea_idx]); + g += emb_dim; + fea_idx++; + } + } + CHECK(fea_idx == fea_keys.size()) << "fea_idx: " << fea_idx + << "features size: " << fea_keys.size(); + std::vector push_g_vec; + for (auto i = 0u; i < fea_keys.size(); ++i) { + push_g_vec.push_back((*push_values)[i].data()); + } + auto status = pslib_ptr_->_worker_ptr->push_sparse( + table_id, fea_keys.data(), (const float**)push_g_vec.data(), + fea_keys.size()); + push_sparse_status->push_back(std::move(status)); + +#endif +} + +int FleetWrapper::RegisterClientToClientMsgHandler(int msg_type, + MsgHandlerFunc handler) { +#ifdef PADDLE_WITH_PSLIB + VLOG(3) << "calling FleetWrapper::RegisterClientToClientMsgHandler"; + VLOG(3) << "pslib_ptr_=" << pslib_ptr_; + VLOG(3) << "_worker_ptr=" << pslib_ptr_->_worker_ptr; + return pslib_ptr_->_worker_ptr->registe_client2client_msg_handler(msg_type, + handler); +#else + VLOG(0) << "FleetWrapper::RegisterClientToClientMsgHandler" + << " does nothing when no pslib"; +#endif + return 0; +} + +std::future FleetWrapper::SendClientToClientMsg( + int msg_type, int to_client_id, const std::string& msg) { +#ifdef PADDLE_WITH_PSLIB + return pslib_ptr_->_worker_ptr->send_client2client_msg(msg_type, to_client_id, + msg); +#else + VLOG(0) << "FleetWrapper::SendClientToClientMsg" + << " does nothing when no pslib"; +#endif + return std::future(); +} + +template +void FleetWrapper::Serialize(const std::vector& t, std::string* str) { +#ifdef PADDLE_WITH_PSLIB + paddle::ps::BinaryArchive ar; + for (size_t i = 0; i < t.size(); ++i) { + ar << *(t[i]); + } + *str = std::string(ar.buffer(), ar.length()); +#else + VLOG(0) << "FleetWrapper::Serialize does nothing when no pslib"; +#endif +} + +template +void FleetWrapper::Deserialize(std::vector* t, const std::string& str) { +#ifdef PADDLE_WITH_PSLIB + if (str.length() == 0) { + return; + } + paddle::ps::BinaryArchive ar; + ar.set_read_buffer(const_cast(str.c_str()), str.length(), nullptr); + if (ar.cursor() == ar.finish()) { + return; + } + while (ar.cursor() < ar.finish()) { + t->push_back(ar.get()); + } + CHECK(ar.cursor() == ar.finish()); + VLOG(3) << "Deserialize size " << t->size(); +#else + VLOG(0) << "FleetWrapper::Deserialize does nothing when no pslib"; +#endif +} + +template void FleetWrapper::Serialize>( + const std::vector*>&, std::string*); +template void FleetWrapper::Deserialize>( + std::vector>*, const std::string&); + +} // end namespace framework +} // end namespace paddle diff --git a/paddle/fluid/framework/fleet/fleet_wrapper.h b/paddle/fluid/framework/fleet/fleet_wrapper.h new file mode 100644 index 0000000000000000000000000000000000000000..386e711ff71dbf978cbcb620589490d3f06d3c53 --- /dev/null +++ b/paddle/fluid/framework/fleet/fleet_wrapper.h @@ -0,0 +1,165 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#ifdef PADDLE_WITH_PSLIB +#include +#include +#endif +#include +#include +#include +#include +#include +#include +#include "paddle/fluid/framework/program_desc.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/framework/variable_helper.h" +#include "paddle/fluid/platform/macros.h" // for DISABLE_COPY_AND_ASSIGN + +namespace paddle { +namespace framework { + +// A wrapper class for pslib.h, this class follows Singleton pattern +// i.e. only initialized once in the current process +// Example: +// std::shared_ptr fleet_ptr = +// FleetWrapper::GetInstance(); +// string dist_desc; +// fleet_ptr->InitServer(dist_desc, 0); +// interface design principles: +// Pull +// Sync: PullSparseVarsSync +// Async: PullSparseVarsAsync(not implemented currently) +// Push +// Sync: PushSparseVarsSync +// Async: PushSparseVarsAsync(not implemented currently) +// Async: PushSparseVarsWithLabelAsync(with special usage) +// Push dense variables to server in Async mode +// Param: scope, table_id, var_names +// Param: push_sparse_status + +class FleetWrapper { + public: + virtual ~FleetWrapper() {} + FleetWrapper() {} + // Pull sparse variables from server in Sync mode + // Param: scope, table_id, var_names, fea_keys + // Param: fea_values + void PullSparseVarsSync(const Scope& scope, const uint64_t table_id, + const std::vector& var_names, + std::vector* fea_keys, + std::vector>* fea_values, + int fea_dim); + + void PullDenseVarsSync(const Scope& scope, const uint64_t table_id, + const std::vector& var_names); + + void PullDenseVarsAsync( + const Scope& scope, const uint64_t table_id, + const std::vector& var_names, + std::vector<::std::future>* pull_dense_status); + + void PushDenseParamSync(const Scope& scope, const uint64_t table_id, + const std::vector& var_names); + + // Push dense variables to server in async mode + // Param: scope, table_id, var_names, + // Param: push_sparse_status + void PushDenseVarsAsync( + const Scope& scope, const uint64_t table_id, + const std::vector& var_names, + std::vector<::std::future>* push_sparse_status); + + void PushDenseVarsSync(Scope* scope, const uint64_t table_id, + const std::vector& var_names); + + // Push sparse variables with labels to server in Async mode + // This is specially designed for click/show stats in server + // Param: scope, table_id, var_grad_names, + // fea_keys, fea_labels, sparse_grad_names + // Param: push_values, push_sparse_status + void PushSparseVarsWithLabelAsync( + const Scope& scope, const uint64_t table_id, + const std::vector& fea_keys, + const std::vector& fea_labels, + const std::vector& sparse_key_names, + const std::vector& sparse_grad_names, const int emb_dim, + std::vector>* push_values, + std::vector<::std::future>* push_sparse_status); + + // Push sparse variables to server in Async mode + // Param: scope, table_id, fea_keys, sparse_grad_names + // Param: push_values, push_sparse_status + /* + void PushSparseVarsAsync( + const Scope& scope, + const uint64_t table_id, + const std::vector& fea_keys, + const std::vector& sparse_grad_names, + std::vector>* push_values, + std::vector<::std::future>* push_sparse_status); + */ + + void InitServer(const std::string& dist_desc, int index); + void InitWorker(const std::string& dist_desc, + const std::vector& host_sign_list, int node_num, + int index); + void StopServer(); + uint64_t RunServer(); + void GatherServers(const std::vector& host_sign_list, int node_num); + // gather client ip + void GatherClients(const std::vector& host_sign_list); + // get client info + std::vector GetClientsInfo(); + // create client to client connection + void CreateClient2ClientConnection(); + + // register client to client communication + typedef std::function MsgHandlerFunc; + int RegisterClientToClientMsgHandler(int msg_type, MsgHandlerFunc handler); + // send client to client message + std::future SendClientToClientMsg(int msg_type, int to_client_id, + const std::string& msg); + + template + void Serialize(const std::vector& t, std::string* str); + template + void Deserialize(std::vector* t, const std::string& str); + static std::shared_ptr GetInstance() { + if (NULL == s_instance_) { + s_instance_.reset(new paddle::framework::FleetWrapper()); + } + return s_instance_; + } + +#ifdef PADDLE_WITH_PSLIB + static std::shared_ptr pslib_ptr_; +#endif + + private: + static std::shared_ptr s_instance_; +#ifdef PADDLE_WITH_PSLIB + std::map> _regions; +#endif + + protected: + static bool is_initialized_; + DISABLE_COPY_AND_ASSIGN(FleetWrapper); +}; + +} // end namespace framework +} // end namespace paddle diff --git a/paddle/fluid/framework/framework.proto b/paddle/fluid/framework/framework.proto index 25f0ba418433571343c5b2bbfdbf9fb940eaec52..efdabffb9b33ddf007c13008d0f3afb7a3961eda 100644 --- a/paddle/fluid/framework/framework.proto +++ b/paddle/fluid/framework/framework.proto @@ -35,6 +35,7 @@ enum AttrType { BLOCK = 8; LONG = 9; BLOCKS = 10; + LONGS = 11; } // OpDesc describes an instance of a C++ framework::OperatorBase @@ -55,6 +56,7 @@ message OpDesc { optional int32 block_idx = 12; optional int64 l = 13; repeated int32 blocks_idx = 14; + repeated int64 longs = 15; }; message Var { @@ -80,7 +82,6 @@ message OpProto { optional bool duplicable = 3 [ default = false ]; optional bool intermediate = 4 [ default = false ]; optional bool dispensable = 5 [ default = false ]; - optional string reuse = 6; } // AttrProto describes the C++ type Attribute. diff --git a/paddle/fluid/framework/garbage_collector.cc b/paddle/fluid/framework/garbage_collector.cc new file mode 100644 index 0000000000000000000000000000000000000000..789b2ef80ec09a69ca227a27c61dd58e58a2fc04 --- /dev/null +++ b/paddle/fluid/framework/garbage_collector.cc @@ -0,0 +1,131 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include +#include +#include // NOLINT +#include +#ifdef PADDLE_WITH_CUDA +#include "paddle/fluid/platform/cuda_device_guard.h" +#endif +#include "gflags/gflags.h" +#include "glog/logging.h" +#include "paddle/fluid/framework/garbage_collector.h" + +namespace paddle { +namespace framework { + +DEFINE_double( + eager_delete_tensor_gb, -1.0, + "Memory size threshold (GB) when the garbage collector clear tensors." + "Disabled when this value is less than 0"); + +DEFINE_bool(fast_eager_deletion_mode, true, + "Fast eager deletion mode. If enabled, memory would release " + "immediately without waiting GPU kernel ends."); + +DEFINE_double(memory_fraction_of_eager_deletion, 1.0, + "Fraction of eager deletion. If less than 1.0, all variables in " + "the program would be sorted according to its memory size, and " + "only the FLAGS_memory_fraction_of_eager_deletion of the largest " + "variables would be deleted."); + +GarbageCollector::GarbageCollector(const platform::Place &place, + size_t max_memory_size) + : max_memory_size_((std::max)(max_memory_size, static_cast(1))) { + garbages_.reset(new GarbageQueue()); + dev_ctx_ = platform::DeviceContextPool::Instance().Get(place); +} + +CPUGarbageCollector::CPUGarbageCollector(const platform::CPUPlace &place, + size_t max_memory_size) + : GarbageCollector(place, max_memory_size) {} + +void CPUGarbageCollector::ClearCallback(const std::function &callback) { + callback(); +} + +#ifdef PADDLE_WITH_CUDA +UnsafeFastGPUGarbageCollector::UnsafeFastGPUGarbageCollector( + const platform::CUDAPlace &place, size_t max_memory_size) + : GarbageCollector(place, max_memory_size) {} + +void UnsafeFastGPUGarbageCollector::ClearCallback( + const std::function &callback) { + callback(); +} + +DefaultStreamGarbageCollector::DefaultStreamGarbageCollector( + const platform::CUDAPlace &place, size_t max_memory_size) + : GarbageCollector(place, max_memory_size) {} + +void DefaultStreamGarbageCollector::Wait() const { + static_cast(this->dev_ctx_) + ->WaitStreamCallback(); +} + +void DefaultStreamGarbageCollector::ClearCallback( + const std::function &callback) { + static_cast(this->dev_ctx_) + ->AddStreamCallback(callback); +} + +StreamGarbageCollector::StreamGarbageCollector(const platform::CUDAPlace &place, + size_t max_memory_size) + : GarbageCollector(place, max_memory_size) { + platform::CUDADeviceGuard guard(place.device); + PADDLE_ENFORCE(cudaStreamCreate(&stream_)); + callback_manager_.reset(new platform::StreamCallbackManager(stream_)); +} + +StreamGarbageCollector::~StreamGarbageCollector() { + auto place = boost::get(this->dev_ctx_->GetPlace()); + platform::CUDADeviceGuard guard(place.device); + PADDLE_ENFORCE(cudaStreamSynchronize(stream_)); + PADDLE_ENFORCE(cudaStreamDestroy(stream_)); +} + +cudaStream_t StreamGarbageCollector::stream() const { return stream_; } + +void StreamGarbageCollector::Wait() const { callback_manager_->Wait(); } + +void StreamGarbageCollector::ClearCallback( + const std::function &callback) { + callback_manager_->AddCallback(callback); +} +#endif + +int64_t GetEagerDeletionThreshold() { + return FLAGS_eager_delete_tensor_gb < 0 + ? -1 + : static_cast(FLAGS_eager_delete_tensor_gb * + (static_cast(1) << 30)); +} + +bool IsFastEagerDeletionModeEnabled() { return FLAGS_fast_eager_deletion_mode; } + +void SetEagerDeletionMode(double threshold, double fraction, bool fast_mode) { + FLAGS_eager_delete_tensor_gb = threshold; + FLAGS_memory_fraction_of_eager_deletion = fraction; + FLAGS_fast_eager_deletion_mode = fast_mode; +} + +double GetEagerDeletionMemoryFraction() { + return FLAGS_memory_fraction_of_eager_deletion; +} + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/garbage_collector.h b/paddle/fluid/framework/garbage_collector.h index b403252c972d26da6deeca54ce88a9547ffe7afa..f0b504627ae0cd99c8b4b15df3dcfc39a56507f2 100644 --- a/paddle/fluid/framework/garbage_collector.h +++ b/paddle/fluid/framework/garbage_collector.h @@ -14,144 +14,85 @@ #pragma once -#include #include #include #include #include // NOLINT +#include +#include "gflags/gflags.h" #include "paddle/fluid/platform/device_context.h" namespace paddle { namespace framework { -// T should have memory_size() and clear() method -template class GarbageCollector { public: - GarbageCollector(const platform::Place &place, size_t max_memory_size) - : max_memory_size_(std::max(max_memory_size, static_cast(1))) { - garbages_.reset(new std::deque()); - dev_ctx_ = platform::DeviceContextPool::Instance().Get(place); - } + using GarbageQueue = std::deque>; - virtual ~GarbageCollector() {} + GarbageCollector(const platform::Place &place, size_t max_memory_size); - void Reset() { - std::lock_guard guard(mutex_); - garbages_.reset(new std::deque()); - cur_memory_size_ = 0; - } + virtual ~GarbageCollector() = default; + + virtual void Wait() const {} template - void Add(const Container &objs) { - Add(objs, []() {}); - } + void Add(Container &&objs); template - void Add(const Container &objs, Callback &&callback) { - std::shared_ptr> clear_deque; - { - std::lock_guard guard(mutex_); - for (auto *obj : objs) { - garbages_->push_back(obj); - cur_memory_size_ += obj->memory_size(); - } - if (cur_memory_size_ >= max_memory_size_) { - cur_memory_size_ = 0; - clear_deque = garbages_; - garbages_.reset(new std::deque()); - } - } - - if (clear_deque != nullptr) { - callback(); - ClearCallback([=]() { - for (auto *obj : *clear_deque) obj->clear(); - }); - } - } - - virtual void Wait() const {} + void Add(Container &&objs, Callback &&callback); protected: virtual void ClearCallback(const std::function &callback) = 0; platform::DeviceContext *dev_ctx_; - std::shared_ptr> garbages_; + std::unique_ptr garbages_; mutable std::mutex mutex_; const size_t max_memory_size_; - size_t cur_memory_size_ = 0; + size_t cur_memory_size_{0}; }; -template -class CPUGarbageCollector : public GarbageCollector { +class CPUGarbageCollector : public GarbageCollector { public: - CPUGarbageCollector(const platform::CPUPlace &place, size_t max_memory_size) - : GarbageCollector(place, max_memory_size) {} + CPUGarbageCollector(const platform::CPUPlace &place, size_t max_memory_size); protected: - void ClearCallback(const std::function &callback) override { - callback(); - } + void ClearCallback(const std::function &callback) override; }; #ifdef PADDLE_WITH_CUDA -template -class DefaultStreamGarbageCollector : public GarbageCollector { +class UnsafeFastGPUGarbageCollector : public GarbageCollector { public: - DefaultStreamGarbageCollector(const platform::CUDAPlace &place, - size_t max_memory_size) - : GarbageCollector(place, max_memory_size) {} + UnsafeFastGPUGarbageCollector(const platform::CUDAPlace &place, + size_t max_memory_size); - cudaStream_t stream() const { - return static_cast(this->dev_ctx_) - ->stream(); - } + protected: + void ClearCallback(const std::function &callback) override; +}; - void Wait() const override { - this->dev_ctx_->Wait(); - static_cast(this->dev_ctx_) - ->WaitStreamCallback(); - } +class DefaultStreamGarbageCollector : public GarbageCollector { + public: + DefaultStreamGarbageCollector(const platform::CUDAPlace &place, + size_t max_memory_size); + + void Wait() const override; protected: - void ClearCallback(const std::function &callback) override { - static_cast(this->dev_ctx_) - ->AddStreamCallback(callback); - } + void ClearCallback(const std::function &callback) override; }; -template -class StreamGarbageCollector : public GarbageCollector { +class StreamGarbageCollector : public GarbageCollector { public: StreamGarbageCollector(const platform::CUDAPlace &place, - size_t max_memory_size) - : GarbageCollector(place, max_memory_size) { - PADDLE_ENFORCE(cudaSetDevice(place.device)); - PADDLE_ENFORCE(cudaStreamCreate(&stream_)); - callback_manager_.reset(new platform::StreamCallbackManager(stream_)); - } + size_t max_memory_size); - ~StreamGarbageCollector() { - auto place = boost::get(this->dev_ctx_->GetPlace()); - PADDLE_ENFORCE(cudaSetDevice(place.device)); - PADDLE_ENFORCE(cudaStreamSynchronize(stream_)); - PADDLE_ENFORCE(cudaStreamDestroy(stream_)); - } + ~StreamGarbageCollector(); - void Wait() const override { - PADDLE_ENFORCE(cudaStreamSynchronize(stream_)); - std::lock_guard guard(this->mutex_); - callback_manager_->Wait(); - } + void Wait() const override; - cudaStream_t stream() const { return stream_; } + cudaStream_t stream() const; protected: - void ClearCallback(const std::function &callback) override { - std::lock_guard guard(this->mutex_); - callback_manager_->AddCallback(callback); - } + void ClearCallback(const std::function &callback) override; private: cudaStream_t stream_; @@ -159,5 +100,40 @@ class StreamGarbageCollector : public GarbageCollector { }; #endif +template +void GarbageCollector::Add(Container &&objs) { + Add(std::forward(objs), []() {}); +} + +template +void GarbageCollector::Add(Container &&objs, Callback &&callback) { + GarbageQueue *garbage_queue = nullptr; + { + std::lock_guard guard(mutex_); + for (auto &obj : objs) { + if (!obj) continue; + cur_memory_size_ += obj->size(); + garbages_->push_back(std::move(obj)); + } + if (cur_memory_size_ >= max_memory_size_) { + cur_memory_size_ = 0; + garbage_queue = garbages_.release(); + garbages_.reset(new GarbageQueue()); + } + } + + if (garbage_queue) { + callback(); + ClearCallback([garbage_queue]() { delete garbage_queue; }); + } +} + +int64_t GetEagerDeletionThreshold(); +bool IsFastEagerDeletionModeEnabled(); + +void SetEagerDeletionMode(double threshold, double fraction, bool fast_mode); + +double GetEagerDeletionMemoryFraction(); + } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/grad_op_desc_maker.h b/paddle/fluid/framework/grad_op_desc_maker.h index 9bccb1a32bf63b30351ef4428594691b0eef0b6a..25a64b69ae8b459d6daefb502e9fba84b5bcf3ba 100644 --- a/paddle/fluid/framework/grad_op_desc_maker.h +++ b/paddle/fluid/framework/grad_op_desc_maker.h @@ -14,7 +14,9 @@ limitations under the License. */ #pragma once #include +#include #include +#include #include #include #include "paddle/fluid/framework/op_desc.h" @@ -55,11 +57,11 @@ class GradOpDescMakerBase { std::back_inserter(ret_val), [this](const std::string& fwd_var_name) -> std::string { auto g_name = GradVarName(fwd_var_name); - if (no_grad_set_.count(g_name)) { - return kEmptyVarName; - } else { + if (no_grad_set_.empty() || !no_grad_set_.count(g_name)) { (*this->grad_to_var_)[g_name] = fwd_var_name; return g_name; + } else { + return kEmptyVarName; } }); if (!drop_empty_grad) { @@ -145,7 +147,7 @@ class SingleGradOpDescMaker : public GradOpDescMakerBase { public: using GradOpDescMakerBase::GradOpDescMakerBase; - std::vector> operator()() const { + std::vector> operator()() const final { std::vector> retv; retv.emplace_back(this->Apply()); return retv; @@ -156,14 +158,14 @@ class SingleGradOpDescMaker : public GradOpDescMakerBase { }; template -class DefaultGradOpDescMaker : public SingleGradOpDescMaker { +class DefaultGradOpDescMaker final : public SingleGradOpDescMaker { public: using SingleGradOpDescMaker::SingleGradOpDescMaker; protected: - virtual std::unique_ptr Apply() const { + std::unique_ptr Apply() const final { auto* grad = new OpDesc(); - grad->SetType(this->GradOpType()); + grad->SetType(this->ForwardOpType() + "_grad"); for (auto& input_param : this->InputNames()) { grad->SetInput(input_param, this->Input(input_param)); @@ -180,18 +182,12 @@ class DefaultGradOpDescMaker : public SingleGradOpDescMaker { return std::unique_ptr(grad); } - - virtual std::string GradOpType() const { - return this->ForwardOpType() + "_grad"; - } }; -class EmptyGradOpMaker : public GradOpDescMakerBase { +class EmptyGradOpMaker final : public GradOpDescMakerBase { public: using GradOpDescMakerBase::GradOpDescMakerBase; - std::vector> operator()() const override { - return {}; - } + std::vector> operator()() const final { return {}; } }; } // namespace framework diff --git a/paddle/fluid/framework/hogwild_worker.cc b/paddle/fluid/framework/hogwild_worker.cc new file mode 100644 index 0000000000000000000000000000000000000000..75c985d10f3b24cc1a49f2e6f87a89550f170c5d --- /dev/null +++ b/paddle/fluid/framework/hogwild_worker.cc @@ -0,0 +1,177 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/device_worker.h" +#include "paddle/fluid/framework/device_worker_factory.h" +#include "paddle/fluid/platform/cpu_helper.h" +#include "paddle/fluid/platform/lodtensor_printer.h" + +namespace paddle { +namespace framework { + +void HogwildWorker::Initialize(const TrainerDesc& desc) { + fetch_config_ = desc.fetch_config(); + param_ = desc.hogwild_param(); + skip_ops_.resize(param_.skip_ops_size()); + for (size_t i = 0; i < param_.skip_ops_size(); ++i) { + skip_ops_[i] = param_.skip_ops(i); + } +} + +void HogwildWorker::CreateThreadOperators(const ProgramDesc& program) { + auto& block = program.Block(0); + op_names_.clear(); + for (auto& op_desc : block.AllOps()) { + std::unique_ptr local_op = OpRegistry::CreateOp(*op_desc); + op_names_.push_back(op_desc->Type()); + OperatorBase* local_op_ptr = local_op.release(); + ops_.push_back(local_op_ptr); + continue; + } +} + +void HogwildWorker::CreateThreadScope(const ProgramDesc& program) { + auto& block = program.Block(0); + + PADDLE_ENFORCE_NOT_NULL( + root_scope_, "root_scope should be set before creating thread scope"); + + thread_scope_ = &root_scope_->NewScope(); + for (auto& var : block.AllVars()) { + if (var->Persistable()) { + auto* ptr = root_scope_->Var(var->Name()); + InitializeVariable(ptr, var->GetType()); + } else { + auto* ptr = thread_scope_->Var(var->Name()); + InitializeVariable(ptr, var->GetType()); + } + } +} + +void HogwildWorker::BindingDataFeedMemory() { + const std::vector& input_feed = + device_reader_->GetUseSlotAlias(); + for (auto name : input_feed) { + device_reader_->AddFeedVar(thread_scope_->Var(name), name); + } +} + +void HogwildWorker::CreateDeviceResource(const ProgramDesc& main_prog) { + CreateThreadScope(main_prog); + CreateThreadOperators(main_prog); +} + +void HogwildWorker::TrainFilesWithProfiler() { + platform::SetNumThreads(1); + device_reader_->Start(); + std::vector op_total_time; + std::vector op_name; + for (auto& op : ops_) { + op_name.push_back(op->Type()); + } + op_total_time.resize(ops_.size()); + for (size_t i = 0; i < op_total_time.size(); ++i) { + op_total_time[i] = 0.0; + } + platform::Timer timeline; + double total_time = 0.0; + double read_time = 0.0; + int cur_batch; + int batch_cnt = 0; + timeline.Start(); + uint64_t total_inst = 0; + while ((cur_batch = device_reader_->Next()) > 0) { + VLOG(3) << "read a batch in thread " << thread_id_; + timeline.Pause(); + read_time += timeline.ElapsedSec(); + total_time += timeline.ElapsedSec(); + for (size_t i = 0; i < ops_.size(); ++i) { + bool need_skip = false; + for (auto t = 0u; t < skip_ops_.size(); ++t) { + if (ops_[i]->Type().find(skip_ops_[t]) != std::string::npos) { + need_skip = true; + break; + } + } + timeline.Start(); + VLOG(3) << "Going to run op " << op_name[i]; + if (!need_skip) { + ops_[i]->Run(*thread_scope_, place_); + } + VLOG(3) << "Op " << op_name[i] << " Finished"; + timeline.Pause(); + op_total_time[i] += timeline.ElapsedSec(); + total_time += timeline.ElapsedSec(); + } + total_inst += cur_batch; + ++batch_cnt; + PrintFetchVars(); + if (thread_id_ == 0) { + if (batch_cnt > 0 && batch_cnt % 100 == 0) { + for (size_t i = 0; i < ops_.size(); ++i) { + fprintf(stderr, "op_name:[%zu][%s], op_mean_time:[%fs]\n", i, + op_name[i].c_str(), op_total_time[i] / batch_cnt); + } + fprintf(stderr, "mean read time: %fs\n", read_time / batch_cnt); + fprintf(stderr, "IO percent: %f\n", read_time / total_time * 100); + fprintf(stderr, "%6.2f instances/s\n", total_inst / total_time); + } + } + thread_scope_->DropKids(); + timeline.Start(); + } +} + +void HogwildWorker::TrainFiles() { + platform::SetNumThreads(1); + + // how to accumulate fetched values here + device_reader_->Start(); + int cur_batch; + while ((cur_batch = device_reader_->Next()) > 0) { + for (auto& op : ops_) { + bool need_skip = false; + for (auto t = 0u; t < skip_ops_.size(); ++t) { + if (op->Type().find(skip_ops_[t]) != std::string::npos) { + need_skip = true; + break; + } + } + if (!need_skip) { + op->Run(*thread_scope_, place_); + } + } + + PrintFetchVars(); + thread_scope_->DropKids(); + } +} + +void HogwildWorker::PrintFetchVars() { + // call count + batch_num_++; + int batch_per_print = fetch_config_.print_period(); + if (thread_id_ == 0) { + if (batch_num_ % batch_per_print == 0) { + int fetch_var_num = fetch_config_.fetch_var_names_size(); + for (int i = 0; i < fetch_var_num; ++i) { + platform::PrintVar(thread_scope_, fetch_config_.fetch_var_names(i), + fetch_config_.fetch_var_str_format(i)); + } + } + } +} + +} // end namespace framework +} // end namespace paddle diff --git a/paddle/fluid/framework/inplace_op_inference.h b/paddle/fluid/framework/inplace_op_inference.h new file mode 100644 index 0000000000000000000000000000000000000000..df46d4f9a805b6e497a6f939e91ecf7dc395e7f0 --- /dev/null +++ b/paddle/fluid/framework/inplace_op_inference.h @@ -0,0 +1,82 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include +#include +#include +#include "glog/logging.h" +#include "paddle/fluid/framework/details/memory_optimize_helper.h" +#include "paddle/fluid/framework/op_desc.h" +#include "paddle/fluid/framework/type_defs.h" + +namespace paddle { +namespace framework { + +/* + Inplace Inference for create In->Out pairs for inplaced operator. + If we specify a pair of corresponding names. For example, X->Out. + then Out will inplaced use X's memory. The base class will do + legality validation for both variables. +*/ + +class InplaceOpInference { + public: + virtual ~InplaceOpInference() {} + virtual std::unordered_map operator()( + const OpDesc& op_desc) const = 0; +}; + +/* + Inplace In and Out for operator only have an Input and an Output. + For example, activation op. + */ +class SingleOpInplaceInToOut : public InplaceOpInference { + public: + std::unordered_map operator()( + const OpDesc& op_desc) const override { + PADDLE_ENFORCE(!op_desc.InputNames().empty(), + "Op inputs must not be empty"); + PADDLE_ENFORCE(!op_desc.OutputNames().empty(), + "Op outputs must not be empty"); + auto x_name = op_desc.InputNames().at(0); + auto out_name = op_desc.OutputNames().at(0); + return std::unordered_map{{x_name, out_name}}; + } +}; + +/* + Gradient op. Inplace output use it's Input. + For example, Input@Grad->Input reuse strategy. + */ +class GradOpInplaceInToOut : public InplaceOpInference { + public: + std::unordered_map operator()( + const OpDesc& op_desc) const override { + std::unordered_map ret; + std::unordered_set output_names(op_desc.OutputNames().begin(), + op_desc.OutputNames().end()); + for (auto& input_name : op_desc.InputNames()) { + if (output_names.count(GradVarName(input_name))) { + ret.insert({input_name, GradVarName(input_name)}); + } + } + return ret; + } +}; + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/inplace_op_inference_test.cc b/paddle/fluid/framework/inplace_op_inference_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..a9b3b889229ee46bf66063c8381bdd02c7229cbd --- /dev/null +++ b/paddle/fluid/framework/inplace_op_inference_test.cc @@ -0,0 +1,316 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. */ + +#include +#include +#include +#include +#include +#include "gtest/gtest.h" +#include "paddle/fluid/framework/details/inplace_op_pass.h" +#include "paddle/fluid/framework/ir/pass_builder.h" +#include "paddle/fluid/framework/op_info.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/framework/operator.h" +#include "paddle/fluid/framework/program_desc.h" +#include "paddle/fluid/framework/var_type_inference.h" + +namespace paddle { +namespace framework { + +class NOP : public OperatorBase { + public: + NOP(const std::string& type, const VariableNameMap& inputs, + const VariableNameMap& outputs, const AttributeMap& attrs) + : OperatorBase(type, inputs, outputs, attrs) {} + + private: + void RunImpl(const Scope& scope, + const platform::Place& place) const override {} +}; + +class SingleOpMaker : public OpProtoAndCheckerMaker { + public: + void Make() { + AddInput("X", "").AsDuplicable(); + AddOutput("Out", ""); + AddComment(""); + } +}; + +class SingleGradOpMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + auto* op = new framework::OpDesc(); + op->SetType("single_op_grad"); + op->SetInput("Out", OutputGrad("Out")); + op->SetOutput(framework::GradVarName("X"), InputGrad("X")); + return std::unique_ptr(op); + } +}; + +class SingleOpShapeInference : public framework::InferShapeBase { + public: + void operator()(framework::InferShapeContext* ctx) const override { + ctx->HasInput("X"); + ctx->HasOutput("Out"); + ctx->SetOutputDim("Out", ctx->GetInputDim("X")); + } +}; + +class SingleGradOpShapeInference : public framework::InferShapeBase { + public: + void operator()(framework::InferShapeContext* ctx) const override { + ctx->HasInput(framework::GradVarName("Out")); + ctx->HasOutput(framework::GradVarName("X")); + ctx->SetOutputDim(framework::GradVarName("X"), ctx->GetInputDim("Out")); + } +}; + +class MultiOutOpMaker : public OpProtoAndCheckerMaker { + public: + void Make() { + AddInput("X", "").AsDuplicable(); + AddInput("Y", "").AsDuplicable(); + AddInput("Z", "").AsDuplicable(); + AddOutput("Out", ""); + AddOutput("YOut", ""); + AddOutput("ZOut", ""); + AddOutput("NotReuseOut", ""); + AddComment(""); + } +}; + +class MultiOutShapeInference : public framework::InferShapeBase { + public: + void operator()(framework::InferShapeContext* ctx) const override { + ctx->ShareDim("X", "Out"); + ctx->ShareDim("Y", "YOut"); + ctx->ShareDim("Z", "ZOut"); + } +}; + +class MultiGradOpMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + auto* op = new framework::OpDesc(); + op->SetType("multi_out_grad"); + op->SetInput("X", Input("X")); + op->SetOutput(framework::GradVarName("Y"), OutputGrad("YOut")); + op->SetOutput(framework::GradVarName("X"), OutputGrad("Out")); + op->SetOutput(framework::GradVarName("Z"), OutputGrad("ZOut")); + return std::unique_ptr(op); + } +}; + +class MultiOutGradShapeInference : public framework::InferShapeBase { + public: + void operator()(framework::InferShapeContext* ctx) const override { + ctx->SetOutputDim(framework::GradVarName("Y"), + ctx->GetInputDim(framework::GradVarName("YOut"))); + ctx->SetOutputDim(framework::GradVarName("X"), + ctx->GetInputDim(framework::GradVarName("Out"))); + ctx->SetOutputDim(framework::GradVarName("Z"), + ctx->GetInputDim(framework::GradVarName("ZOut"))); + } +}; + +class MultiOutInplaceInToOut : public framework::InplaceOpInference { + public: + std::unordered_map operator()( + const OpDesc& op_desc) const override { + return std::unordered_map{ + {"X", "Out"}, {"Y", "YOut"}, {"Z", "ZOut"}, + }; + } +}; + +class MultiOutGradInplaceInToOut : public framework::InplaceOpInference { + public: + std::unordered_map operator()( + const OpDesc& op_desc) const override { + return std::unordered_map{ + {framework::GradVarName("YOut"), framework::GradVarName("Y")}, + {framework::GradVarName("Out"), framework::GradVarName("X")}, + {framework::GradVarName("ZOut"), framework::GradVarName("Z")}, + }; + } +}; + +} // namespace framework +} // namespace paddle + +namespace f = paddle::framework; +REGISTER_OPERATOR(single_op, f::NOP, f::SingleOpMaker, f::SingleGradOpMaker, + f::SingleOpInplaceInToOut, f::SingleOpShapeInference); +REGISTER_OPERATOR(single_op_grad, f::NOP, f::SingleOpInplaceInToOut, + f::SingleGradOpShapeInference); +REGISTER_OPERATOR(multi_out_op, f::NOP, f::MultiOutOpMaker, f::MultiGradOpMaker, + f::MultiOutInplaceInToOut, f::MultiOutShapeInference); +REGISTER_OPERATOR(multi_out_grad, f::NOP, f::MultiOutGradInplaceInToOut, + f::MultiOutGradShapeInference); + +namespace paddle { +namespace framework { + +void FakeSuccData(ProgramDesc* prog) { // NOLINT + prog->MutableBlock(0)->Var("test2_a")->SetType(proto::VarType::LOD_TENSOR); + prog->MutableBlock(0)->Var("test2_a")->SetShape({32, 64, 128, 128}); + prog->MutableBlock(0)->Var("test2_b")->SetType(proto::VarType::LOD_TENSOR); + prog->MutableBlock(0)->Var("test2_c")->SetType(proto::VarType::LOD_TENSOR); + prog->MutableBlock(0)->Var("test2_out"); + prog->MutableBlock(0)->Var("test2_out")->SetShape({64, 32, 128, 128}); +} + +void FakeNoInplaceData(ProgramDesc* prog) { // NOLINT + prog->MutableBlock(0)->Var("test2_a")->SetType(proto::VarType::LOD_TENSOR); + prog->MutableBlock(0)->Var("test2_a")->SetShape({32, 64, 128, 128}); + prog->MutableBlock(0)->Var("test2_b")->SetType(proto::VarType::LOD_TENSOR); + prog->MutableBlock(0)->Var("test2_c")->SetType(proto::VarType::LOD_TENSOR); + prog->MutableBlock(0)->Var("test2_out"); + prog->MutableBlock(0)->Var("test2_out")->SetShape({64, 31, 128, 128}); +} + +ir::Node* GetNodeFromGraph(ir::Graph* g, std::string name) { + ir::Node* op_node = nullptr; + for (auto& item : g->Nodes()) { + if (item->Name() == name) { + op_node = item; + break; + } + } + return op_node; +} + +std::unique_ptr test_SingleOpInplaceInToOut( + std::unique_ptr g) { + std::unique_ptr pass(new details::InplacePass()); + ir::Node* op_node = GetNodeFromGraph(g.get(), "single_op"); + EXPECT_NE(op_node, nullptr); + pass->Apply(g.get()); + return g; +} + +TEST(InferInplace, SingleOpInplaceInToOut) { + ProgramDesc prog; + auto* op = prog.MutableBlock(0)->AppendOp(); + op->SetType("single_op"); + op->SetInput("X", {"test2_a", "test2_b", "test2_c"}); + op->SetOutput("Out", {"test2_out"}); + + FakeSuccData(&prog); + std::unique_ptr g(new ir::Graph(prog)); + g = test_SingleOpInplaceInToOut(std::move(g)); + auto op_node = GetNodeFromGraph(g.get(), "single_op"); + + EXPECT_EQ(op_node->outputs[0]->Name(), "test2_a"); +} + +TEST(InferInplace, SingleOpInplaceInToOutNoInplace) { + ProgramDesc prog; + auto* op = prog.MutableBlock(0)->AppendOp(); + op->SetType("single_op"); + op->SetInput("X", {"test2_a", "test2_b", "test2_c"}); + op->SetOutput("Out", {"test2_out"}); + + FakeNoInplaceData(&prog); + std::unique_ptr g(new ir::Graph(prog)); + g = test_SingleOpInplaceInToOut(std::move(g)); + auto op_node = GetNodeFromGraph(g.get(), "single_op"); + + EXPECT_EQ(op_node->outputs[0]->Name(), "test2_out"); +} + +TEST(InferInplace, MultiOutInplaceInToOut) { + ProgramDesc prog; + auto* op = prog.MutableBlock(0)->AppendOp(); + op->SetType("multi_out_op"); + op->SetInput("X", {"a0", "a1"}); + op->SetInput("Y", {"b0"}); + op->SetInput("Z", {"c0", "c1"}); + op->SetOutput("Out", {"o0"}); + op->SetOutput("YOut", {"y0"}); + op->SetOutput("ZOut", {"z0"}); + + prog.MutableBlock(0)->Var("a0")->SetType(proto::VarType::LOD_TENSOR); + prog.MutableBlock(0)->Var("b0")->SetType(proto::VarType::LOD_TENSOR); + prog.MutableBlock(0)->Var("c0")->SetType(proto::VarType::LOD_TENSOR); + prog.MutableBlock(0)->Var("c1")->SetType(proto::VarType::LOD_TENSOR); + prog.MutableBlock(0)->Var("o0"); + prog.MutableBlock(0)->Var("y0"); + prog.MutableBlock(0)->Var("z0"); + prog.MutableBlock(0)->Var("a0")->SetShape({32, 16, 1024, 1024}); + prog.MutableBlock(0)->Var("b0")->SetShape({32, 16, 1024, 1024}); + prog.MutableBlock(0)->Var("c0")->SetShape({32, 16, 1024, 1024}); + prog.MutableBlock(0)->Var("o0")->SetShape({32, 16, 1024, 1024}); + prog.MutableBlock(0)->Var("y0")->SetShape({32, 16, 1024, 1024}); + prog.MutableBlock(0)->Var("z0")->SetShape({32, 16, 1024, 1024}); + + std::unique_ptr g(new ir::Graph(prog)); + std::unique_ptr pass(new details::InplacePass()); + pass->Apply(g.get()); + auto op_node = GetNodeFromGraph(g.get(), "multi_out_op"); + ASSERT_TRUE(op_node != nullptr); + EXPECT_EQ(op_node->outputs[0]->Name(), "a0"); + EXPECT_EQ(op_node->outputs[1]->Name(), "b0"); + EXPECT_EQ(op_node->outputs[2]->Name(), "c0"); +} + +TEST(InferInplace, MultiGradInplaceInToOut) { + ProgramDesc prog; + auto* op = prog.MutableBlock(0)->AppendOp(); + op->SetType("multi_out_grad"); + op->SetInput(GradVarName("Out"), {"o0"}); + op->SetInput(GradVarName("YOut"), {"y0"}); + op->SetInput(GradVarName("ZOut"), {"z0"}); + op->SetOutput(GradVarName("X"), {"a0", "a1"}); + op->SetOutput(GradVarName("Y"), {"b0"}); + op->SetOutput(GradVarName("Z"), {"c0", "c1"}); + + prog.MutableBlock(0)->Var("a0")->SetType(proto::VarType::LOD_TENSOR); + prog.MutableBlock(0)->Var("b0")->SetType(proto::VarType::LOD_TENSOR); + prog.MutableBlock(0)->Var("c0")->SetType(proto::VarType::LOD_TENSOR); + prog.MutableBlock(0)->Var("c1")->SetType(proto::VarType::LOD_TENSOR); + prog.MutableBlock(0)->Var("o0"); + prog.MutableBlock(0)->Var("y0"); + prog.MutableBlock(0)->Var("z0"); + prog.MutableBlock(0)->Var("a0")->SetShape({32, 16, 1024, 1024}); + prog.MutableBlock(0)->Var("b0")->SetShape({32, 16, 1024, 1024}); + prog.MutableBlock(0)->Var("c0")->SetShape({32, 16, 1024, 1024}); + prog.MutableBlock(0)->Var("o0")->SetShape({32, 16, 1024, 1024}); + prog.MutableBlock(0)->Var("y0")->SetShape({32, 16, 1024, 1024}); + prog.MutableBlock(0)->Var("z0")->SetShape({32, 15, 1024, 1024}); + + std::unique_ptr g(new ir::Graph(prog)); + std::unique_ptr pass(new details::InplacePass()); + pass->Apply(g.get()); + auto op_node = GetNodeFromGraph(g.get(), "multi_out_grad"); + ASSERT_TRUE(op_node != nullptr); + EXPECT_EQ(op_node->outputs[0]->Name(), "o0"); + EXPECT_EQ(op_node->outputs[2]->Name(), "y0"); + EXPECT_EQ(op_node->outputs[3]->Name(), "c0"); + + std::unordered_map expects = { + {"o0", "a0"}, {"y0", "b0"}, {"z0", "c0"}, + }; +} + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/io/CMakeLists.txt b/paddle/fluid/framework/io/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..2baef77b9ce32ce616e7781b971665d3d885066c --- /dev/null +++ b/paddle/fluid/framework/io/CMakeLists.txt @@ -0,0 +1,2 @@ +cc_library(fs SRCS fs.cc DEPS string_helper glog boost) +cc_library(shell SRCS shell.cc DEPS string_helper glog) diff --git a/paddle/fluid/framework/io/fs.cc b/paddle/fluid/framework/io/fs.cc new file mode 100644 index 0000000000000000000000000000000000000000..d5bc5df2565b0f25bc29f2fce37c1bd8626a0dbc --- /dev/null +++ b/paddle/fluid/framework/io/fs.cc @@ -0,0 +1,456 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/io/fs.h" +#include + +namespace paddle { +namespace framework { + +static void fs_add_read_converter_internal(std::string& path, // NOLINT + bool& is_pipe, // NOLINT + const std::string& converter) { + if (converter == "") { + return; + } + + if (!is_pipe) { + path = string::format_string("( %s ) < \"%s\"", converter.c_str(), + path.c_str()); + is_pipe = true; + } else { + path = string::format_string("%s | %s", path.c_str(), converter.c_str()); + } +} + +static void fs_add_write_converter_internal(std::string& path, // NOLINT + bool& is_pipe, // NOLINT + const std::string& converter) { + if (converter == "") { + return; + } + + if (!is_pipe) { + path = string::format_string("( %s ) > \"%s\"", converter.c_str(), + path.c_str()); + is_pipe = true; + } else { + path = string::format_string("%s | %s", converter.c_str(), path.c_str()); + } +} + +static std::shared_ptr fs_open_internal(const std::string& path, + bool is_pipe, + const std::string& mode, + size_t buffer_size, + int* err_no = 0) { + std::shared_ptr fp = nullptr; + + if (!is_pipe) { + fp = shell_fopen(path, mode); + } else { + fp = shell_popen(path, mode, err_no); + } + + if (buffer_size > 0) { + char* buffer = new char[buffer_size]; + CHECK_EQ(0, setvbuf(&*fp, buffer, _IOFBF, buffer_size)); + fp = {&*fp, [fp, buffer](FILE*) mutable { // NOLINT + CHECK(fp.unique()); // NOLINT + fp = nullptr; + delete[] buffer; + }}; + } + + return fp; +} + +static bool fs_begin_with_internal(const std::string& path, + const std::string& str) { + return strncmp(path.c_str(), str.c_str(), str.length()) == 0; +} + +static bool fs_end_with_internal(const std::string& path, + const std::string& str) { + return path.length() >= str.length() && + strncmp(&path[path.length() - str.length()], str.c_str(), + str.length()) == 0; +} + +static size_t& localfs_buffer_size_internal() { + static size_t x = 0; + return x; +} + +size_t localfs_buffer_size() { return localfs_buffer_size_internal(); } + +void localfs_set_buffer_size(size_t x) { localfs_buffer_size_internal() = x; } + +std::shared_ptr localfs_open_read(std::string path, + const std::string& converter) { + bool is_pipe = false; + + if (fs_end_with_internal(path, ".gz")) { + fs_add_read_converter_internal(path, is_pipe, "zcat"); + } + + fs_add_read_converter_internal(path, is_pipe, converter); + return fs_open_internal(path, is_pipe, "r", localfs_buffer_size()); +} + +std::shared_ptr localfs_open_write(std::string path, + const std::string& converter) { + shell_execute( + string::format_string("mkdir -p $(dirname \"%s\")", path.c_str())); + + bool is_pipe = false; + + if (fs_end_with_internal(path, ".gz")) { + fs_add_write_converter_internal(path, is_pipe, "gzip"); + } + + fs_add_write_converter_internal(path, is_pipe, converter); + return fs_open_internal(path, is_pipe, "w", localfs_buffer_size()); +} + +int64_t localfs_file_size(const std::string& path) { + struct stat buf; + if (0 != stat(path.c_str(), &buf)) { + LOG(FATAL) << "file stat not zero"; + return -1; + } + return (int64_t)buf.st_size; +} + +void localfs_remove(const std::string& path) { + if (path == "") { + return; + } + + shell_execute(string::format_string("rm -rf %s", path.c_str())); +} + +std::vector localfs_list(const std::string& path) { + if (path == "") { + return {}; + } + + std::shared_ptr pipe; + int err_no = 0; + pipe = shell_popen( + string::format_string("find %s -type f -maxdepth 1", path.c_str()), "r", + &err_no); + string::LineFileReader reader; + std::vector list; + + while (reader.getline(&*pipe)) { + list.push_back(reader.get()); + } + + return list; +} + +std::string localfs_tail(const std::string& path) { + if (path == "") { + return ""; + } + + return shell_get_command_output( + string::format_string("tail -1 %s ", path.c_str())); +} + +bool localfs_exists(const std::string& path) { + std::string test_f = shell_get_command_output( + string::format_string("[ -f %s ] ; echo $?", path.c_str())); + + if (string::trim_spaces(test_f) == "0") { + return true; + } + + std::string test_d = shell_get_command_output( + string::format_string("[ -d %s ] ; echo $?", path.c_str())); + + if (string::trim_spaces(test_d) == "0") { + return true; + } + + return false; +} + +void localfs_mkdir(const std::string& path) { + if (path == "") { + return; + } + + shell_execute(string::format_string("mkdir -p %s", path.c_str())); +} + +static size_t& hdfs_buffer_size_internal() { + static size_t x = 0; + return x; +} + +size_t hdfs_buffer_size() { return hdfs_buffer_size_internal(); } + +void hdfs_set_buffer_size(size_t x) { hdfs_buffer_size_internal() = x; } + +static std::string& hdfs_command_internal() { + static std::string x = "hadoop fs"; + return x; +} + +const std::string& hdfs_command() { return hdfs_command_internal(); } + +void hdfs_set_command(const std::string& x) { hdfs_command_internal() = x; } + +std::shared_ptr hdfs_open_read(std::string path, int* err_no, + const std::string& converter) { + if (fs_end_with_internal(path, ".gz")) { + path = string::format_string("%s -text \"%s\"", hdfs_command().c_str(), + path.c_str()); + } else { + path = string::format_string("%s -cat \"%s\"", hdfs_command().c_str(), + path.c_str()); + } + + bool is_pipe = true; + fs_add_read_converter_internal(path, is_pipe, converter); + return fs_open_internal(path, is_pipe, "r", hdfs_buffer_size(), err_no); +} + +std::shared_ptr hdfs_open_write(std::string path, int* err_no, + const std::string& converter) { + path = string::format_string("%s -put - \"%s\"", hdfs_command().c_str(), + path.c_str()); + bool is_pipe = true; + + if (fs_end_with_internal(path, ".gz\"")) { + fs_add_write_converter_internal(path, is_pipe, "gzip"); + } + + fs_add_write_converter_internal(path, is_pipe, converter); + return fs_open_internal(path, is_pipe, "w", hdfs_buffer_size(), err_no); +} + +void hdfs_remove(const std::string& path) { + if (path == "") { + return; + } + + shell_execute(string::format_string("%s -rmr %s &>/dev/null; true", + hdfs_command().c_str(), path.c_str())); +} + +std::vector hdfs_list(const std::string& path) { + if (path == "") { + return {}; + } + + std::string prefix = "hdfs:"; + + if (fs_begin_with_internal(path, "afs:")) { + prefix = "afs:"; + } + int err_no = 0; + std::vector list; + do { + err_no = 0; + std::shared_ptr pipe; + pipe = shell_popen( + string::format_string("%s -ls %s | ( grep ^- ; [ $? != 2 ] )", + hdfs_command().c_str(), path.c_str()), + "r", &err_no); + string::LineFileReader reader; + list.clear(); + + while (reader.getline(&*pipe)) { + std::vector line = string::split_string(reader.get()); + if (line.size() != 8) { + continue; + } + list.push_back(prefix + line[7]); + } + } while (err_no == -1); + return list; +} + +std::string hdfs_tail(const std::string& path) { + if (path == "") { + return ""; + } + + return shell_get_command_output(string::format_string( + "%s -text %s | tail -1 ", hdfs_command().c_str(), path.c_str())); +} + +bool hdfs_exists(const std::string& path) { + std::string test = shell_get_command_output(string::format_string( + "%s -test -e %s ; echo $?", hdfs_command().c_str(), path.c_str())); + + if (string::trim_spaces(test) == "0") { + return true; + } + + return false; +} + +void hdfs_mkdir(const std::string& path) { + if (path == "") { + return; + } + + shell_execute(string::format_string("%s -mkdir %s; true", + hdfs_command().c_str(), path.c_str())); +} + +int fs_select_internal(const std::string& path) { + if (fs_begin_with_internal(path, "hdfs:")) { + return 1; + } else if (fs_begin_with_internal(path, "afs:")) { + return 1; + } + + return 0; +} + +std::shared_ptr fs_open_read(const std::string& path, int* err_no, + const std::string& converter) { + switch (fs_select_internal(path)) { + case 0: + return localfs_open_read(path, converter); + + case 1: + return hdfs_open_read(path, err_no, converter); + + default: + LOG(FATAL) << "Not supported"; + } + + return {}; +} + +std::shared_ptr fs_open_write(const std::string& path, int* err_no, + const std::string& converter) { + switch (fs_select_internal(path)) { + case 0: + return localfs_open_write(path, converter); + + case 1: + return hdfs_open_write(path, err_no, converter); + + default: + LOG(FATAL) << "Not supported"; + } + + return {}; +} + +std::shared_ptr fs_open(const std::string& path, const std::string& mode, + int* err_no, const std::string& converter) { + if (mode == "r" || mode == "rb") { + return fs_open_read(path, err_no, converter); + } + + if (mode == "w" || mode == "wb") { + return fs_open_write(path, err_no, converter); + } + + LOG(FATAL) << "Unknown mode: " << mode; + return {}; +} + +int64_t fs_file_size(const std::string& path) { + switch (fs_select_internal(path)) { + case 0: + return localfs_file_size(path); + + default: + LOG(FATAL) << "Not supported"; + } + + return 0; +} + +void fs_remove(const std::string& path) { + switch (fs_select_internal(path)) { + case 0: + return localfs_remove(path); + + case 1: + return hdfs_remove(path); + + default: + LOG(FATAL) << "Not supported"; + } +} + +std::vector fs_list(const std::string& path) { + switch (fs_select_internal(path)) { + case 0: + return localfs_list(path); + + case 1: + return hdfs_list(path); + + default: + LOG(FATAL) << "Not supported"; + } + + return {}; +} + +std::string fs_tail(const std::string& path) { + switch (fs_select_internal(path)) { + case 0: + return localfs_tail(path); + + case 1: + return hdfs_tail(path); + + default: + LOG(FATAL) << "Not supported"; + } + + return ""; +} + +bool fs_exists(const std::string& path) { + switch (fs_select_internal(path)) { + case 0: + return localfs_exists(path); + + case 1: + return hdfs_exists(path); + + default: + LOG(FATAL) << "Not supported"; + } + + return false; +} + +void fs_mkdir(const std::string& path) { + switch (fs_select_internal(path)) { + case 0: + return localfs_mkdir(path); + + case 1: + return hdfs_mkdir(path); + + default: + LOG(FATAL) << "Not supported"; + } +} +} // end namespace framework +} // end namespace paddle diff --git a/paddle/fluid/framework/io/fs.h b/paddle/fluid/framework/io/fs.h new file mode 100644 index 0000000000000000000000000000000000000000..3f0174701c24cc5a3eac38d12792650bdbd9463b --- /dev/null +++ b/paddle/fluid/framework/io/fs.h @@ -0,0 +1,101 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include "glog/logging.h" +#include "paddle/fluid/framework/io/shell.h" +#include "paddle/fluid/string/string_helper.h" + +namespace paddle { +namespace framework { + +int fs_select_internal(const std::string& path); + +// localfs +extern size_t localfs_buffer_size(); + +extern void localfs_set_buffer_size(size_t x); + +extern std::shared_ptr localfs_open_read(std::string path, + const std::string& converter); + +extern std::shared_ptr localfs_open_write(std::string path, + const std::string& converter); + +extern int64_t localfs_file_size(const std::string& path); + +extern void localfs_remove(const std::string& path); + +extern std::vector localfs_list(const std::string& path); + +extern std::string localfs_tail(const std::string& path); + +extern bool localfs_exists(const std::string& path); + +extern void localfs_mkdir(const std::string& path); + +// hdfs +extern size_t hdfs_buffer_size(); + +extern void hdfs_set_buffer_size(size_t x); + +extern const std::string& hdfs_command(); + +extern void hdfs_set_command(const std::string& x); + +extern std::shared_ptr hdfs_open_read(std::string path, int* err_no, + const std::string& converter); + +extern std::shared_ptr hdfs_open_write(std::string path, int* err_no, + const std::string& converter); + +extern void hdfs_remove(const std::string& path); + +extern std::vector hdfs_list(const std::string& path); + +extern std::string hdfs_tail(const std::string& path); + +extern bool hdfs_exists(const std::string& path); + +extern void hdfs_mkdir(const std::string& path); + +// aut-detect fs +extern std::shared_ptr fs_open_read(const std::string& path, int* err_no, + const std::string& converter); + +extern std::shared_ptr fs_open_write(const std::string& path, int* err_no, + const std::string& converter); + +extern std::shared_ptr fs_open(const std::string& path, + const std::string& mode, int* err_no, + const std::string& converter = ""); + +extern int64_t fs_file_size(const std::string& path); + +extern void fs_remove(const std::string& path); + +extern std::vector fs_list(const std::string& path); + +extern std::string fs_tail(const std::string& path); + +extern bool fs_exists(const std::string& path); + +extern void fs_mkdir(const std::string& path); +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/io/shell.cc b/paddle/fluid/framework/io/shell.cc new file mode 100644 index 0000000000000000000000000000000000000000..ab671cb5690df51c1cff141906c40cc9e74584fa --- /dev/null +++ b/paddle/fluid/framework/io/shell.cc @@ -0,0 +1,323 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/io/shell.h" + +namespace paddle { +namespace framework { + +std::shared_ptr shell_fopen(const std::string& path, + const std::string& mode) { +#if defined _WIN32 || defined __APPLE__ + return nullptr; +#else + if (shell_verbose()) { + LOG(INFO) << "Opening file[" << path << "] with mode[" << mode << "]"; + } + FILE* fp; + if (!(fp = fopen(path.c_str(), mode.c_str()))) { + LOG(FATAL) << "fopen fail, path[" << path << "], mode[" << mode << "]"; + } + return {fp, [path](FILE* fp) { + if (shell_verbose()) { + LOG(INFO) << "Closing file[" << path << "]"; + } + if (0 != fclose(fp)) { + LOG(FATAL) << "fclose fail, path[" << path << "]"; + } + }}; +#endif +} + +// Close all open file descriptors +// The implementation is async signal safe +// Mostly copy from CPython code +static int close_open_fds_internal() { +#if defined _WIN32 || defined __APPLE__ + return 0; +#else + struct linux_dirent { + long d_ino = 0; // NOLINT + off_t d_off; + unsigned short d_reclen = 0; // NOLINT + char d_name[256]; + }; + + int dir_fd = -1; + if ((dir_fd = open("/proc/self/fd", O_RDONLY)) < 0) { + LOG(FATAL) << "proc/self/fd open fail"; + return -1; + } + char buffer[sizeof(linux_dirent)]; + + for (;;) { + int bytes = 0; + if ((bytes = syscall(SYS_getdents, dir_fd, + reinterpret_cast(buffer), + sizeof(buffer))) < 0) { + LOG(FATAL) << "syscall fail"; + return -1; + } + + if (bytes == 0) { + break; + } + + linux_dirent* entry = NULL; + + for (int offset = 0; offset < bytes; offset += entry->d_reclen) { + entry = reinterpret_cast(buffer + offset); + int fd = 0; + const char* s = entry->d_name; + + while (*s >= '0' && *s <= '9') { + fd = fd * 10 + (*s - '0'); + s++; + } + + if (s != entry->d_name && fd != dir_fd && fd >= 3) { + close(fd); + } + } + } + + close(dir_fd); + return 0; +#endif +} + +static int shell_popen_fork_internal(const char* real_cmd, bool do_read, + int parent_end, int child_end) { +#if defined _WIN32 || defined __APPLE__ + return 0; +#else + int child_pid = -1; + // Too frequent calls to fork() makes openmpi very slow. Use vfork() instead. + // But vfork() is very dangerous. Be careful. + if ((child_pid = vfork()) < 0) { + return -1; + } + + // The following code is async signal safe (No memory allocation, no access to + // global data, etc.) + if (child_pid != 0) { + return child_pid; + } + + int child_std_end = do_read ? 1 : 0; + close(parent_end); + + if (child_end != child_std_end) { + if (dup2(child_end, child_std_end) != child_std_end) { + return -1; + } + close(child_end); + } + + close_open_fds_internal(); + if (execl("/bin/bash", "bash", "-c", real_cmd, NULL) < 0) { + return -1; + } + exit(127); +#endif +} + +std::shared_ptr shell_popen(const std::string& cmd, + const std::string& mode, int* err_no) { +#if defined _WIN32 || defined __APPLE__ + return nullptr; +#else + bool do_read = mode == "r"; + bool do_write = mode == "w"; + if (!(do_read || do_write)) { + *err_no = -1; + return NULL; + } + + if (shell_verbose()) { + LOG(INFO) << "Opening pipe[" << cmd << "] with mode[" << mode << "]"; + } + + std::string real_cmd = "set -o pipefail; " + cmd; + + int pipe_fds[2]; + if (pipe(pipe_fds) != 0) { + *err_no = -1; + return NULL; + } + int parent_end = 0; + int child_end = 0; + + if (do_read) { + parent_end = pipe_fds[0]; + child_end = pipe_fds[1]; + } else if (do_write) { + parent_end = pipe_fds[1]; + child_end = pipe_fds[0]; + } + + int child_pid = shell_popen_fork_internal(real_cmd.c_str(), do_read, + parent_end, child_end); + close(child_end); + fcntl(parent_end, F_SETFD, FD_CLOEXEC); + FILE* fp; + if ((fp = fdopen(parent_end, mode.c_str())) == NULL) { + *err_no = -1; + return NULL; + } + return {fp, [child_pid, cmd, err_no](FILE* fp) { + if (shell_verbose()) { + LOG(INFO) << "Closing pipe[" << cmd << "]"; + } + + if (fclose(fp) != 0) { + *err_no = -1; + } + int wstatus = -1; + waitpid(child_pid, &wstatus, 0); + if (wstatus == 0 || wstatus == (128 + SIGPIPE) * 256 || + (wstatus == -1 && errno == ECHILD)) { + } else { + *err_no = -1; + LOG(WARNING) << "status[" << wstatus << "], cmd[" << cmd << "]" + << ", err_no[" << *err_no << "]"; + } + if (wstatus == -1 && errno == ECHILD) { + LOG(WARNING) << "errno is ECHILD"; + } + }}; +#endif +} + +static int shell_p2open_fork_internal(const char* real_cmd, int pipein_fds[2], + int pipeout_fds[2]) { +#if defined _WIN32 || defined __APPLE__ + return 0; +#else + int child_pid = -1; + if ((child_pid = fork()) < 0) { + return -1; + } + + if (child_pid != 0) { + return child_pid; + } + + close(pipein_fds[0]); + close(pipeout_fds[1]); + + if (pipein_fds[1] != 1) { + if (dup2(pipein_fds[1], 1) != 1) { + return -1; + } + close(pipein_fds[1]); + } + + if (pipeout_fds[0] != 0) { + if (dup2(pipeout_fds[0], 0) != 0) { + return -1; + } + close(pipeout_fds[0]); + } + + close_open_fds_internal(); + if (execl("/bin/sh", "sh", "-c", real_cmd, NULL) < 0) { + return -1; + } + exit(127); +#endif +} + +std::pair, std::shared_ptr> shell_p2open( + const std::string& cmd) { +#if defined _WIN32 || defined __APPLE__ + return {}; +#else + if (shell_verbose()) { + LOG(INFO) << "Opening bidirectional pipe[" << cmd << "]"; + } + + std::string real_cmd = "set -o pipefail; " + cmd; + + int pipein_fds[2]; + int pipeout_fds[2]; + if (pipe(pipein_fds) != 0) { + return {NULL, NULL}; + } + if (pipe(pipeout_fds) != 0) { + return {NULL, NULL}; + } + + int child_pid = + shell_p2open_fork_internal(real_cmd.c_str(), pipein_fds, pipeout_fds); + + close(pipein_fds[1]); + close(pipeout_fds[0]); + fcntl(pipein_fds[0], F_SETFD, FD_CLOEXEC); + fcntl(pipeout_fds[1], F_SETFD, FD_CLOEXEC); + + std::shared_ptr child_life = { + NULL, [child_pid, cmd](void*) { + if (shell_verbose()) { + LOG(INFO) << "Closing bidirectional pipe[" << cmd << "]"; + } + + int wstatus, ret; + + do { + PCHECK((ret = waitpid(child_pid, &wstatus, 0)) >= 0 || + (ret == -1 && errno == EINTR)); + } while (ret == -1 && errno == EINTR); + + PCHECK(wstatus == 0 || wstatus == (128 + SIGPIPE) * 256 || + (wstatus == -1 && errno == ECHILD)) + << "status[" << wstatus << "], cmd[" << cmd << "]"; + + if (wstatus == -1 && errno == ECHILD) { + LOG(WARNING) << "errno is ECHILD"; + } + }}; + + FILE* in_fp; + PCHECK((in_fp = fdopen(pipein_fds[0], "r")) != NULL); + FILE* out_fp; + PCHECK((out_fp = fdopen(pipeout_fds[1], "w")) != NULL); + return {{in_fp, [child_life](FILE* fp) { PCHECK(fclose(fp) == 0); }}, + {out_fp, [child_life](FILE* fp) { PCHECK(fclose(fp) == 0); }}}; +#endif +} + +std::string shell_get_command_output(const std::string& cmd) { +#if defined _WIN32 || defined __APPLE__ + return ""; +#else + int err_no = 0; + do { + err_no = 0; + std::shared_ptr pipe = shell_popen(cmd, "r", &err_no); + string::LineFileReader reader; + + if (reader.getdelim(&*pipe, 0)) { + pipe = nullptr; + if (err_no == 0) { + return reader.get(); + } + } + } while (err_no == -1); + return ""; +#endif +} + +} // end namespace framework +} // end namespace paddle diff --git a/paddle/fluid/framework/io/shell.h b/paddle/fluid/framework/io/shell.h new file mode 100644 index 0000000000000000000000000000000000000000..46fcc92bafa84e4c1b89e4603fe0db364572b73e --- /dev/null +++ b/paddle/fluid/framework/io/shell.h @@ -0,0 +1,66 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#ifdef _WIN32 +#include +#else +#include +#endif +#include +#ifndef _WIN32 +#include +#endif +#include +#include +#include +#include "paddle/fluid/platform/port.h" +#include "paddle/fluid/string/string_helper.h" + +namespace paddle { +namespace framework { + +inline bool& shell_verbose_internal() { + static bool x = false; + return x; +} + +inline bool shell_verbose() { return shell_verbose_internal(); } + +inline void shell_set_verbose(bool x) { shell_verbose_internal() = x; } + +extern std::shared_ptr shell_fopen(const std::string& path, + const std::string& mode); + +extern std::shared_ptr shell_popen(const std::string& cmd, + const std::string& mode, int* err_no); + +extern std::pair, std::shared_ptr> shell_p2open( + const std::string& cmd); + +inline void shell_execute(const std::string& cmd) { + int err_no = 0; + do { + err_no = 0; + shell_popen(cmd, "w", &err_no); + } while (err_no == -1); +} + +extern std::string shell_get_command_output(const std::string& cmd); + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/CMakeLists.txt b/paddle/fluid/framework/ir/CMakeLists.txt index 796ce1f91ce6f3e21dc6f0af8fca4960d43f6e2b..16fc1721eb6f5d2517ad45289f2415ef41749df2 100644 --- a/paddle/fluid/framework/ir/CMakeLists.txt +++ b/paddle/fluid/framework/ir/CMakeLists.txt @@ -5,53 +5,114 @@ file(APPEND ${pass_file} "\#include \"paddle/fluid/framework/ir/pass.h\"\n") # Usage: pass_library(target inference) will append to paddle_inference_pass.h +unset(INFER_IR_PASSES CACHE) # clear the global variable function(pass_library TARGET DEST) set(options "") set(oneValueArgs "") set(multiValueArgs SRCS DEPS) + set(targetPrefix "") + + # Get optional argument + set(extraMacroArgs ${ARGN}) + list(LENGTH extraMacroArgs numExtraMacroArgs) + if(numExtraMacroArgs GREATER 0) + list(GET extraMacroArgs 0 targetPrefix) + endif() + cmake_parse_arguments(op_library "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) - cc_library(${TARGET} SRCS ${TARGET}.cc DEPS graph_pattern_detector pass ${op_library_DEPS}) + if(targetPrefix) + cc_library(${TARGET} SRCS ${targetPrefix}/${TARGET}.cc DEPS graph_pattern_detector pass fuse_pass_base ${op_library_DEPS}) + else() + cc_library(${TARGET} SRCS ${TARGET}.cc DEPS graph_pattern_detector pass fuse_pass_base ${op_library_DEPS}) + endif() + # add more DEST here, such as train, dist and collect USE_PASS into a file automatically. if (${DEST} STREQUAL "base" OR ${DEST} STREQUAL "inference") message(STATUS "add pass ${TARGET} ${DEST}") file(APPEND ${pass_file} "USE_PASS(${TARGET});\n") - set(PASS_LIBRARY ${TARGET} ${PASS_LIBRARY} PARENT_SCOPE) + set(INFER_IR_PASSES ${INFER_IR_PASSES} ${TARGET} CACHE INTERNAL "") endif() endfunction() + cc_library(node SRCS node.cc DEPS proto_desc) cc_library(graph SRCS graph.cc DEPS node pretty_log) cc_library(graph_helper SRCS graph_helper.cc DEPS graph) cc_library(pass SRCS pass.cc DEPS graph node graph_helper) cc_library(graph_traits SRCS graph_traits.cc DEPS graph) cc_library(graph_pattern_detector SRCS graph_pattern_detector.cc DEPS graph graph_helper graph_traits) +cc_library(fuse_pass_base SRCS fuse_pass_base.cc DEPS pass) pass_library(graph_to_program_pass base) pass_library(graph_viz_pass base) +pass_library(lock_free_optimize_pass base) pass_library(fc_fuse_pass inference) -if (WITH_MKLDNN) - pass_library(conv_relu_mkldnn_fuse_pass inference) -endif () pass_library(attention_lstm_fuse_pass inference) pass_library(infer_clean_graph_pass inference) pass_library(fc_lstm_fuse_pass inference) pass_library(embedding_fc_lstm_fuse_pass inference) pass_library(fc_gru_fuse_pass inference) pass_library(seq_concat_fc_fuse_pass inference) +pass_library(multi_batch_merge_pass base) pass_library(conv_bn_fuse_pass inference) +pass_library(seqconv_eltadd_relu_fuse_pass inference) +pass_library(seqpool_concat_fuse_pass inference) +pass_library(repeated_fc_relu_fuse_pass inference) +pass_library(squared_mat_sub_fuse_pass inference) +pass_library(is_test_pass base) +pass_library(conv_elementwise_add_act_fuse_pass inference) +pass_library(conv_elementwise_add2_act_fuse_pass inference) +pass_library(conv_elementwise_add_fuse_pass inference) +pass_library(conv_affine_channel_fuse_pass inference) +pass_library(transpose_flatten_concat_fuse_pass inference) +pass_library(identity_scale_op_clean_pass base) +pass_library(sync_batch_norm_pass base) +pass_library(runtime_context_cache_pass base) +pass_library(expected_kernel_cache_pass base) +pass_library(quant_conv2d_dequant_fuse_pass inference) +pass_library(fillconstant_elementwisemul_fuse inference) + +if(ANAKIN_FOUND) +pass_library(simplify_anakin_priorbox_detection_out_pass inference) +endif() + +if(WITH_MKLDNN) + pass_library(mkldnn_placement_pass base mkldnn) + pass_library(depthwise_conv_mkldnn_pass base mkldnn) + pass_library(conv_bias_mkldnn_fuse_pass inference mkldnn) + pass_library(conv_relu_mkldnn_fuse_pass inference mkldnn) + pass_library(conv_elementwise_add_mkldnn_fuse_pass inference mkldnn) + pass_library(cpu_quantize_placement_pass base mkldnn) + pass_library(cpu_quantize_pass inference mkldnn) + pass_library(cpu_quantize_squash_pass inference mkldnn) +endif() cc_library(fuse_elewise_add_act_pass SRCS fuse_elewise_add_act_pass.cc DEPS pass graph_pattern_detector ) +cc_library(fuse_relu_depthwise_conv_pass SRCS fuse_relu_depthwise_conv_pass.cc DEPS pass graph_pattern_detector ) set(GLOB_PASS_LIB ${PASS_LIBRARY} CACHE INTERNAL "Global PASS library") cc_library(pass_builder SRCS pass_builder.cc DEPS pass) +cc_test(node_test SRCS node_test.cc DEPS node) cc_test(pass_test SRCS pass_test.cc DEPS graph pass graph_helper) cc_test(graph_test SRCS graph_test.cc DEPS graph graph_helper op_registry) cc_test(graph_helper_test SRCS graph_helper_test.cc DEPS graph graph_helper op_registry) cc_test(graph_to_program_pass_test SRCS graph_to_program_pass_test.cc DEPS graph_to_program_pass) cc_test(test_graph_pattern_detector SRCS graph_pattern_detector_tester.cc DEPS graph_pattern_detector) cc_test(test_fc_fuse_pass SRCS fc_fuse_pass_tester.cc DEPS fc_fuse_pass framework_proto) +cc_test(test_seqpool_concat_fuse_pass SRCS seqpool_concat_fuse_pass_tester.cc DEPS seqpool_concat_fuse_pass framework_proto) +cc_test(test_is_test_pass SRCS is_test_pass_tester.cc DEPS is_test_pass) +if(NOT WIN32) + cc_test(test_sync_batch_norm_pass SRCS sync_batch_norm_pass_tester.cc DEPS sync_batch_norm_pass) +endif() if (WITH_MKLDNN) - cc_test(test_conv_relu_mkldnn_fuse_pass SRCS conv_relu_mkldnn_fuse_pass_tester.cc DEPS conv_relu_mkldnn_fuse_pass) + cc_test(test_depthwise_conv_mkldnn_pass SRCS mkldnn/depthwise_conv_mkldnn_pass_tester.cc DEPS depthwise_conv_mkldnn_pass) + cc_test(test_conv_bias_mkldnn_fuse_pass SRCS mkldnn/conv_bias_mkldnn_fuse_pass_tester.cc DEPS conv_bias_mkldnn_fuse_pass naive_executor) + cc_test(test_conv_relu_mkldnn_fuse_pass SRCS mkldnn/conv_relu_mkldnn_fuse_pass_tester.cc DEPS conv_relu_mkldnn_fuse_pass) + cc_test(test_conv_elementwise_add_mkldnn_fuse_pass SRCS mkldnn/conv_elementwise_add_mkldnn_fuse_pass_tester.cc DEPS conv_elementwise_add_mkldnn_fuse_pass) + cc_test(test_mkldnn_placement_pass SRCS mkldnn/mkldnn_placement_pass_tester.cc DEPS mkldnn_placement_pass) + cc_test(test_cpu_quantize_placement_pass SRCS mkldnn/cpu_quantize_placement_pass_tester.cc DEPS cpu_quantize_placement_pass) + cc_test(test_cpu_quantize_pass SRCS mkldnn/cpu_quantize_pass_tester.cc DEPS cpu_quantize_pass naive_executor) + cc_test(test_cpu_quantize_squash_pass SRCS mkldnn/cpu_quantize_squash_pass_tester.cc DEPS cpu_quantize_squash_pass naive_executor) endif () diff --git a/paddle/fluid/framework/ir/attention_lstm_fuse_pass.cc b/paddle/fluid/framework/ir/attention_lstm_fuse_pass.cc index 6090f1fe76a49dddad0640123b1fa4db8c489634..5a82d7927f4cf3ca7e7b27ecdb71eab69e007efb 100644 --- a/paddle/fluid/framework/ir/attention_lstm_fuse_pass.cc +++ b/paddle/fluid/framework/ir/attention_lstm_fuse_pass.cc @@ -14,6 +14,7 @@ #include "paddle/fluid/framework/ir/attention_lstm_fuse_pass.h" #include +#include #include "paddle/fluid/framework/ir/graph_pattern_detector.h" #include "paddle/fluid/framework/ir/graph_viz_pass.h" #include "paddle/fluid/framework/lod_tensor.h" @@ -91,10 +92,10 @@ void FindWhileOp(Graph* graph) { #undef OP_SET_IN #undef OP_SET_OUT - auto* X = graph->RetriveNode(34); - auto* LSTMOUT = graph->RetriveNode(81); - auto* cell_init = graph->RetriveNode(6); - auto* hidden_init = graph->RetriveNode(8); + auto* X = graph->RetrieveNode(34); + auto* LSTMOUT = graph->RetrieveNode(81); + auto* cell_init = graph->RetrieveNode(6); + auto* hidden_init = graph->RetrieveNode(8); auto* lstm_op = graph->CreateOpNode(&op_desc); PrepareParameters(graph, param); @@ -211,12 +212,12 @@ void PrepareLSTMWeight(const LoDTensor& W_forget_w0, VLOG(3) << "LSTMWeight resized to " << out->dims(); float* out_data = out->mutable_data(platform::CPUPlace()); - std::array tensors( - {{W_forget_w0.data(), W_input_w0.data(), - W_output_w0.data(), W_cell_w0.data()}}); - std::array tensors1( - {{W_forget_w1.data(), W_input_w1.data(), - W_output_w1.data(), W_cell_w1.data()}}); + std::array tensors{ + W_forget_w0.data(), W_input_w0.data(), + W_output_w0.data(), W_cell_w0.data()}; + std::array tensors1{ + W_forget_w1.data(), W_input_w1.data(), + W_output_w1.data(), W_cell_w1.data()}; for (int row = 0; row < D; row++) { for (int col = 0; col < 4; col++) { @@ -238,9 +239,9 @@ void PrepareLSTMWeight(const LoDTensor& W_forget_w0, void PrepareLSTMBias(const LoDTensor& B_forget, const LoDTensor& B_input, const LoDTensor& B_output, const LoDTensor& B_cell, LoDTensor* out) { - std::array tensors( - {{B_forget.data(), B_input.data(), B_output.data(), - B_cell.data()}}); + std::array tensors{ + B_forget.data(), B_input.data(), B_output.data(), + B_cell.data()}; PADDLE_ENFORCE_EQ(B_forget.dims().size(), 1); int D = B_forget.dims()[0]; @@ -253,8 +254,7 @@ void PrepareLSTMBias(const LoDTensor& B_forget, const LoDTensor& B_input, // Parameters -std::unique_ptr AttentionLSTMFusePass::ApplyImpl( - std::unique_ptr graph) const { +void AttentionLSTMFusePass::ApplyImpl(ir::Graph* graph) const { PDPattern external_pattern, subblock_pattern; // Use the following variables to tell whether this model is RNN1. @@ -269,12 +269,11 @@ std::unique_ptr AttentionLSTMFusePass::ApplyImpl( } } if (count < specified_vars.size()) { - return graph; + return; } // Continue to fuse. - FindWhileOp(graph.get()); - return graph; + FindWhileOp(graph); } } // namespace ir diff --git a/paddle/fluid/framework/ir/attention_lstm_fuse_pass.h b/paddle/fluid/framework/ir/attention_lstm_fuse_pass.h index a756dfc1b98e1de55c809c73e2c4df1e628950ae..47ed9f0393fb222e612ed3bce1afbc879edb410d 100644 --- a/paddle/fluid/framework/ir/attention_lstm_fuse_pass.h +++ b/paddle/fluid/framework/ir/attention_lstm_fuse_pass.h @@ -22,7 +22,7 @@ namespace ir { class AttentionLSTMFusePass : public FusePassBase { protected: - std::unique_ptr ApplyImpl(std::unique_ptr graph) const; + void ApplyImpl(ir::Graph* graph) const override; }; } // namespace ir diff --git a/paddle/fluid/framework/ir/conv_affine_channel_fuse_pass.cc b/paddle/fluid/framework/ir/conv_affine_channel_fuse_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..fecc159adef1992a90b6ee88b3b7ffceea116243 --- /dev/null +++ b/paddle/fluid/framework/ir/conv_affine_channel_fuse_pass.cc @@ -0,0 +1,218 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/conv_affine_channel_fuse_pass.h" +#include +#include +#include +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/operators/math/cpu_vec.h" +#include "paddle/fluid/platform/enforce.h" + +namespace paddle { +namespace framework { +namespace ir { + +#define GET_CONV_BN_NODES(pattern_name) \ + /* OPERATORS */ \ + GET_IR_NODE_FROM_SUBGRAPH(conv, conv, pattern_name); \ + GET_IR_NODE_FROM_SUBGRAPH(affine_channel, affine_channel, pattern_name); \ + /* CONV inputs */ \ + GET_IR_NODE_FROM_SUBGRAPH(conv_weight, conv_weight, pattern_name); \ + /* CONV outputs */ \ + GET_IR_NODE_FROM_SUBGRAPH(conv_out, conv_out, pattern_name); \ + /* Affine Channel inputs */ \ + GET_IR_NODE_FROM_SUBGRAPH(ac_scale, ac_scale, pattern_name); \ + GET_IR_NODE_FROM_SUBGRAPH(ac_bias, ac_bias, pattern_name); \ + /* Affine channel outputs */ \ + GET_IR_NODE_FROM_SUBGRAPH(ac_out, ac_out, pattern_name); /* Out */ + +void recompute_bias_and_weights(const Scope* scope, ir::Node* conv_weight, + const ir::Node& ac_scale, + const LoDTensor& ac_bias_tensor, + LoDTensor* eltwise_y_in_tensor) { + using EigenVectorArrayMap = + Eigen::Map>; + using ConstEigenVectorArrayMap = + Eigen::Map>; + using EigenMatrixArrayMap = Eigen::Map< + Eigen::Array>; + + // Re-compute bias of conv2d from AffineChannel + PADDLE_ENFORCE_EQ(eltwise_y_in_tensor->dims(), ac_bias_tensor.dims()); + + auto* scale_tensor = scope->FindVar(ac_scale.Name())->GetMutable(); + + ConstEigenVectorArrayMap scale_array(scale_tensor->data(), + scale_tensor->numel(), 1); + ConstEigenVectorArrayMap ac_bias_array(ac_bias_tensor.data(), + ac_bias_tensor.numel(), 1); + + EigenVectorArrayMap eltwise_y_in_array( + eltwise_y_in_tensor->mutable_data(platform::CPUPlace()), + eltwise_y_in_tensor->numel(), 1); + + eltwise_y_in_array = (eltwise_y_in_array * scale_array) + ac_bias_array; + + // Re-compute weight of conv2d from AffineChannel + auto* weights = scope->FindVar(conv_weight->Name())->GetMutable(); + auto weights_shape = weights->dims(); + auto weights_shape_2d = flatten_to_2d(weights_shape, 1); + + EigenMatrixArrayMap weights_array_2d( + weights->mutable_data(platform::CPUPlace()), weights_shape_2d[0], + weights_shape_2d[1]); + + weights_array_2d.colwise() *= scale_array; +} + +void ConvAffineChannelFusePass::ApplyImpl(ir::Graph* graph) const { + PADDLE_ENFORCE(graph); + FusePassBase::Init(name_scope_, graph); + + auto* scope = param_scope(); + PADDLE_ENFORCE(scope); + + GraphPatternDetector gpd; + auto* conv_input = + gpd.mutable_pattern() + ->NewNode(patterns::PDNodeName(name_scope_, "conv_input")) + ->AsInput() + ->assert_is_op_input("conv2d", "Input"); + patterns::ConvAffineChannel conv_ac_pattern(gpd.mutable_pattern(), + name_scope_); + conv_ac_pattern(conv_input, false /*with_eltwise_add*/); + + int found_conv_ac_count = 0; + auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, + Graph* g) { + VLOG(4) << "handle ConvAffineChannel fuse"; + + GET_CONV_BN_NODES(conv_ac_pattern); + + // check if fuse can be done and if MKL-DNN should be used + FuseOptions fuse_option = FindFuseOption(*conv, *affine_channel); + if (fuse_option == DO_NOT_FUSE) { + VLOG(3) << "do not perform conv+affinechannel fuse"; + return; + } + + // Create eltwise_y (conv bias) variable + VarDesc eltwise_y_in_desc( + patterns::PDNodeName(name_scope_, "eltwise_y_in")); + eltwise_y_in_desc.SetPersistable(true); + auto* eltwise_y_in_node = g->CreateVarNode(&eltwise_y_in_desc); + auto* eltwise_y_in_tensor = + scope->Var(eltwise_y_in_node->Name())->GetMutable(); + + // Get affine_channel bias + auto* ac_bias_tensor = + scope->FindVar(ac_bias->Name())->GetMutable(); + + // Initialize eltwise_y + eltwise_y_in_tensor->Resize(ac_bias_tensor->dims()); + std::fill_n(eltwise_y_in_tensor->mutable_data(platform::CPUPlace()), + eltwise_y_in_tensor->numel(), 0.0f); + + // update weights and biases + recompute_bias_and_weights(scope, conv_weight, *ac_scale, *ac_bias_tensor, + eltwise_y_in_tensor); + + // create an elementwise add node. + OpDesc desc; + desc.SetInput("X", std::vector({conv_out->Name()})); + desc.SetInput("Y", std::vector({eltwise_y_in_node->Name()})); + desc.SetOutput("Out", std::vector({ac_out->Name()})); + desc.SetType("elementwise_add"); + desc.SetAttr("axis", 1); + auto eltwise_op = g->CreateOpNode(&desc); // OpDesc will be copied. + + GraphSafeRemoveNodes(graph, {ac_scale, ac_bias, affine_channel}); + + IR_NODE_LINK_TO(conv_out, eltwise_op); + IR_NODE_LINK_TO(eltwise_y_in_node, eltwise_op); + IR_NODE_LINK_TO(eltwise_op, ac_out); + found_conv_ac_count++; + }; + + gpd(graph, handler); + + AddStatis(found_conv_ac_count); +} + +void ConvEltwiseAddAffineChannelFusePass::ApplyImpl(ir::Graph* graph) const { + PADDLE_ENFORCE(graph); + FusePassBase::Init(name_scope_, graph); + + auto* scope = param_scope(); + PADDLE_ENFORCE(scope); + + GraphPatternDetector gpd; + auto* conv_input = + gpd.mutable_pattern() + ->NewNode(patterns::PDNodeName(name_scope_, "conv_input")) + ->AsInput() + ->assert_is_op_input("conv2d", "Input"); + patterns::ConvAffineChannel conv_ac_pattern(gpd.mutable_pattern(), + name_scope_); + conv_ac_pattern(conv_input, true /*with_eltwise_add*/); + + int found_conv_ac_count = 0; + auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, + Graph* g) { + VLOG(4) << "handle ConvBN fuse"; + + GET_CONV_BN_NODES(conv_ac_pattern); + // OPERATORS + GET_IR_NODE_FROM_SUBGRAPH(eltwise, eltwise, conv_ac_pattern); + // BIAS inputs + GET_IR_NODE_FROM_SUBGRAPH(eltwise_y_in, eltwise_y_in, conv_ac_pattern); + // BIAS outputs + GET_IR_NODE_FROM_SUBGRAPH(eltwise_out, eltwise_out, conv_ac_pattern); + + // Get eltwise_y (conv bias) variable + auto* eltwise_y_in_tensor = + scope->FindVar(eltwise_y_in->Name())->GetMutable(); + + // Get batch norm bias + auto* ac_bias_tensor = + scope->FindVar(ac_bias->Name())->GetMutable(); + + recompute_bias_and_weights(scope, conv_weight, *ac_scale, *ac_bias_tensor, + eltwise_y_in_tensor); + + // Update the elementwise_add node + eltwise->Op()->SetAttr("axis", 1); + eltwise->Op()->SetOutput("Out", std::vector({ac_out->Name()})); + + GraphSafeRemoveNodes(graph, + {ac_scale, ac_bias, affine_channel, eltwise_out}); + + IR_NODE_LINK_TO(eltwise, ac_out); + + found_conv_ac_count++; + }; + + gpd(graph, handler); + AddStatis(found_conv_ac_count); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(conv_affine_channel_fuse_pass, + paddle::framework::ir::ConvAffineChannelFusePass); +REGISTER_PASS(conv_eltwiseadd_affine_channel_fuse_pass, + paddle::framework::ir::ConvEltwiseAddAffineChannelFusePass); diff --git a/paddle/fluid/framework/ir/conv_affine_channel_fuse_pass.h b/paddle/fluid/framework/ir/conv_affine_channel_fuse_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..d607020a47b8c589775ac763f04e64272dfec4e0 --- /dev/null +++ b/paddle/fluid/framework/ir/conv_affine_channel_fuse_pass.h @@ -0,0 +1,49 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" + +namespace paddle { +namespace framework { +namespace ir { + +/* + * Fuse the Conv and ConvAffineChannel. + */ +class ConvAffineChannelFusePass : public FusePassBase { + public: + virtual ~ConvAffineChannelFusePass() {} + + protected: + void ApplyImpl(ir::Graph*) const override; + const std::string name_scope_{"conv_affine_channel_fuse"}; +}; + +class ConvEltwiseAddAffineChannelFusePass : public FusePassBase { + public: + virtual ~ConvEltwiseAddAffineChannelFusePass() {} + + protected: + void ApplyImpl(ir::Graph*) const override; + const std::string name_scope_{"conv_eltwiseadd_affine_channel_fuse"}; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/conv_bn_fuse_pass.cc b/paddle/fluid/framework/ir/conv_bn_fuse_pass.cc index 04459612a726bcb60f9d752dfd8927b6f5c2500d..876a9996456c256f9b5f511ecd792f915b74b0df 100644 --- a/paddle/fluid/framework/ir/conv_bn_fuse_pass.cc +++ b/paddle/fluid/framework/ir/conv_bn_fuse_pass.cc @@ -101,10 +101,9 @@ void recompute_bias_and_weights(const Scope* scope, weights_array_2d.colwise() *= variance_array; } -std::unique_ptr ConvBNFusePass::ApplyImpl( - std::unique_ptr graph) const { - PADDLE_ENFORCE(graph.get()); - FusePassBase::Init(name_scope_, graph.get()); +void ConvBNFusePass::ApplyImpl(ir::Graph* graph) const { + PADDLE_ENFORCE(graph); + FusePassBase::Init(name_scope_, graph); auto* scope = param_scope(); PADDLE_ENFORCE(scope); @@ -126,12 +125,21 @@ std::unique_ptr ConvBNFusePass::ApplyImpl( // conv, batch_norm, // conv_weight, conv_out, // bn_scale, bn_bias, bn_mean, bn_variance, - // bn_out, bn_mean_out, bn_variance_out, bn_saved_mean, bn_saved_variance + // bn_out, bn_mean_out, bn_variance_out, bn_saved_mean, + // bn_saved_variance GET_CONV_BN_NODES(conv_bn_pattern); + // check if fuse can be done and if MKL-DNN should be used + FuseOptions fuse_option = FindFuseOption(*conv, *batch_norm); + if (fuse_option == DO_NOT_FUSE) { + VLOG(3) << "do not perform conv+bn fuse"; + return; + } + // Create eltwise_y (conv bias) variable VarDesc eltwise_y_in_desc( patterns::PDNodeName(name_scope_, "eltwise_y_in")); + eltwise_y_in_desc.SetPersistable(true); auto* eltwise_y_in_node = g->CreateVarNode(&eltwise_y_in_desc); auto* eltwise_y_in_tensor = scope->Var(eltwise_y_in_node->Name())->GetMutable(); @@ -151,39 +159,68 @@ std::unique_ptr ConvBNFusePass::ApplyImpl( *bn_mean, *bn_variance, eltwise_y_in_tensor, epsilon); - // Create an elementwise add node - OpDesc desc; - desc.SetInput("X", std::vector({conv_out->Name()})); - desc.SetInput("Y", std::vector({eltwise_y_in_node->Name()})); - desc.SetOutput("Out", std::vector({bn_out->Name()})); - desc.SetType("elementwise_add"); - desc.SetAttr("axis", 1); - bool a = boost::get(conv->Op()->GetAttr("use_mkldnn")); - desc.SetAttr("use_mkldnn", a); - auto eltwise_op = g->CreateOpNode(&desc); // OpDesc will be copied. - - GraphSafeRemoveNodes(graph.get(), {bn_scale, bn_bias, bn_mean, bn_variance, - batch_norm, bn_mean_out, bn_variance_out, - bn_saved_mean, bn_saved_variance}); - - PADDLE_ENFORCE(subgraph.count(conv_input)); - IR_NODE_LINK_TO(conv_out, eltwise_op); - IR_NODE_LINK_TO(eltwise_y_in_node, eltwise_op); - IR_NODE_LINK_TO(eltwise_op, bn_out); - - found_conv_bn_count++; + // with MKL-DNN fuse conv+bn into conv with bias + // without MKL-DNN fuse conv+bn into conv+elementwise_add + if (fuse_option == FUSE_MKLDNN) { + auto input_names = conv->Op()->InputNames(); + bool has_bias = std::find(input_names.begin(), input_names.end(), + "Bias") != input_names.end(); + if (has_bias && conv->Op()->Input("Bias").size() > 0) { + // reuse existing conv bias node + auto conv_bias_names = conv->Op()->Input("Bias"); + PADDLE_ENFORCE_EQ(conv_bias_names.size(), 1UL); + auto* conv_bias_var = scope->FindVar(conv_bias_names[0]); + auto* conv_bias_tensor = conv_bias_var->GetMutable(); + PADDLE_ENFORCE_EQ(conv_bias_tensor->dims(), + eltwise_y_in_tensor->dims()); + + auto eigen_conv_bias = EigenVector::From(*conv_bias_tensor); + eigen_conv_bias += EigenVector::From(*eltwise_y_in_tensor); + } else { + // add new conv_bias node + conv->Op()->SetInput( + "Bias", std::vector({eltwise_y_in_node->Name()})); + IR_NODE_LINK_TO(eltwise_y_in_node, conv); + } + conv->Op()->SetOutput("Output", + std::vector({bn_out->Name()})); + + GraphSafeRemoveNodes( + graph, + {conv_out, bn_scale, bn_bias, bn_mean, bn_variance, batch_norm, + bn_mean_out, bn_variance_out, bn_saved_mean, bn_saved_variance}); + + IR_NODE_LINK_TO(conv, bn_out); + found_conv_bn_count++; + } else { // fuse_option == FUSE_NATIVE + // create an elementwise add node. + OpDesc desc; + desc.SetInput("X", std::vector({conv_out->Name()})); + desc.SetInput("Y", std::vector({eltwise_y_in_node->Name()})); + desc.SetOutput("Out", std::vector({bn_out->Name()})); + desc.SetType("elementwise_add"); + desc.SetAttr("axis", 1); + auto eltwise_op = g->CreateOpNode(&desc); // OpDesc will be copied. + + GraphSafeRemoveNodes(graph, {bn_scale, bn_bias, bn_mean, bn_variance, + batch_norm, bn_mean_out, bn_variance_out, + bn_saved_mean, bn_saved_variance}); + + IR_NODE_LINK_TO(conv_out, eltwise_op); + IR_NODE_LINK_TO(eltwise_y_in_node, eltwise_op); + IR_NODE_LINK_TO(eltwise_op, bn_out); + found_conv_bn_count++; + } }; - gpd(graph.get(), handler); + gpd(graph, handler); AddStatis(found_conv_bn_count); - return graph; } -std::unique_ptr ConvEltwiseAddBNFusePass::ApplyImpl( - std::unique_ptr graph) const { - PADDLE_ENFORCE(graph.get()); - FusePassBase::Init(name_scope_, graph.get()); +void ConvEltwiseAddBNFusePass::ApplyImpl(ir::Graph* graph) const { + PADDLE_ENFORCE(graph); + FusePassBase::Init(name_scope_, graph); auto* scope = param_scope(); PADDLE_ENFORCE(scope); @@ -233,20 +270,18 @@ std::unique_ptr ConvEltwiseAddBNFusePass::ApplyImpl( eltwise->Op()->SetOutput("Out", std::vector({bn_out->Name()})); GraphSafeRemoveNodes( - graph.get(), + graph, {bn_scale, bn_bias, bn_mean, bn_variance, batch_norm, bn_mean_out, bn_variance_out, bn_saved_mean, bn_saved_variance, eltwise_out}); - PADDLE_ENFORCE(subgraph.count(conv_input)); IR_NODE_LINK_TO(eltwise, bn_out); found_conv_bn_count++; }; - gpd(graph.get(), handler); + gpd(graph, handler); AddStatis(found_conv_bn_count); - return graph; } } // namespace ir diff --git a/paddle/fluid/framework/ir/conv_bn_fuse_pass.h b/paddle/fluid/framework/ir/conv_bn_fuse_pass.h index 2c9eb574fe8e054e0ae221f08f664b91f05d95c9..837a48ed7305f4176fc709ab2cb4edf68aeb9fa1 100644 --- a/paddle/fluid/framework/ir/conv_bn_fuse_pass.h +++ b/paddle/fluid/framework/ir/conv_bn_fuse_pass.h @@ -31,7 +31,7 @@ class ConvBNFusePass : public FusePassBase { virtual ~ConvBNFusePass() {} protected: - std::unique_ptr ApplyImpl(std::unique_ptr graph) const; + void ApplyImpl(ir::Graph* graph) const override; const std::string name_scope_{"conv_bn_fuse"}; }; @@ -40,7 +40,7 @@ class ConvEltwiseAddBNFusePass : public FusePassBase { virtual ~ConvEltwiseAddBNFusePass() {} protected: - std::unique_ptr ApplyImpl(std::unique_ptr graph) const; + void ApplyImpl(ir::Graph* graph) const override; const std::string name_scope_{"conv_eltwiseadd_bn_fuse"}; }; diff --git a/paddle/fluid/framework/ir/conv_elementwise_add2_act_fuse.cc b/paddle/fluid/framework/ir/conv_elementwise_add2_act_fuse.cc new file mode 100644 index 0000000000000000000000000000000000000000..99bc5fe8c506bb69c0fefcfb9af6747ea7db38d7 --- /dev/null +++ b/paddle/fluid/framework/ir/conv_elementwise_add2_act_fuse.cc @@ -0,0 +1,104 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include "paddle/fluid/framework/ir/conv_elementwise_add2_act_fuse_pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +#define GET_IR_NODE(node__) GET_IR_NODE_FROM_SUBGRAPH(node__, node__, pattern); +#define GET_NODES \ + GET_IR_NODE(conv_op); \ + GET_IR_NODE(conv_out); \ + GET_IR_NODE(conv_filter); \ + GET_IR_NODE(elementwise_add_op); \ + GET_IR_NODE(elementwise_add_in_y); \ + GET_IR_NODE(elementwise_add_out); \ + GET_IR_NODE(elementwise_add_op_1); \ + GET_IR_NODE(elementwise_add_in_y_1); \ + GET_IR_NODE(elementwise_add_out_1); \ + GET_IR_NODE(act_op); \ + GET_IR_NODE(act_out); + +// Inherient the basic infomation from `base_desc`, and modify some fields. +framework::proto::OpDesc PrepareOpDesc( + const framework::proto::OpDesc& base_desc, const std::string& bias, + const std::string& bias1, const std::string& activation, + const std::string& output) { + auto proto = base_desc; + framework::OpDesc desc(proto, nullptr); + desc.SetInput("Bias", {bias}); + desc.SetInput("ResidualData", {bias1}); + desc.SetAttr("activation", activation); + desc.SetOutput("Output", {output}); + desc.SetAttr("is_test", true); + desc.SetAttr("use_cudnn", false); + + return *desc.Proto(); +} + +void ConvElementwiseAddActFusePass::ApplyImpl(ir::Graph* graph) const { + const std::string pattern_name = "conv_elementwise_add_act_fuse"; + FusePassBase::Init(pattern_name, graph); + + GraphPatternDetector gpd; + auto* x = gpd.mutable_pattern()->NewNode("x")->AsInput()->assert_is_op_input( + "conv2d", "Input"); + + patterns::ConvElementwiseaddAct pattern(gpd.mutable_pattern(), pattern_name); + pattern(x); + + auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, + Graph* g) { + GET_NODES; + + auto base_op_desc = *conv_op->Op()->Proto(); + std::string bias_name = elementwise_add_in_y->Name(); + std::string bias1_name = elementwise_add_in_y_1->Name(); + std::string act_op_type = act_op->Op()->Type(); + std::string act_op_out = act_out->Name(); + + auto new_op_proto = PrepareOpDesc(base_op_desc, bias_name, bias1_name, + act_op_type, act_op_out); + framework::OpDesc new_op_desc(new_op_proto, nullptr); + + // Create a new node for the fused op. + auto new_conv_op = graph->CreateOpNode(&new_op_desc); + + // Link inputs and outputs. + PADDLE_ENFORCE(subgraph.count(x)); + auto* conv_in_node = subgraph.at(x); + + IR_NODE_LINK_TO(conv_in_node, new_conv_op); // Input + IR_NODE_LINK_TO(conv_filter, new_conv_op); // Filter + IR_NODE_LINK_TO(elementwise_add_in_y, new_conv_op); // Bias + IR_NODE_LINK_TO(elementwise_add_in_y_1, new_conv_op); // ResidualData + IR_NODE_LINK_TO(new_conv_op, act_out); // Output + + // Delete the unneeded nodes. + GraphSafeRemoveNodes(graph.get(), + {conv_op, elementwise_add_op, elementwise_add_op_1, + elementwise_add_out}); + }; + gpd(graph.get(), handler); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(conv_elementwise_add2_act_fuse_pass, + paddle::framework::ir::ConvElementwiseAdd2ActFusePass); diff --git a/paddle/fluid/framework/ir/conv_elementwise_add2_act_fuse_pass.cc b/paddle/fluid/framework/ir/conv_elementwise_add2_act_fuse_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..b4d6f683ce747a35aea7b431165911d942bcf092 --- /dev/null +++ b/paddle/fluid/framework/ir/conv_elementwise_add2_act_fuse_pass.cc @@ -0,0 +1,105 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/conv_elementwise_add2_act_fuse_pass.h" +#include + +namespace paddle { +namespace framework { +namespace ir { + +#define GET_IR_NODE(node__) GET_IR_NODE_FROM_SUBGRAPH(node__, node__, pattern); +#define GET_NODES \ + GET_IR_NODE(conv_op); \ + GET_IR_NODE(conv_out); \ + GET_IR_NODE(conv_filter); \ + GET_IR_NODE(elementwise_add_op); \ + GET_IR_NODE(elementwise_add_in_y); \ + GET_IR_NODE(elementwise_add_out); \ + GET_IR_NODE(elementwise_add_op_1); \ + GET_IR_NODE(elementwise_add_in_y_1); \ + GET_IR_NODE(elementwise_add_out_1); \ + GET_IR_NODE(act_op); \ + GET_IR_NODE(act_out); + +// Inherient the basic infomation from `base_desc`, and modify some fields. +framework::proto::OpDesc PrepareOpDesc( + const framework::proto::OpDesc& base_desc, const std::string& bias, + const std::string& bias1, const std::string& activation, + const std::string& output) { + auto proto = base_desc; + framework::OpDesc desc(proto, nullptr); + desc.SetType("conv2d_fusion"); + desc.SetInput("Bias", {bias}); + desc.SetInput("ResidualData", {bias1}); + desc.SetAttr("activation", activation); + desc.SetOutput("Output", {output}); + desc.SetAttr("is_test", true); + desc.SetAttr("use_cudnn", false); + desc.Flush(); + return *desc.Proto(); +} + +void ConvElementwiseAdd2ActFusePass::ApplyImpl(ir::Graph* graph) const { + const std::string pattern_name = "conv_elementwise_add2_act_fuse"; + FusePassBase::Init(pattern_name, graph); + + GraphPatternDetector gpd; + auto* x = gpd.mutable_pattern()->NewNode("x")->AsInput()->assert_is_op_input( + "conv2d", "Input"); + + patterns::ConvElementwiseadd2Act pattern(gpd.mutable_pattern(), pattern_name); + pattern(x); + + auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, + Graph* g) { + GET_NODES; + + auto base_op_desc = *conv_op->Op()->Proto(); + std::string bias_name = elementwise_add_in_y->Name(); + std::string bias1_name = elementwise_add_in_y_1->Name(); + std::string act_op_type = act_op->Op()->Type(); + std::string act_op_out = act_out->Name(); + + auto new_op_proto = PrepareOpDesc(base_op_desc, bias_name, bias1_name, + act_op_type, act_op_out); + framework::OpDesc new_op_desc(new_op_proto, nullptr); + + // Create a new node for the fused op. + auto* new_conv_op = graph->CreateOpNode(&new_op_desc); + + // Link inputs and outputs. + PADDLE_ENFORCE(subgraph.count(x)); + auto* conv_in_node = subgraph.at(x); + + IR_NODE_LINK_TO(conv_in_node, new_conv_op); // Input + IR_NODE_LINK_TO(conv_filter, new_conv_op); // Filter + IR_NODE_LINK_TO(elementwise_add_in_y, new_conv_op); // Bias + IR_NODE_LINK_TO(elementwise_add_in_y_1, new_conv_op); // Bias + IR_NODE_LINK_TO(new_conv_op, act_out); // Output + + // Delete the unneeded nodes. + GraphSafeRemoveNodes( + graph, {conv_op, conv_out, elementwise_add_op, elementwise_add_op_1, + elementwise_add_out, elementwise_add_out_1, act_op}); + }; + gpd(graph, handler); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(conv_elementwise_add2_act_fuse_pass, + paddle::framework::ir::ConvElementwiseAdd2ActFusePass); diff --git a/paddle/fluid/framework/ir/conv_elementwise_add2_act_fuse_pass.h b/paddle/fluid/framework/ir/conv_elementwise_add2_act_fuse_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..ea9e465d8d765a298215db29c77aa58e727fd15e --- /dev/null +++ b/paddle/fluid/framework/ir/conv_elementwise_add2_act_fuse_pass.h @@ -0,0 +1,33 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" + +namespace paddle { +namespace framework { +namespace ir { + +class ConvElementwiseAdd2ActFusePass : public FusePassBase { + public: + virtual ~ConvElementwiseAdd2ActFusePass() {} + + protected: + void ApplyImpl(ir::Graph* graph) const override; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/conv_elementwise_add_act_fuse_pass.cc b/paddle/fluid/framework/ir/conv_elementwise_add_act_fuse_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..ba0a2fb96458bd70105fa4d97114b609657b62f6 --- /dev/null +++ b/paddle/fluid/framework/ir/conv_elementwise_add_act_fuse_pass.cc @@ -0,0 +1,102 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/conv_elementwise_add_act_fuse_pass.h" +#include +#include "paddle/fluid/framework/ir/graph_viz_pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +#define GET_IR_NODE(node__) GET_IR_NODE_FROM_SUBGRAPH(node__, node__, pattern); +#define GET_NODES \ + GET_IR_NODE(conv_op); \ + GET_IR_NODE(conv_out); \ + GET_IR_NODE(conv_filter); \ + GET_IR_NODE(elementwise_add_op); \ + GET_IR_NODE(elementwise_add_in_y); \ + GET_IR_NODE(elementwise_add_out); \ + GET_IR_NODE(act_op); \ + GET_IR_NODE(act_out); + +// Inherient the basic infomation from `base_desc`, and modify some fields. +framework::proto::OpDesc PrepareOpDesc( + const framework::proto::OpDesc& base_desc, const std::string& bias, + const std::string& activation, const std::string& output) { + auto proto = base_desc; + framework::OpDesc desc(proto, nullptr); + desc.SetType("conv2d_fusion"); + desc.SetInput("Bias", {bias}); + desc.SetInput("ResidualData", {}); + desc.SetAttr("activation", activation); + desc.SetOutput("Output", {output}); + desc.SetAttr("is_test", true); + desc.SetAttr("use_cudnn", false); + desc.Flush(); + return *desc.Proto(); +} + +void ConvElementwiseAddActFusePass::ApplyImpl(ir::Graph* graph) const { + const std::string pattern_name = "conv_elementwise_add_act_fuse"; + FusePassBase::Init(pattern_name, graph); + + GraphPatternDetector gpd; + auto* x = gpd.mutable_pattern() + ->NewNode("x") + ->assert_is_op_input("conv2d", "Input") + ->AsInput(); + + patterns::ConvElementwiseaddAct pattern(gpd.mutable_pattern(), pattern_name); + pattern(x); + + auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, + Graph* g) { + GET_NODES; + + auto base_op_desc = *conv_op->Op()->Proto(); + std::string bias_name = elementwise_add_in_y->Name(); + std::string act_op_type = act_op->Op()->Type(); + std::string act_op_out = act_out->Name(); + + auto new_op_proto = + PrepareOpDesc(base_op_desc, bias_name, act_op_type, act_op_out); + framework::OpDesc new_op_desc(new_op_proto, nullptr); + + // Create a new node for the fused op. + auto* new_conv_op = graph->CreateOpNode(&new_op_desc); + + // Link inputs and outputs. + PADDLE_ENFORCE(subgraph.count(x)); + auto* conv_in_node = subgraph.at(x); + + IR_NODE_LINK_TO(conv_in_node, new_conv_op); // Input + IR_NODE_LINK_TO(conv_filter, new_conv_op); // Filter + IR_NODE_LINK_TO(elementwise_add_in_y, new_conv_op); // Bias + IR_NODE_LINK_TO(new_conv_op, act_out); // Output + + // Delete the unneeded nodes. + GraphSafeRemoveNodes(graph, {conv_op, conv_out, elementwise_add_op, + elementwise_add_out, act_op}); + }; + + gpd(graph, handler); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(conv_elementwise_add_act_fuse_pass, + paddle::framework::ir::ConvElementwiseAddActFusePass); diff --git a/paddle/fluid/framework/ir/conv_elementwise_add_act_fuse_pass.h b/paddle/fluid/framework/ir/conv_elementwise_add_act_fuse_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..8b34c3551d8f9b54f01e52cc0fc896901cd7df99 --- /dev/null +++ b/paddle/fluid/framework/ir/conv_elementwise_add_act_fuse_pass.h @@ -0,0 +1,33 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" + +namespace paddle { +namespace framework { +namespace ir { + +class ConvElementwiseAddActFusePass : public FusePassBase { + public: + virtual ~ConvElementwiseAddActFusePass() {} + + protected: + void ApplyImpl(ir::Graph* graph) const override; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/conv_elementwise_add_fuse_pass.cc b/paddle/fluid/framework/ir/conv_elementwise_add_fuse_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..8c491d4f58b4d3a1d93fe075fd0d118feeb6f8c2 --- /dev/null +++ b/paddle/fluid/framework/ir/conv_elementwise_add_fuse_pass.cc @@ -0,0 +1,89 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include + +#include "paddle/fluid/framework/ir/conv_elementwise_add_fuse_pass.h" +#include "paddle/fluid/framework/ir/graph_viz_pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +#define GET_IR_NODE(node__) GET_IR_NODE_FROM_SUBGRAPH(node__, node__, pattern); +#define GET_NODES \ + GET_IR_NODE(conv_op); \ + GET_IR_NODE(conv_out); \ + GET_IR_NODE(conv_filter); \ + GET_IR_NODE(elementwise_add_op); \ + GET_IR_NODE(elementwise_add_in_y); \ + GET_IR_NODE(elementwise_add_out); + +void ConvElementwiseAddFusePass::ApplyImpl(ir::Graph* graph) const { + const std::string pattern_name = "conv_elementwise_add_fuse"; + FusePassBase::Init(pattern_name, graph); + + GraphPatternDetector gpd; + auto* x = gpd.mutable_pattern() + ->NewNode("x") + ->assert_is_op_input("conv2d", "Input") + ->AsInput(); + + patterns::ConvElementwiseadd pattern(gpd.mutable_pattern(), pattern_name); + pattern(x); + + auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, + Graph* g) { + GET_NODES; + + auto base_op_desc = *conv_op->Op()->Proto(); + std::string bias_name = elementwise_add_in_y->Name(); + std::string output_name = elementwise_add_out->Name(); + + std::string act_type = "identity"; + framework::OpDesc new_op_desc(base_op_desc, nullptr); + new_op_desc.SetType("conv2d_fusion"); + new_op_desc.SetInput("Bias", {bias_name}); + new_op_desc.SetInput("ResidualData", {}); + new_op_desc.SetAttr("activation", act_type); + new_op_desc.SetOutput("Output", {output_name}); + new_op_desc.SetAttr("is_test", true); + new_op_desc.SetAttr("use_cudnn", false); + new_op_desc.Flush(); + + // Create a new node for the fused op. + auto* new_conv_op = graph->CreateOpNode(&new_op_desc); + + // Link inputs and outputs. + PADDLE_ENFORCE(subgraph.count(x)); + auto* conv_in_node = subgraph.at(x); + + IR_NODE_LINK_TO(conv_in_node, new_conv_op); // Input + IR_NODE_LINK_TO(conv_filter, new_conv_op); // Filter + IR_NODE_LINK_TO(elementwise_add_in_y, new_conv_op); // Bias + IR_NODE_LINK_TO(new_conv_op, elementwise_add_out); // Output + + // Delete the unneeded nodes. + GraphSafeRemoveNodes(graph, {conv_op, conv_out, elementwise_add_op}); + }; + + gpd(graph, handler); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(conv_elementwise_add_fuse_pass, + paddle::framework::ir::ConvElementwiseAddFusePass); diff --git a/paddle/fluid/framework/ir/conv_elementwise_add_fuse_pass.h b/paddle/fluid/framework/ir/conv_elementwise_add_fuse_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..66a562cdd1948980a6792a53713cac947d72e7d6 --- /dev/null +++ b/paddle/fluid/framework/ir/conv_elementwise_add_fuse_pass.h @@ -0,0 +1,33 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" + +namespace paddle { +namespace framework { +namespace ir { + +class ConvElementwiseAddFusePass : public FusePassBase { + public: + virtual ~ConvElementwiseAddFusePass() {} + + protected: + void ApplyImpl(ir::Graph* graph) const override; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass.cc b/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass.cc deleted file mode 100644 index d7df6389cfd595324e284e0da10f65213ccee80f..0000000000000000000000000000000000000000 --- a/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass.cc +++ /dev/null @@ -1,72 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include "paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass.h" -#include -#include -#include "paddle/fluid/platform/enforce.h" - -namespace paddle { -namespace framework { -namespace ir { - -std::unique_ptr ConvReLUFusePass::ApplyImpl( - std::unique_ptr graph) const { - PADDLE_ENFORCE(graph.get()); - FusePassBase::Init("conv_relu_mkldnn_fuse", graph.get()); - - GraphPatternDetector gpd; - auto* conv_input = gpd.mutable_pattern() - ->NewNode("conv_relu_mkldnn_fuse/conv_input") - ->AsInput() - ->assert_is_op_input("conv2d", "Input"); - patterns::ConvReLU conv_relu_pattern(gpd.mutable_pattern(), - "conv_relu_mkldnn_fuse"); - conv_relu_pattern(conv_input); - - int found_conv_relu_count = 0; - auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, - Graph* g) { - VLOG(4) << "handle ConvReLU fuse"; - GET_IR_NODE_FROM_SUBGRAPH(conv_weight, conv_weight, - conv_relu_pattern); // Filter - GET_IR_NODE_FROM_SUBGRAPH(conv_out, conv_out, conv_relu_pattern); // tmp - GET_IR_NODE_FROM_SUBGRAPH(conv, conv, conv_relu_pattern); // CONV op - GET_IR_NODE_FROM_SUBGRAPH(relu_out, relu_out, conv_relu_pattern); // Out - GET_IR_NODE_FROM_SUBGRAPH(relu, relu, conv_relu_pattern); // ReLU op - - // Transform Conv node into ConvReLU node. - OpDesc* desc = conv->Op(); - desc->SetOutput("Output", std::vector({relu_out->Name()})); - desc->SetAttr("fuse_relu", true); - GraphSafeRemoveNodes(graph.get(), {relu, conv_out}); - - PADDLE_ENFORCE(subgraph.count(conv_input)); - IR_NODE_LINK_TO(conv, relu_out); - - found_conv_relu_count++; - }; - - gpd(graph.get(), handler); - - AddStatis(found_conv_relu_count); - return graph; -} - -} // namespace ir -} // namespace framework -} // namespace paddle - -REGISTER_PASS(conv_relu_mkldnn_fuse_pass, - paddle::framework::ir::ConvReLUFusePass); diff --git a/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass.h b/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass.h deleted file mode 100644 index b5de0d548713772e7ad41cfb6d8b3e9460683efb..0000000000000000000000000000000000000000 --- a/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass.h +++ /dev/null @@ -1,39 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#pragma once - -#include "paddle/fluid/framework/ir/fuse_pass_base.h" -#include "paddle/fluid/framework/ir/graph.h" -#include "paddle/fluid/framework/ir/graph_pattern_detector.h" -#include "paddle/fluid/framework/ir/pass.h" - -namespace paddle { -namespace framework { -namespace ir { - -/* - * Fuse the CONV and ReLU to a ConvReLUOp. - */ -class ConvReLUFusePass : public FusePassBase { - public: - virtual ~ConvReLUFusePass() {} - - protected: - std::unique_ptr ApplyImpl(std::unique_ptr graph) const; -}; - -} // namespace ir -} // namespace framework -} // namespace paddle diff --git a/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass_tester.cc b/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass_tester.cc deleted file mode 100644 index 9dd780ec89ab991d6d99cb66fa2a9b683be2b9ca..0000000000000000000000000000000000000000 --- a/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass_tester.cc +++ /dev/null @@ -1,105 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include "paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass.h" - -#include - -namespace paddle { -namespace framework { -namespace ir { - -void SetOp(ProgramDesc* prog, const std::string& type, - const std::vector& inputs, - const std::vector& outputs) { - auto* op = prog->MutableBlock(0)->AppendOp(); - op->SetType(type); - if (type == "conv2d") { - op->SetAttr("use_mkldnn", true); - op->SetInput("Input", {inputs[0]}); - op->SetInput("Filter", {inputs[1]}); - op->SetInput("Bias", {inputs[2]}); - } else if (type == "relu") { - op->SetInput("X", inputs); - } - op->SetOutput("Out", outputs); -} - -// a->OP0->b -// b->OP1->c -// (c, weights, bias)->conv->f -// (f)->relu->g -ProgramDesc BuildProgramDesc() { - ProgramDesc prog; - for (auto& v : - std::vector({"a", "b", "c", "weights", "bias", "f", "g"})) { - auto* var = prog.MutableBlock(0)->Var(v); - var->SetType(proto::VarType::SELECTED_ROWS); - if (v == "weights" || v == "bias") { - var->SetPersistable(true); - } - } - - SetOp(&prog, "OP0", std::vector({"a"}), - std::vector({"b"})); - SetOp(&prog, "OP1", std::vector({"b"}), - std::vector({"c"})); - SetOp(&prog, "conv2d", std::vector({"c", "weights", "bias"}), - std::vector({"f"})); - SetOp(&prog, "relu", std::vector({"f"}), - std::vector({"g"})); - - return prog; -} - -TEST(ConvReLUFusePass, basic) { - auto prog = BuildProgramDesc(); - - std::unique_ptr graph(new ir::Graph(prog)); - - auto pass = PassRegistry::Instance().Get("conv_relu_mkldnn_fuse_pass"); - - int original_nodes_num = graph->Nodes().size(); - - graph = pass->Apply(std::move(graph)); - - int current_nodes_num = graph->Nodes().size(); - - // Remove 3 Nodes: CONV, RELU, conv_out - // Add 1 Node: ConvReLU - EXPECT_EQ(original_nodes_num - 2, current_nodes_num); - - // Assert conv_relu op in newly generated graph - int conv_relu_count = 0; - - for (auto* node : graph->Nodes()) { - if (node->IsOp() && node->Op()->Type() == "conv2d") { - auto* op = node->Op(); - ASSERT_TRUE(op->HasAttr("use_mkldnn")); - EXPECT_TRUE(boost::get(op->GetAttr("use_mkldnn"))); - ASSERT_TRUE(op->HasAttr("fuse_relu")); - bool fuse_relu = boost::get(op->GetAttr("fuse_relu")); - if (fuse_relu) { - ++conv_relu_count; - } - } - } - EXPECT_EQ(conv_relu_count, 1); -} - -} // namespace ir -} // namespace framework -} // namespace paddle - -USE_PASS(conv_relu_mkldnn_fuse_pass); diff --git a/paddle/fluid/framework/ir/embedding_fc_lstm_fuse_pass.cc b/paddle/fluid/framework/ir/embedding_fc_lstm_fuse_pass.cc index ba11f19c9273650113096be3fa23ca077bbc7dd9..3a6bbe65b369341c2a142dfcb261f5646d782796 100644 --- a/paddle/fluid/framework/ir/embedding_fc_lstm_fuse_pass.cc +++ b/paddle/fluid/framework/ir/embedding_fc_lstm_fuse_pass.cc @@ -15,6 +15,8 @@ #include "paddle/fluid/framework/ir/embedding_fc_lstm_fuse_pass.h" #include #include +#include +#include #include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/operators/math/blas.h" @@ -201,7 +203,7 @@ static int BuildFusion(Graph* graph, const std::string& name_scope, // Remove unneeded nodes. // TODO(jczaja): Proper removing of lookup table std::unordered_set marked_nodes( - //{lookup_table, mul, lstm, elementwise_add, fc_bias, W}); + // {lookup_table, mul, lstm, elementwise_add, fc_bias, W}); {mul, lstm, elementwise_add, fc_bias}); GraphSafeRemoveNodes(graph, marked_nodes); } else { @@ -224,15 +226,13 @@ static int BuildFusion(Graph* graph, const std::string& name_scope, return fusion_count; } -std::unique_ptr EmbeddingFCLSTMFusePass::ApplyImpl( - std::unique_ptr graph) const { - FusePassBase::Init(name_scope_, graph.get()); +void EmbeddingFCLSTMFusePass::ApplyImpl(ir::Graph* graph) const { + FusePassBase::Init(name_scope_, graph); - int fusion_count = BuildFusion(graph.get(), name_scope_, param_scope(), - true /*with_fc_bias*/); + int fusion_count = + BuildFusion(graph, name_scope_, param_scope(), true /*with_fc_bias*/); AddStatis(fusion_count); - return graph; } } // namespace ir diff --git a/paddle/fluid/framework/ir/embedding_fc_lstm_fuse_pass.h b/paddle/fluid/framework/ir/embedding_fc_lstm_fuse_pass.h index e5ad3067ec4060e41f1464395f3fc76183de3e66..65cb4439727b466506af35df1bed609b18c06ee0 100644 --- a/paddle/fluid/framework/ir/embedding_fc_lstm_fuse_pass.h +++ b/paddle/fluid/framework/ir/embedding_fc_lstm_fuse_pass.h @@ -14,6 +14,8 @@ #pragma once +#include + #include "paddle/fluid/framework/ir/fuse_pass_base.h" #include "paddle/fluid/framework/ir/graph.h" #include "paddle/fluid/framework/ir/graph_pattern_detector.h" @@ -30,7 +32,7 @@ class EmbeddingFCLSTMFusePass : public FusePassBase { virtual ~EmbeddingFCLSTMFusePass() {} protected: - std::unique_ptr ApplyImpl(std::unique_ptr graph) const; + void ApplyImpl(ir::Graph* graph) const override; const std::string name_scope_{"embedding_fc_lstm_fuse"}; }; diff --git a/paddle/fluid/framework/ir/expected_kernel_cache_pass.cc b/paddle/fluid/framework/ir/expected_kernel_cache_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..4a99d4c1a9c0f0bd973097d281e380341fe88515 --- /dev/null +++ b/paddle/fluid/framework/ir/expected_kernel_cache_pass.cc @@ -0,0 +1,37 @@ +/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/ir/expected_kernel_cache_pass.h" +#include +#include "paddle/fluid/framework/operator.h" + +namespace paddle { +namespace framework { +namespace ir { + +void ExpectedKernelCachePass::ApplyImpl(ir::Graph* graph) const { + VLOG(3) << "Applies Expected Kernel Cache strategy."; + for (const Node* n : graph->Nodes()) { + if (n->IsOp() && n->Op()) { + n->Op()->SetAttr(kEnableCacheExpectedKernel, true); + } + } +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(expected_kernel_cache_pass, + paddle::framework::ir::ExpectedKernelCachePass); diff --git a/paddle/fluid/framework/ir/expected_kernel_cache_pass.h b/paddle/fluid/framework/ir/expected_kernel_cache_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..bf0907d3feb7bccd163363da65505e0af3fb9bf6 --- /dev/null +++ b/paddle/fluid/framework/ir/expected_kernel_cache_pass.h @@ -0,0 +1,31 @@ +/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include "paddle/fluid/framework/ir/pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +class ExpectedKernelCachePass : public Pass { + protected: + void ApplyImpl(ir::Graph* graph) const override; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/fc_fuse_pass.cc b/paddle/fluid/framework/ir/fc_fuse_pass.cc index ca704c7f5631bbaa88f1bc2caaa22fd021de11c4..ca008763bff8ff89d5dba02e483090f2bec77592 100644 --- a/paddle/fluid/framework/ir/fc_fuse_pass.cc +++ b/paddle/fluid/framework/ir/fc_fuse_pass.cc @@ -14,17 +14,18 @@ #include "paddle/fluid/framework/ir/fc_fuse_pass.h" #include +#include #include +#include "paddle/fluid/framework/ir/graph_helper.h" #include "paddle/fluid/platform/enforce.h" namespace paddle { namespace framework { namespace ir { -std::unique_ptr FCFusePass::ApplyImpl( - std::unique_ptr graph) const { - PADDLE_ENFORCE(graph.get()); - FusePassBase::Init("fc_fuse", graph.get()); +void FCFusePass::ApplyImpl(ir::Graph* graph) const { + PADDLE_ENFORCE(graph); + FusePassBase::Init("fc_fuse", graph); std::unordered_set nodes2delete; @@ -57,9 +58,10 @@ std::unique_ptr FCFusePass::ApplyImpl( desc.SetInput("W", std::vector({fc_Y_in})); desc.SetInput("Bias", std::vector({fc_bias_in})); desc.SetOutput("Out", std::vector({fc_out_out})); + desc.SetAttr("in_num_col_dims", mul->Op()->GetAttr("x_num_col_dims")); desc.SetType("fc"); auto fc_node = g->CreateOpNode(&desc); // OpDesc will be copied. - GraphSafeRemoveNodes(graph.get(), {mul, elementwise_add, mul_out}); + GraphSafeRemoveNodes(graph, {mul, elementwise_add, mul_out}); PADDLE_ENFORCE(subgraph.count(x)); IR_NODE_LINK_TO(subgraph.at(x), fc_node); @@ -70,10 +72,9 @@ std::unique_ptr FCFusePass::ApplyImpl( found_fc_count++; }; - gpd(graph.get(), handler); + gpd(graph, handler); AddStatis(found_fc_count); - return graph; } } // namespace ir diff --git a/paddle/fluid/framework/ir/fc_fuse_pass.h b/paddle/fluid/framework/ir/fc_fuse_pass.h index 6c69539d1e48268afc2435f8f73b3818d13107cd..0a0fcd2da8542b83e6b1239f9d822eb8637b8f5b 100644 --- a/paddle/fluid/framework/ir/fc_fuse_pass.h +++ b/paddle/fluid/framework/ir/fc_fuse_pass.h @@ -12,6 +12,8 @@ // See the License for the specific language governing permissions and // limitations under the License. +#pragma once + #include "paddle/fluid/framework/ir/fuse_pass_base.h" #include "paddle/fluid/framework/ir/graph.h" #include "paddle/fluid/framework/ir/graph_pattern_detector.h" @@ -29,7 +31,7 @@ class FCFusePass : public FusePassBase { virtual ~FCFusePass() {} protected: - std::unique_ptr ApplyImpl(std::unique_ptr graph) const; + void ApplyImpl(ir::Graph* graph) const override; }; } // namespace ir diff --git a/paddle/fluid/framework/ir/fc_fuse_pass_tester.cc b/paddle/fluid/framework/ir/fc_fuse_pass_tester.cc index 06286a109d01af638e74e06ccc83e2a5500663ea..affe506910bbefc6244d85ff8c88cb33e05f8fe5 100644 --- a/paddle/fluid/framework/ir/fc_fuse_pass_tester.cc +++ b/paddle/fluid/framework/ir/fc_fuse_pass_tester.cc @@ -15,6 +15,7 @@ #include "paddle/fluid/framework/ir/fc_fuse_pass.h" #include +#include "paddle/fluid/framework/op_proto_maker.h" namespace paddle { namespace framework { @@ -28,10 +29,13 @@ void SetOp(ProgramDesc* prog, const std::string& type, if (type == "mul") { op->SetInput("X", {inputs[0]}); op->SetInput("Y", {inputs[1]}); + op->SetAttr("x_num_col_dims", {1}); } else if (type == "elementwise_add") { op->SetInput("X", inputs); } op->SetOutput("Out", outputs); + op->SetAttr(OpProtoAndCheckerMaker::OpRoleAttrName(), + static_cast(OpRole::kForward)); } // a->OP0->b @@ -69,7 +73,7 @@ TEST(FCFusePass, basic) { int pre_nodes = graph->Nodes().size(); - graph = pass->Apply(std::move(graph)); + graph.reset(pass->Apply(graph.release())); int after_nodes = graph->Nodes().size(); diff --git a/paddle/fluid/framework/ir/fc_gru_fuse_pass.cc b/paddle/fluid/framework/ir/fc_gru_fuse_pass.cc index a902b0b50cf27ff84877053aca2ff921cd00b833..5f660c6d366fe094aed84ed2aa2f05adcbebbc43 100644 --- a/paddle/fluid/framework/ir/fc_gru_fuse_pass.cc +++ b/paddle/fluid/framework/ir/fc_gru_fuse_pass.cc @@ -14,6 +14,7 @@ #include "paddle/fluid/framework/ir/fc_gru_fuse_pass.h" #include +#include #include "paddle/fluid/framework/lod_tensor.h" namespace paddle { @@ -39,7 +40,6 @@ static int BuildFusion(Graph* graph, const std::string& name_scope, // Create New OpDesc auto gru_creater = [&](Node* gru, Node* x, Node* weight_x, Node* weight_h, Node* bias, Node* hidden, Node* fc_bias) { - OpDesc op_desc; op_desc.SetType("fusion_gru"); @@ -155,26 +155,22 @@ static int BuildFusion(Graph* graph, const std::string& name_scope, return fusion_count; } -std::unique_ptr MulGRUFusePass::ApplyImpl( - std::unique_ptr graph) const { - FusePassBase::Init(name_scope_, graph.get()); +void MulGRUFusePass::ApplyImpl(ir::Graph* graph) const { + FusePassBase::Init(name_scope_, graph); - int fusion_count = BuildFusion(graph.get(), name_scope_, param_scope(), - false /*with_fc_bias*/); + int fusion_count = + BuildFusion(graph, name_scope_, param_scope(), false /*with_fc_bias*/); AddStatis(fusion_count); - return graph; } -std::unique_ptr FCGRUFusePass::ApplyImpl( - std::unique_ptr graph) const { - FusePassBase::Init(name_scope_, graph.get()); +void FCGRUFusePass::ApplyImpl(ir::Graph* graph) const { + FusePassBase::Init(name_scope_, graph); - int fusion_count = BuildFusion(graph.get(), name_scope_, param_scope(), - true /*with_fc_bias*/); + int fusion_count = + BuildFusion(graph, name_scope_, param_scope(), true /*with_fc_bias*/); AddStatis(fusion_count); - return graph; } } // namespace ir diff --git a/paddle/fluid/framework/ir/fc_gru_fuse_pass.h b/paddle/fluid/framework/ir/fc_gru_fuse_pass.h index 63e1c72bfb2e2641ae5d44858b342d5e427e9045..e11cdac7ea95219444c35bb8deef630fe29d3734 100644 --- a/paddle/fluid/framework/ir/fc_gru_fuse_pass.h +++ b/paddle/fluid/framework/ir/fc_gru_fuse_pass.h @@ -30,7 +30,7 @@ class FCGRUFusePass : public FusePassBase { virtual ~FCGRUFusePass() {} protected: - std::unique_ptr ApplyImpl(std::unique_ptr graph) const; + void ApplyImpl(ir::Graph* graph) const override; const std::string name_scope_{"fc_gru_fuse"}; }; @@ -41,7 +41,7 @@ class MulGRUFusePass : public FusePassBase { virtual ~MulGRUFusePass() {} protected: - std::unique_ptr ApplyImpl(std::unique_ptr graph) const; + void ApplyImpl(ir::Graph* graph) const override; const std::string name_scope_{"fc_nobias_gru_fuse"}; }; diff --git a/paddle/fluid/framework/ir/fc_lstm_fuse_pass.cc b/paddle/fluid/framework/ir/fc_lstm_fuse_pass.cc index f5c286486520391906a6cd7545041c8a7df614ea..babeba96149247fda20a1621a580cdcdbc2750d1 100644 --- a/paddle/fluid/framework/ir/fc_lstm_fuse_pass.cc +++ b/paddle/fluid/framework/ir/fc_lstm_fuse_pass.cc @@ -14,6 +14,7 @@ #include "paddle/fluid/framework/ir/fc_lstm_fuse_pass.h" #include +#include #include "paddle/fluid/framework/lod_tensor.h" namespace paddle { @@ -157,26 +158,22 @@ int BuildFusion(Graph* graph, const std::string& name_scope, Scope* scope, return fusion_count; } -std::unique_ptr MulLstmFusePass::ApplyImpl( - std::unique_ptr graph) const { - FusePassBase::Init(name_scope_, graph.get()); +void MulLstmFusePass::ApplyImpl(ir::Graph* graph) const { + FusePassBase::Init(name_scope_, graph); - int fusion_count = BuildFusion(graph.get(), name_scope_, param_scope(), - false /*with_fc_bias*/); + int fusion_count = + BuildFusion(graph, name_scope_, param_scope(), false /*with_fc_bias*/); AddStatis(fusion_count); - return graph; } -std::unique_ptr FCLstmFusePass::ApplyImpl( - std::unique_ptr graph) const { - FusePassBase::Init(name_scope_, graph.get()); +void FCLstmFusePass::ApplyImpl(ir::Graph* graph) const { + FusePassBase::Init(name_scope_, graph); - int fusion_count = BuildFusion(graph.get(), name_scope_, param_scope(), - true /*with_fc_bias*/); + int fusion_count = + BuildFusion(graph, name_scope_, param_scope(), true /*with_fc_bias*/); AddStatis(fusion_count); - return graph; } } // namespace ir diff --git a/paddle/fluid/framework/ir/fc_lstm_fuse_pass.h b/paddle/fluid/framework/ir/fc_lstm_fuse_pass.h index 3ee32c63a46fcc34bdccd1e14d4bbaf9668c49e9..5dea7c91a860f0b9622610f12f195eafb9849555 100644 --- a/paddle/fluid/framework/ir/fc_lstm_fuse_pass.h +++ b/paddle/fluid/framework/ir/fc_lstm_fuse_pass.h @@ -14,6 +14,8 @@ #pragma once +#include + #include "paddle/fluid/framework/ir/fuse_pass_base.h" #include "paddle/fluid/framework/ir/graph.h" #include "paddle/fluid/framework/ir/graph_pattern_detector.h" @@ -30,7 +32,7 @@ class FCLstmFusePass : public FusePassBase { virtual ~FCLstmFusePass() {} protected: - std::unique_ptr ApplyImpl(std::unique_ptr graph) const; + void ApplyImpl(ir::Graph* graph) const override; const std::string name_scope_{"fc_lstm_fuse"}; }; @@ -40,7 +42,7 @@ class MulLstmFusePass : public FusePassBase { virtual ~MulLstmFusePass() {} protected: - std::unique_ptr ApplyImpl(std::unique_ptr graph) const; + void ApplyImpl(ir::Graph* graph) const override; const std::string name_scope_{"fc_nobias_lstm_fuse"}; }; diff --git a/paddle/fluid/framework/ir/fillconstant_elementwisemul_fuse.cc b/paddle/fluid/framework/ir/fillconstant_elementwisemul_fuse.cc new file mode 100644 index 0000000000000000000000000000000000000000..915a2f62bafa2baf98b7407cd87d3e69f20b44d2 --- /dev/null +++ b/paddle/fluid/framework/ir/fillconstant_elementwisemul_fuse.cc @@ -0,0 +1,83 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include + +#include "paddle/fluid/framework/ir/fillconstant_elementwisemul_fuse.h" +#include "paddle/fluid/framework/ir/graph_viz_pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +#define GET_IR_NODE(node__) GET_IR_NODE_FROM_SUBGRAPH(node__, node__, pattern); +#define GET_NODES \ + GET_IR_NODE(fill_constant); \ + GET_IR_NODE(fill_constant_out); \ + GET_IR_NODE(elementwise_mul); \ + GET_IR_NODE(elementwise_mul_out); + +void FillconstantElementwisemulFuse::ApplyImpl(ir::Graph* graph) const { + const std::string pattern_name = "fillconstant_elementwisemul_fuse"; + FusePassBase::Init(pattern_name, graph); + + GraphPatternDetector gpd; + auto* x = gpd.mutable_pattern() + ->NewNode("x") + ->assert_is_op_input("elementwise_mul", "X") + ->AsInput(); + + patterns::FillConstantElementWiseMulFuse pattern(gpd.mutable_pattern(), + pattern_name); + pattern(x); + + auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, + Graph* g) { + GET_NODES; + + PADDLE_ENFORCE(subgraph.count(x)); + auto* elementwise_in = subgraph.at(x); + float constant_value = + boost::get(fill_constant->Op()->GetAttr("value")); + + framework::OpDesc new_op_desc; + new_op_desc.SetType("scale"); + new_op_desc.SetInput("X", {elementwise_in->Name()}); + new_op_desc.SetAttr("scale", constant_value); + new_op_desc.SetAttr("bias", static_cast(0.0)); + new_op_desc.SetAttr("bias_after_scale", true); + new_op_desc.SetOutput("Out", {elementwise_mul_out->Name()}); + new_op_desc.Flush(); + + // Create a new node for the fused op. + auto* scale_op = graph->CreateOpNode(&new_op_desc); + + IR_NODE_LINK_TO(elementwise_in, scale_op); // Input + IR_NODE_LINK_TO(scale_op, elementwise_mul_out); // Output + + // Delete the unneeded nodes. + GraphSafeRemoveNodes(graph, + {fill_constant, fill_constant_out, elementwise_mul}); + }; + + gpd(graph, handler); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(fillconstant_elementwisemul_fuse, + paddle::framework::ir::FillconstantElementwisemulFuse); diff --git a/paddle/fluid/framework/ir/fillconstant_elementwisemul_fuse.h b/paddle/fluid/framework/ir/fillconstant_elementwisemul_fuse.h new file mode 100644 index 0000000000000000000000000000000000000000..ab66fb4a46a8a5b60b3bf95e27ae24c7217a5a3a --- /dev/null +++ b/paddle/fluid/framework/ir/fillconstant_elementwisemul_fuse.h @@ -0,0 +1,34 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" + +namespace paddle { +namespace framework { +namespace ir { + +class FillconstantElementwisemulFuse : public FusePassBase { + public: + virtual ~FillconstantElementwisemulFuse() {} + + protected: + void ApplyImpl(ir::Graph* graph) const override; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/fuse_elewise_add_act_pass.cc b/paddle/fluid/framework/ir/fuse_elewise_add_act_pass.cc index 648acc4a759417240d9a39749b059289182ebb1e..bd49673168377486cd81726ce623e7196270d6a0 100644 --- a/paddle/fluid/framework/ir/fuse_elewise_add_act_pass.cc +++ b/paddle/fluid/framework/ir/fuse_elewise_add_act_pass.cc @@ -15,6 +15,8 @@ #include "paddle/fluid/framework/ir/fuse_elewise_add_act_pass.h" #include #include +#include +#include #include #include "paddle/fluid/framework/operator.h" #include "paddle/fluid/platform/enforce.h" @@ -23,29 +25,25 @@ namespace paddle { namespace framework { namespace ir { -std::unique_ptr FuseElewiseAddActPass::ApplyImpl( - std::unique_ptr graph) const { +void FuseElewiseAddActPass::ApplyImpl(ir::Graph *graph) const { std::unordered_set act_types = {"relu", "scale"}; - graph = FuseActElewiseAdd(std::move(graph), act_types); - graph = FuseElewiseAddAct(std::move(graph), act_types); + graph = FuseActElewiseAdd(graph, act_types); + graph = FuseElewiseAddAct(graph, act_types); // backward { std::unordered_set in_place_act_types = {"relu_grad"}; - graph = FuseElewiseAddActInplaceGrad(std::move(graph), in_place_act_types); + graph = FuseElewiseAddActInplaceGrad(graph, in_place_act_types); } // Remove the removable intermediate_out. - RemoveIntermediateOut(graph.get()); - - return graph; + RemoveIntermediateOut(graph); } // ele_add(x, act(y)) -std::unique_ptr FuseElewiseAddActPass::FuseElewiseAddAct( - std::unique_ptr graph, - const std::unordered_set &act_types) const { - PADDLE_ENFORCE(graph.get()); - FusePassBase::Init("elewise_add_act", graph.get()); +ir::Graph *FuseElewiseAddActPass::FuseElewiseAddAct( + ir::Graph *graph, const std::unordered_set &act_types) const { + PADDLE_ENFORCE(graph); + FusePassBase::Init("elewise_add_act", graph); GraphPatternDetector gpd; auto *x = gpd.mutable_pattern() @@ -86,18 +84,17 @@ std::unique_ptr FuseElewiseAddActPass::FuseElewiseAddAct( found_elewise_add_act_count++; }; - gpd(graph.get(), handler); + gpd(graph, handler); AddStatis(found_elewise_add_act_count); return graph; } // act(ele_add(x,y)) -std::unique_ptr FuseElewiseAddActPass::FuseActElewiseAdd( - std::unique_ptr graph, - const std::unordered_set &act_types) const { - PADDLE_ENFORCE(graph.get()); - FusePassBase::Init("act_elewise_add", graph.get()); +ir::Graph *FuseElewiseAddActPass::FuseActElewiseAdd( + ir::Graph *graph, const std::unordered_set &act_types) const { + PADDLE_ENFORCE(graph); + FusePassBase::Init("act_elewise_add", graph); GraphPatternDetector gpd; auto *x = gpd.mutable_pattern() @@ -137,7 +134,7 @@ std::unique_ptr FuseElewiseAddActPass::FuseActElewiseAdd( found_elewise_add_act_count++; }; - gpd(graph.get(), handler); + gpd(graph, handler); AddStatis(found_elewise_add_act_count); return graph; @@ -146,11 +143,10 @@ std::unique_ptr FuseElewiseAddActPass::FuseActElewiseAdd( // the backward of act(ele_add(x,y)) // act_grad: in["Out", "Out@GRAD"], out["X@GRAD"] // ele_add_grad: in["Y", "Out@GRAD"], out["X@GRAD", "Y@GRAD"] -std::unique_ptr FuseElewiseAddActPass::FuseElewiseAddActInplaceGrad( - std::unique_ptr graph, - const std::unordered_set &act_types) const { - PADDLE_ENFORCE(graph.get()); - FusePassBase::Init("elewise_add_act_grad", graph.get()); +ir::Graph *FuseElewiseAddActPass::FuseElewiseAddActInplaceGrad( + ir::Graph *graph, const std::unordered_set &act_types) const { + PADDLE_ENFORCE(graph); + FusePassBase::Init("elewise_add_act_grad", graph); GraphPatternDetector gpd; auto *d_act_out = gpd.mutable_pattern() @@ -217,7 +213,7 @@ std::unique_ptr FuseElewiseAddActPass::FuseElewiseAddActInplaceGrad( found_elewise_add_act_count++; }; - gpd(graph.get(), handler); + gpd(graph, handler); AddStatis(found_elewise_add_act_count); return graph; diff --git a/paddle/fluid/framework/ir/fuse_elewise_add_act_pass.h b/paddle/fluid/framework/ir/fuse_elewise_add_act_pass.h index b2fecc076efca333539fe81e67eee222873aee2a..dc73f1fda03e130c6876819d91897b497b8b321e 100644 --- a/paddle/fluid/framework/ir/fuse_elewise_add_act_pass.h +++ b/paddle/fluid/framework/ir/fuse_elewise_add_act_pass.h @@ -14,6 +14,8 @@ #pragma once #include +#include +#include #include #include "paddle/fluid/framework/ir/fuse_pass_base.h" #include "paddle/fluid/framework/ir/graph.h" @@ -32,19 +34,16 @@ class FuseElewiseAddActPass : public FusePassBase { virtual ~FuseElewiseAddActPass() {} protected: - std::unique_ptr ApplyImpl(std::unique_ptr graph) const; + void ApplyImpl(ir::Graph *graph) const override; - std::unique_ptr FuseElewiseAddAct( - std::unique_ptr graph, - const std::unordered_set &act_types) const; + ir::Graph *FuseElewiseAddAct( + ir::Graph *graph, const std::unordered_set &act_types) const; - std::unique_ptr FuseActElewiseAdd( - std::unique_ptr graph, - const std::unordered_set &act_types) const; + ir::Graph *FuseActElewiseAdd( + ir::Graph *graph, const std::unordered_set &act_types) const; - std::unique_ptr FuseElewiseAddActInplaceGrad( - std::unique_ptr graph, - const std::unordered_set &act_types) const; + ir::Graph *FuseElewiseAddActInplaceGrad( + ir::Graph *graph, const std::unordered_set &act_types) const; /** * Remove the removable intermediate_out. diff --git a/paddle/fluid/framework/ir/fuse_pass_base.cc b/paddle/fluid/framework/ir/fuse_pass_base.cc new file mode 100644 index 0000000000000000000000000000000000000000..d70010089e4b4fbb4542ef7748b8e9ece48d3942 --- /dev/null +++ b/paddle/fluid/framework/ir/fuse_pass_base.cc @@ -0,0 +1,62 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/fuse_pass_base.h" + +namespace paddle { +namespace framework { +namespace ir { + +void FusePassBase::Init(const std::string& repr, Graph* graph) const { + repr_ = repr; + graph_ = graph; +} + +Scope* FusePassBase::param_scope() const { + PADDLE_ENFORCE(graph_->Has(kParamScopeAttr)); + return graph_->Get(kParamScopeAttr); +} + +void FusePassBase::AddStatis(int count_of_fused) const { + PADDLE_ENFORCE(graph_); + PADDLE_ENFORCE(!repr_.empty()); + if (!graph_->Has(kFuseStatisAttr)) { + graph_->Set(kFuseStatisAttr, new std::unordered_map); + } + auto& info = + graph_->Get>(kFuseStatisAttr); + info[repr_] = count_of_fused; +} + +FuseOptions FusePassBase::FindFuseOption(const Node& node1, + const Node& node2) const { +#ifdef PADDLE_WITH_MKLDNN + bool node1_mkldnn = node1.Op()->HasAttr("use_mkldnn") && + boost::get(node1.Op()->GetAttr("use_mkldnn")); + bool node2_mkldnn = node2.Op()->HasAttr("use_mkldnn") && + boost::get(node2.Op()->GetAttr("use_mkldnn")); + if (node1_mkldnn && node2_mkldnn) + return FUSE_MKLDNN; + else if (!node1_mkldnn && !node2_mkldnn) + return FUSE_NATIVE; + else + return DO_NOT_FUSE; +#else + return FUSE_NATIVE; +#endif +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/fuse_pass_base.h b/paddle/fluid/framework/ir/fuse_pass_base.h index 877bbeb502252cac77095981641d7ce283ca1eb7..3a1022bbcbd671391fb034bdff7c3cf97952f84d 100644 --- a/paddle/fluid/framework/ir/fuse_pass_base.h +++ b/paddle/fluid/framework/ir/fuse_pass_base.h @@ -14,6 +14,7 @@ #pragma once +#include #include "paddle/fluid/framework/ir/graph.h" #include "paddle/fluid/framework/ir/pass.h" #include "paddle/fluid/framework/scope.h" @@ -24,33 +25,29 @@ namespace ir { static const char kParamScopeAttr[] = "__param_scope__"; static const char kFuseStatisAttr[] = "__fuse_statis__"; +// When we use trt or other third_party lib, the parameters are managed by +// the lib, but not the fluid. So we need to record them to avoid duplicate +// allocation. +static const char kRepetitiveParamAttr[] = "__repetitive_param__"; + +enum FuseOptions { + DO_NOT_FUSE, // fusing will not be done + FUSE_NATIVE, // fusing will be done without MKL-DNN + FUSE_MKLDNN // fusing will be done with MKL-DNN +}; class FusePassBase : public Pass { public: - void Init(const std::string& repr, Graph* graph) const { - repr_ = repr; - graph_ = graph; - } - - Scope* param_scope() const { - PADDLE_ENFORCE(graph_->Has(kParamScopeAttr)); - return graph_->Get(kParamScopeAttr); - } - - void AddStatis(int count_of_fused) const { - PADDLE_ENFORCE(graph_); - PADDLE_ENFORCE(!repr_.empty()); - if (!graph_->Has(kFuseStatisAttr)) { - graph_->Set(kFuseStatisAttr, new std::unordered_map); - } - auto& info = - graph_->Get>(kFuseStatisAttr); - info[repr_] = count_of_fused; - } + void Init(const std::string& repr, Graph* graph) const; + Scope* param_scope() const; + void AddStatis(int count_of_fused) const; virtual ~FusePassBase() {} protected: + virtual FuseOptions FindFuseOption(const Node& node1, + const Node& node2) const; + mutable Graph* graph_; mutable std::string repr_; }; diff --git a/paddle/fluid/framework/ir/fuse_relu_depthwise_conv_pass.cc b/paddle/fluid/framework/ir/fuse_relu_depthwise_conv_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..c4e6b6e6a52ec77c85c7c6162c4cbd006e47c502 --- /dev/null +++ b/paddle/fluid/framework/ir/fuse_relu_depthwise_conv_pass.cc @@ -0,0 +1,157 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/fuse_relu_depthwise_conv_pass.h" +#include +#include +#include +#include +#include "paddle/fluid/framework/operator.h" +#include "paddle/fluid/platform/enforce.h" + +namespace paddle { +namespace framework { +namespace ir { + +void FuseReluDepthwiseConvPass::ApplyImpl(ir::Graph *graph) const { + graph = FuseReluDepthwiseConv(graph, true); + graph = FuseReluDepthwiseConv(graph, false); +} + +ir::Graph *FuseReluDepthwiseConvPass::FuseReluDepthwiseConv( + ir::Graph *graph, bool only_forward) const { + PADDLE_ENFORCE(graph); + if (only_forward) + FusePassBase::Init("relu_depthwise_conv_only_forward", graph); + else + FusePassBase::Init("relu_depthwise_conv", graph); + /* + x ---act--> y ---layer-> z + +----------+ + ↓ ↓ + x' <--act'--- y' <-layer'--- z' + + fuse to: + + x ---act-layer-> z + | + ↓ + x' <--act-layer'--- z' + + */ + + GraphPatternDetector gpd; + auto *pattern = gpd.mutable_pattern(); + std::string act_type = "relu"; + std::string layer_type = "depthwise_conv2d"; + auto *x = pattern->NewNode("x")->AsInput(); + auto *y = pattern->NewNode("y")->AsIntermediate(); + auto *z = pattern->NewNode("z")->AsOutput(); + PDNode *xg = nullptr; + PDNode *yg = nullptr; + PDNode *zg = nullptr; + if (!only_forward) { + xg = pattern->NewNode("xg")->AsOutput(); + yg = pattern->NewNode("yg")->AsIntermediate(); + zg = pattern->NewNode("zg")->AsInput(); + } + + PDNode *act_g = nullptr; + PDNode *layer_g = nullptr; + auto *act = pattern->NewNode("act")->assert_is_op(act_type); + auto *layer = pattern->NewNode("layer")->assert_is_op(layer_type); + if (!only_forward) { + act_g = pattern->NewNode("act_g")->assert_is_op(act_type + "_grad"); + layer_g = pattern->NewNode("layer_g")->assert_is_op(layer_type + "_grad"); + } + + act->LinksFrom({x}).LinksTo({y}); + layer->LinksFrom({y}).LinksTo({z}); + if (!only_forward) { + layer_g->LinksFrom({y, zg}).LinksTo({yg}); + act_g->LinksFrom({y, yg}).LinksTo({xg}); + } + + int count = 0; + std::unordered_set need_removed_nodes; + + auto handler = [&](const GraphPatternDetector::subgraph_t &subgraph, + Graph *g) { + VLOG(4) << "handle FuseReluDepthwiseConv fuse"; + // 1. turn on fuse option + auto *layer_op = subgraph.at(layer)->Op(); + layer_op->SetAttr("use_cudnn", false); + layer_op->SetAttr("fuse_relu_before_depthwise_conv", true); + + OpDesc *layer_g_op = nullptr; + if (!only_forward) { + layer_g_op = subgraph.at(layer_g)->Op(); + layer_g_op->SetAttr("use_cudnn", false); + layer_g_op->SetAttr("fuse_relu_before_depthwise_conv", true); + } + // 2. connect x to layer and layer_g, layer_g to xg + auto *y_var = subgraph.at(y)->Var(); + auto *x_var = subgraph.at(x)->Var(); + VarDesc *yg_var = nullptr; + VarDesc *xg_var = nullptr; + if (!only_forward) { + yg_var = subgraph.at(yg)->Var(); + xg_var = subgraph.at(xg)->Var(); + } + + PADDLE_ENFORCE_EQ(layer_op->Input("Input").size(), 1UL); + PADDLE_ENFORCE_EQ(layer_op->Input("Input")[0], y_var->Name()); + layer_op->SetInput("Input", {x_var->Name()}); + subgraph.at(layer)->inputs.push_back(subgraph.at(x)); + subgraph.at(x)->outputs.push_back(subgraph.at(layer)); + VLOG(4) << "replace " << y_var->Name() << " -> " << x_var->Name(); + + if (!only_forward) { + PADDLE_ENFORCE_EQ(layer_g_op->Input("Input").size(), 1UL); + PADDLE_ENFORCE_EQ(layer_g_op->Input("Input")[0], y_var->Name()); + layer_g_op->SetInput("Input", {x_var->Name()}); + subgraph.at(layer_g)->inputs.push_back(subgraph.at(x)); + subgraph.at(x)->outputs.push_back(subgraph.at(layer_g)); + + PADDLE_ENFORCE_EQ(layer_g_op->Output(GradVarName("Input")).size(), 1UL); + PADDLE_ENFORCE_EQ(layer_g_op->Output(GradVarName("Input"))[0], + yg_var->Name()); + layer_g_op->SetOutput(GradVarName("Input"), {xg_var->Name()}); + subgraph.at(layer_g)->outputs.push_back(subgraph.at(xg)); + subgraph.at(xg)->inputs.push_back(subgraph.at(layer_g)); + VLOG(4) << "replace " << yg_var->Name() << " -> " << xg_var->Name(); + } + + // 3. delete y, yg, act, act_g + + if (only_forward) { + need_removed_nodes.insert({subgraph.at(y), subgraph.at(act)}); + } else { + need_removed_nodes.insert({subgraph.at(y), subgraph.at(yg), + subgraph.at(act), subgraph.at(act_g)}); + } + count++; + }; + gpd(graph, handler); + GraphSafeRemoveNodes(graph, need_removed_nodes); + AddStatis(count); + return graph; +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(fuse_relu_depthwise_conv_pass, + paddle::framework::ir::FuseReluDepthwiseConvPass); diff --git a/paddle/fluid/framework/ir/fuse_relu_depthwise_conv_pass.h b/paddle/fluid/framework/ir/fuse_relu_depthwise_conv_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..d37c153dd2a05ecfc8f0626626bbc3ed2f85968b --- /dev/null +++ b/paddle/fluid/framework/ir/fuse_relu_depthwise_conv_pass.h @@ -0,0 +1,41 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +#pragma once + +#include +#include +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" +#include "paddle/fluid/framework/ir/pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +/* + * Fuse the relu and depthwise conv + */ +class FuseReluDepthwiseConvPass : public FusePassBase { + public: + virtual ~FuseReluDepthwiseConvPass() {} + + protected: + void ApplyImpl(ir::Graph* graph) const override; + ir::Graph* FuseReluDepthwiseConv(ir::Graph* graph, bool only_forward) const; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/graph.cc b/paddle/fluid/framework/ir/graph.cc index 398f7095968e62f92d610f560d7574b27706d13e..6a9340b870df324f7dea03181bdb2b097e13e705 100644 --- a/paddle/fluid/framework/ir/graph.cc +++ b/paddle/fluid/framework/ir/graph.cc @@ -13,7 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #include -#include +#include #include "paddle/fluid/framework/ir/graph.h" #include "paddle/fluid/framework/op_proto_maker.h" @@ -24,79 +24,21 @@ namespace paddle { namespace framework { namespace ir { -std::vector FindDistTrainSendVars( - const std::vector &nodes) { - std::vector send_vars; - // since parameters are all in block 0, - // it's enough to only scan send ops in block 0 - for (auto &node : nodes) { - auto op_vars = node->Op()->InputArgumentNames(); - send_vars.reserve(send_vars.size() + - std::distance(op_vars.begin(), op_vars.end())); - send_vars.insert(send_vars.end(), op_vars.begin(), op_vars.end()); - } - return send_vars; -} - -std::vector FindDistTrainRecvVars( - const std::vector &nodes) { - std::vector recv_vars; - for (auto &node : nodes) { - auto op_vars = node->Op()->OutputArgumentNames(); - recv_vars.reserve(recv_vars.size() + - std::distance(op_vars.begin(), op_vars.end())); - recv_vars.insert(recv_vars.end(), op_vars.begin(), op_vars.end()); - } - return recv_vars; -} - -bool IsDistTrainOp(ir::Node *node, const std::vector &send_vars, - const std::vector &recv_vars) { - if (send_vars.size() == 0 || recv_vars.size() == 0) { - return false; - } - - /** - * Check any of opvars contains `.block` and in sendvars - */ - auto checker = [](const std::vector &opvars, - const std::vector &rpc_vars) -> bool { - for (auto &var : opvars) { - // a variable name with the suffix `.block` means it's a splited - // variable by (DistributeTranspiler) - // [python/paddle/fluid/transpiler/distribute_transpiler.py] - if (var.find(".block") != std::string::npos && - std::find(rpc_vars.begin(), rpc_vars.end(), var) != rpc_vars.end()) { - return true; - } - } - return false; - }; - - std::vector input_var_names; - std::vector output_var_names; - for (ir::Node *input : node->inputs) { - input_var_names.push_back(input->Name()); - } - for (ir::Node *output : node->outputs) { - output_var_names.push_back(output->Name()); - } - - return checker(output_var_names, send_vars) || - checker(input_var_names, recv_vars); -} - Graph::Graph(const ProgramDesc &program) : program_(program) { - // Make the nodes id start from 0. - Node::ResetId(); + auto var_nodes = InitFromProgram(program_); + ResolveHazard(var_nodes); +} +std::map> Graph::InitFromProgram( + const ProgramDesc &program) { VLOG(3) << "block in program:" << program_.Size(); std::unordered_map all_vars; + // var nodes for each var name, will have multiple versions in SSA + std::map> var_nodes; for (auto *var : program.Block(0).AllVars()) { all_vars.emplace(var->Name(), var); } - std::map> var_nodes; for (auto *op : program.Block(0).AllOps()) { ir::Node *node = CreateOpNode(op); // For input args, reuse the same var name if it was created before. @@ -134,7 +76,14 @@ Graph::Graph(const ProgramDesc &program) : program_(program) { var->inputs.push_back(node); } } + Set>( + details::kStaleProgramOpDescs, + new std::vector(program.Block(0).AllOps())); + return var_nodes; +} +void Graph::ResolveHazard( + const std::map> &var_nodes) { /** * We should handle write after read(WAR) and write after write(WAW) here. * Because some of the operators of the program can be executed parallelly. @@ -153,11 +102,15 @@ Graph::Graph(const ProgramDesc &program) : program_(program) { auto it_old = versions.rbegin(); ++it_old; for (; it_old != versions.rend(); it_new = it_old, ++it_old) { + VLOG(3) << "deal with var: " << (*it_new)->Name(); ir::Node *write_op = (*it_new)->inputs.empty() ? nullptr : (*it_new)->inputs[0]; const auto &read_ops = (*it_old)->outputs; - PADDLE_ENFORCE(write_op, "The write_op should not be empty."); + PADDLE_ENFORCE( + write_op, + string::Sprintf("The write_op of var %s should not be empty.", + (*it_new)->Name())); // Add write after write dependence ir::Node *upstream_op = @@ -199,6 +152,39 @@ Graph::Graph(const ProgramDesc &program) : program_(program) { } } +std::shared_ptr Graph::Clone() { + auto cloned_graph = std::make_shared(this->program_); + cloned_graph->ReleaseNodes(); + cloned_graph->num_node_created_ = 0; + std::unordered_map origin_to_cloned; + for (auto *n : this->node_set_) { + ir::Node *cloned_node = nullptr; + if (n->IsCtrlVar()) { + cloned_node = cloned_graph->CreateControlDepVar(); + } else if (!n->var_desc_ && !n->op_desc_) { // empty node + cloned_node = cloned_graph->CreateEmptyNode(n->Name(), n->NodeType()); + } else if (n->IsVar()) { + cloned_node = cloned_graph->CreateVarNode(n->Var()); + } else if (n->IsOp()) { + cloned_node = cloned_graph->CreateOpNode(n->Op()); + } + if (cloned_node) { + origin_to_cloned[n] = cloned_node; + } else { + PADDLE_THROW("The cloned node's type is not supported!"); + } + } + for (auto *n : this->node_set_) { + for (auto it = n->inputs.begin(); it != n->inputs.end(); it++) { + origin_to_cloned[n]->inputs.push_back(origin_to_cloned[*it]); + } + for (auto it = n->outputs.begin(); it != n->outputs.end(); it++) { + origin_to_cloned[n]->outputs.push_back(origin_to_cloned[*it]); + } + } + return cloned_graph; +} + bool IsControlDepVar(const ir::Node &var) { return var.Name().find(ir::Node::kControlDepVarName) != std::string::npos; } diff --git a/paddle/fluid/framework/ir/graph.h b/paddle/fluid/framework/ir/graph.h index ab687e760a761d4e445726bd5149966adc2403d0..fff015d4a6f0c631017458ceb039ae3f1deb0e2c 100644 --- a/paddle/fluid/framework/ir/graph.h +++ b/paddle/fluid/framework/ir/graph.h @@ -17,6 +17,7 @@ limitations under the License. */ #include #include #include +#include #include #include "paddle/fluid/framework/ir/node.h" @@ -26,6 +27,14 @@ limitations under the License. */ namespace paddle { namespace framework { + +namespace details { + +// This attr is not recommended, because the graph should not dependence +// the program once it is built. +constexpr char kStaleProgramOpDescs[] = "stale_program_op_descs"; +} // namespace details + namespace ir { /* @@ -73,14 +82,21 @@ class Graph { } bool Has(const std::string &attr_name) const { - return attrs_.find(attr_name) != attrs_.end(); + return attrs_.count(attr_name) > 0; } template AttrType &Get(const std::string &attr_name) const { PADDLE_ENFORCE(Has(attr_name), "%s attr not registered for graph.", attr_name); - return *boost::any_cast(attrs_.at(attr_name)); + try { + return *boost::any_cast(attrs_.at(attr_name)); + } catch (boost::bad_any_cast &) { + PADDLE_THROW( + "Invalid attribute type of %s error, expected: %s, actual: %s", + attr_name, typeid(AttrType *).name(), + attrs_.at(attr_name).type().name()); + } } template @@ -102,18 +118,30 @@ class Graph { attr_dels_[attr_name] = []() {}; } + void Erase(const std::string &attr_name) { + PADDLE_ENFORCE(attrs_.count(attr_name) != 0, "%s not set in the graph", + attr_name); + attr_dels_[attr_name](); + attrs_.erase(attr_name); + attr_dels_.erase(attr_name); + } + const std::unordered_set &Nodes() const { return node_set_; } // Create a normal variable with non-null VarDesc. ir::Node *CreateVarNode(VarDesc *var_desc) { PADDLE_ENFORCE(var_desc); - return AddNode(new ir::Node(var_desc)); + auto *x = AddNode(new ir::Node(var_desc)); + x->SetId(num_node_created_++); + return x; } // Create a normal runnable operator with OpDesc. ir::Node *CreateOpNode(OpDesc *op_desc) { PADDLE_ENFORCE(op_desc); - return AddNode(new ir::Node(op_desc)); + auto *x = AddNode(new ir::Node(op_desc)); + x->SetId(num_node_created_++); + return x; } // Create a control dependency var that connects 2 operations. The @@ -122,14 +150,19 @@ class Graph { ir::Node *CreateControlDepVar() { // TODO(panyx0718): control var name should be really unique. const std::string name = string::Sprintf( - "%s@%llu", ir::Node::kControlDepVarName, node_set_.size()); - return AddNode(new ir::Node(name, ir::Node::Type::kVariable)); + "%s@%llu", static_cast(ir::Node::kControlDepVarName), + num_node_created_); + auto *x = AddNode(new ir::Node(name, ir::Node::Type::kVariable)); + x->SetId(num_node_created_++); + return x; } // A more free style way of creating a graph node. Mostly use for test // or "copy" from another node. Avoid using it if possible. ir::Node *CreateEmptyNode(const std::string &name, ir::Node::Type type) { - return AddNode(new ir::Node(name, type)); + auto *x = AddNode(new ir::Node(name, type)); + x->SetId(num_node_created_++); + return x; } // Clear all node information of the graph and return the ownership of the @@ -144,14 +177,17 @@ class Graph { return ret; } - void RemoveNode(ir::Node *node) { + std::unique_ptr RemoveNode(ir::Node *node) { PADDLE_ENFORCE(node_set_.find(node) != node_set_.end()); - node_set_.erase(node); + std::unique_ptr ret; + ret.reset(nodes_.at(node).release()); nodes_.erase(node); + node_set_.erase(node); + return ret; } // NOTE low performance, but simple and secure. - Node *RetriveNode(int id) { + Node *RetrieveNode(int id) { for (auto &node : nodes_) { if (node.second->id() == id) { return node.second.get(); @@ -160,7 +196,17 @@ class Graph { return nullptr; } - private: + // Returns reference to the original program. + // WARN: After a series of passes, the current graph can be quite + // different from OriginProgram. Caller shouldn't assume much from + // the returned OriginProgram. + const ProgramDesc &OriginProgram() const { + LOG(WARNING) << "WARN: After a series of passes, the current graph can be " + "quite different from OriginProgram. So, please avoid " + "using the `OriginProgram()` method!"; + return program_; + } + // This method takes ownership of `node`. ir::Node *AddNode(ir::Node *node) { PADDLE_ENFORCE(node_set_.find(node) == node_set_.end()); @@ -169,12 +215,24 @@ class Graph { return node; } + void ResolveHazard( + const std::map> &var_nodes); + + // Create a new and duplicated graph. + // WARN: The method only clones the graph structure, not its attributes. + std::shared_ptr Clone(); + + private: + std::map> InitFromProgram( + const ProgramDesc &program); + // NOTE: program_ shouldn't be exposed to user. const ProgramDesc program_; std::map attrs_; std::map> attr_dels_; std::map> nodes_; std::unordered_set node_set_; + size_t num_node_created_{0}; // help to generate a unique node id. }; bool IsControlDepVar(const ir::Node &var); diff --git a/paddle/fluid/framework/ir/graph_helper.cc b/paddle/fluid/framework/ir/graph_helper.cc index c54766d95a61ac1a4b61566c6de62cbc86685a1d..28a37f331c100695f0ffec7288db84f4493d68a0 100644 --- a/paddle/fluid/framework/ir/graph_helper.cc +++ b/paddle/fluid/framework/ir/graph_helper.cc @@ -15,7 +15,17 @@ limitations under the License. */ #include "paddle/fluid/framework/ir/graph_helper.h" #include #include +#include +#include +#include +#include +#include #include +#include "paddle/fluid/framework/ir/graph_traits.h" + +DEFINE_string(print_sub_graph_dir, "", + "FLAGS_print_sub_graph_dir is used " + "to print the nodes of sub_graphs."); namespace paddle { namespace framework { @@ -33,7 +43,7 @@ void SortHelper( } } - VLOG(3) << "topology sort insert: " << node->Name() + VLOG(5) << "topology sort insert: " << node->Name() << " " << reinterpret_cast(node) << " input " << node->inputs.size(); ret->push_back(node); } @@ -42,16 +52,29 @@ bool HasCircleHelper( ir::Node *node, const std::map> &adj_list, std::unordered_set *visited, - std::unordered_set *in_trace) { + std::unordered_set *in_trace, + std::vector> *circles) { if (visited->find(node) == visited->end()) { visited->insert(node); in_trace->insert(node); for (ir::Node *in : adj_list.at(node)) { if (visited->find(in) == visited->end() && - HasCircleHelper(in, adj_list, visited, in_trace)) { + HasCircleHelper(in, adj_list, visited, in_trace, circles)) { return true; } else if (in_trace->find(in) != in_trace->end()) { + if (circles != nullptr) { + std::vector circle; + circle.emplace_back(in); + ir::Node *p = in; + for (auto &adj : adj_list.at(p)) { + if (in_trace->count(adj)) { + circle.emplace_back(adj); + p = adj; + } + } + circles->emplace_back(circle); + } return true; } } @@ -61,11 +84,12 @@ bool HasCircleHelper( } bool HasCircleInternal( - const std::map> &adj_list) { + const std::map> &adj_list, + std::vector> *circles) { std::unordered_set visited; std::unordered_set in_trace; for (auto &adj : adj_list) { - if (HasCircleHelper(adj.first, adj_list, &visited, &in_trace)) { + if (HasCircleHelper(adj.first, adj_list, &visited, &in_trace, circles)) { return true; } } @@ -74,13 +98,18 @@ bool HasCircleInternal( } // namespace bool HasCircle(const Graph &graph) { - return HasCircleInternal(BuildOperationAdjList(graph)); + return HasCircleInternal(BuildOperationAdjList(graph), nullptr); +} + +bool FindCircleSubGraph(const Graph &graph, + std::vector> *circles) { + return HasCircleInternal(BuildOperationAdjList(graph), circles); } std::vector TopologySortOperations(const Graph &graph) { std::map> adj_list = BuildOperationAdjList(graph); - PADDLE_ENFORCE(!HasCircleInternal(adj_list)); + PADDLE_ENFORCE(!HasCircleInternal(adj_list, nullptr)); std::unordered_set visited; std::vector ret; for (auto adj : adj_list) { @@ -91,21 +120,51 @@ std::vector TopologySortOperations(const Graph &graph) { return ret; } +// Build operator inlink edge table. std::map> BuildOperationAdjList( const Graph &graph) { std::map> adj_list; for (auto &n : graph.Nodes()) { - if (n->NodeType() != ir::Node::Type::kOperation) continue; + if (!n->IsOp()) continue; if (adj_list.find(n) == adj_list.end()) { adj_list[n] = std::unordered_set(); } + std::vector nodes; for (auto &var : n->inputs) { for (auto &adj_n : var->inputs) { PADDLE_ENFORCE(adj_n->NodeType() == ir::Node::Type::kOperation); VLOG(4) << "adj " << adj_n->Name() << reinterpret_cast(adj_n) << " -> " << n->Name() << reinterpret_cast(n) << " via " << var->Name() << reinterpret_cast(var); + nodes.push_back(adj_n); + } + } + std::sort(nodes.begin(), nodes.end(), [](ir::Node *node1, ir::Node *node2) { + return node1->id() > node2->id(); + }); + adj_list[n].insert(std::make_move_iterator(nodes.begin()), + std::make_move_iterator(nodes.end())); + } + return adj_list; +} + +// Build operator outlink edge table. +std::map> BuildOperationOutAdjList( + const Graph &graph) { + std::map> adj_list; + + for (auto &n : graph.Nodes()) { + if (!n->IsOp()) continue; + if (adj_list.find(n) == adj_list.end()) { + adj_list[n] = std::unordered_set(); + } + for (auto &var : n->outputs) { + for (auto &adj_n : var->outputs) { + PADDLE_ENFORCE(adj_n->NodeType() == ir::Node::Type::kOperation); + VLOG(40) << "adj " << adj_n->Name() << reinterpret_cast(adj_n) + << " -> " << n->Name() << reinterpret_cast(n) + << " via " << var->Name() << reinterpret_cast(var); adj_list[n].insert(adj_n); } } @@ -113,26 +172,122 @@ std::map> BuildOperationAdjList( return adj_list; } +std::vector OpDFSSort(const Graph &graph) { + auto edge_table = BuildOperationOutAdjList(graph); + std::stack stack; + for (auto &ele : edge_table) { + if (ele.first->inputs.empty()) { + // find the input ops (those without input vars) + stack.push(ele.first); + } else { + // find the ops with only persistable vars as inputs. + bool all_persistable = true; + for (auto *input : ele.first->inputs) { + if (!(input->IsVar() && input->Var() && input->Var()->Persistable())) { + all_persistable = false; + } + } + if (all_persistable) { + stack.push(ele.first); + } + } + } + + std::vector res; + // start from the feed op and DFS + std::unordered_set unique_set; + while (!stack.empty()) { + // will start from the last feed by default. + auto cur = stack.top(); + stack.pop(); + unique_set.insert(cur); + res.push_back(cur); + + for (auto *op : edge_table[cur]) { + if (!unique_set.count(op)) { + stack.push(op); + } + } + } + return res; +} + +std::vector TopologyDfsSortOperations(const Graph &graph) { + std::vector nodes; + std::unordered_map in_degree; + + auto set_out_ops_ready = [&](Node *var) { + for (auto *op : var->outputs) { + --in_degree[op]; + } + }; + // build in_degree + for (auto *node : graph.Nodes()) { + if (node->IsOp()) { + in_degree[node] += node->inputs.size(); + } else if (node->IsVar() && node->inputs.empty()) { + // put all the inputs of the whole graph ready. + set_out_ops_ready(node); + } + } + + std::deque op_queue; + // first visit + for (auto &node : OpDFSSort(graph)) { + if (node->IsOp()) { + op_queue.push_back(node); + } + } + + // traverse the graph + int num_ops = op_queue.size(); + while (num_ops) { + for (auto it = op_queue.begin(); it != op_queue.end(); it++) { + auto *&cur_op = *it; + if (!cur_op || in_degree[cur_op] > 0) continue; + // visit this node + // put all the output var of this op valid. + for (auto *out_var : cur_op->outputs) { + if (!out_var) continue; + set_out_ops_ready(out_var); + } + VLOG(8) << "visit " << cur_op->Name(); + nodes.push_back(cur_op); + + cur_op = nullptr; + num_ops--; + } + } + + return nodes; +} + size_t GraphNum(const Graph &graph) { - std::unordered_set nodes = graph.Nodes(); + std::unordered_set nodes(graph.Nodes()); std::unordered_set visited_nodes; visited_nodes.reserve(nodes.size()); std::deque q_nodes; std::vector> graph_nodes; std::unordered_set g_nodes; + // q_set used to record records in the queue. + std::unordered_set q_set; size_t graph_count = 0; - auto traverse_nodes = [&visited_nodes, - &q_nodes](const std::vector &nodes) { - std::copy_if( - nodes.begin(), nodes.end(), std::back_inserter(q_nodes), - [&visited_nodes](Node *node) { return !visited_nodes.count(node); }); + auto traverse_nodes = [&visited_nodes, &q_nodes, + &q_set](const std::vector &nodes) { + for (auto n : nodes) { + if (visited_nodes.count(n) == 0 && q_set.count(n) == 0) { + q_nodes.push_back(n); + q_set.insert(n); + } + } }; while (visited_nodes.size() != nodes.size()) { if (!q_nodes.empty()) { auto cur_node = q_nodes.front(); q_nodes.pop_front(); + q_set.erase(cur_node); visited_nodes.insert(cur_node); g_nodes.insert(cur_node); traverse_nodes(cur_node->inputs); @@ -146,6 +301,7 @@ size_t GraphNum(const Graph &graph) { for (auto &n : nodes) { if (visited_nodes.count(n) == 0) { q_nodes.push_back(n); + q_set.insert(n); break; } } @@ -156,12 +312,15 @@ size_t GraphNum(const Graph &graph) { graph_nodes.emplace_back(g_nodes); } - if (VLOG_IS_ON(10)) { - VLOG(10) << "graph_num: " << graph_nodes.size(); - for (auto &g_n : graph_nodes) { - VLOG(10) << "graph_nodes: " << g_n.size(); - if (g_n.size() < 10) { - std::stringstream out; + if (FLAGS_print_sub_graph_dir.size()) { + if (graph_nodes.size() > 1) { + std::stringstream out; + for (auto &g_n : graph_nodes) { + out << "graph_nodes: " << g_n.size() << "\n"; + } + out << "\n\n"; + for (auto &g_n : graph_nodes) { + out << "graph_nodes: " << g_n.size(); for (auto &node : g_n) { out << "\nNode: " << node->Name() << " in ["; for (auto &n : node->inputs) { @@ -173,14 +332,41 @@ size_t GraphNum(const Graph &graph) { } out << "]"; } - VLOG(10) << out.str(); + out << "\n\n\n"; } + std::unique_ptr fout( + new std::ofstream(FLAGS_print_sub_graph_dir)); + PADDLE_ENFORCE(fout->good()); + *fout << out.str(); } } return graph_count; } +void CleanIndividualNodes(Graph *graph) { + std::unordered_set nodes2rm; + for (auto *node : graph->Nodes()) { + if (node->inputs.empty() && node->outputs.empty()) { + nodes2rm.insert(node); + } + } + + for (auto *node : nodes2rm) { + graph->RemoveNode(node); + } +} + +std::vector TopologyVarientSort(const Graph &graph, + SortKind sort_kind) { + switch (sort_kind) { + case SortKind::TS: + return framework::ir::TopologySortOperations(graph); + default: + return framework::ir::TopologyDfsSortOperations(graph); + } +} + } // namespace ir } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/ir/graph_helper.h b/paddle/fluid/framework/ir/graph_helper.h index ec46b38c01b8c369ab37b4fbd5497ec120d8db91..214de9ec7d85aee6021b18866295777e317aa79d 100644 --- a/paddle/fluid/framework/ir/graph_helper.h +++ b/paddle/fluid/framework/ir/graph_helper.h @@ -24,19 +24,51 @@ limitations under the License. */ namespace paddle { namespace framework { namespace ir { + // Test if the graph contains circle. bool HasCircle(const Graph &graph); +// Find All Circles for debugging, +// store all subgraph in circles. +bool FindCircleSubGraph(const Graph &graph, + std::vector> *circles); + size_t GraphNum(const Graph &graph); // Topology Sort the operations in the graph from inputs to outputs. // `graph` cannot contain circle. std::vector TopologySortOperations(const Graph &graph); +// Topological sort, but try to DFS. +std::vector TopologyDfsSortOperations(const Graph &graph); + +// Different kinds to sort the operators in a graph to a sequence. +enum class SortKind { + // Topological Search + TS = 0, + // Topological and Depth First Search + TDFS +}; + +// Several kinds of topological sort. +std::vector TopologyVarientSort(const Graph &graph, SortKind sort_kind); + +// Clean the nodes that doesn't connect to others. +void CleanIndividualNodes(Graph *graph); + // Build an adjacency list of operations for the `graph`. std::map> BuildOperationAdjList( const Graph &graph); +template +std::vector FilterByNodeWrapper(const Graph &graph) { + std::vector ret; + for (ir::Node *n : graph.Nodes()) { + if (n->IsWrappedBy()) ret.push_back(&n->Wrapper()); + } + return ret; +} + } // namespace ir } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/ir/graph_helper_test.cc b/paddle/fluid/framework/ir/graph_helper_test.cc index cea902809339f9d45b0e2525163f08a3c1c44c95..d8973d5aeda1a2e0650a506b4c916b4346f01e2d 100644 --- a/paddle/fluid/framework/ir/graph_helper_test.cc +++ b/paddle/fluid/framework/ir/graph_helper_test.cc @@ -195,20 +195,31 @@ void BuildTwoGraphs(Graph* g) { // v4->outputs.push_back(o5); } +TEST(GraphHelperTest, Circles) { + ProgramDesc prog; + + Graph g(prog); + BuildCircleGraph(&g); + + std::vector> circles; + ASSERT_TRUE(FindCircleSubGraph(g, &circles)); + ASSERT_EQ(circles.size(), 1UL); +} + TEST(GraphHelperTest, GraphNum) { ProgramDesc prog; Graph g(prog); BuildZeroGraph(&g); - ASSERT_EQ(GraphNum(g), 0); + ASSERT_EQ(GraphNum(g), 0UL); Graph g2(prog); BuildOneGraph(&g2); - ASSERT_EQ(GraphNum(g2), 1); + ASSERT_EQ(GraphNum(g2), 1UL); Graph g3(prog); BuildTwoGraphs(&g3); - ASSERT_EQ(GraphNum(g3), 2); + ASSERT_EQ(GraphNum(g3), 2UL); } } // namespace ir diff --git a/paddle/fluid/framework/ir/graph_pattern_detector.cc b/paddle/fluid/framework/ir/graph_pattern_detector.cc index 4664953c63ca0c4b86691995899e73eab2399740..8468f9ccc12a017ebe4fe73581e7bbce00dd626d 100644 --- a/paddle/fluid/framework/ir/graph_pattern_detector.cc +++ b/paddle/fluid/framework/ir/graph_pattern_detector.cc @@ -12,6 +12,7 @@ // See the License for the specific language governing permissions and // limitations under the License. +#include #include #include #include @@ -24,6 +25,7 @@ #include "paddle/fluid/platform/enforce.h" #include "paddle/fluid/string/pretty_log.h" #include "paddle/fluid/string/printf.h" + namespace paddle { namespace framework { namespace ir { @@ -36,7 +38,7 @@ size_t PDPattern::id_ = 0UL; PDNode *PDPattern::NewNode(const std::string &name) { if (!name.empty()) { - PADDLE_ENFORCE_EQ(node_map_.count(name), 0, + PADDLE_ENFORCE_EQ(node_map_.count(name), 0UL, "PDNode's name should be unique, get duplicate [%s]", name); } @@ -49,7 +51,7 @@ PDNode *PDPattern::NewNode(const std::string &name) { PDNode *PDPattern::NewNode(PDNode::teller_t &&teller, const std::string &name) { if (!name.empty()) { - PADDLE_ENFORCE_EQ(node_map_.count(name), 0, + PADDLE_ENFORCE_EQ(node_map_.count(name), 0UL, "PDNode's name should be unique, get duplicate [%s]", name); } @@ -88,7 +90,8 @@ void GraphPatternDetector::operator()(Graph *graph, ValidateByNodeRole(&subgraphs); if (subgraphs.empty()) return; - PrettyLogEndl(Style::detail(), "--- detect %d subgraphs", subgraphs.size()); + PrettyLogEndl(Style::detail(), "--- detected %d subgraphs", + subgraphs.size()); int id = 0; for (auto &g : subgraphs) { VLOG(3) << "optimizing #" << id++ << " subgraph"; @@ -103,7 +106,7 @@ bool GraphPatternDetector::MarkPDNodesInGraph(const ir::Graph &graph) { for (auto &node : GraphTraits::DFS(graph)) { for (const auto &pdnode : pattern_.nodes()) { if (pdnode->Tell(&node)) { - VLOG(4) << "pdnode " << pdnode->name() << " marked"; + VLOG(4) << "Node " << node.Name() << " marked as " << pdnode->name(); pdnodes2nodes_[pdnode.get()].insert(&node); } } @@ -115,11 +118,6 @@ bool GraphPatternDetector::MarkPDNodesInGraph(const ir::Graph &graph) { // return false; } } - for (auto &item : pdnodes2nodes_) { - for (auto &n : item.second) { - GetMarkedNodes(const_cast(&graph)).insert(n); - } - } VLOG(3) << pdnodes2nodes_.size() << " nodes marked"; return !pdnodes2nodes_.empty(); @@ -166,10 +164,12 @@ struct HitGroup { bool Match(Node *node, PDNode *pat) { if (nodes_.count(node)) { - if (!roles.count(pat)) return false; - return roles[pat] == node; + if (roles.count(pat) && roles[pat] == node) return true; + return false; + } else { + if (roles.count(pat) && roles[pat] != node) return false; + return true; } - return !roles.count(pat) || roles.at(pat) == node; } void Register(Node *node, PDNode *pat) { @@ -197,7 +197,6 @@ GraphPatternDetector::DetectPatterns() { std::vector result; std::vector init_groups; std::array, 2> bi_records; - // PADDLE_ENFORCE(!pattern_.edges().empty(), "At least one edge is needed"); auto *first_pnode = pattern_.edges().empty() ? pattern().nodes().front().get() : pattern_.edges().front().first; if (!pdnodes2nodes_.count(first_pnode)) return result; @@ -227,11 +226,12 @@ GraphPatternDetector::DetectPatterns() { VLOG(8) << "check " << source->id() << " -- " << target->id(); // TODO(Superjomn) add some prune strategies. for (const auto &group : pre_groups) { - HitGroup new_group = group; - if (IsNodesLink(source, target) && - new_group.Match(source, edge.first)) { - new_group.Register(source, edge.first); - if (new_group.Match(target, edge.second)) { + if (IsNodesLink(source, target)) { + HitGroup new_group = group; + bool flag = new_group.Match(source, edge.first) && + new_group.Match(target, edge.second); + if (flag) { + new_group.Register(source, edge.first); new_group.Register(target, edge.second); cur_groups.push_back(new_group); // TODO(Superjomn) need to unique @@ -259,6 +259,17 @@ GraphPatternDetector::DetectPatterns() { return result; } +struct GraphItemLessThan { + bool operator()(const std::pair &a, + const std::pair &b) { + if (a.first != b.first) { + return a.first < b.first; + } else { + return a.second < b.second; + } + } +}; + // TODO(Superjomn) enhance the function as it marks unique unique as duplicates // see https://github.com/PaddlePaddle/Paddle/issues/13550 void GraphPatternDetector::UniquePatterns( @@ -267,12 +278,16 @@ void GraphPatternDetector::UniquePatterns( std::vector result; std::unordered_set set; + std::hash hasher; for (auto &g : *subgraphs) { - size_t key = 0; - for (auto &item : g) { - key ^= std::hash{}(item.first); - key ^= std::hash{}(item.second); + // Sort the items in the sub-graph, and transform to a string key. + std::vector> sorted_keys(g.begin(), g.end()); + std::sort(sorted_keys.begin(), sorted_keys.end(), GraphItemLessThan()); + std::stringstream ss; + for (auto &item : sorted_keys) { + ss << item.first << ":" << item.second; } + auto key = hasher(ss.str()); if (!set.count(key)) { result.emplace_back(g); set.insert(key); @@ -584,10 +599,19 @@ bool VarLinksToOp(Node *node, const std::string &op_type) { bool IsNthInput(Node *var, Node *op, const std::string &argument, size_t nth) { PADDLE_ENFORCE(var->IsVar()); PADDLE_ENFORCE(op->IsOp()); - if (op->Op()->Input(argument).size() <= nth) return false; + if (!HasInput(op, argument) || op->Op()->Input(argument).size() <= nth) + return false; return var->Name() == op->Op()->Input(argument)[nth]; } +bool HasInput(Node *op, const std::string &argument) { + PADDLE_ENFORCE(op->IsOp()); + auto const &names = op->Op()->InputNames(); + if (std::find(names.begin(), names.end(), argument) == names.end()) + return false; + return true; +} + bool IsNthOutput(Node *var, Node *op, const std::string &argument, size_t nth) { PADDLE_ENFORCE(var->IsVar()); PADDLE_ENFORCE(op->IsOp()); @@ -761,6 +785,51 @@ PDNode *patterns::ConvReLU::operator()( return relu_out_var; } +PDNode *patterns::SeqConvEltAddRelu::operator()( + paddle::framework::ir::PDNode *seqconv_input) { + // Create Operators + seqconv_input->assert_is_op_input("sequence_conv", "X"); + auto *seqconv_op = pattern->NewNode(seqconv_repr()) + ->assert_is_op("sequence_conv") + ->assert_op_attr("paddingTrainable", false) + ->assert_op_attr("contextStride", 1); + + auto *eltadd_op = + pattern->NewNode(eltadd_repr())->assert_is_op("elementwise_add"); + auto *relu_op = pattern->NewNode(relu_repr())->assert_is_op("relu"); + // Create variables + // Filter + auto *seqconv_weight_var = + pattern->NewNode(seqconv_weight_repr()) + ->AsInput() + ->assert_is_persistable_var() + ->assert_is_op_input("sequence_conv", "Filter"); + // Bias + auto *eltadd_bias_var = pattern->NewNode(eltadd_bias_repr()) + ->AsInput() + ->assert_is_op_input("elementwise_add"); + // intermediate variable, will be removed in the IR after fuse. + auto *seqconv_out_var = pattern->NewNode(seqconv_out_repr()) + ->AsIntermediate() + ->assert_is_only_output_of_op("sequence_conv") + ->assert_is_op_input("elementwise_add"); + auto *eltadd_out_var = pattern->NewNode(eltadd_out_repr()) + ->AsIntermediate() + ->assert_is_only_output_of_op("elementwise_add") + ->assert_is_only_input_of_op("relu"); + // output + auto *relu_out_var = pattern->NewNode(relu_out_repr()) + ->AsOutput() + ->assert_is_op_output("relu"); + + seqconv_op->LinksFrom({seqconv_input, seqconv_weight_var}) + .LinksTo({seqconv_out_var}); + eltadd_op->LinksFrom({seqconv_out_var, eltadd_bias_var}) + .LinksTo({eltadd_out_var}); + relu_op->LinksFrom({eltadd_out_var}).LinksTo({relu_out_var}); + return relu_out_var; +} + PDNode *patterns::FC::operator()(paddle::framework::ir::PDNode *x, bool with_bias) { // Create shared nodes. @@ -966,6 +1035,678 @@ PDNode *patterns::ElewiseAddActInplaceGrad::operator()( return ele_add_grad; } +PDNode *patterns::ConvBias::operator()( + paddle::framework::ir::PDNode *conv_input, bool is_conv3d) { + std::string type = is_conv3d ? "conv3d" : "conv2d"; + // Create Operators + conv_input->assert_is_op_input(type, "Input"); + auto *conv_op = pattern->NewNode(conv_repr())->assert_is_op(type); + auto *eltiwse_op = + pattern->NewNode(eltwise_repr())->assert_is_op("elementwise_add"); + // Create variables + // Filter + auto *conv_weight_var = pattern->NewNode(conv_weight_repr()) + ->AsInput() + ->assert_is_persistable_var() + ->assert_is_op_input(type, "Filter"); + // intermediate variable, will be removed in the IR after fuse. + auto *conv_out_var = pattern->NewNode(conv_out_repr()) + ->AsIntermediate() + ->assert_is_only_output_of_op(type) + ->assert_is_op_input("elementwise_add"); + // Bias stored in elementwise_add + auto *eltwise_bias_var = pattern->NewNode(eltwise_bias_repr()) + ->AsInput() + ->assert_is_persistable_var() + ->assert_is_op_input("elementwise_add", "Y"); + // output + auto *eltwise_out_var = pattern->NewNode(eltwise_out_repr()) + ->AsOutput() + ->assert_is_op_output("elementwise_add"); + conv_op->LinksFrom({conv_input, conv_weight_var}).LinksTo({conv_out_var}); + eltiwse_op->LinksFrom({conv_out_var, eltwise_bias_var}) + .LinksTo({eltwise_out_var}); + return eltwise_out_var; +} + +PDNode *patterns::Conv::operator()() { + auto conv_op = pattern->NewNode(conv_op_repr())->assert_is_op("conv2d"); + + auto input_var = pattern->NewNode(conv_input_repr()) + ->AsInput() + ->assert_is_op_input("conv2d", "Input"); + + auto filter_var = pattern->NewNode(conv_filter_repr()) + ->AsInput() + ->assert_is_op_input("conv2d", "Filter"); + + auto output_var = pattern->NewNode(conv_output_repr()) + ->AsOutput() + ->assert_is_op_output("conv2d", "Output"); + + conv_op->LinksFrom({input_var, filter_var}).LinksTo({output_var}); + return output_var; +} + +PDNode *patterns::ConvResidual::operator()(bool with_residual_data) { + auto conv_op = pattern->NewNode(conv_op_repr())->assert_is_op("conv2d"); + + if (!with_residual_data) { + conv_op->assert_more([&](Node *x) { + auto node_names = x->Op()->InputNames(); + if (!HasInput(x, "ResidualData") || + x->Op()->Input("ResidualData").size() == 0) + return true; + return false; + }); + } + + auto input_var = pattern->NewNode(conv_input_repr()) + ->AsInput() + ->assert_is_op_input("conv2d", "Input"); + + auto filter_var = pattern->NewNode(conv_filter_repr()) + ->AsInput() + ->assert_is_op_input("conv2d", "Filter"); + + auto output_var = pattern->NewNode(conv_output_repr()) + ->AsOutput() + ->assert_is_op_output("conv2d", "Output"); + + std::vector links_from{input_var, filter_var}; + + if (with_residual_data) { + auto res_conn_var = pattern->NewNode(conv_residual_data_repr()) + ->AsInput() + ->assert_is_op_input("conv2d", "ResidualData"); + links_from.push_back(res_conn_var); + } + + conv_op->LinksFrom(links_from).LinksTo({output_var}); + return output_var; +} + +PDNode *patterns::Pool::operator()() { + auto pool_op = pattern->NewNode(pool_op_repr())->assert_is_op("pool2d"); + + auto input_var = pattern->NewNode(pool_input_repr()) + ->AsInput() + ->assert_is_op_input("pool2d", "X"); + + auto output_var = pattern->NewNode(pool_output_repr()) + ->AsOutput() + ->assert_is_op_output("pool2d", "Out"); + + pool_op->LinksFrom({input_var}).LinksTo({output_var}); + return output_var; +} + +PDNode *patterns::ElementwiseAdd::operator()(PDNode *x_var, PDNode *y_var) { + auto elementwise_add_op = pattern->NewNode(elementwise_add_op_repr()) + ->assert_is_op("elementwise_add"); + + x_var->AsInput()->assert_is_op_input("elementwise_add", "X"); + y_var->AsInput()->assert_is_op_input("elementwise_add", "Y"); + auto out_var = pattern->NewNode(elementwise_add_out_repr()) + ->AsOutput() + ->assert_is_op_output("elementwise_add", "Out"); + + elementwise_add_op->LinksFrom({x_var, y_var}); + elementwise_add_op->LinksTo({out_var}); + + return out_var; +} + +std::unordered_set conv_act_set({"identity", "relu"}); + +PDNode *patterns::ConvElementwiseaddAct::operator()(PDNode *conv_in) { + conv_in->AsInput(); + auto conv_op = pattern->NewNode(conv_op_repr())->assert_is_op("conv2d"); + auto conv_out = pattern->NewNode(conv_out_repr()) + ->assert_is_op_output("conv2d") + ->assert_is_op_input("elementwise_add", "X") + ->AsIntermediate(); + auto conv_filter = pattern->NewNode(conv_filter_repr()) + ->assert_is_op_input("conv2d", "Filter") + ->AsInput(); + auto elementwise_add_op = pattern->NewNode(elementwise_add_op_repr()) + ->assert_is_op("elementwise_add"); + auto elementwise_add_in_y = pattern->NewNode(elementwise_add_in_y_repr()) + ->assert_is_op_input("elementwise_add", "Y") + ->AsInput(); + auto elementwise_add_out = pattern->NewNode(elementwise_add_out_repr()) + ->assert_is_op_output("elementwise_add") + ->AsIntermediate(); + + auto act_op = pattern->NewNode(act_op_repr()) + ->assert_is_op() + ->assert_more([&](Node *node) { + auto op_type = node->Name(); + return conv_act_set.count(op_type); + }); + + auto act_out = pattern->NewNode(act_out_repr()) + ->assert_is_var() + // is activation op's output. + ->assert_more([&](Node *node) { + for (auto *in_op : node->inputs) { + if (conv_act_set.count(in_op->Name())) { + return true; + } + } + return false; + }) + ->AsOutput(); + + conv_op->LinksFrom({conv_in, conv_filter}); + conv_out->LinksFrom({conv_op}); + elementwise_add_op->LinksFrom({conv_out, elementwise_add_in_y}) + .LinksTo({elementwise_add_out}); + act_op->LinksFrom({elementwise_add_out}).LinksTo({act_out}); + + return act_out; +} + +PDNode *patterns::ConvElementwiseadd2Act::operator()(PDNode *conv_in) { + auto conv_op = pattern->NewNode(conv_op_repr())->assert_is_op("conv2d"); + auto conv_filter = pattern->NewNode(conv_filter_repr()) + ->assert_is_op_input("conv2d", "Filter") + ->AsInput(); + auto conv_out = pattern->NewNode(conv_out_repr()) + ->assert_is_op_output("conv2d") + ->assert_is_op_input("elementwise_add", "X") + ->AsIntermediate(); + auto elementwise_add_op = pattern->NewNode(elementwise_add_op_repr()) + ->assert_is_op("elementwise_add"); + auto elementwise_add_in_y = pattern->NewNode(elementwise_add_in_y_repr()) + ->assert_is_op_input("elementwise_add", "Y") + ->AsInput(); + auto elementwise_add_out = pattern->NewNode(elementwise_add_out_repr()) + ->assert_is_op_output("elementwise_add") + ->assert_is_op_input("elementwise_add", "Y") + ->AsIntermediate(); + + auto elementwise_add_op_1 = pattern->NewNode(elementwise_add_op_1_repr()) + ->assert_is_op("elementwise_add"); + auto elementwise_add_in_y_1 = pattern->NewNode(elementwise_add_in_y_1_repr()) + ->assert_is_op_input("elementwise_add", "X") + ->AsInput(); + auto elementwise_add_out_1 = pattern->NewNode(elementwise_add_out_1_repr()) + ->assert_is_op_output("elementwise_add") + ->AsIntermediate(); + + auto act_op = pattern->NewNode(act_op_repr()) + ->assert_is_op() + ->assert_more([&](Node *node) { + auto op_type = node->Name(); + return conv_act_set.count(op_type); + }); + auto act_out = pattern->NewNode(act_out_repr()) + ->assert_is_var() + // is activation op's output. + ->assert_more([&](Node *node) { + for (auto *in_op : node->inputs) { + if (conv_act_set.count(in_op->Name())) { + return true; + } + } + return false; + }) + ->AsOutput(); + + conv_op->LinksFrom({conv_in, conv_filter}).LinksTo({conv_out}); + elementwise_add_op->LinksFrom({conv_out, elementwise_add_in_y}) + .LinksTo({elementwise_add_out}); + elementwise_add_op_1->LinksFrom({elementwise_add_out, elementwise_add_in_y_1}) + .LinksTo({elementwise_add_out_1}); + act_op->LinksFrom({elementwise_add_out_1}).LinksTo({act_out}); + return act_out; +} + +PDNode *patterns::ConvElementwiseadd::operator()(PDNode *conv_in) { + conv_in->AsInput(); + auto conv_op = pattern->NewNode(conv_op_repr())->assert_is_op("conv2d"); + auto conv_out = pattern->NewNode(conv_out_repr()) + ->assert_is_op_output("conv2d") + ->assert_is_op_input("elementwise_add", "X") + ->AsIntermediate(); + auto conv_filter = pattern->NewNode(conv_filter_repr()) + ->assert_is_op_input("conv2d", "Filter") + ->AsInput(); + auto elementwise_add_op = pattern->NewNode(elementwise_add_op_repr()) + ->assert_is_op("elementwise_add"); + auto elementwise_add_in_y = pattern->NewNode(elementwise_add_in_y_repr()) + ->assert_is_op_input("elementwise_add", "Y") + ->AsInput(); + auto elementwise_add_out = pattern->NewNode(elementwise_add_out_repr()) + ->assert_is_op_output("elementwise_add") + ->AsOutput(); + + conv_op->LinksFrom({conv_in, conv_filter}); + conv_out->LinksFrom({conv_op}); + elementwise_add_op->LinksFrom({conv_out, elementwise_add_in_y}) + .LinksTo({elementwise_add_out}); + + return elementwise_add_out; +} + +PDNode *patterns::ConvAffineChannel::operator()( + paddle::framework::ir::PDNode *conv_input, bool with_eltwise_add) { + // Create Operators + conv_input->assert_is_op_input("conv2d", "Input"); + auto *conv_op = pattern->NewNode(conv_repr())->assert_is_op("conv2d"); + + PDNode *eltwise_op = nullptr; + if (with_eltwise_add) { + eltwise_op = + pattern->NewNode(eltwise_repr())->assert_is_op("elementwise_add"); + } + + auto *affine_channel_op = + pattern->NewNode(affine_channel_repr())->assert_is_op("affine_channel"); + // Create variables + // Conv Filter + auto *conv_weight_var = pattern->NewNode(conv_weight_repr()) + ->AsInput() + ->assert_is_persistable_var() + ->assert_is_op_input("conv2d", "Filter"); + + auto *conv_out_var = pattern->NewNode(conv_out_repr()) + ->AsIntermediate() + ->assert_is_only_output_of_op("conv2d"); + + PDNode *eltwise_y_in_var = nullptr; + PDNode *eltwise_out_var = nullptr; + if (with_eltwise_add) { + // Conv output as Bias input + conv_out_var->assert_is_op_input("elementwise_add", "X"); + // Bias + eltwise_y_in_var = pattern->NewNode(eltwise_y_in_repr()) + ->assert_is_op_input("elementwise_add", "Y") + ->AsInput(); + eltwise_out_var = pattern->NewNode(eltwise_out_repr()) + ->AsIntermediate() + ->assert_is_only_output_of_op("elementwise_add"); + } else { + // Conv output as AffineChannel input + conv_out_var->assert_is_op_input("affine_channel", "X"); + } + + // AC Scale + auto *ac_scale_var = pattern->NewNode(ac_scale_repr()) + ->AsInput() + ->assert_is_persistable_var() + ->assert_is_op_input("affine_channel", "Scale"); + // AC Bias + auto *ac_bias_var = pattern->NewNode(ac_bias_repr()) + ->AsInput() + ->assert_is_persistable_var() + ->assert_is_op_input("affine_channel", "Bias"); + + // AC output + auto *ac_out_var = pattern->NewNode(ac_out_repr()) + ->AsOutput() + ->assert_is_op_output("affine_channel"); + + conv_op->LinksFrom({conv_input, conv_weight_var}).LinksTo({conv_out_var}); + + if (with_eltwise_add) { + eltwise_op->LinksFrom({conv_out_var, eltwise_y_in_var}) + .LinksTo({eltwise_out_var}); + affine_channel_op->LinksFrom({eltwise_out_var, ac_scale_var, ac_bias_var}) + .LinksTo({ac_out_var}); + } else { + affine_channel_op->LinksFrom({conv_out_var, ac_scale_var, ac_bias_var}) + .LinksTo({ac_out_var}); + } + return ac_out_var; +} + +PDNode *patterns::DequantQuantAny::operator()() { + auto *dequant_in = pattern->NewNode(dequant_in_repr()) + ->AsInput() + ->assert_is_op_input("dequantize", "Input"); + + auto *dequant_op = + pattern->NewNode(dequant_op_repr())->assert_is_op("dequantize"); + + auto *dequant_out = pattern->NewNode(dequant_out_repr()) + ->AsOutput() + ->assert_is_op_output("dequantize", "Output"); + + auto *quant_op = pattern->NewNode(quant_op_repr()) + ->assert_is_op("quantize") + ->AsIntermediate(); + + auto *quant_out = pattern->NewNode(quant_out_repr()) + ->AsOutput() + ->assert_is_op_output("quantize"); + + auto *next_op = pattern->NewNode(next_op_repr())->assert_is_op(); + + dequant_op->LinksFrom({dequant_in}).LinksTo({dequant_out}); + quant_op->LinksFrom({dequant_out}).LinksTo({quant_out}); + next_op->LinksFrom({quant_out}); + + return quant_out; +} + +PDNode *patterns::DequantAny::operator()() { + auto *dequant_op = + pattern->NewNode(dequant_op_repr())->assert_is_op("dequantize"); + + auto *dequant_out = pattern->NewNode(dequant_out_repr()) + ->AsOutput() + ->assert_is_op_output("dequantize", "Output"); + + auto *next_op = pattern->NewNode(next_op_repr())->assert_is_op(); + + dequant_op->LinksTo({dequant_out}); + next_op->LinksFrom({dequant_out}); + + return dequant_out; +} + +// a -> transpose_op(1) -> transpose_out_a -> flatten_op(1) -> flatten_out_a +// b -> transpose_op(2) -> transpose_out_b -> flatten_op(2) -> flatten_out_b +// ... +// z -> transpose_op(n) -> transpose_out_z -> flatten_op(n) -> flatten_out_z +// flatten_out_a -> concat_op flatten_out_b -> concat_op ... flatten_out_z -> +// concat_op +PDNode *patterns::TransposeFlattenConcat::operator()( + std::vector conv_in, int times) { + // The times represents the repeat times of the + // {trans, trans_out, flatten, flatten_out} + const int kNumFields = 4; + const int kTransOutOffset = 1; + const int kFlattenOffset = 2; + const int kFlattenOutOffset = 3; + + std::vector nodes; + + for (int i = 0; i < times; i++) { + nodes.push_back( + pattern->NewNode(GetNodeName("transpose" + std::to_string(i))) + ->assert_is_op("transpose2")); + nodes.push_back( + pattern->NewNode(GetNodeName("transpose_out" + std::to_string(i))) + ->assert_is_op_output("transpose2") + ->assert_is_op_input("flatten2", "X") + ->AsIntermediate()); + nodes.push_back(pattern->NewNode(GetNodeName("flatten" + std::to_string(i))) + ->assert_is_op("flatten2")); + + nodes.push_back( + pattern->NewNode(GetNodeName("flatten_out" + std::to_string(i))) + ->assert_is_op_output("flatten2") + ->assert_is_op_nth_input("concat", "X", i) + ->AsIntermediate()); + } + + auto concat_op = pattern->NewNode(GetNodeName("concat")) + ->assert_is_op("concat") + ->assert_op_has_n_inputs("concat", times); + auto concat_out = pattern->NewNode(GetNodeName("concat_out")) + ->assert_is_op_output("concat") + ->AsOutput(); + + std::vector flatten_outs; + for (int i = 0; i < times; i++) { + conv_in[i]->AsInput(); + // trans + nodes[i * kNumFields]->LinksFrom({conv_in[i]}); + // trans_out + nodes[i * kNumFields + kTransOutOffset]->LinksFrom({nodes[i * kNumFields]}); + // flatten + nodes[i * kNumFields + kFlattenOffset]->LinksFrom( + {nodes[i * kNumFields + kTransOutOffset]}); + // flatten_out + nodes[i * kNumFields + kFlattenOutOffset]->LinksFrom( + {nodes[i * kNumFields + kFlattenOffset]}); + flatten_outs.push_back(nodes[i * kNumFields + kFlattenOutOffset]); + } + + concat_op->LinksFrom(flatten_outs).LinksTo({concat_out}); + return concat_out; +} + +PDNode *patterns::AnakinDetectionPattern::operator()( + std::vector conv_in, int times, std::string priorbox_type, + bool is_reshape) { + // The times represents the repeat times of the + // {prior_box, prior_box_loc_out, flatten, prior_box_var_out, reshape} + const int kNumFields = 7; + const int kPriorBoxLocOffset = 1; + const int kReshape1Offset = 2; + const int kReshape1OutOffset = 3; + const int kPriorBoxVarOffset = 4; + const int kReshape2Offset = 5; + const int kReshape2OutOffset = 6; + + const int kBoxCoderThirdInputOffset = times; + const int kMultiClassSecondInputNmsOffset = times + 1; + + std::vector nodes; + std::string op_after_priorbox = is_reshape ? "reshape2" : "flatten2"; + + for (int i = 0; i < times; i++) { + nodes.push_back( + pattern->NewNode(GetNodeName("prior_box" + std::to_string(i))) + ->assert_is_op(priorbox_type)); + nodes.push_back(pattern->NewNode(GetNodeName("box_out" + std::to_string(i))) + ->assert_is_op_output(priorbox_type, "Boxes") + ->assert_is_op_input(op_after_priorbox, "X") + ->AsIntermediate()); + nodes.push_back( + pattern->NewNode(GetNodeName("reshape1" + std::to_string(i))) + ->assert_is_op(op_after_priorbox)); + + nodes.push_back( + pattern->NewNode(GetNodeName("reshape1_out" + std::to_string(i))) + ->assert_is_op_output(op_after_priorbox) + ->assert_is_op_nth_input("concat", "X", i) + ->AsIntermediate()); + + nodes.push_back( + pattern->NewNode(GetNodeName("box_var_out" + std::to_string(i))) + ->assert_is_op_output(priorbox_type, "Variances") + ->assert_is_op_input(op_after_priorbox, "X") + ->AsIntermediate()); + nodes.push_back( + pattern->NewNode(GetNodeName("reshape2" + std::to_string(i))) + ->assert_is_op(op_after_priorbox)); + + nodes.push_back( + pattern->NewNode(GetNodeName("reshape2_out" + std::to_string(i))) + ->assert_is_op_output(op_after_priorbox) + ->assert_is_op_nth_input("concat", "X", i) + ->AsIntermediate()); + } + + auto concat_op1 = pattern->NewNode(GetNodeName("concat1")) + ->assert_is_op("concat") + ->assert_op_has_n_inputs("concat", times); + auto concat_out1 = pattern->NewNode(GetNodeName("concat1_out")) + ->assert_is_op_output("concat") + ->AsIntermediate(); + + auto concat_op2 = pattern->NewNode(GetNodeName("concat2")) + ->assert_is_op("concat") + ->assert_op_has_n_inputs("concat", times); + auto concat_out2 = pattern->NewNode(GetNodeName("concat2_out")) + ->assert_is_op_output("concat") + ->AsIntermediate(); + + auto box_coder_op = pattern->NewNode(GetNodeName("box_coder")) + ->assert_is_op("box_coder") + ->assert_op_has_n_inputs("box_coder", 3); + + auto box_coder_out = pattern->NewNode(GetNodeName("box_coder_out")) + ->assert_is_op_output("box_coder") + ->AsIntermediate(); + + auto transpose_before_nms = + pattern->NewNode(GetNodeName("transpose_before_nms")) + ->assert_is_op("transpose2"); + + auto transpose_before_nms_out = + pattern->NewNode(GetNodeName("transpose_before_nms_out")) + ->assert_is_op_output("transpose2") + ->assert_is_op_input("multiclass_nms", "Scores") + ->AsIntermediate(); + + auto multiclass_nms_op = pattern->NewNode(GetNodeName("multiclass_nms")) + ->assert_is_op("multiclass_nms") + ->assert_op_has_n_inputs("multiclass_nms", 2); + + auto multiclass_nms_out = pattern->NewNode(GetNodeName("multiclass_nms_out")) + ->assert_is_op_output("multiclass_nms") + ->AsOutput(); + + std::vector reshape1_outs; + std::vector reshape2_outs; + + for (int i = 0; i < times; i++) { + conv_in[i]->AsInput(); + // prior_box + nodes[i * kNumFields]->LinksFrom({conv_in[i]}); + // prior_box box out + nodes[i * kNumFields + kPriorBoxLocOffset]->LinksFrom( + {nodes[i * kNumFields]}); + // reshape + nodes[i * kNumFields + kReshape1Offset]->LinksFrom( + {nodes[i * kNumFields + kPriorBoxLocOffset]}); + // reshape_out + nodes[i * kNumFields + kReshape1OutOffset]->LinksFrom( + {nodes[i * kNumFields + kReshape1Offset]}); + + nodes[i * kNumFields + kPriorBoxVarOffset]->LinksFrom( + {nodes[i * kNumFields]}); + // reshape + nodes[i * kNumFields + kReshape2Offset]->LinksFrom( + {nodes[i * kNumFields + kPriorBoxVarOffset]}); + // reshape_out + nodes[i * kNumFields + kReshape2OutOffset]->LinksFrom( + {nodes[i * kNumFields + kReshape2Offset]}); + + reshape1_outs.push_back(nodes[i * kNumFields + kReshape1OutOffset]); + reshape2_outs.push_back(nodes[i * kNumFields + kReshape2OutOffset]); + } + + concat_op1->LinksFrom(reshape1_outs); + concat_op2->LinksFrom(reshape2_outs); + concat_out1->LinksFrom({concat_op1}); + concat_out2->LinksFrom({concat_op2}); + + conv_in[kBoxCoderThirdInputOffset]->AsInput(); + conv_in[kMultiClassSecondInputNmsOffset]->AsInput(); + + box_coder_op->LinksFrom( + {concat_out1, concat_out2, conv_in[kBoxCoderThirdInputOffset]}); + box_coder_out->LinksFrom({box_coder_op}); + + transpose_before_nms->LinksFrom({conv_in[kMultiClassSecondInputNmsOffset]}); + transpose_before_nms_out->LinksFrom({transpose_before_nms}); + + multiclass_nms_op->LinksFrom({box_coder_out, transpose_before_nms_out}) + .LinksTo({multiclass_nms_out}); + + return multiclass_nms_out; +} + +PDNode *patterns::FillConstantElementWiseMulFuse::operator()( + PDNode *elementwise_op_input) { + auto fill_constant = + pattern->NewNode(fill_constant_repr())->assert_is_op("fill_constant"); + + auto fill_constant_out = pattern->NewNode(fill_constant_out_repr()) + ->assert_is_op_output("fill_constant") + ->assert_is_op_input("elementwise_mul", "Y") + ->AsIntermediate(); + + auto elementwise_mul_op = + pattern->NewNode(elementwise_mul_repr())->assert_is_op("elementwise_mul"); + + auto elementwise_mul_out = pattern->NewNode(elementwise_mul_out_repr()) + ->assert_is_op_output("elementwise_mul") + ->AsOutput(); + + fill_constant_out->LinksFrom({fill_constant}); + elementwise_mul_op->LinksFrom({elementwise_op_input, fill_constant_out}); + elementwise_mul_out->LinksFrom({elementwise_mul_op}); + return elementwise_mul_out; +} + +void patterns::QuantDequantOpFuse::operator()(PDNode *quant_op_input, + const std::string &op_type, + const std::string &weight_name, + int times) { + const int kNumFields = 5; + const int kQuantizedWeightOffset = 0; + const int kQuantizedOpOffset = 1; + const int kQuantizedOpOutOffset = 2; + const int kDequantOpOffset = 3; + const int kDequantOpOutOffset = 4; + // the quant op always be one. + auto quant_op_in_scale = + pattern->NewNode(GetNodeName("quant_op_in_scale")) + ->assert_is_op_input("fake_quantize_range_abs_max", "InScale") + ->AsInput(); + auto quant_op = pattern->NewNode(GetNodeName("quant_op")) + ->assert_is_op("fake_quantize_range_abs_max"); + + auto quant_op_out_scale = + pattern->NewNode(GetNodeName("quant_op_out_scale")) + ->assert_is_op_output("fake_quantize_range_abs_max", "OutScale") + ->assert_is_op_input("fake_dequantize_max_abs", "Scale") + ->AsIntermediate(); + + auto quant_op_out = + pattern->NewNode(GetNodeName("quant_op_out")) + ->assert_is_op_output("fake_quantize_range_abs_max", "Out") + ->assert_is_op_input(op_type) + ->AsIntermediate(); + + // there are 'times' quantized and dequant op + std::vector nodes; + for (int i = 0; i < times; i++) { + nodes.push_back( + pattern->NewNode(GetNodeName("quantized_op_weight") + std::to_string(i)) + ->assert_is_op_input(op_type, weight_name) + ->AsInput()); + nodes.push_back( + pattern->NewNode(GetNodeName("quantized_op") + std::to_string(i)) + ->assert_is_op(op_type)); + + nodes.push_back( + pattern->NewNode(GetNodeName("quantized_op_out") + std::to_string(i)) + ->assert_is_op_output(op_type) + ->assert_is_op_input("fake_dequantize_max_abs", "X") + ->AsIntermediate()); + + nodes.push_back( + pattern->NewNode(GetNodeName("dequant_op") + std::to_string(i)) + ->assert_is_op("fake_dequantize_max_abs")); + nodes.push_back( + pattern->NewNode(GetNodeName("dequant_op_out") + std::to_string(i)) + ->assert_is_op_output("fake_dequantize_max_abs", "Out") + ->AsOutput()); + } + + quant_op->LinksFrom({quant_op_input, quant_op_in_scale}); + quant_op_out->LinksFrom({quant_op}); + for (int i = 0; i < times; i++) { + nodes[i * kNumFields + kQuantizedOpOffset]->LinksFrom( + {quant_op_out, nodes[i * kNumFields + kQuantizedWeightOffset]}); + nodes[i * kNumFields + kQuantizedOpOutOffset]->LinksFrom( + {nodes[i * kNumFields + kQuantizedOpOffset]}); + nodes[i * kNumFields + kDequantOpOffset]->LinksFrom( + {nodes[i * kNumFields + kQuantizedOpOutOffset], quant_op_out_scale}); + nodes[i * kNumFields + kDequantOpOutOffset]->LinksFrom( + {nodes[i * kNumFields + kDequantOpOffset]}); + } +} + } // namespace ir } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/ir/graph_pattern_detector.h b/paddle/fluid/framework/ir/graph_pattern_detector.h index cdd6413d968b065453177ff78b0aad641a09f6e7..a5ac3a0c3733cf610159c6367d04f3323b797c50 100644 --- a/paddle/fluid/framework/ir/graph_pattern_detector.h +++ b/paddle/fluid/framework/ir/graph_pattern_detector.h @@ -18,8 +18,11 @@ #include #endif +#include #include #include +#include +#include #include #include #include "paddle/fluid/framework/ir/graph.h" @@ -128,6 +131,15 @@ struct PDNode { const std::unordered_set& op_types, const std::string& argument, int nth); + template + PDNode* assert_op_attr(const std::string& attr_name, const T& attr) { + asserts_.emplace_back([=](Node* x) { + return x && x->IsOp() && x->Op()->HasAttr(attr_name) && + boost::get(x->Op()->GetAttr(attr_name)) == attr; + }); + return this; + } + private: PDNode(PDPattern* pattern, const std::string& name = "", Type type = Type::kVar) @@ -293,6 +305,9 @@ bool VarLinksFromOp(Node* node, const std::string& op_type); // Check whether a var node is a op node's nth input. bool IsNthInput(Node* var, Node* op, const std::string& argument, size_t nth); +// Check whether the op node has input of given name. +bool HasInput(Node* op, const std::string& argument); + // Tell whether a var node is a op node's nth output. bool IsNthOutput(Node* var, Node* op, const std::string& argument, size_t nth); @@ -301,8 +316,8 @@ void GraphSafeRemoveNodes(Graph* graph, const std::unordered_set& nodes); // Some pre-defined patterns those can be reused in multiple passes. -// The related Fluid Layer or Op should be one pattern here for better reusage -// accross different fusion. +// The related Fluid Layer or Op should be one pattern here for better re-usage +// across different fusion. namespace patterns { struct KeyCounter { @@ -434,6 +449,31 @@ struct ConvReLU : public PatternBase { PATTERN_DECL_NODE(relu_out); }; +// SEQCONV with Elementwise_Add ReLU +// op: seqconv + elementwise_add + relu +// named nodes: +// seqconv_input, seqconv_weight, +// seqconv_out, seqconv, +// elementwise_add_bias, elementwise_add_out, elementwise_add +// relu_out, relu +struct SeqConvEltAddRelu : public PatternBase { + SeqConvEltAddRelu(PDPattern* pattern, const std::string& name_scope) + : PatternBase(pattern, name_scope, "seqconv_eltadd_relu") {} + + PDNode* operator()(PDNode* seqconv_input); + + // declare operator node's name + PATTERN_DECL_NODE(seqconv); + PATTERN_DECL_NODE(eltadd); + PATTERN_DECL_NODE(relu); + // declare variable node's name + PATTERN_DECL_NODE(seqconv_weight); + PATTERN_DECL_NODE(seqconv_out); + PATTERN_DECL_NODE(eltadd_bias); + PATTERN_DECL_NODE(eltadd_out); + PATTERN_DECL_NODE(relu_out); +}; + // FC with bias // op: mul + elementwise_add // named nodes: @@ -578,6 +618,279 @@ struct ElewiseAddActInplaceGrad : public PatternBase { PATTERN_DECL_NODE(d_ele_y); PATTERN_DECL_NODE(ele_y); }; + +// Conv with Elementwise_add as bias +// op: conv + elementwise_add +// named nodes: +// conv_input, conv_weight, +// conv_out, conv, +// eltwise_bias, eltwise_out, +// elementwise_add +struct ConvBias : public PatternBase { + ConvBias(PDPattern* pattern, const std::string& name_scope) + : PatternBase(pattern, name_scope, "conv_bias") {} + PDNode* operator()(PDNode* conv_input, bool is_conv3d = false); + // declare operator node's name + PATTERN_DECL_NODE(conv); + PATTERN_DECL_NODE(eltwise); + // declare variable node's name + PATTERN_DECL_NODE(conv_weight); + PATTERN_DECL_NODE(conv_out); + PATTERN_DECL_NODE(eltwise_bias); + PATTERN_DECL_NODE(eltwise_out); +}; + +// Convolution op +// Forward pass for convolution. +// conv_input, conv_bias and conv_filter are inputs. +// conv_output is a result of the operator. +// residual_data is data used by skip connection. +// If residual connection fusion is on, the formula is: +// conv_output = conv_op(conv_filter, conv_input, conv_bias) +// + conv_residual_data +// If the fusion is off, conv_residual_data is not added. +struct Conv : public PatternBase { + Conv(PDPattern* pattern, const std::string& name_scope) + : PatternBase(pattern, name_scope, "convolution") {} + + PDNode* operator()(); + + PATTERN_DECL_NODE(conv_op); + PATTERN_DECL_NODE(conv_input); + PATTERN_DECL_NODE(conv_filter); + PATTERN_DECL_NODE(conv_residual_data); + PATTERN_DECL_NODE(conv_output); +}; + +// Convolution op with residual data +struct ConvResidual : public PatternBase { + ConvResidual(PDPattern* pattern, const std::string& name_scope) + : PatternBase(pattern, name_scope, "conv_residual") {} + + PDNode* operator()(bool with_residual_data); + + PATTERN_DECL_NODE(conv_op); + PATTERN_DECL_NODE(conv_input); + PATTERN_DECL_NODE(conv_filter); + PATTERN_DECL_NODE(conv_residual_data); + PATTERN_DECL_NODE(conv_output); +}; + +// Pool op +// Forward pass for pooling. +// pool_input is the input. +// pool_output is a result of the operator. +struct Pool : public PatternBase { + Pool(PDPattern* pattern, const std::string& name_scope) + : PatternBase(pattern, name_scope, "pooling") {} + + PDNode* operator()(); + + PATTERN_DECL_NODE(pool_op); + PATTERN_DECL_NODE(pool_input); + PATTERN_DECL_NODE(pool_output); +}; + +// ElementwiseAdd used in residual connections. +// y_var is used and convolution output. +// The operator is removed, when residual +// connection fusion is on. +struct ElementwiseAdd : public PatternBase { + ElementwiseAdd(PDPattern* pattern, const std::string& name_scope) + : PatternBase(pattern, name_scope, "elementwise_add") {} + + PDNode* operator()(PDNode* x_var, PDNode* y_var); + + PATTERN_DECL_NODE(elementwise_add_op); + PATTERN_DECL_NODE(elementwise_add_x); + PATTERN_DECL_NODE(elementwise_add_y); + PATTERN_DECL_NODE(elementwise_add_out); +}; + +// Conv + ElementwiseAdd + an activation +// This pattern can futher fuse the conv related ops after the conv+bn fusion. +struct ConvElementwiseaddAct : public PatternBase { + ConvElementwiseaddAct(PDPattern* pattern, const std::string& name_scope) + : PatternBase(pattern, name_scope, "conv_elementwiseadd_act") {} + + PDNode* operator()(PDNode* conv_in); + + PATTERN_DECL_NODE(conv_op); + PATTERN_DECL_NODE(conv_out); + PATTERN_DECL_NODE(conv_filter); + + PATTERN_DECL_NODE(elementwise_add_op); + PATTERN_DECL_NODE(elementwise_add_in_y); // input + PATTERN_DECL_NODE(elementwise_add_out); + + PATTERN_DECL_NODE(act_op); + PATTERN_DECL_NODE(act_out); +}; + +// Conv + ElementwiseAdd + ElementwiseAdd + Activation +struct ConvElementwiseadd2Act : public PatternBase { + ConvElementwiseadd2Act(PDPattern* pattern, const std::string& name_scope) + : PatternBase(pattern, name_scope, + "conv_elementwiseadd2_elementwiseadd_act") {} + + PDNode* operator()(PDNode* conv_in); + + PATTERN_DECL_NODE(conv_op); + PATTERN_DECL_NODE(conv_filter); + PATTERN_DECL_NODE(conv_out); + + PATTERN_DECL_NODE(elementwise_add_op); + PATTERN_DECL_NODE(elementwise_add_in_y); // input + PATTERN_DECL_NODE(elementwise_add_out); + + PATTERN_DECL_NODE(elementwise_add_op_1); + PATTERN_DECL_NODE(elementwise_add_in_y_1); // input + PATTERN_DECL_NODE(elementwise_add_out_1); + + PATTERN_DECL_NODE(act_op); + PATTERN_DECL_NODE(act_out); +}; + +// Conv + ElementwiseAdd +// This pattern should be used after ConvElementwiseadd2Act or +// ConvElementwiseadd pass +struct ConvElementwiseadd : public PatternBase { + ConvElementwiseadd(PDPattern* pattern, const std::string& name_scope) + : PatternBase(pattern, name_scope, "conv_elementwiseadd") {} + + PDNode* operator()(PDNode* conv_in); + + PATTERN_DECL_NODE(conv_op); + PATTERN_DECL_NODE(conv_out); + PATTERN_DECL_NODE(conv_filter); + + PATTERN_DECL_NODE(elementwise_add_op); + PATTERN_DECL_NODE(elementwise_add_in_y); + PATTERN_DECL_NODE(elementwise_add_out); +}; + +// Conv with affine_channel +// op: conv + (elementwise_add +) affine_channel +// named nodes: +// conv_weight, conv_out, conv, +// ac_x, ac_scale, ac_bias +// affine_channel, ac_out +struct ConvAffineChannel : public PatternBase { + ConvAffineChannel(PDPattern* pattern, const std::string& name_scope) + : PatternBase(pattern, name_scope, "conv_affine_channel") {} + + PDNode* operator()(PDNode* conv_input, bool with_eltwise_add); + + // declare operator node's name + PATTERN_DECL_NODE(conv); + PATTERN_DECL_NODE(affine_channel); + PATTERN_DECL_NODE(eltwise); // ELEMENTWISE_ADD + // CONV inputs + PATTERN_DECL_NODE(conv_weight); // Filter + // CONV outputs + PATTERN_DECL_NODE(conv_out); // tmp + // ELTWISE inputs + PATTERN_DECL_NODE(eltwise_y_in); + // ELTWISE outputs + PATTERN_DECL_NODE(eltwise_out); // tmp + + // AC(Affine_Channel) inputs + PATTERN_DECL_NODE(ac_scale); + PATTERN_DECL_NODE(ac_bias); + // AC outputs + PATTERN_DECL_NODE(ac_out); // Out +}; + +// Dequantize + Quantize + anyOP +// This pattern is used for squashing the dequantize-quantize pairs. +struct DequantQuantAny : public PatternBase { + DequantQuantAny(PDPattern* pattern, const std::string& name_scope) + : PatternBase(pattern, name_scope, "dequant_quant_any") {} + PDNode* operator()(); + + PATTERN_DECL_NODE(dequant_in); + PATTERN_DECL_NODE(dequant_op); + PATTERN_DECL_NODE(dequant_out); + PATTERN_DECL_NODE(quant_op); + PATTERN_DECL_NODE(quant_out); + PATTERN_DECL_NODE(next_op); +}; + +// Dequantize + anyOP +// This quantize is used for getting number of ops the Dequantize's +// output is an input to. +struct DequantAny : public PatternBase { + DequantAny(PDPattern* pattern, const std::string& name_scope) + : PatternBase(pattern, name_scope, "dequant_any") {} + PDNode* operator()(); + + PATTERN_DECL_NODE(dequant_op); + PATTERN_DECL_NODE(dequant_out); + PATTERN_DECL_NODE(next_op); +}; + +struct TransposeFlattenConcat : public PatternBase { + TransposeFlattenConcat(PDPattern* pattern, const std::string& name_scope) + : PatternBase(pattern, name_scope, "transpose_flatten_concat") {} + + PDNode* operator()(std::vector conv_inputs, int times); + + std::string GetNodeName(const std::string& op_type) { + return PDNodeName(name_scope_, repr_, id_, op_type); + } + + PDNode* GetPDNode(const std::string& op_type) { + return pattern->RetrieveNode(GetNodeName(op_type)); + } +}; + +struct AnakinDetectionPattern : public PatternBase { + AnakinDetectionPattern(PDPattern* pattern, const std::string& name_scope) + : PatternBase(pattern, name_scope, "anakin_detect_pattern") {} + + PDNode* operator()(std::vector conv_inputs, int times, + std::string priorbox_type, bool is_reshape); + + std::string GetNodeName(const std::string& op_type) { + return PDNodeName(name_scope_, repr_, id_, op_type); + } + + PDNode* GetPDNode(const std::string& op_type) { + return pattern->RetrieveNode(GetNodeName(op_type)); + } +}; + +struct FillConstantElementWiseMulFuse : public PatternBase { + FillConstantElementWiseMulFuse(PDPattern* pattern, + const std::string& name_scope) + : PatternBase(pattern, name_scope, + "anakin_fillconstant_elementwisemul_fuse") {} + + PDNode* operator()(PDNode* elementwise_op_input); + + // declare operator node's name + PATTERN_DECL_NODE(fill_constant); + PATTERN_DECL_NODE(fill_constant_out); + PATTERN_DECL_NODE(elementwise_mul); + PATTERN_DECL_NODE(elementwise_mul_out); +}; + +struct QuantDequantOpFuse : public PatternBase { + QuantDequantOpFuse(PDPattern* pattern, const std::string& name_scope) + : PatternBase(pattern, name_scope, "quant_dequant_fuse") {} + + void operator()(PDNode* quant_op_input, const std::string& op_name, + const std::string& weight_name, int times = 1); + + std::string GetNodeName(const std::string& op_type) { + return PDNodeName(name_scope_, repr_, id_, op_type); + } + + PDNode* GetPDNode(const std::string& op_type) { + return pattern->RetrieveNode(GetNodeName(op_type)); + } +}; + } // namespace patterns // Link two ir::Nodes from each other. diff --git a/paddle/fluid/framework/ir/graph_test.cc b/paddle/fluid/framework/ir/graph_test.cc index cadda49c399a6d65079cacedfea61f4fd580a69a..a95588a57b434763fb0f01e33528ef15fd1aa42b 100644 --- a/paddle/fluid/framework/ir/graph_test.cc +++ b/paddle/fluid/framework/ir/graph_test.cc @@ -43,20 +43,20 @@ class SumOpMaker : public OpProtoAndCheckerMaker { class SumOpVarTypeInference : public VarTypeInference { public: - void operator()(const OpDesc &op_desc, BlockDesc *block) const override { - auto &inputs = op_desc.Input("X"); + void operator()(InferVarTypeContext *ctx) const override { + auto &inputs = ctx->Input("X"); auto default_var_type = proto::VarType::SELECTED_ROWS; bool any_input_is_lod_tensor = std::any_of( - inputs.begin(), inputs.end(), [block](const std::string &name) { - return block->Var(name)->GetType() == proto::VarType::LOD_TENSOR; + inputs.begin(), inputs.end(), [&ctx](const std::string &name) { + return ctx->GetType(name) == proto::VarType::LOD_TENSOR; }); if (any_input_is_lod_tensor) { default_var_type = proto::VarType::LOD_TENSOR; } - auto out_var_name = op_desc.Output("Out").front(); - block->Var(out_var_name)->SetType(default_var_type); + auto out_var_name = ctx->Output("Out").front(); + ctx->SetType(out_var_name, default_var_type); } }; @@ -71,7 +71,7 @@ class DummyOpMaker : public OpProtoAndCheckerMaker { class DummyOpVarTypeInference : public VarTypeInference { public: - void operator()(const OpDesc &op_desc, BlockDesc *block) const override {} + void operator()(framework::InferVarTypeContext *ctx) const override {} }; } // namespace framework } // namespace paddle @@ -124,7 +124,7 @@ TEST(GraphTest, Basic) { ASSERT_EQ(n->outputs.size(), 0UL); } } - ASSERT_EQ(nodes.size(), 5); + ASSERT_EQ(nodes.size(), 5UL); } TEST(GraphTest, WriteAfterRead) { diff --git a/paddle/fluid/framework/ir/graph_to_program_pass.cc b/paddle/fluid/framework/ir/graph_to_program_pass.cc index 414d8f79b15de091c62af5fe099ffae144156e4e..b0d056f2c0f8286caadfbfed3b55b19fcef34402 100644 --- a/paddle/fluid/framework/ir/graph_to_program_pass.cc +++ b/paddle/fluid/framework/ir/graph_to_program_pass.cc @@ -15,49 +15,67 @@ limitations under the License. */ #include "paddle/fluid/framework/ir/graph_to_program_pass.h" #include +#include #include +#include #include #include "paddle/fluid/framework/ir/graph.h" #include "paddle/fluid/framework/ir/graph_helper.h" - #include "paddle/fluid/framework/program_desc.h" namespace paddle { namespace framework { namespace ir { -std::unique_ptr GraphToProgramPass::ApplyImpl( - std::unique_ptr graph) const { +void GraphToProgramPass::ApplyImpl(ir::Graph* graph) const { + // Remove the unneeded variables after memory optimization. + std::unordered_set vars2remove; + if (graph->Has(kGraphToProgramVarsToRemove)) { + vars2remove = graph->Get>( + kGraphToProgramVarsToRemove); + VLOG(2) << "graph to program remove " << vars2remove.size() << " nodes"; + } + ProgramDesc& program = Get("program"); std::unique_ptr program_pb( new proto::ProgramDesc(*program.Proto())); auto block = program_pb->mutable_blocks(kRootBlockIndex); + block->set_idx(kRootBlockIndex); block->clear_vars(); std::unordered_set visited_vars; for (ir::Node* n : graph->Nodes()) { - if (n->NodeType() == ir::Node::Type::kVariable) { - if (n->Var() && visited_vars.count(n->Var()->Name()) == 0) { + if (n->IsVar()) { + if (n->Var() && visited_vars.count(n->Var()->Name()) == 0 && + !vars2remove.count(n->Var()->Name())) { visited_vars.insert(n->Var()->Name()); block->add_vars()->MergeFrom(*n->Var()->Proto()); } } } - block->clear_ops(); - std::vector nodes = TopologySortOperations(*graph); + + std::vector nodes; + if (Has(kGraphToProgramSortKind)) { + // Inference Memory Optimize relays on this branch. + int sort_kind = Get(kGraphToProgramSortKind); + nodes = TopologyVarientSort( + *graph, static_cast(sort_kind)); + } else { + nodes = TopologySortOperations(*graph); + } + for (ir::Node* n : nodes) { - if (!n->Op()) { - continue; - } + if (!n->Op()) continue; + block->add_ops()->MergeFrom(*n->Op()->Proto()); } program.CopyFrom(*program_pb); - return graph; } + } // namespace ir } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/ir/graph_to_program_pass.h b/paddle/fluid/framework/ir/graph_to_program_pass.h index 124ec5a8e771fb768b31fa2e9f5143db96154490..52c8f4e0fcafcd42647b323a20fee7c7cf167b3a 100644 --- a/paddle/fluid/framework/ir/graph_to_program_pass.h +++ b/paddle/fluid/framework/ir/graph_to_program_pass.h @@ -20,9 +20,13 @@ namespace paddle { namespace framework { namespace ir { +const char kGraphToProgramVarsToRemove[] = + "__graph_to_program_vars_to_remove__"; +const char kGraphToProgramSortKind[] = "__graph_to_program_sort_kind__"; + class GraphToProgramPass : public Pass { protected: - std::unique_ptr ApplyImpl(std::unique_ptr graph) const override; + void ApplyImpl(ir::Graph* graph) const override; }; } // namespace ir diff --git a/paddle/fluid/framework/ir/graph_to_program_pass_test.cc b/paddle/fluid/framework/ir/graph_to_program_pass_test.cc index 5d51d9751a28d2b1549096b1984d67b55f913da6..5ee6b8a5f1e4e7415adfac6b51e9d3ae8e3062a9 100644 --- a/paddle/fluid/framework/ir/graph_to_program_pass_test.cc +++ b/paddle/fluid/framework/ir/graph_to_program_pass_test.cc @@ -14,7 +14,9 @@ limitations under the License. */ #include "paddle/fluid/framework/ir/graph_to_program_pass.h" +#include #include +#include #include #include "gtest/gtest.h" #include "paddle/fluid/framework/program_desc.h" @@ -84,7 +86,7 @@ TEST(GraphToProgramPass, Basic) { ProgramDesc compiled_prog; pass->SetNotOwned("program", &compiled_prog); - pass->Apply(std::move(g)); + pass->Apply(g.get()); std::vector ops = compiled_prog.Block(0).AllOps(); EXPECT_EQ(ops[0]->Type(), "op1"); EXPECT_EQ(ops[1]->Type(), "op2"); diff --git a/paddle/fluid/framework/ir/graph_traits.cc b/paddle/fluid/framework/ir/graph_traits.cc index 084a4ba2def87eaa8badb3ca2c39865c6e5cb981..929d9edc34ffb92f468d5b7af54a0b8da4121543 100644 --- a/paddle/fluid/framework/ir/graph_traits.cc +++ b/paddle/fluid/framework/ir/graph_traits.cc @@ -14,6 +14,7 @@ #include "paddle/fluid/framework/ir/graph_traits.h" +#include #include namespace paddle { @@ -66,6 +67,76 @@ NodesDFSIterator &NodesDFSIterator::operator=(const NodesDFSIterator &other) { } Node *NodesDFSIterator::operator->() { return stack_.top(); } +inline bool CheckNodeIndegreeEquals(const Node &node, size_t n) { + return node.inputs.size() == n; +} + +NodesTSIterator::NodesTSIterator(const std::vector &source) { + PADDLE_ENFORCE(!source.empty(), + "Start points of topological sorting should not be empty!"); + // CHECK all the inputs' in-degree is 0 + for (auto *node : source) { + PADDLE_ENFORCE(CheckNodeIndegreeEquals(*node, 0)); + } + + std::unordered_set visited; + std::set to_visit{source.begin(), source.end()}; + + std::vector inlink_visited; + while (!to_visit.empty()) { + std::vector queue(to_visit.begin(), to_visit.end()); + for (auto *p : queue) { + inlink_visited.clear(); + + std::copy_if(p->inputs.begin(), p->inputs.end(), + std::back_inserter(inlink_visited), + [&](Node *x) -> bool { return visited.count(x) != 0; }); + + if (inlink_visited.size() == p->inputs.size()) { + sorted_.push_back(p); + for (auto *_ : p->outputs) { + if (!visited.count(_)) { + to_visit.insert(_); + } + } + + to_visit.erase(p); + visited.insert(p); + } + } + } +} + +NodesTSIterator::NodesTSIterator(const NodesTSIterator &other) + : sorted_(other.sorted_), cursor_(other.cursor_) {} + +Node &NodesTSIterator::operator*() { + PADDLE_ENFORCE_LT(cursor_, sorted_.size()); + return *sorted_[cursor_]; +} + +NodesTSIterator &NodesTSIterator::operator++() { + if (++cursor_ >= sorted_.size()) { + sorted_.clear(); + cursor_ = 0; + } + return *this; +} +NodesTSIterator &NodesTSIterator::operator=(const NodesTSIterator &other) { + cursor_ = other.cursor_; + sorted_ = other.sorted_; + return *this; +} + +bool NodesTSIterator::operator==(const NodesTSIterator &other) { + return sorted_ == other.sorted_ && cursor_ == other.cursor_; +} + +Node *NodesTSIterator::operator->() { + PADDLE_ENFORCE_LT(cursor_, sorted_.size()); + return sorted_[cursor_]; +} + } // namespace ir } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/ir/graph_traits.h b/paddle/fluid/framework/ir/graph_traits.h index f42bab20ed97e372d2da0c4a492a4458ab94e0a0..f6772f9a37567c83c49bd44d551481edda1a74ae 100644 --- a/paddle/fluid/framework/ir/graph_traits.h +++ b/paddle/fluid/framework/ir/graph_traits.h @@ -62,6 +62,32 @@ struct NodesDFSIterator std::unordered_set visited_; }; +// Topological sorting iterator on nodes. +struct NodesTSIterator + : public std::iterator { + NodesTSIterator() = default; + NodesTSIterator(const std::vector &source); + NodesTSIterator(NodesTSIterator &&other) + : sorted_(std::move(other.sorted_)), cursor_(other.cursor_) { + other.cursor_ = 0; + } + NodesTSIterator(const NodesTSIterator &other); + + Node &operator*(); + NodesTSIterator &operator++(); + // TODO(Superjomn) current implementation just compare the first + // element, need to compare the graph and all the elements in the queue and + // set. + NodesTSIterator &operator=(const NodesTSIterator &other); + bool operator==(const NodesTSIterator &other); + bool operator!=(const NodesTSIterator &other) { return !(*this == other); } + Node *operator->(); + + private: + std::vector sorted_; + size_t cursor_{0}; +}; + /* * GraphTraits contains some graph traversal algorithms. * @@ -76,6 +102,14 @@ struct GraphTraits { NodesDFSIterator()); } + static iterator_range TS(const Graph &g) { + auto start_points = ExtractStartPoints(g); + PADDLE_ENFORCE(!start_points.empty()); + NodesTSIterator x(start_points); + return iterator_range(NodesTSIterator(start_points), + NodesTSIterator()); + } + private: // The nodes those have no input will be treated as start points. static std::vector ExtractStartPoints(const Graph &g) { diff --git a/paddle/fluid/framework/ir/graph_viz_pass.cc b/paddle/fluid/framework/ir/graph_viz_pass.cc index 31ed98db72c8fd4af8c970861d386687962001ce..f4df4cfeba66889f3bf547d989d27aa76587e6be 100644 --- a/paddle/fluid/framework/ir/graph_viz_pass.cc +++ b/paddle/fluid/framework/ir/graph_viz_pass.cc @@ -12,10 +12,10 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ +#include "paddle/fluid/framework/ir/graph_viz_pass.h" #include +#include #include - -#include "paddle/fluid/framework/ir/graph_viz_pass.h" #include "paddle/fluid/framework/op_proto_maker.h" #include "paddle/fluid/inference/analysis/dot.h" #include "paddle/fluid/string/printf.h" @@ -38,8 +38,7 @@ std::string FormatName(const Node* node) { } } // namespace -std::unique_ptr GraphVizPass::ApplyImpl( - std::unique_ptr graph) const { +void GraphVizPass::ApplyImpl(ir::Graph* graph) const { const std::string graph_viz_path = Get(kGraphVizPath); VLOG(3) << "draw IR graph viz to " << graph_viz_path; std::unique_ptr fout(new std::ofstream(graph_viz_path)); @@ -82,7 +81,7 @@ std::unique_ptr GraphVizPass::ApplyImpl( {Dot::Attr("style", "filled,rounded"), Dot::Attr("shape", "box"), Dot::Attr("fillcolor", "yellow")}); - auto marked_nodes = ConsumeMarkedNodes(graph.get()); + auto marked_nodes = ConsumeMarkedNodes(graph); // Create nodes for (const Node* n : graph->Nodes()) { std::string node_id = FormatName(n) + "(" + std::to_string(n->id()) + ")"; @@ -115,8 +114,6 @@ std::unique_ptr GraphVizPass::ApplyImpl( } sout << dot.Build(); - - return graph; } GraphVizPass::marked_nodes_t GraphVizPass::ConsumeMarkedNodes( diff --git a/paddle/fluid/framework/ir/graph_viz_pass.h b/paddle/fluid/framework/ir/graph_viz_pass.h index e64916a5bb662e3b00cfe212f0bbbc537c7bc2cc..7091aa6a95bd9ebde10bfbd45c98f8757b9d06c4 100644 --- a/paddle/fluid/framework/ir/graph_viz_pass.h +++ b/paddle/fluid/framework/ir/graph_viz_pass.h @@ -18,6 +18,7 @@ limitations under the License. */ #include #include #include +#include #include #include "paddle/fluid/framework/ir/graph.h" @@ -34,8 +35,7 @@ class GraphVizPass : public Pass { using marked_nodes_t = std::unordered_set; protected: - std::unique_ptr ApplyImpl( - std::unique_ptr graph) const override; + void ApplyImpl(ir::Graph* graph) const override; // Tell whether there are any marked nodes in the graph. Consume the // corresponding attribute. diff --git a/paddle/fluid/framework/ir/identity_scale_op_clean_pass.cc b/paddle/fluid/framework/ir/identity_scale_op_clean_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..a39901e63bf65f7c314595a5fb2cc31d00959bd5 --- /dev/null +++ b/paddle/fluid/framework/ir/identity_scale_op_clean_pass.cc @@ -0,0 +1,82 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/identity_scale_op_clean_pass.h" +#include +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" + +namespace paddle { +namespace framework { +namespace ir { + +void IdentityScaleOpCleanPass::ApplyImpl(ir::Graph* graph) const { + FusePassBase::Init("identity_scale_op_clean", graph); + + // pre_op -> scale_in -> scale_op -> scale_out + // -> + // pre_op -> scale_out + GraphPatternDetector detector; + auto pre_op = detector.mutable_pattern()->NewNode("pre_op")->assert_is_op(); + auto scale_in = detector.mutable_pattern() + ->NewNode("scale_in") + ->assert_is_op_input("scale") + ->AsIntermediate(); + auto scale_op = detector.mutable_pattern() + ->NewNode("scale_fuse") + ->assert_is_op("scale") + ->assert_op_attr("scale", 1.) + ->assert_op_attr("bias", 0.); + auto scale_out = + detector.mutable_pattern() + ->NewNode("scale_out") + ->assert_is_op_output("scale") + // scale's output var should has only one consumer, or it can't be + // removed. + ->assert_more([](Node* x) { return x->outputs.size() == 1UL; }); + + pre_op->LinksTo({scale_in}); + scale_op->LinksFrom({scale_in}).LinksTo({scale_out}); + + GraphPatternDetector::handle_t handler = [&]( + const GraphPatternDetector::subgraph_t& subgraph, Graph* graph) { + Node* scale_op_var = subgraph.at(scale_op); + Node* scale_in_var = subgraph.at(scale_in); + Node* scale_out_var = subgraph.at(scale_out); + Node* pre_op_var = subgraph.at(pre_op); + // Link pre_op directly to scale_out + const std::string scale_in_name = scale_in_var->Name(); + const std::string scale_out_name = scale_out_var->Name(); + // Remove links in graph + GraphSafeRemoveNodes(graph, {scale_in_var, scale_op_var}); + // Modify proto message + auto* pre_op_desc = pre_op_var->Op(); + for (auto& parameter : *pre_op_desc->Proto()->mutable_outputs()) { + auto* arguments = parameter.mutable_arguments(); + auto it = std::find(arguments->begin(), arguments->end(), scale_in_name); + PADDLE_ENFORCE(it != arguments->end()); + *it = scale_out_name; + } + + IR_NODE_LINK_TO(pre_op_var, scale_out_var); + }; + + detector(graph, handler); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(identity_scale_op_clean_pass, + paddle::framework::ir::IdentityScaleOpCleanPass); diff --git a/paddle/fluid/framework/ir/identity_scale_op_clean_pass.h b/paddle/fluid/framework/ir/identity_scale_op_clean_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..d66b411257e530fa5188091702b0b309652ffaa4 --- /dev/null +++ b/paddle/fluid/framework/ir/identity_scale_op_clean_pass.h @@ -0,0 +1,33 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/framework/ir/fuse_pass_base.h" + +namespace paddle { +namespace framework { +namespace ir { + +class IdentityScaleOpCleanPass : public FusePassBase { + protected: + void ApplyImpl(ir::Graph* graph) const override; + + private: + virtual ~IdentityScaleOpCleanPass() = default; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/infer_clean_graph_pass.cc b/paddle/fluid/framework/ir/infer_clean_graph_pass.cc index 7713ed1eab88ee4fa16d52e7425075ae66f721a3..d76924116f6d6202557a0d76cfcdadba0a3a6de6 100644 --- a/paddle/fluid/framework/ir/infer_clean_graph_pass.cc +++ b/paddle/fluid/framework/ir/infer_clean_graph_pass.cc @@ -26,9 +26,9 @@ class InferCleanGraphPass : public FusePassBase { virtual ~InferCleanGraphPass() {} protected: - std::unique_ptr ApplyImpl(std::unique_ptr graph) const { - FusePassBase::Init("original_graph", graph.get()); - PADDLE_ENFORCE(graph.get()); + void ApplyImpl(ir::Graph* graph) const { + FusePassBase::Init("original_graph", graph); + PADDLE_ENFORCE(graph); auto is_valid_node = [](Node* x) { return x && IsControlDepVar(*x) && x->IsVar() && !x->Var(); @@ -37,6 +37,7 @@ class InferCleanGraphPass : public FusePassBase { std::unordered_set invalid_nodes; int valid_op = 0; for (auto* node : graph->Nodes()) { + PADDLE_ENFORCE_NOT_NULL(node); if (is_valid_node(node)) { invalid_nodes.insert(node); } else if (node->IsOp()) { @@ -45,11 +46,9 @@ class InferCleanGraphPass : public FusePassBase { } } - GraphSafeRemoveNodes(graph.get(), invalid_nodes); + GraphSafeRemoveNodes(graph, invalid_nodes); AddStatis(valid_op); - - return graph; } void CleanEdges(std::vector* nodes, diff --git a/paddle/fluid/framework/ir/is_test_pass.cc b/paddle/fluid/framework/ir/is_test_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..bf6fe999c1e68c35bc2c19fe38646da93bb1e204 --- /dev/null +++ b/paddle/fluid/framework/ir/is_test_pass.cc @@ -0,0 +1,55 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/ir/is_test_pass.h" +#include +#include + +namespace paddle { +namespace framework { +namespace ir { + +void IsTestPass::ApplyImpl(ir::Graph* graph) const { + VLOG(3) << "Sets is_test attrbiute to true and if it is missing, inserts it " + "for activations and pooling."; + auto op_list = {"pool2d", "sigmoid", "logsigmoid", + "softshrink", "exp", "brelu", + "pow", "leaky_relu", "stanh", + "relu", "tanh", "tanh_shrink", + "sqrt", "abs", "ceil", + "elu", "floor", "cos", + "sin", "round", "reciprocal", + "hard_shrink", "hard_sigmoid", "relu6", + "soft_relu", "swish", "thresholded_relu", + "log", "square", "softplus", + "softsign"}; + for (const Node* n : graph->Nodes()) { + if (n->IsOp()) { + auto* op = n->Op(); + if (op->HasAttr("is_test") || op->HasProtoAttr("is_test")) { + op->SetAttr("is_test", true); + } else if (std::find(begin(op_list), end(op_list), op->Type()) != + end(op_list)) { + op->MutableAttrMap()->insert( + std::pair("is_test", true)); + } + } + } +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(is_test_pass, paddle::framework::ir::IsTestPass); diff --git a/paddle/fluid/framework/ir/is_test_pass.h b/paddle/fluid/framework/ir/is_test_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..80cedbf9f850f6fe31c9f2898264e19ebf931c72 --- /dev/null +++ b/paddle/fluid/framework/ir/is_test_pass.h @@ -0,0 +1,30 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include "paddle/fluid/framework/ir/pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +class IsTestPass : public Pass { + protected: + void ApplyImpl(ir::Graph* graph) const override; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/is_test_pass_tester.cc b/paddle/fluid/framework/ir/is_test_pass_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..3fa543c6221ae6ada8afddcf4563c1174127c221 --- /dev/null +++ b/paddle/fluid/framework/ir/is_test_pass_tester.cc @@ -0,0 +1,120 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/is_test_pass.h" + +#include +#ifdef _WIN32 +#undef FALSE +#undef TRUE +#endif +namespace paddle { +namespace framework { +namespace ir { + +enum class ISTEST_STATE { FALSE, TRUE, UNSET }; + +void SetOp(ProgramDesc* prog, const std::string& type, const std::string& name, + const std::vector& inputs, + const std::vector& outputs, bool use_mkldnn = false, + ISTEST_STATE is_test = ISTEST_STATE::UNSET) { + auto* op = prog->MutableBlock(0)->AppendOp(); + op->SetType(type); + op->SetAttr("name", name); + op->SetInput("X", inputs); + op->SetOutput("Out", outputs); + op->SetAttr("use_mkldnn", use_mkldnn); + if (is_test == ISTEST_STATE::UNSET) + op->MutableAttrMap()->erase("is_test"); + else if (is_test == ISTEST_STATE::FALSE) + op->SetAttr("is_test", false); + else + op->SetAttr("is_test", true); +} + +// a->pool2d->b +// b->relu->c +// c,weights1)->conv2d->d +// +// d->pool2d->e +// e->hard_sigmoid->f +// (f,weights2)->conv2d->g +// +// g->pool2d->h +// h->tanh->i +// (i,weights3)->conv2d->j +ProgramDesc BuildProgramDesc() { + ProgramDesc prog; + for (auto& v : + std::vector({"a", "b", "c", "d", "e", "f", "g", "h", "i", + "j", "weights1", "weights2", "weights3"})) { + auto* var = prog.MutableBlock(0)->Var(v); + var->SetType(proto::VarType::SELECTED_ROWS); + if (v == "weights1" || v == "weights2" || v == "weights3") { + var->SetPersistable(true); + } + } + + SetOp(&prog, "pool2d", "pooling1", std::vector({"a"}), + std::vector({"b"}), true, ISTEST_STATE::TRUE); + SetOp(&prog, "relu", "activation1", std::vector({"b"}), + std::vector({"c"}), true, ISTEST_STATE::TRUE); + SetOp(&prog, "conv2d", "conv1", std::vector({"c", "weights1"}), + std::vector({"d"}), true, ISTEST_STATE::TRUE); + + SetOp(&prog, "pool2d", "pooling2", std::vector({"d"}), + std::vector({"e"}), false, ISTEST_STATE::FALSE); + SetOp(&prog, "hard_sigmoid", "activation2", std::vector({"e"}), + std::vector({"f"}), false, ISTEST_STATE::FALSE); + SetOp(&prog, "conv2d", "conv2", std::vector({"f", "weights2"}), + std::vector({"g"}), false, ISTEST_STATE::FALSE); + + SetOp(&prog, "pool2d", "pooling3", std::vector({"g"}), + std::vector({"h"}), false, ISTEST_STATE::UNSET); + SetOp(&prog, "tanh", "activation3", std::vector({"h"}), + std::vector({"i"}), true, ISTEST_STATE::UNSET); + SetOp(&prog, "conv2d", "conv3", std::vector({"i", "weights3"}), + std::vector({"j"}), false, ISTEST_STATE::UNSET); + + return prog; +} + +TEST(IsTestPass, basic) { + auto prog = BuildProgramDesc(); + + std::unique_ptr graph(new ir::Graph(prog)); + + auto pass = PassRegistry::Instance().Get("is_test_pass"); + + graph.reset(pass->Apply(graph.release())); + + for (auto* node : graph->Nodes()) { + if (node->IsOp()) { + auto* op = node->Op(); + auto op_name = boost::get(op->GetAttr("name")); + if (op_name == "conv3") { + ASSERT_FALSE(op->HasAttr("is_test")); + } else { + ASSERT_TRUE(op->HasAttr("is_test")); + EXPECT_TRUE(boost::get(op->GetAttr("is_test"))); + } + } + } +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +USE_PASS(is_test_pass); diff --git a/paddle/fluid/framework/ir/lock_free_optimize_pass.cc b/paddle/fluid/framework/ir/lock_free_optimize_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..05d23961a8b180381eef6372f7049bed2b530db7 --- /dev/null +++ b/paddle/fluid/framework/ir/lock_free_optimize_pass.cc @@ -0,0 +1,355 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/lock_free_optimize_pass.h" + +#include +#include +#include + +#include "paddle/fluid/framework/ir/node.h" +#include "paddle/fluid/framework/op_proto_maker.h" +#include "paddle/fluid/framework/operator.h" +#include "paddle/fluid/platform/enforce.h" + +namespace paddle { +namespace framework { +namespace ir { + +const char kSumGradOpName[] = "sum"; +// TODO(minqiyang): only support sgd at current time, please add +// other optimizers later. +const char kOptimizerType[] = "sgd"; + +void LockFreeOptimizePass::ApplyImpl(ir::Graph* graph) const { + PADDLE_ENFORCE(graph); + + // We could collect all weights' name from SGD, where + // W1 <- SGD(W0, Grad0) + std::unordered_set weight_var_set; + for (auto* node : graph->Nodes()) { + if (IsOpNamed(node, kOptimizerType)) { + auto& param_out_vars = node->Op()->Output("ParamOut"); + PADDLE_ENFORCE(param_out_vars.size() == 1u); + weight_var_set.insert(param_out_vars[0]); + } + } + + // find all grad's merge op via weight name, where + // Grad0 <- SUM(Grad1, Grad2, Grad3 ...) + std::unordered_set grad_sum_op_set; + for (ir::Node* node : graph->Nodes()) { + if (IsOpNamed(node, kSumGradOpName)) { + for (ir::Node* output : node->outputs) { + // strip the last grad suffix @GRAD + std::string var_name = output->Name(); + const std::string suffix(kGradVarSuffix); + if (var_name != suffix && var_name.size() > suffix.size() && + var_name.substr(var_name.size() - suffix.size()) == suffix) { + // if so then strip them off + var_name = var_name.substr(0, var_name.size() - suffix.size()); + if (weight_var_set.find(var_name) != weight_var_set.end()) { + grad_sum_op_set.insert(node); + break; + } + } + } + } + } + + // get the forward op and backward op pairs, where + // out <- forward(X, W) + // Grad1 <- backward(out, X') + // Grad0 <- SUM(Grad1, Grad2, Grad3 ...) + // W0 <- SGD(W1, Grad0) + for (ir::Node* node : grad_sum_op_set) { + for (ir::Node* merged_grad_var : node->outputs) { + // find the optimizers connected with sum op + if (IsVarNameEndsWith(merged_grad_var, kGradVarSuffix) && + merged_grad_var->outputs.size() == 1u) { + ir::Node* opt_node = merged_grad_var->outputs[0]; + VLOG(3) << "Found opt node " << opt_node->Name(); + + // find the backward op connected with sum op + for (ir::Node* unmerged_grad_var : node->inputs) { + if (IsVarNameContains(unmerged_grad_var, kGradVarSuffix) && + unmerged_grad_var->inputs.size() == 1u) { + ir::Node* backward_op = unmerged_grad_var->inputs[0]; + + VLOG(3) << "Found backward_op " << backward_op->Name(); + + // find the forward op related to the backward op + ir::Node* forward_op = + FindForwardOpViaBackwardOp(graph, backward_op); + + VLOG(3) << "Found forward_op " << forward_op->Name(); + + PADDLE_ENFORCE(forward_op); + + Node* new_optimizer_node = CreateNewSGDNode( + graph, forward_op, backward_op, node, opt_node); + + PADDLE_ENFORCE(new_optimizer_node); + } + } + } + } + } + + // Remove the sum_op and its' outputs and connected Optimizers + for (Node* sum_op : grad_sum_op_set) { + for (Node* sum_op_output : sum_op->outputs) { + for (Node* optimize_op : sum_op_output->outputs) { + if (optimize_op->NodeType() == Node::Type::kOperation && + optimize_op->Name() == kOptimizerType) { + VLOG(3) << "remove optimize_op: " << optimize_op->Name() << "_" + << optimize_op->id(); + graph->RemoveNode(optimize_op); + } + } + VLOG(3) << "remove sum_op_output: " << sum_op_output->Name() << "_" + << sum_op_output->id(); + graph->RemoveNode(sum_op_output); + } + VLOG(3) << "remove sum_op: " << sum_op->Name() << "_" << sum_op->id(); + graph->RemoveNode(sum_op); + } + + for (auto* node : graph->Nodes()) { + for (Node* output_node : node->outputs) { + if (output_node->Name() == "sgd") { + VLOG(3) << "Node link to SGD: " << node->Name() << "_" << node->id() + << " --> " << output_node->Name() << "_" << output_node->id(); + for (Node* input_node : node->inputs) { + VLOG(3) << "SGD Input link: " << input_node->Name() << "_" + << input_node->id() << " --> " << node->Name() << "_" + << node->id(); + } + } + } + } +} + +ir::Node* LockFreeOptimizePass::CreateNewSGDNode( + ir::Graph* graph, ir::Node* forward_node, ir::Node* backward_node, + ir::Node* grad_sum_node, ir::Node* optimize_node) const { + PADDLE_ENFORCE(graph); + PADDLE_ENFORCE(forward_node); + PADDLE_ENFORCE(backward_node); + PADDLE_ENFORCE(grad_sum_node); + PADDLE_ENFORCE(optimize_node); + + // find the grad var node between the grad sum node and backward_node + std::vector grad_vars = + FindConnectedNode(backward_node, grad_sum_node); + ir::Node* grad_node = nullptr; + for (ir::Node* node : grad_vars) { + if (!ir::IsControlDepVar(*node)) { + grad_node = node; + } + } + PADDLE_ENFORCE(grad_node); + + // create a new SGD node + OpDesc* old_desc = optimize_node->Op(); + // keep with the same block between new optimizer and the old one + OpDesc new_desc(*old_desc, old_desc->Block()); + new_desc.SetInput("Param", old_desc->Input("Param")); + new_desc.SetInput("LearningRate", old_desc->Input("LearningRate")); + new_desc.SetInput("Grad", std::vector({grad_node->Name()})); + new_desc.SetOutput("ParamOut", old_desc->Output("ParamOut")); + + std::vector op_role_vars = boost::get>( + new_desc.GetAttr(framework::OpProtoAndCheckerMaker::OpRoleVarAttrName())); + // replace the second op role var, because the grad name was + // changed in new optimizer + op_role_vars.pop_back(); + op_role_vars.push_back(grad_node->Name()); + new_desc.SetAttr(framework::OpProtoAndCheckerMaker::OpRoleVarAttrName(), + op_role_vars); + new_desc.SetType(kOptimizerType); + + // set backward op's op role var, this will be used to + // set device_id in multi_device_pass + backward_node->Op()->SetAttr( + framework::OpProtoAndCheckerMaker::OpRoleVarAttrName(), op_role_vars); + // backward_node->Op()->SetAttr( + // framework::OpProtoAndCheckerMaker::OpRoleVarAttrName(), {}); + + // keep with the same output nodes between new optimizer and the + // old one + Node* sgd_node = graph->CreateOpNode(&new_desc); + + // change all outputs of the optimize_node to the new one + ReplaceAllDownstreamNode(optimize_node, sgd_node); + + // find connected node between forward node and optimize node + // and replace the optimize node to new sgd node + std::vector forward_opt_connected_nodes = + FindConnectedNode(forward_node, optimize_node); + for (ir::Node* node : forward_opt_connected_nodes) { + ReplaceUpstreamNode(node, optimize_node, sgd_node); + } + + // find connected node between backward node and optimize node + // and replace the optimize node to new sgd node + std::vector backward_opt_connected_nodes = + FindConnectedNode(backward_node, optimize_node); + for (ir::Node* node : backward_opt_connected_nodes) { + ReplaceUpstreamNode(node, optimize_node, sgd_node); + } + + // SGD must have only one param and LR in + PADDLE_ENFORCE(old_desc->Input("LearningRate").size() == 1u); + PADDLE_ENFORCE(old_desc->Input("Param").size() == 1u); + + // LR and weight nodes should be copied + for (Node* upstream_node : optimize_node->inputs) { + if (upstream_node->Name() == old_desc->Input("LearningRate")[0] || + upstream_node->Name() == old_desc->Input("Param")[0]) { + ReplaceUpstreamNode(upstream_node, optimize_node, sgd_node); + } + } + + VLOG(3) << "Create new opt node" << sgd_node->Name() << "_" << sgd_node->id(); + + return sgd_node; +} + +std::vector LockFreeOptimizePass::FindConnectedNode( + ir::Node* upstream_node, ir::Node* downstream_node) const { + std::vector result; + for (ir::Node* out_node : upstream_node->outputs) { + for (ir::Node* in_node : downstream_node->inputs) { + if (in_node == out_node) { + result.push_back(in_node); + } + } + } + + return result; +} + +void LockFreeOptimizePass::ReplaceUpstreamNode( + ir::Node* upstream_node, ir::Node* old_optimizer_node, + ir::Node* new_optimizer_node) const { + PADDLE_ENFORCE(upstream_node); + PADDLE_ENFORCE(old_optimizer_node); + PADDLE_ENFORCE(new_optimizer_node); + + // Remove the old_optimizer_node from upstream_node's outputs vector + auto& output_node_vec = upstream_node->outputs; + for (auto output_node_iter = output_node_vec.begin(); + output_node_iter != output_node_vec.end();) { + if (*output_node_iter == old_optimizer_node) { + output_node_vec.erase(output_node_iter); + break; + } else { + ++output_node_iter; + } + } + + // Add the new_optimizer_node to upstream_node's outputs vector + output_node_vec.emplace_back(new_optimizer_node); + new_optimizer_node->inputs.emplace_back(upstream_node); +} + +void LockFreeOptimizePass::ReplaceAllDownstreamNode( + ir::Node* old_optimizer_node, ir::Node* new_optimizer_node) const { + PADDLE_ENFORCE(old_optimizer_node); + PADDLE_ENFORCE(new_optimizer_node); + + for (ir::Node* downstream_node : old_optimizer_node->outputs) { + // Remove the old_optimizer_node from downstream_node's inputs vector + auto& input_node_vec = downstream_node->inputs; + for (auto input_node_iter = input_node_vec.begin(); + input_node_iter != input_node_vec.end();) { + if (*input_node_iter == old_optimizer_node) { + input_node_vec.erase(input_node_iter); + break; + } else { + ++input_node_iter; + } + } + + // Add the new_optimizer_node to downstream_node's inputs vector + input_node_vec.emplace_back(new_optimizer_node); + new_optimizer_node->outputs.emplace_back(downstream_node); + } +} + +ir::Node* LockFreeOptimizePass::FindForwardOpViaBackwardOp( + ir::Graph* graph, ir::Node* backward_node) const { + PADDLE_ENFORCE(graph); + PADDLE_ENFORCE(backward_node); + + // strip the suffix _grad of backward_node's name + std::string forward_op_name = backward_node->Name(); + const std::string suffix("_grad"); + if (forward_op_name != suffix && forward_op_name.size() > suffix.size() && + forward_op_name.substr(forward_op_name.size() - suffix.size()) == + suffix) { + // if so then strip them off + forward_op_name = + forward_op_name.substr(0, forward_op_name.size() - suffix.size()); + } else { + LOG(WARNING) << "Illegal backward node's name " << backward_node->Name() + << " id " << backward_node->id(); + + return nullptr; + } + + for (ir::Node* node : graph->Nodes()) { + if (node->Name() == forward_op_name) { + if (node->outputs.size() == 0u) { + // if forward_node has no output, then it has NO grad op + continue; + } + + // check whether all inputs of the backward_op that ends_with @GRAD + // comes from the output of forward_op is the input of the backward_op + bool is_related_forward_node = true; + for (ir::Node* backward_input : backward_node->inputs) { + if (IsVarNameEndsWith(backward_input, kGradVarSuffix)) { + bool meets_correct_output = false; + for (ir::Node* forward_output : node->outputs) { + if (forward_output->Name() + kGradVarSuffix == + backward_input->Name()) { + meets_correct_output = true; + break; + } + } + + if (!meets_correct_output) { + is_related_forward_node = false; + break; + } + } + } + + if (is_related_forward_node) { + return node; + } + } + } + + return nullptr; +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(lock_free_optimize_pass, + paddle::framework::ir::LockFreeOptimizePass); diff --git a/paddle/fluid/framework/ir/lock_free_optimize_pass.h b/paddle/fluid/framework/ir/lock_free_optimize_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..d1718857a5d84304c3c02e74c7ca79c24f367f8c --- /dev/null +++ b/paddle/fluid/framework/ir/lock_free_optimize_pass.h @@ -0,0 +1,130 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#ifndef PADDLE_FLUID_FRAMEWORK_IR_LOCK_FREE_OPTIMIZE_PASS_H_ +#define PADDLE_FLUID_FRAMEWORK_IR_LOCK_FREE_OPTIMIZE_PASS_H_ + +#include +#include + +#include + +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +class Node; + +/* +* Remove the sum op of all gradients of the backward op. +* And remove the dependecies of the optimizer related to the +* same backward op. +* +* Before this pass: +* +* forward_op1 forward_op2 +* | | +* grad_op1 grad_op2 +* \ / +* \ / +* sum_op +* | +* sgd_op +* +* After this pass: +* forward_op1 forward_op2 +* | | +* grad_op1 grad_op2 +* | | +* sgd_op1 sgd_op2 +* +* sgd_op1 and sgd_op2 will update the same weight which holds the same +* memory, so we could benefits from the acceleration +*/ +class LockFreeOptimizePass : public Pass { + public: + virtual ~LockFreeOptimizePass() {} + + protected: + void ApplyImpl(ir::Graph* graph) const override; + + private: + // Create a new sgd node via current optimizer node + ir::Node* CreateNewSGDNode(ir::Graph* graph, ir::Node* forward_node, + ir::Node* backward_node, ir::Node* grad_sum_node, + ir::Node* optimize_node) const; + + // Replace the input weight's optimizers + void ReplaceUpstreamNode(ir::Node* upstream_node, + ir::Node* old_optimizer_node, + ir::Node* new_optimizer_node) const; + + // Replace the output weight's optimizers + void ReplaceAllDownstreamNode(ir::Node* old_optimizer_node, + ir::Node* new_optimizer_node) const; + + // Find all weight variables in graph + bool FindAllWeightVars(ir::Graph* graph) const; + + // Find the forward_op node via the backward_op node + ir::Node* FindForwardOpViaBackwardOp(ir::Graph* graph, + ir::Node* backward_node) const; + + std::vector FindConnectedNode(ir::Node* upstream_node, + ir::Node* downstream_node) const; + + inline bool IsOpNamed(ir::Node* node, const std::string& name) const { + PADDLE_ENFORCE(node); + + return node->NodeType() == Node::Type::kOperation && node->Name() == name; + } + + inline bool IsVarNamed(ir::Node* node, const std::string& name) const { + PADDLE_ENFORCE(node); + + return node->NodeType() == Node::Type::kVariable && node->Name() == name; + } + + inline bool IsVarNameEndsWith(ir::Node* node, const std::string& name) const { + PADDLE_ENFORCE(node); + + return node->NodeType() == Node::Type::kVariable && + boost::algorithm::ends_with(node->Name(), name); + } + + inline bool IsVarNameContains(ir::Node* node, const std::string& name) const { + PADDLE_ENFORCE(node); + + return node->NodeType() == Node::Type::kVariable && + node->Name().find(name) != std::string::npos; + } + + inline bool IsControlDepFrom(ir::Node* ctrl_dep_node, ir::Node* node) const { + PADDLE_ENFORCE(ctrl_dep_node); + PADDLE_ENFORCE(node); + + return IsControlDepVar(*ctrl_dep_node) && + ctrl_dep_node->inputs.size() >= 1u && + ctrl_dep_node->inputs[0] == node; + } +}; + +} // namespace ir +} // namespace framework +} // namespace paddle + +#endif // PADDLE_FLUID_FRAMEWORK_IR_LOCK_FREE_OPTIMIZE_PASS_H_ diff --git a/paddle/fluid/framework/ir/mkldnn/conv_bias_mkldnn_fuse_pass.cc b/paddle/fluid/framework/ir/mkldnn/conv_bias_mkldnn_fuse_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..8ef3993b065bcd37dcd571ba5a284cd35cfe052d --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/conv_bias_mkldnn_fuse_pass.cc @@ -0,0 +1,139 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/mkldnn/conv_bias_mkldnn_fuse_pass.h" +#include +#include +#include +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/platform/enforce.h" + +namespace paddle { +namespace framework { +namespace ir { + +template +LoDTensor tensor_apply_eltwise(const LoDTensor& vec_a, const LoDTensor& vec_b, + BinaryOperation f) { + PADDLE_ENFORCE_EQ(vec_a.dims(), vec_b.dims()); + LoDTensor vec_y; + vec_y.Resize(vec_a.dims()); + const float* a = vec_a.data(); + const float* b = vec_b.data(); + float* y = vec_y.mutable_data(platform::CPUPlace()); + for (int i = 0; i < vec_a.numel(); i++) { + y[i] = f(a[i], b[i]); + } + return vec_y; +} + +void ConvBiasFusePass::ApplyImpl(ir::Graph* graph) const { + PADDLE_ENFORCE(graph); + FusePassBase::Init(name_scope_, graph); + + auto* scope = param_scope(); + PADDLE_ENFORCE(scope); + + std::string type = is_conv3d() ? "conv3d" : "conv2d"; + + GraphPatternDetector gpd; + auto* conv_input = + gpd.mutable_pattern() + ->NewNode(patterns::PDNodeName(name_scope_, "conv_input")) + ->AsInput() + ->assert_is_op_input(type, "Input"); + patterns::ConvBias conv_bias_pattern(gpd.mutable_pattern(), name_scope_); + conv_bias_pattern(conv_input, is_conv3d()); + int found_conv_bias_count = 0; + auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, + Graph* g) { + VLOG(4) << "handle ConvBias fuse"; + GET_IR_NODE_FROM_SUBGRAPH(conv_weight, conv_weight, + conv_bias_pattern); // Filter + GET_IR_NODE_FROM_SUBGRAPH(conv_out, conv_out, conv_bias_pattern); // tmp + GET_IR_NODE_FROM_SUBGRAPH(conv, conv, conv_bias_pattern); // CONV op + // bias + GET_IR_NODE_FROM_SUBGRAPH(eltwise_bias, eltwise_bias, conv_bias_pattern); + // output + GET_IR_NODE_FROM_SUBGRAPH(eltwise_out, eltwise_out, conv_bias_pattern); + // elementwise_add op + GET_IR_NODE_FROM_SUBGRAPH(eltwise, eltwise, conv_bias_pattern); + + PADDLE_ENFORCE(subgraph.count(conv_input)); + + // check if fuse can be done and if MKL-DNN should be used + FuseOptions fuse_option = FindFuseOption(*conv, *eltwise); + if (fuse_option == DO_NOT_FUSE || fuse_option == FUSE_NATIVE) { + VLOG(3) << "do not perform conv+bias fuse"; + return; + } + + auto* eltwise_bias_tensor = + scope->FindVar(eltwise_bias->Name())->GetMutable(); + + auto input_names = conv->Op()->InputNames(); + bool has_bias = std::find(input_names.begin(), input_names.end(), "Bias") != + input_names.end(); + if (has_bias && conv->Op()->Input("Bias").size() > 0) { + auto conv_bias_names = conv->Op()->Input("Bias"); + // add eltwise bias to existing conv bias + PADDLE_ENFORCE_EQ(conv_bias_names.size(), 1); + auto* conv_bias_var = scope->FindVar(conv_bias_names[0]); + auto* conv_bias_tensor = conv_bias_var->GetMutable(); + PADDLE_ENFORCE_EQ(conv_bias_tensor->dims(), eltwise_bias_tensor->dims()); + *conv_bias_tensor = tensor_apply_eltwise( + *conv_bias_tensor, *eltwise_bias_tensor, std::plus()); + + conv->Op()->SetOutput("Output", + std::vector({eltwise_out->Name()})); + + GraphSafeRemoveNodes(graph, {eltwise, conv_out}); + + IR_NODE_LINK_TO(conv, eltwise_out); + } else { + // take eltwise bias as conv bias + OpDesc desc; + + desc.SetInput( + "Input", std::vector({subgraph.at(conv_input)->Name()})); + desc.SetInput("Filter", std::vector({conv_weight->Name()})); + desc.SetInput("Bias", std::vector({eltwise_bias->Name()})); + desc.SetOutput("Output", std::vector({eltwise_out->Name()})); + desc.SetType(type); + + for (auto& attr : conv->Op()->GetAttrMap()) { + desc.SetAttr(attr.first, attr.second); + } + auto conv_bias_node = g->CreateOpNode(&desc); + + IR_NODE_LINK_TO(subgraph.at(conv_input), conv_bias_node); + IR_NODE_LINK_TO(conv_weight, conv_bias_node); + IR_NODE_LINK_TO(eltwise_bias, conv_bias_node); + IR_NODE_LINK_TO(conv_bias_node, eltwise_out); + + GraphSafeRemoveNodes(graph, {conv, eltwise, conv_out}); + } + + found_conv_bias_count++; + }; + gpd(graph, handler); + AddStatis(found_conv_bias_count); +} +} // namespace ir +} // namespace framework +} // namespace paddle +REGISTER_PASS(conv_bias_mkldnn_fuse_pass, + paddle::framework::ir::ConvBiasFusePass); +REGISTER_PASS(conv3d_bias_mkldnn_fuse_pass, + paddle::framework::ir::Conv3DBiasFusePass); diff --git a/paddle/fluid/framework/ir/mkldnn/conv_bias_mkldnn_fuse_pass.h b/paddle/fluid/framework/ir/mkldnn/conv_bias_mkldnn_fuse_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..84106d0655d5578338da3b5993f3d2ec191542fd --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/conv_bias_mkldnn_fuse_pass.h @@ -0,0 +1,44 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +#pragma once +#include +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" +#include "paddle/fluid/framework/ir/pass.h" +namespace paddle { +namespace framework { +namespace ir { +/* +* Fuse the Conv and Elementwise_add to a ConvBiasOp. +*/ +class ConvBiasFusePass : public FusePassBase { + public: + virtual ~ConvBiasFusePass() {} + virtual bool is_conv3d() const { return false; } + + protected: + void ApplyImpl(ir::Graph* graph) const override; + const std::string name_scope_{"conv_bias_mkldnn_fuse"}; +}; +/* +* Fuse the Conv3D and Elementwise_add to a Conv3DBiasOp. +*/ +class Conv3DBiasFusePass : public ConvBiasFusePass { + public: + bool is_conv3d() const override { return true; } +}; +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/mkldnn/conv_bias_mkldnn_fuse_pass_tester.cc b/paddle/fluid/framework/ir/mkldnn/conv_bias_mkldnn_fuse_pass_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..ff7f9190fdeb1648a7ff2c59a07bad399a03bf3f --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/conv_bias_mkldnn_fuse_pass_tester.cc @@ -0,0 +1,151 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/mkldnn/conv_bias_mkldnn_fuse_pass.h" +#include +#include "paddle/fluid/framework/naive_executor.h" +#include "paddle/fluid/platform/place.h" + +#include "paddle/fluid/framework/op_proto_maker.h" + +namespace paddle { +namespace framework { +namespace ir { + +void SetOp(ProgramDesc* prog, const std::string& type, const std::string& name, + const std::vector& inputs, + const std::vector& outputs) { + auto* op = prog->MutableBlock(0)->AppendOp(); + op->SetType(type); + if (type == "conv2d") { + op->SetAttr("use_mkldnn", true); + op->SetAttr("name", name); + op->SetInput("Input", {inputs[0]}); + op->SetInput("Filter", {inputs[1]}); + if (inputs.size() > 2) + op->SetInput("Bias", {inputs[2]}); + else + op->SetInput("Bias", {}); + } else if (type == "elementwise_add") { + op->SetAttr("use_mkldnn", true); + op->SetInput("X", {inputs[0]}); + op->SetInput("Y", {inputs[1]}); + } + op->SetOutput("Out", outputs); + op->SetAttr(OpProtoAndCheckerMaker::OpRoleAttrName(), + static_cast(OpRole::kForward)); +} + +// (c, weights)->conv->f +// (f)->elementwise_add->g +ProgramDesc BuildProgramDesc(bool convWithExistingBias) { + ProgramDesc prog; + std::vector nodes{"c", "weights", "f", "eltwise_bias", "g"}; + if (convWithExistingBias) nodes.push_back("conv_bias"); + for (auto& v : nodes) { + auto* var = prog.MutableBlock(0)->Var(v); + var->SetType(proto::VarType::LOD_TENSOR); + if (v == "weights" || v == "conv_bias" || v == "eltwise_bias") { + var->SetPersistable(true); + } + } + + // conv+bias, both with MKL-DNN + if (convWithExistingBias) { + SetOp(&prog, "conv2d", "conv", + std::vector({"c", "weights", "conv_bias"}), + std::vector({"f"})); + } else { + SetOp(&prog, "conv2d", "conv", std::vector({"c", "weights"}), + std::vector({"f"})); + } + SetOp(&prog, "elementwise_add", "eltwise", + std::vector({"f", "eltwise_bias"}), + std::vector({"g"})); + + return prog; +} + +void InitTensorHolder(Scope* scope, const paddle::platform::Place& place, + const char* var_name) { + auto x = scope->Var(var_name); + auto tensor = x->GetMutable(); + tensor->mutable_data(place, proto::VarType::FP32, + ::paddle::memory::Allocator::kDefault, 1); +} + +void MainTest(bool convWithExistingBias) { + auto prog = BuildProgramDesc(convWithExistingBias); + std::unique_ptr graph(new ir::Graph(prog)); + auto place = paddle::platform::CPUPlace(); + NaiveExecutor exe{place}; + Scope scope; + // Init scope, as it is used in pass + exe.CreateVariables(prog, 0, true, &scope); + if (convWithExistingBias) { + InitTensorHolder(&scope, place, "conv_bias"); + InitTensorHolder(&scope, place, "eltwise_bias"); + } + graph->Set(kParamScopeAttr, new framework::Scope*(&scope)); + + auto pass = PassRegistry::Instance().Get("conv_bias_mkldnn_fuse_pass"); + + int original_nodes_num = graph->Nodes().size(); + + graph.reset(pass->Apply(graph.release())); + + int current_nodes_num = graph->Nodes().size(); + + // Remove 3 Nodes: Conv, Bias, conv_out + // Add 1 Node: ConvBias + EXPECT_EQ(original_nodes_num - 2, current_nodes_num); + + // Assert conv_bias op in newly generated graph + int conv_bias_count = 0; + + for (auto* node : graph->Nodes()) { + if (node->IsOp() && node->Op()->Type() == "conv2d") { + auto* op = node->Op(); + ASSERT_TRUE(op->HasAttr("use_mkldnn")); + EXPECT_TRUE(boost::get(op->GetAttr("use_mkldnn"))); + // check if "conv" convolution is fused + auto op_name = boost::get(op->GetAttr("name")); + if (op_name == "conv") { + auto input_names = op->InputNames(); + ASSERT_TRUE(std::find(input_names.begin(), input_names.end(), "Bias") != + input_names.end()); + auto bias = boost::get>(op->Input("Bias")); + if (bias.size()) { + ++conv_bias_count; + } + } + } + } + EXPECT_EQ(conv_bias_count, 1); +} + +TEST(ConvBiasFusePass, bias_free_conv) { MainTest(false); } + +TEST(ConvBiasFusePass, conv_with_existing_bias) { MainTest(true); } + +TEST(ConvBiasFusePass, conv3d) { + Conv3DBiasFusePass pass; + ASSERT_TRUE(pass.is_conv3d()); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +USE_PASS(conv_bias_mkldnn_fuse_pass); diff --git a/paddle/fluid/framework/ir/mkldnn/conv_elementwise_add_mkldnn_fuse_pass.cc b/paddle/fluid/framework/ir/mkldnn/conv_elementwise_add_mkldnn_fuse_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..ef7874c1c0b21f7c4ce4a2883e6b8e3ba49bf2f7 --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/conv_elementwise_add_mkldnn_fuse_pass.cc @@ -0,0 +1,345 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/mkldnn/conv_elementwise_add_mkldnn_fuse_pass.h" +#include +#include +#include +#include +#include +#include "paddle/fluid/framework/ir/graph_traits.h" + +namespace paddle { +namespace framework { +namespace ir { + +bool IsReachable(ir::Graph* graph, Node* from, Node* to) { + auto find_node = [](ir::Graph* graph, const Node* node) -> Node* { + for (auto n : graph->Nodes()) { + if (n == node) { + return n; + } + } + + return nullptr; + }; + + if (from == to) { + return true; + } + + std::map visited; + + for (auto& node : GraphTraits::DFS(*graph)) { + visited[&node] = false; + } + + visited[from] = true; + + std::list queue; + queue.push_back(from); + + while (!queue.empty()) { + auto cur = find_node(graph, queue.front()); + queue.pop_front(); + + if (!cur) return false; + + for (auto n : cur->outputs) { + if (n == to) { + return true; + } + + if (!visited[n]) { + visited[n] = true; + queue.push_back(n); + } + } + } + return false; +} + +template +boost::optional HasAttribute(const Node& op, const std::string& attr) { + if (op.Op()->HasAttr(attr)) + return boost::get(op.Op()->GetAttr(attr)); + else + return boost::none; +} + +ResidualConnectionMKLDNNFusePass::IdentityFuseHandle::IdentityFuseHandle( + const ResidualConnectionMKLDNNFusePass::CanFuseFunc& can_fuse_func, + const ResidualConnectionMKLDNNFusePass::IdentityConvFunc& + get_node_from_conv_op, + const ResidualConnectionMKLDNNFusePass::IdentityElementwiseAddFunc& + get_node_from_elementwise_add_op) + : fusion_stats{std::make_shared(0)}, + can_fuse_func{can_fuse_func}, + get_node_from_conv_op{get_node_from_conv_op}, + get_node_from_elementwise_add_op{get_node_from_elementwise_add_op} {} + +void ResidualConnectionMKLDNNFusePass::IdentityFuseHandle::operator()( + const GraphPatternDetector::subgraph_t& subgraph, Graph* graph) { + Node* conv_op; + Node* conv_input; + Node* conv_filter; + Node* conv_output; + + Node* elementwise_add_op; + Node* elementwise_add_identity; + Node* elementwise_add_out; + + std::tie(conv_op, conv_input, conv_filter, conv_output) = + get_node_from_conv_op(subgraph); + std::tie(elementwise_add_op, elementwise_add_identity, elementwise_add_out) = + get_node_from_elementwise_add_op(subgraph); + + if (!can_fuse_func(conv_op, elementwise_add_op)) return; + + if (!IsReachable(graph, elementwise_add_identity, conv_output)) return; + + auto fuse_relu = HasAttribute(*conv_op, "fuse_relu"); + if (fuse_relu && *fuse_relu) return; + + conv_op->Op()->SetInput("ResidualData", {elementwise_add_identity->Name()}); + conv_op->Op()->SetOutput("Output", {elementwise_add_out->Name()}); + conv_op->Op()->SetAttr("fuse_residual_connection", true); + + GraphSafeRemoveNodes(graph, {conv_output, elementwise_add_op}); + + IR_NODE_LINK_TO(elementwise_add_identity, conv_op); + IR_NODE_LINK_TO(conv_op, elementwise_add_out); + + (*fusion_stats)++; +} + +ResidualConnectionMKLDNNFusePass::ProjectionFuseHandle::ProjectionFuseHandle( + const ResidualConnectionMKLDNNFusePass::CanFuseFunc& can_fuse_func, + const ResidualConnectionMKLDNNFusePass::ProjectionConvFunc& + get_node_from_conv_x_op, + const ResidualConnectionMKLDNNFusePass::ProjectionConvFunc& + get_node_from_conv_y_op, + const ResidualConnectionMKLDNNFusePass::ProjectionElementwiseAddFunc& + get_node_from_elementwise_add_op) + : fusion_stats{std::make_shared(0)}, + can_fuse_func{can_fuse_func}, + get_node_from_conv_x_op{get_node_from_conv_x_op}, + get_node_from_conv_y_op{get_node_from_conv_y_op}, + get_node_from_elementwise_add_op{get_node_from_elementwise_add_op} {} + +void ResidualConnectionMKLDNNFusePass::ProjectionFuseHandle::operator()( + const GraphPatternDetector::subgraph_t& subgraph, Graph* graph) { + Node* conv_x_op; + Node* conv_x_input; + Node* conv_x_filter; + Node* conv_x_output; + + Node* conv_y_op; + Node* conv_y_input; + Node* conv_y_filter; + Node* conv_y_output; + + Node* elementwise_add_op; + Node* elementwise_add_out; + + std::tie(conv_x_op, conv_x_input, conv_x_filter, conv_x_output) = + get_node_from_conv_x_op(subgraph); + std::tie(conv_y_op, conv_y_input, conv_y_filter, conv_y_output) = + get_node_from_conv_y_op(subgraph); + std::tie(elementwise_add_op, elementwise_add_out) = + get_node_from_elementwise_add_op(subgraph); + + if (!can_fuse_func(conv_x_op, elementwise_add_op)) return; + if (!can_fuse_func(conv_y_op, elementwise_add_op)) return; + + Node* projection_node; + Node* residual_conv_op; + Node* residual_conv_output; + + if (IsReachable(graph, conv_x_input, conv_y_output)) { + projection_node = conv_x_output; + residual_conv_op = conv_y_op; + residual_conv_output = conv_y_output; + } else if (IsReachable(graph, conv_y_input, conv_x_output)) { + projection_node = conv_y_output; + residual_conv_op = conv_x_op; + residual_conv_output = conv_x_output; + } else { + return; + } + + auto fuse_relu = HasAttribute(*residual_conv_op, "fuse_relu"); + if (fuse_relu && *fuse_relu) return; + + residual_conv_op->Op()->SetInput("ResidualData", {projection_node->Name()}); + residual_conv_op->Op()->SetOutput("Output", {elementwise_add_out->Name()}); + + residual_conv_op->Op()->SetAttr("fuse_residual_connection", true); + + GraphSafeRemoveNodes(graph, {residual_conv_output, elementwise_add_op}); + + IR_NODE_LINK_TO(projection_node, residual_conv_op); + IR_NODE_LINK_TO(residual_conv_op, elementwise_add_out); + + (*fusion_stats)++; +} + +std::tuple +ResidualConnectionMKLDNNFusePass::GetNodesFromConv( + const patterns::Conv& conv_pattern, + const GraphPatternDetector::subgraph_t& subgraph) const { + GET_IR_NODE_FROM_SUBGRAPH(conv_op, conv_op, conv_pattern); + GET_IR_NODE_FROM_SUBGRAPH(conv_input, conv_input, conv_pattern); + GET_IR_NODE_FROM_SUBGRAPH(conv_filter, conv_filter, conv_pattern); + GET_IR_NODE_FROM_SUBGRAPH(conv_output, conv_output, conv_pattern); + + return std::make_tuple(conv_op, conv_input, conv_filter, conv_output); +} + +GraphWithStats ResidualConnectionMKLDNNFusePass::FuseConvAsX( + const std::string& name_scope, + const GraphWithStats& graph_with_stats) const { + ir::Graph* graph; + int stats; + + std::tie(graph, stats) = graph_with_stats; + + GraphPatternDetector gpd; + auto pattern = gpd.mutable_pattern(); + + patterns::Conv conv_pattern{pattern, name_scope}; + auto conv_output = conv_pattern(); + + patterns::ElementwiseAdd elementwise_add_pattern{pattern, name_scope}; + elementwise_add_pattern( + conv_output, + pattern->NewNode(elementwise_add_pattern.elementwise_add_y_repr())); + conv_output->AsIntermediate(); + + auto get_node_from_elementwise_add = [&elementwise_add_pattern]( + const GraphPatternDetector::subgraph_t& subgraph) + -> std::tuple { + GET_IR_NODE_FROM_SUBGRAPH(elementwise_add_op, elementwise_add_op, + elementwise_add_pattern); + GET_IR_NODE_FROM_SUBGRAPH(elementwise_add_y, elementwise_add_y, + elementwise_add_pattern); + GET_IR_NODE_FROM_SUBGRAPH(elementwise_add_out, elementwise_add_out, + elementwise_add_pattern); + + return std::make_tuple(elementwise_add_op, elementwise_add_y, + elementwise_add_out); + }; + + return ExecuteHandleOnGraph( + &gpd, graph_with_stats, + [this, &conv_pattern](const GraphPatternDetector::subgraph_t& subgraph) { + return GetNodesFromConv(conv_pattern, subgraph); + }, + get_node_from_elementwise_add); +} + +GraphWithStats ResidualConnectionMKLDNNFusePass::FuseConvAsY( + const std::string& name_scope, + const GraphWithStats& graph_with_stats) const { + GraphPatternDetector gpd; + auto pattern = gpd.mutable_pattern(); + + patterns::Conv conv_pattern{pattern, name_scope}; + auto conv_output = conv_pattern(); + + patterns::ElementwiseAdd elementwise_add_pattern{pattern, name_scope}; + elementwise_add_pattern( + pattern->NewNode(elementwise_add_pattern.elementwise_add_x_repr()), + conv_output); + conv_output->AsIntermediate(); + + auto get_node_from_elementwise_add = [&elementwise_add_pattern]( + const GraphPatternDetector::subgraph_t& subgraph) + -> std::tuple { + GET_IR_NODE_FROM_SUBGRAPH(elementwise_add_op, elementwise_add_op, + elementwise_add_pattern); + GET_IR_NODE_FROM_SUBGRAPH(elementwise_add_x, elementwise_add_x, + elementwise_add_pattern); + GET_IR_NODE_FROM_SUBGRAPH(elementwise_add_out, elementwise_add_out, + elementwise_add_pattern); + + return std::make_tuple(elementwise_add_op, elementwise_add_x, + elementwise_add_out); + }; + + return ExecuteHandleOnGraph( + &gpd, graph_with_stats, + [this, &conv_pattern](const GraphPatternDetector::subgraph_t& subgraph) { + return GetNodesFromConv(conv_pattern, subgraph); + }, + get_node_from_elementwise_add); +} + +GraphWithStats ResidualConnectionMKLDNNFusePass::FuseProjectionConv( + const std::string& name_scope, + const GraphWithStats& graph_with_stats) const { + GraphPatternDetector gpd; + auto pattern = gpd.mutable_pattern(); + + patterns::Conv conv_x_pattern{pattern, name_scope}; + auto conv_x_output = conv_x_pattern(); + + patterns::Conv conv_y_pattern{pattern, name_scope}; + auto conv_y_output = conv_y_pattern(); + + patterns::ElementwiseAdd elementwise_add_pattern{pattern, name_scope}; + elementwise_add_pattern(conv_x_output, conv_y_output); + conv_x_output->AsIntermediate(); + conv_y_output->AsIntermediate(); + + auto get_node_from_elementwise_add = [&elementwise_add_pattern]( + const GraphPatternDetector::subgraph_t& subgraph) + -> std::tuple { + GET_IR_NODE_FROM_SUBGRAPH(elementwise_add_op, elementwise_add_op, + elementwise_add_pattern); + GET_IR_NODE_FROM_SUBGRAPH(elementwise_add_out, elementwise_add_out, + elementwise_add_pattern); + + return std::make_tuple(elementwise_add_op, elementwise_add_out); + }; + + return ExecuteHandleOnGraph( + &gpd, graph_with_stats, + [this, + &conv_x_pattern](const GraphPatternDetector::subgraph_t& subgraph) { + return GetNodesFromConv(conv_x_pattern, subgraph); + }, + [this, + &conv_y_pattern](const GraphPatternDetector::subgraph_t& subgraph) { + return GetNodesFromConv(conv_y_pattern, subgraph); + }, + get_node_from_elementwise_add); +} + +void ResidualConnectionMKLDNNFusePass::ApplyImpl(graph_ptr graph) const { + FusePassBase::Init(name_scope_, graph); + auto fused_graph_with_stats = FuseConvAsY( + name_scope_, + FuseConvAsX(name_scope_, + FuseProjectionConv(name_scope_, std::make_pair(graph, 0)))); + + std::cout << "Fused graph " << fused_graph_with_stats.second << std::endl; + AddStatis(fused_graph_with_stats.second); +} +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(conv_elementwise_add_mkldnn_fuse_pass, + paddle::framework::ir::ResidualConnectionMKLDNNFusePass); diff --git a/paddle/fluid/framework/ir/mkldnn/conv_elementwise_add_mkldnn_fuse_pass.h b/paddle/fluid/framework/ir/mkldnn/conv_elementwise_add_mkldnn_fuse_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..9bf1ae607937f0cae2fd312b0f6c7f7e14bd8fbf --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/conv_elementwise_add_mkldnn_fuse_pass.h @@ -0,0 +1,134 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" + +#include + +namespace paddle { +namespace framework { +namespace ir { + +using graph_ptr = ir::Graph*; +using GraphWithStats = std::pair; + +void CorrectGraphEdges(Graph* graph, Node* from, Node* to); +bool IsReachable(ir::Graph* graph, Node* from, Node* to); +boost::optional HasBias(const Node& op, const std::string& bias_name); + +class ResidualConnectionMKLDNNFusePass : public FusePassBase { + private: + GraphWithStats FuseConvAsX(const std::string& name_scope, + const GraphWithStats& graph_with_stats) const; + GraphWithStats FuseConvAsY(const std::string& name_scope, + const GraphWithStats& graph_with_stats) const; + GraphWithStats FuseProjectionConv( + const std::string& name_scope, + const GraphWithStats& graph_with_stats) const; + + template + using GetNodeFunc = + std::function; + using IdentityConvFunc = GetNodeFunc>; + using IdentityElementwiseAddFunc = + GetNodeFunc>; + + using ProjectionConvFunc = IdentityConvFunc; + using ProjectionElementwiseAddFunc = GetNodeFunc>; + + using CanFuseFunc = std::function; + + std::tuple GetNodesFromConv( + const patterns::Conv& conv_pattern, + const GraphPatternDetector::subgraph_t& subgraph) const; + + std::tuple GetNodesFromProjectionConv( + const patterns::Conv& conv_pattern, + const GraphPatternDetector::subgraph_t& subgraph) const; + + template + GraphWithStats ExecuteHandleOnGraph(GraphPatternDetector* gpd, + const GraphWithStats& graph_with_stats, + OpFuncs&&... op_funcs) const { + ir::Graph* graph; + int stats; + + std::tie(graph, stats) = graph_with_stats; + + auto can_fuse = [this](Node* op1, Node* op2) -> bool { + return this->FindFuseOption(*op1, *op2) == FUSE_MKLDNN; + }; + + auto fuse_handle = HandleType{can_fuse, std::forward(op_funcs)...}; + + (*gpd)(graph, fuse_handle); + + return std::make_pair(graph, stats + fuse_handle.get_stats()); + } + + struct IdentityFuseHandle { + IdentityFuseHandle( + const CanFuseFunc& can_fuse_func, + const IdentityConvFunc& get_node_from_conv_op, + const IdentityElementwiseAddFunc& get_node_from_elementwise_add_op); + + void operator()(const GraphPatternDetector::subgraph_t& subgraph, + Graph* graph); + int get_stats() const { return *fusion_stats; } + + private: + std::shared_ptr fusion_stats; + CanFuseFunc can_fuse_func; + IdentityConvFunc get_node_from_conv_op; + IdentityElementwiseAddFunc get_node_from_elementwise_add_op; + }; + + struct ProjectionFuseHandle { + ProjectionFuseHandle( + const CanFuseFunc& can_fuse_func, + const ProjectionConvFunc& get_node_from_conv_x_op, + const ProjectionConvFunc& get_node_from_conv_y_op, + const ProjectionElementwiseAddFunc& get_node_from_elementwise_add_op); + + void operator()(const GraphPatternDetector::subgraph_t& subgraph, + Graph* graph); + int get_stats() const { return *fusion_stats; } + + private: + std::shared_ptr fusion_stats; + CanFuseFunc can_fuse_func; + ProjectionConvFunc get_node_from_conv_x_op; + ProjectionConvFunc get_node_from_conv_y_op; + ProjectionElementwiseAddFunc get_node_from_elementwise_add_op; + }; + + public: + virtual ~ResidualConnectionMKLDNNFusePass() {} + + protected: + void ApplyImpl(graph_ptr graph) const; + + const std::string name_scope_{"residual_connection_fuse_pass"}; +}; +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/mkldnn/conv_elementwise_add_mkldnn_fuse_pass_tester.cc b/paddle/fluid/framework/ir/mkldnn/conv_elementwise_add_mkldnn_fuse_pass_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..8a13596cd50087475bf12b6cfa5920b82e24de31 --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/conv_elementwise_add_mkldnn_fuse_pass_tester.cc @@ -0,0 +1,274 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include + +#include "paddle/fluid/framework/ir/graph_traits.h" +#include "paddle/fluid/framework/ir/mkldnn/conv_elementwise_add_mkldnn_fuse_pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +namespace { +constexpr int nodes_removed = 3; +constexpr int nodes_added = 1; + +void SetOp(ProgramDesc* prog, const std::string& type, + const std::vector>& inputs, + const std::pair& output) { + auto op = prog->MutableBlock(0)->AppendOp(); + op->SetType(type); + op->SetAttr("use_mkldnn", true); + + for (const auto& input : inputs) { + op->SetInput(input.first, {input.second}); + } + + op->SetOutput(output.first, {output.second}); +} + +struct TestIsReachable { + using func = std::function; + + auto operator()(const std::unique_ptr& graph) -> func { + auto hash = [](const Node* node) -> std::string { + return node->Name() + std::to_string(node->id()); + }; + + auto find_node = [&](const std::unique_ptr& graph, + const std::string& name) -> Node* { + for (auto& node : GraphTraits::DFS(*graph)) { + if (name == hash(&node)) { + return &node; + } + } + + return nullptr; + }; + + // update the from and to strings to hashed equivs in loop from graph traits + return [&](std::string from, std::string to) -> bool { + if (from == to) return true; + + std::map visited; + + for (auto& node : GraphTraits::DFS(*graph)) { + auto hashed = hash(&node); + if (node.Name() == from) from = hashed; + if (node.Name() == to) to = hashed; + visited[hashed] = false; + } + + visited[from] = true; + + std::list queue; + queue.push_back(from); + + while (!queue.empty()) { + auto cur = find_node(graph, queue.front()); + queue.pop_front(); + if (cur == nullptr) return false; + + for (auto n : cur->outputs) { + auto hashed_name = hash(n); + if (hashed_name == to) return true; + + if (!visited[hashed_name]) { + visited[hashed_name] = true; + queue.push_back(hashed_name); + } + } + } + return false; + }; + } +}; + +void AssertOpsCount(const std::unique_ptr& graph, + int expected_conv_count, + int expected_elementwise_add_count = 0) { + int conv_count = 0; + int elementwise_add_count = 0; + + for (auto* node : graph->Nodes()) { + if (node->IsOp() && node->Op()->Type() == "conv2d") { + ++conv_count; + } + if (node->IsOp() && node->Op()->Type() == "elementwise_add") { + ++elementwise_add_count; + } + } + EXPECT_EQ(conv_count, expected_conv_count); + EXPECT_EQ(elementwise_add_count, expected_elementwise_add_count); +} + +ProgramDesc BuildProgramDesc(const std::vector& transient_vars, + const std::vector& persistent_vars) { + ProgramDesc prog; + + auto add_var_to_prog = [&prog](const std::string& var_name) -> VarDesc* { + auto var = prog.MutableBlock(0)->Var(var_name); + var->SetType(proto::VarType::LOD_TENSOR); + + return var; + }; + + for (const auto& v : transient_vars) { + add_var_to_prog(v); + } + + for (const auto& v : persistent_vars) { + auto var = add_var_to_prog(v); + var->SetPersistable(true); + } + + return prog; +} + +void RunPassAndAssert(ProgramDesc* prog, const std::string& from, + const std::string& to, int expected_conv_num) { + std::unique_ptr graph(new ir::Graph(*prog)); + + TestIsReachable is_reachable; + EXPECT_TRUE(is_reachable(graph)(from, to)); + + auto pass = + PassRegistry::Instance().Get("conv_elementwise_add_mkldnn_fuse_pass"); + int original_nodes_num = graph->Nodes().size(); + graph.reset(pass->Apply(graph.release())); + int current_nodes_num = graph->Nodes().size(); + + EXPECT_TRUE(is_reachable(graph)(from, to)); + + EXPECT_EQ(original_nodes_num - nodes_removed + nodes_added, + current_nodes_num); + + AssertOpsCount(graph, expected_conv_num); +} +} // namespace + +TEST(ConvElementwiseAddMKLDNNFusePass, ConvolutionAsYWithElementwiseAddRelu) { + auto prog = BuildProgramDesc({"a", "b", "c", "d", "e"}, {"bias", "weights"}); + + SetOp(&prog, "sigmoid", {{"X", "a"}}, {"Out", "b"}); + SetOp(&prog, "conv2d", + {{"Input", "b"}, {"Bias", "bias"}, {"Filter", "weights"}}, + {"Output", "c"}); + + SetOp(&prog, "elementwise_add", {{"X", "a"}, {"Y", "c"}}, {"Out", "d"}); + SetOp(&prog, "relu", {{"X", "d"}}, {"Out", "e"}); + + RunPassAndAssert(&prog, "a", "relu", 1); +} + +TEST(ConvElementwiseAddMKLDNNFusePass, + ConvolutionProjectionAsYWithElementwiseAddRelu) { + auto prog = BuildProgramDesc({"a", "b", "c", "d", "e", "f"}, + {"bias", "weights", "bias2", "weights2"}); + + SetOp(&prog, "sigmoid", {{"X", "a"}}, {"Out", "b"}); + // right branch + SetOp(&prog, "conv2d", + {{"Input", "b"}, {"Bias", "bias"}, {"Filter", "weights"}}, + {"Output", "c"}); + + // left branch + SetOp(&prog, "conv2d", + {{"Input", "a"}, {"Bias", "bias2"}, {"Filter", "weights2"}}, + {"Output", "f"}); + + SetOp(&prog, "elementwise_add", {{"X", "f"}, {"Y", "c"}}, {"Out", "d"}); + SetOp(&prog, "relu", {{"X", "d"}}, {"Out", "e"}); + + RunPassAndAssert(&prog, "a", "relu", 2); +} + +TEST(ConvElementwiseAddMKLDNNFusePass, + ConvolutionAsYWithElementwiseAddReluNoBias) { + auto prog = BuildProgramDesc({"a", "b", "c", "d", "e"}, {"weights"}); + + SetOp(&prog, "sigmoid", {{"X", "a"}}, {"Out", "b"}); + SetOp(&prog, "conv2d", {{"Input", "b"}, {"Filter", "weights"}}, + {"Output", "c"}); + SetOp(&prog, "elementwise_add", {{"X", "a"}, {"Y", "c"}}, {"Out", "d"}); + SetOp(&prog, "relu", {{"X", "d"}}, {"Out", "e"}); + + RunPassAndAssert(&prog, "a", "relu", 1); +} + +TEST(ConvElementwiseAddMKLDNNFusePass, ConvolutionAsXWithElementwiseAddRelu) { + auto prog = BuildProgramDesc({"a", "b", "c", "d", "e"}, {"bias", "weights"}); + + SetOp(&prog, "sigmoid", {{"X", "a"}}, {"Out", "b"}); + SetOp(&prog, "conv2d", + {{"Input", "b"}, {"Bias", "bias"}, {"Filter", "weights"}}, + {"Output", "c"}); + + SetOp(&prog, "elementwise_add", {{"X", "c"}, {"Y", "a"}}, {"Out", "d"}); + SetOp(&prog, "relu", {{"X", "d"}}, {"Out", "e"}); + + RunPassAndAssert(&prog, "a", "relu", 1); +} + +TEST(ConvElementwiseAddMKLDNNFusePass, + ConvolutionAsXWithElementwiseAddReluNoBias) { + auto prog = BuildProgramDesc({"a", "b", "c", "d", "e"}, {"weights"}); + + SetOp(&prog, "sigmoid", {{"X", "a"}}, {"Out", "b"}); + SetOp(&prog, "conv2d", {{"Input", "b"}, {"Filter", "weights"}}, + {"Output", "c"}); + SetOp(&prog, "elementwise_add", {{"X", "c"}, {"Y", "a"}}, {"Out", "d"}); + SetOp(&prog, "relu", {{"X", "d"}}, {"Out", "e"}); + + RunPassAndAssert(&prog, "a", "relu", 1); +} + +TEST(ConvElementwiseAddMKLDNNFusePass, NoFusion) { + auto prog = + BuildProgramDesc({"a", "b", "c", "d", "e", "f", "g"}, {"weights"}); + + SetOp(&prog, "sigmoid", {{"X", "a"}}, {"Out", "b"}); + SetOp(&prog, "conv2d", {{"Input", "b"}, {"Filter", "weights"}}, + {"Output", "c"}); + + SetOp(&prog, "conv2d", {{"Input", "d"}, {"Filter", "weights"}}, + {"Output", "e"}); + + SetOp(&prog, "elementwise_add", {{"X", "c"}, {"Y", "e"}}, {"Out", "f"}); + SetOp(&prog, "relu", {{"X", "f"}}, {"Out", "g"}); + + std::unique_ptr graph(new ir::Graph(prog)); + + TestIsReachable is_reachable; + EXPECT_TRUE(is_reachable(graph)("a", "g")); + + auto pass = + PassRegistry::Instance().Get("conv_elementwise_add_mkldnn_fuse_pass"); + int original_nodes_num = graph->Nodes().size(); + graph.reset(pass->Apply(graph.release())); + int current_nodes_num = graph->Nodes().size(); + + EXPECT_TRUE(is_reachable(graph)("a", "g")); + EXPECT_EQ(original_nodes_num, current_nodes_num); + + AssertOpsCount(graph, 2, 1); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +USE_PASS(conv_elementwise_add_mkldnn_fuse_pass); diff --git a/paddle/fluid/framework/ir/mkldnn/conv_relu_mkldnn_fuse_pass.cc b/paddle/fluid/framework/ir/mkldnn/conv_relu_mkldnn_fuse_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..dd0fb456040fcf4e135333f938f8e3bdb18b7bcf --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/conv_relu_mkldnn_fuse_pass.cc @@ -0,0 +1,76 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/mkldnn/conv_relu_mkldnn_fuse_pass.h" +#include +#include +#include "paddle/fluid/platform/enforce.h" + +namespace paddle { +namespace framework { +namespace ir { + +void ConvReLUFusePass::ApplyImpl(ir::Graph* graph) const { + PADDLE_ENFORCE(graph); + FusePassBase::Init("conv_relu_mkldnn_fuse", graph); + + GraphPatternDetector gpd; + auto* conv_input = gpd.mutable_pattern() + ->NewNode("conv_relu_mkldnn_fuse/conv_input") + ->AsInput() + ->assert_is_op_input("conv2d", "Input"); + patterns::ConvReLU conv_relu_pattern(gpd.mutable_pattern(), + "conv_relu_mkldnn_fuse"); + conv_relu_pattern(conv_input); + + int found_conv_relu_count = 0; + auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, + Graph* g) { + VLOG(4) << "handle ConvReLU fuse"; + GET_IR_NODE_FROM_SUBGRAPH(conv_weight, conv_weight, + conv_relu_pattern); // Filter + GET_IR_NODE_FROM_SUBGRAPH(conv_out, conv_out, conv_relu_pattern); // tmp + GET_IR_NODE_FROM_SUBGRAPH(conv, conv, conv_relu_pattern); // CONV op + GET_IR_NODE_FROM_SUBGRAPH(relu_out, relu_out, conv_relu_pattern); // Out + GET_IR_NODE_FROM_SUBGRAPH(relu, relu, conv_relu_pattern); // ReLU op + + FuseOptions fuse_option = FindFuseOption(*conv, *relu); + if (fuse_option == DO_NOT_FUSE) { + VLOG(3) << "do not perform conv+relu fuse"; + return; + } + + // Transform Conv node into ConvReLU node. + OpDesc* desc = conv->Op(); + desc->SetOutput("Output", std::vector({relu_out->Name()})); + desc->SetAttr("fuse_relu", true); + GraphSafeRemoveNodes(graph, {relu, conv_out}); + + PADDLE_ENFORCE(subgraph.count(conv_input)); + IR_NODE_LINK_TO(conv, relu_out); + + found_conv_relu_count++; + }; + + gpd(graph, handler); + + AddStatis(found_conv_relu_count); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(conv_relu_mkldnn_fuse_pass, + paddle::framework::ir::ConvReLUFusePass); diff --git a/paddle/fluid/framework/ir/mkldnn/conv_relu_mkldnn_fuse_pass.h b/paddle/fluid/framework/ir/mkldnn/conv_relu_mkldnn_fuse_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..2174c22dbf53790015be4c651b6e0c40b8e159fb --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/conv_relu_mkldnn_fuse_pass.h @@ -0,0 +1,39 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" +#include "paddle/fluid/framework/ir/pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +/* + * Fuse the CONV and ReLU to a ConvReLUOp. + */ +class ConvReLUFusePass : public FusePassBase { + public: + virtual ~ConvReLUFusePass() {} + + protected: + void ApplyImpl(ir::Graph* graph) const override; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/mkldnn/conv_relu_mkldnn_fuse_pass_tester.cc b/paddle/fluid/framework/ir/mkldnn/conv_relu_mkldnn_fuse_pass_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..67a9957059a501f39f20c1de2ae17cafbe51a53a --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/conv_relu_mkldnn_fuse_pass_tester.cc @@ -0,0 +1,127 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/mkldnn/conv_relu_mkldnn_fuse_pass.h" + +#include +#include "paddle/fluid/framework/op_proto_maker.h" + +namespace paddle { +namespace framework { +namespace ir { + +void SetOp(ProgramDesc* prog, const std::string& type, const std::string& name, + const std::vector& inputs, + const std::vector& outputs, bool use_mkldnn = false) { + auto* op = prog->MutableBlock(0)->AppendOp(); + op->SetType(type); + if (type == "conv2d") { + op->SetAttr("use_mkldnn", use_mkldnn); + op->SetAttr("name", name); + op->SetInput("Input", {inputs[0]}); + op->SetInput("Filter", {inputs[1]}); + op->SetInput("Bias", {inputs[2]}); + } else if (type == "relu") { + op->SetAttr("use_mkldnn", use_mkldnn); + op->SetInput("X", inputs); + } + op->SetOutput("Out", outputs); + op->SetAttr(OpProtoAndCheckerMaker::OpRoleAttrName(), + static_cast(OpRole::kForward)); +} + +// a->OP0->b +// b->OP1->c +// (c, weights, bias)->conv->f +// (f)->relu->g +ProgramDesc BuildProgramDesc() { + ProgramDesc prog; + for (auto& v : + std::vector({"a", "b", "c", "weights", "bias", "f", "g", + "h", "weights2", "bias2", "k", "l"})) { + auto* var = prog.MutableBlock(0)->Var(v); + var->SetType(proto::VarType::SELECTED_ROWS); + if (v == "weights" || v == "bias") { + var->SetPersistable(true); + } + } + + SetOp(&prog, "OP0", "op0", std::vector({"a"}), + std::vector({"b"})); + SetOp(&prog, "OP1", "op1", std::vector({"b"}), + std::vector({"c"})); + // conv+relu, both with MKL-DNN + SetOp(&prog, "conv2d", "conv1", + std::vector({"c", "weights", "bias"}), + std::vector({"f"}), true); + SetOp(&prog, "relu", "relu1", std::vector({"f"}), + std::vector({"g"}), true); + SetOp(&prog, "OP3", "op3", std::vector({"g"}), + std::vector({"h"})); + // conv+relu, only one with MKL-DNN + SetOp(&prog, "conv2d", "conv2", + std::vector({"h", "weights2", "bias2"}), + std::vector({"k"}), true); + SetOp(&prog, "relu", "relu2", std::vector({"k"}), + std::vector({"l"})); + + return prog; +} + +TEST(ConvReLUFusePass, basic) { + auto prog = BuildProgramDesc(); + + std::unique_ptr graph(new ir::Graph(prog)); + + auto pass = PassRegistry::Instance().Get("conv_relu_mkldnn_fuse_pass"); + + int original_nodes_num = graph->Nodes().size(); + + graph.reset(pass->Apply(graph.release())); + + int current_nodes_num = graph->Nodes().size(); + + // Remove 3 Nodes: CONV, RELU, conv_out + // Add 1 Node: ConvReLU + EXPECT_EQ(original_nodes_num - 2, current_nodes_num); + + // Assert conv_relu op in newly generated graph + int conv_relu_count = 0; + + for (auto* node : graph->Nodes()) { + if (node->IsOp() && node->Op()->Type() == "conv2d") { + auto* op = node->Op(); + ASSERT_TRUE(op->HasAttr("use_mkldnn")); + EXPECT_TRUE(boost::get(op->GetAttr("use_mkldnn"))); + // check if only "conv1" convolution is fused + auto op_name = boost::get(op->GetAttr("name")); + if (op_name == "conv1") { + ASSERT_TRUE(op->HasAttr("fuse_relu")); + bool fuse_relu = boost::get(op->GetAttr("fuse_relu")); + if (fuse_relu) { + ++conv_relu_count; + } + } else if (op_name == "conv2") { + ASSERT_FALSE(op->HasAttr("fuse_relu")); + } + } + } + EXPECT_EQ(conv_relu_count, 1); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +USE_PASS(conv_relu_mkldnn_fuse_pass); diff --git a/paddle/fluid/framework/ir/mkldnn/cpu_quantize_pass.cc b/paddle/fluid/framework/ir/mkldnn/cpu_quantize_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..dff98e523ac45ef79f3e8fd020ecd6cd7035cf92 --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/cpu_quantize_pass.cc @@ -0,0 +1,236 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/mkldnn/cpu_quantize_pass.h" +#include +#include +#include "paddle/fluid/framework/eigen.h" +#include "paddle/fluid/string/pretty_log.h" + +namespace paddle { +namespace framework { +namespace ir { + +namespace { + +void UnlinkNodes(ir::Node* a, ir::Node* b) { + a->outputs.erase(std::remove(a->outputs.begin(), a->outputs.end(), b), + a->outputs.end()); + b->inputs.erase(std::remove(b->inputs.begin(), b->inputs.end(), a), + b->inputs.end()); +} + +} // namespace + +enum { U8_MAX = 255, S8_MAX = 127 }; + +using EigenVectorArrayMap = Eigen::Map>; +using string::PrettyLogDetail; + +void CPUQuantizePass::QuantizeInput(Graph* g, Node* op, Node* input, + std::string input_name, double scale_to_one, + bool is_unsigned, + std::string scale_attr_name) const { + unsigned max = is_unsigned ? U8_MAX : S8_MAX; + float scale = scale_to_one * max; + + // Create quantize output variable + VarDesc quantize_out_desc(patterns::PDNodeName("quantize", "out")); + auto* quantize_out_node = g->CreateVarNode(&quantize_out_desc); + + // create a quantize op node + OpDesc q_desc; + q_desc.SetType("quantize"); + q_desc.SetInput("Input", std::vector({input->Name()})); + q_desc.SetOutput("Output", + std::vector({quantize_out_node->Name()})); + q_desc.SetAttr("Scale", scale); + q_desc.SetAttr("is_negative_input", !is_unsigned); + auto quantize_op = g->CreateOpNode(&q_desc); // OpDesc will be copied. + + // update op's input + op->Op()->SetInput(input_name, + std::vector({quantize_out_node->Name()})); + + // link quantize op + UnlinkNodes(input, op); + IR_NODE_LINK_TO(input, quantize_op); + IR_NODE_LINK_TO(quantize_op, quantize_out_node); + IR_NODE_LINK_TO(quantize_out_node, op); + + if (!scale_attr_name.empty()) op->Op()->SetAttr(scale_attr_name, scale); +} + +void CPUQuantizePass::DequantizeOutput(Graph* g, Node* op, Node* output, + std::string output_name, + double scale_to_one, bool is_unsigned, + std::string scale_attr_name) const { + unsigned max = is_unsigned ? U8_MAX : S8_MAX; + float scale = scale_to_one * max; + + // Create dequantize input variable + VarDesc dequantize_in_desc(patterns::PDNodeName("dequantize", "in")); + auto* dequantize_in_node = g->CreateVarNode(&dequantize_in_desc); + + // create a dequantize op node for output. + OpDesc deq_desc; + deq_desc.SetType("dequantize"); + deq_desc.SetInput("Input", + std::vector({dequantize_in_node->Name()})); + deq_desc.SetOutput("Output", std::vector({output->Name()})); + deq_desc.SetAttr("Scale", scale); + auto dequantize_op = g->CreateOpNode(&deq_desc); // OpDesc will be copied. + + // update op's output + op->Op()->SetOutput(output_name, + std::vector({dequantize_in_node->Name()})); + + // link dequantize op + UnlinkNodes(op, output); + IR_NODE_LINK_TO(op, dequantize_in_node); + IR_NODE_LINK_TO(dequantize_in_node, dequantize_op); + IR_NODE_LINK_TO(dequantize_op, output); + + if (!scale_attr_name.empty()) op->Op()->SetAttr(scale_attr_name, scale); +} + +void CPUQuantizePass::QuantizeConv(Graph* graph, + bool with_residual_data) const { + GraphPatternDetector gpd; + auto pattern = gpd.mutable_pattern(); + patterns::ConvResidual conv_pattern{pattern, name_scope_}; + conv_pattern(with_residual_data); + + int quantize_conv_count = 0; + auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, + Graph* g) { + VLOG(4) << "Quantize conv2d op"; + GET_IR_NODE_FROM_SUBGRAPH(conv_op, conv_op, conv_pattern); + auto* conv_op_desc = conv_op->Op(); + + // skip if should not be quantized + if (!conv_op_desc->HasAttr("use_quantizer") || + !boost::get(conv_op_desc->GetAttr("use_quantizer"))) + return; + + GET_IR_NODE_FROM_SUBGRAPH(conv_filter, conv_filter, conv_pattern); + GET_IR_NODE_FROM_SUBGRAPH(conv_input, conv_input, conv_pattern); + GET_IR_NODE_FROM_SUBGRAPH(conv_output, conv_output, conv_pattern); + + // get scales calculated after warmup, they scale variables to MAX=1.0 + auto scales = Get("quant_var_scales"); + + auto input_scale = scales[conv_input->Name()].second.data()[0]; + bool is_input_unsigned = scales[conv_input->Name()].first; + QuantizeInput(g, conv_op, conv_input, "Input", input_scale, + is_input_unsigned, "Scale_in"); + + auto filter_scale_tensor = scales[conv_filter->Name()].second; + EigenVectorArrayMap eigen_tensor{filter_scale_tensor.data(), + filter_scale_tensor.numel(), 1}; + eigen_tensor *= static_cast(S8_MAX); + std::vector filter_scale{ + filter_scale_tensor.data(), + filter_scale_tensor.data() + filter_scale_tensor.numel()}; + + conv_op->Op()->SetAttr("Scale_weights", filter_scale); + + if (with_residual_data) { + GET_IR_NODE_FROM_SUBGRAPH(conv_residual_data, conv_residual_data, + conv_pattern); + auto residual_scale = + scales[conv_residual_data->Name()].second.data()[0]; + bool is_residual_unsigned = scales[conv_residual_data->Name()].first; + + QuantizeInput(g, conv_op, conv_residual_data, "ResidualData", + residual_scale, is_residual_unsigned, "Scale_in_eltwise"); + } + + auto output_scale = scales[conv_output->Name()].second.data()[0]; + bool is_output_unsigned = scales[conv_output->Name()].first; + DequantizeOutput(g, conv_op, conv_output, "Output", output_scale, + is_output_unsigned, "Scale_out"); + + ++quantize_conv_count; + }; + + gpd(graph, handler); + AddStatis(quantize_conv_count); + + std::stringstream msg_ss; + msg_ss << "--- quantized " << quantize_conv_count << " conv2d ops"; + if (with_residual_data) msg_ss << " with residual connection"; + PrettyLogDetail(msg_ss.str().c_str()); +} + +void CPUQuantizePass::QuantizePool(Graph* graph) const { + GraphPatternDetector gpd; + auto pattern = gpd.mutable_pattern(); + patterns::Pool pool_pattern{pattern, name_scope_}; + pool_pattern(); + + int quantize_pool_count = 0; + auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, + Graph* g) { + VLOG(4) << "Quantize pool2d op"; + GET_IR_NODE_FROM_SUBGRAPH(pool_op, pool_op, pool_pattern); + auto* pool_op_desc = pool_op->Op(); + + // skip if should not be quantized + if (!pool_op_desc->HasAttr("use_quantizer") || + !boost::get(pool_op_desc->GetAttr("use_quantizer"))) + return; + + GET_IR_NODE_FROM_SUBGRAPH(pool_input, pool_input, pool_pattern); + GET_IR_NODE_FROM_SUBGRAPH(pool_output, pool_output, pool_pattern); + + // get scales calculated after warmup, they scale variables to MAX=1.0 + auto scales = Get("quant_var_scales"); + + auto input_scale = scales[pool_input->Name()].second.data()[0]; + bool is_input_unsigned = scales[pool_input->Name()].first; + QuantizeInput(g, pool_op, pool_input, "X", input_scale, is_input_unsigned); + + auto output_scale = scales[pool_output->Name()].second.data()[0]; + bool is_output_unsigned = scales[pool_output->Name()].first; + DequantizeOutput(g, pool_op, pool_output, "Out", output_scale, + is_output_unsigned); + + ++quantize_pool_count; + }; + + gpd(graph, handler); + AddStatis(quantize_pool_count); + + PrettyLogDetail("--- quantized %d pool2d ops", quantize_pool_count); +} + +void CPUQuantizePass::ApplyImpl(ir::Graph* graph) const { + VLOG(3) << "Quantizing the graph."; + PADDLE_ENFORCE(graph); + FusePassBase::Init(name_scope_, graph); + + PADDLE_ENFORCE(param_scope()); + + QuantizeConv(graph, false /* with_residual_data */); + QuantizeConv(graph, true /* with_residual_data */); + QuantizePool(graph); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(cpu_quantize_pass, paddle::framework::ir::CPUQuantizePass) + .RequirePassAttr("quant_var_scales"); diff --git a/paddle/fluid/framework/ir/mkldnn/cpu_quantize_pass.h b/paddle/fluid/framework/ir/mkldnn/cpu_quantize_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..a178c4dc363f672fdc7c535954be0c5877a599ac --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/cpu_quantize_pass.h @@ -0,0 +1,65 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" + +namespace paddle { +namespace framework { +namespace ir { + +/* + * Map variable name to tensor of scaling factors scaling it to MAX=1.0. + * bool denotes whether quantization of the variable should be done to unsigned + * type. + */ +using VarQuantScale = + std::unordered_map>; + +/* + * Quantize all supported operators. + */ +class CPUQuantizePass : public FusePassBase { + public: + virtual ~CPUQuantizePass() {} + + protected: + void ApplyImpl(ir::Graph* graph) const override; + + void QuantizeConv(Graph* graph, bool with_residual_data = false) const; + + void QuantizePool(Graph* graph) const; + + void QuantizeInput(Graph* g, Node* op, Node* input, std::string input_name, + double scale_to_one, bool is_unsigned, + std::string scale_attr_name = "") const; + + void DequantizeOutput(Graph* g, Node* op, Node* output, + std::string output_name, double scale_to_one, + bool is_unsigned, + std::string scale_attr_name = "") const; + + const std::string name_scope_{"quantize"}; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/mkldnn/cpu_quantize_pass_tester.cc b/paddle/fluid/framework/ir/mkldnn/cpu_quantize_pass_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..8716a412e4d5b96161c5b2e2ac06d6aa0b4e74e1 --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/cpu_quantize_pass_tester.cc @@ -0,0 +1,211 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/mkldnn/cpu_quantize_pass.h" +#include +#include "paddle/fluid/framework/naive_executor.h" +#include "paddle/fluid/platform/place.h" + +namespace paddle { +namespace framework { +namespace ir { + +void SetOp(ProgramDesc* prog, const std::string& type, const std::string& name, + const std::vector& inputs, + const std::vector& outputs, bool use_mkldnn, + bool use_quantizer = false) { + auto* op = prog->MutableBlock(0)->AppendOp(); + op->SetType(type); + op->SetAttr("use_mkldnn", use_mkldnn); + op->SetAttr("name", name); + if (type == "conv2d") { + op->SetInput("Input", {inputs[0]}); + op->SetInput("Filter", {inputs[1]}); + if (inputs.size() > 2) + op->SetInput("Bias", {inputs[2]}); + else + op->SetInput("Bias", {}); + if (inputs.size() > 3) { + op->SetInput("ResidualData", {inputs[3]}); + op->SetAttr("fuse_residual_connection", true); + } else { + op->SetInput("ResidualData", {}); + op->SetAttr("fuse_residual_connection", false); + } + op->SetOutput("Output", {outputs[0]}); + op->SetAttr("use_quantizer", use_quantizer); + op->SetAttr("Scale_in", 1.0f); + op->SetAttr("Scale_out", 1.0f); + op->SetAttr("Scale_weights", std::vector{1.0f}); + } else if (type == "pool2d") { + op->SetInput("X", {inputs[0]}); + op->SetOutput("Out", {outputs[0]}); + op->SetAttr("use_quantizer", use_quantizer); + } else if (type == "dropout") { + op->SetInput("X", {inputs[0]}); + op->SetOutput("Out", {outputs[0]}); + } else if (type == "fc") { + op->SetInput("Input", {inputs[0]}); + if (inputs.size() > 1) op->SetInput("W", {inputs[1]}); + if (inputs.size() > 2) op->SetInput("Bias", {inputs[2]}); + op->SetOutput("Out", {outputs[0]}); + } +} + +static const std::initializer_list variable_names{ + "a", "w1", "c", "d", "w2", "e", "f", "g", + "h", "w3", "b1", "i", "j", "w4", "b2"}; +// (a,w1)->Conv1->c and c->Pool1->d +// +// (d,w2)->Conv2->e and e->Pool2->f +// +// d->Dropout1->g and g->Fc1->h and (h,w3,b1,i)->Conv3->j +// +// (d,w4, b2)->Conv4->i +ProgramDesc BuildProgramDesc(bool use_mkldnn, bool use_quantizer) { + ProgramDesc prog; + for (auto& v : variable_names) { + auto* var = prog.MutableBlock(0)->Var(v); + if (v.find("w") == 0 || v.find("b") == 0) { + var->SetPersistable(true); + } + } + + SetOp(&prog, "conv2d", "Conv1", {"a", "w1"}, {"c"}, use_mkldnn, + use_quantizer); + SetOp(&prog, "pool2d", "Pool1", {"c"}, {"d"}, use_mkldnn, use_quantizer); + + SetOp(&prog, "conv2d", "Conv2", {"d", "w2"}, {"e"}, use_mkldnn, + use_quantizer); + SetOp(&prog, "pool2d", "Pool2", {"e"}, {"f"}, use_mkldnn, use_quantizer); + + SetOp(&prog, "dropout", "Dropout1", {"d"}, {"g"}, use_mkldnn); + SetOp(&prog, "fc", "Fc1", {"g"}, {"h"}, use_mkldnn); + SetOp(&prog, "conv2d", "Conv3", {"h", "w3", "b1", "i"}, {"j"}, use_mkldnn, + use_quantizer); + + SetOp(&prog, "conv2d", "Conv4", {"c", "w4", "b2"}, {"i"}, use_mkldnn, + use_quantizer); + + return prog; +} + +void InitTensorHolder(Scope* scope, const paddle::platform::Place& place, + const char* var_name) { + auto x = scope->Var(var_name); + auto tensor = x->GetMutable(); + tensor->mutable_data(place, proto::VarType::FP32, + ::paddle::memory::Allocator::kDefault, 1); +} + +void MainTest(const ProgramDesc& prog, int conv_count, int pool_count, + int quant_count, int dequant_count, int added_nodes_count, + float scale) { + std::unique_ptr graph(new ir::Graph(prog)); + + // Init scope, as it is used in pass + auto place = paddle::platform::CPUPlace(); + NaiveExecutor exe{place}; + Scope scope; + exe.CreateVariables(prog, 0, true, &scope); + + auto* scales = new VarQuantScale(); + + for (auto& v : variable_names) { + InitTensorHolder(&scope, place, v.c_str()); + LoDTensor tensor; + tensor.Resize({1}); + auto* ptr = tensor.mutable_data(place); + ptr[0] = 2.0; + + (*scales)[v] = std::make_pair(false, std::move(tensor)); + } + + graph->Set(kParamScopeAttr, new framework::Scope*(&scope)); + + auto pass = PassRegistry::Instance().Get("cpu_quantize_pass"); + pass->Set("quant_var_scales", scales); + + int original_nodes_num = graph->Nodes().size(); + + graph.reset(pass->Apply(graph.release())); + + int current_nodes_num = graph->Nodes().size(); + + int quantize_nodes_count = 0; + int dequantize_nodes_count = 0; + int conv2d_nodes_count = 0; + int pool2d_nodes_count = 0; + for (auto* node : graph->Nodes()) { + if (node->IsOp()) { + auto* op = node->Op(); + if (op->Type() == "conv2d") { + conv2d_nodes_count++; + auto op_name = boost::get(op->GetAttr("name")); + EXPECT_EQ(boost::get(op->GetAttr("Scale_in")), scale) + << "Scale_in for node '" + op_name + "'."; + EXPECT_EQ(boost::get(op->GetAttr("Scale_out")), scale) + << "Scale_out for node '" + op_name + "'."; + EXPECT_EQ( + boost::get>(op->GetAttr("Scale_weights"))[0], + scale) + << "Scale_weights for node '" + op_name + "'."; + } else if (op->Type() == "pool2d") { + pool2d_nodes_count++; + } else if (op->Type() == "quantize") { + quantize_nodes_count++; + } else if (op->Type() == "dequantize") { + dequantize_nodes_count++; + } + } + } + EXPECT_EQ(conv2d_nodes_count, conv_count); + EXPECT_EQ(pool2d_nodes_count, pool_count); + EXPECT_EQ(quantize_nodes_count, quant_count); + EXPECT_EQ(dequantize_nodes_count, dequant_count); + EXPECT_EQ(original_nodes_num + added_nodes_count, current_nodes_num); +} + +TEST(CpuQuantizePass, quantize) { + bool use_mkldnn = true; + bool use_quantizer = true; + // (a->QUANT1->IN1,w1)->Conv1->OUT1->DEQUANT1->c and + // c->QUANT2->IN2->Pool1->OUT2->DEQUANT2->d + // + // (d->QUANT3->IN3,w2)->Conv2->OUT3->DEQUANT3->e and + // e->QUANT4->IN4->Pool2->OUT4->DEQUANT4->f + // + // d->Dropout1->g and g->Fc1->h and + // (h->QUANT5->IN5,w3,b1,i->QUANT6->IN6)->Conv3->OUT5->DEQUANT5->j + // + // (d->QUANT7->IN7,w4, b2)->Conv4->DEQUANT6->OUT6->i + // Insert nodes: 7 Quant + 7 IN + 6 OUT + 6 DEQUANT + int added_nodes = 7 + 7 + 6 + 6; + MainTest(BuildProgramDesc(use_mkldnn, use_quantizer), 4, 2, 7, 6, added_nodes, + 2.0f * 127); +} + +TEST(CpuQuantizePass, do_not_quantize) { + bool use_mkldnn = true; + bool use_quantizer = false; + int added_nodes = 0; + MainTest(BuildProgramDesc(use_mkldnn, use_quantizer), 4, 2, 0, 0, added_nodes, + 1.0f); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +USE_PASS(cpu_quantize_pass); diff --git a/paddle/fluid/framework/ir/mkldnn/cpu_quantize_placement_pass.cc b/paddle/fluid/framework/ir/mkldnn/cpu_quantize_placement_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..79a8ac68b82fc79ec91c18ec96a04e1e676c8ba0 --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/cpu_quantize_placement_pass.cc @@ -0,0 +1,56 @@ +/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/ir/mkldnn/cpu_quantize_placement_pass.h" +#include +#include + +namespace paddle { +namespace framework { +namespace ir { + +void CPUQuantizePlacementPass::ApplyImpl(ir::Graph* graph) const { + VLOG(3) << "Marks operators which are to be quantized."; + const auto& excluded_ids_list = + Get>("quantize_excluded_op_ids"); + const auto& op_types_list = + Get>("quantize_enabled_op_types"); + for (const Node* n : graph->Nodes()) { + if (n->IsOp()) { + if (std::find(excluded_ids_list.begin(), excluded_ids_list.end(), + n->id()) != excluded_ids_list.end()) + continue; + auto* op = n->Op(); + if (op->HasAttr("use_quantizer") || op->HasProtoAttr("use_quantizer")) { + if (op_types_list.empty()) { + op->SetAttr("use_quantizer", true); + } else if (std::find(op_types_list.begin(), op_types_list.end(), + n->Name()) != op_types_list.end()) { + op->SetAttr("use_quantizer", true); + } + } + } + } +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(cpu_quantize_placement_pass, + paddle::framework::ir::CPUQuantizePlacementPass) + // a vector of operator type names to be quantized ("conv2d" etc.) + .RequirePassAttr("quantize_enabled_op_types") + // a vector of operator ids that are to be excluded from quantization + .RequirePassAttr("quantize_excluded_op_ids"); diff --git a/paddle/fluid/framework/ir/mkldnn/cpu_quantize_placement_pass.h b/paddle/fluid/framework/ir/mkldnn/cpu_quantize_placement_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..008a462dc414c04f53315a8f262de15ab8fb7fb5 --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/cpu_quantize_placement_pass.h @@ -0,0 +1,33 @@ +/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include "paddle/fluid/framework/ir/pass.h" + +namespace paddle { +namespace framework { +namespace ir { +/* + * Specifies which operators should be quantized. + */ +class CPUQuantizePlacementPass : public Pass { + protected: + void ApplyImpl(ir::Graph* graph) const override; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/mkldnn/cpu_quantize_placement_pass_tester.cc b/paddle/fluid/framework/ir/mkldnn/cpu_quantize_placement_pass_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..ba4d281f818bb752570e7b500013f5f58001307c --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/cpu_quantize_placement_pass_tester.cc @@ -0,0 +1,129 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/mkldnn/cpu_quantize_placement_pass.h" + +#include +#include + +namespace paddle { +namespace framework { +namespace ir { + +void SetOp(ProgramDesc* prog, const std::string& type, const std::string& name, + const std::vector& inputs, + const std::vector& outputs, + boost::tribool use_quantizer) { + auto* op = prog->MutableBlock(0)->AppendOp(); + + op->SetType(type); + + if (!boost::indeterminate(use_quantizer)) + op->SetAttr("use_quantizer", use_quantizer); + + if (type == "conv2d") { + op->SetAttr("name", name); + op->SetInput("Input", {inputs[0]}); + op->SetInput("Filter", {inputs[1]}); + op->SetInput("Bias", {inputs[2]}); + } else if (type == "relu") { + op->SetInput("X", inputs); + } else if (type == "concat") { + op->SetAttr("axis", 1); + op->SetInput("X", {inputs[0], inputs[1]}); + } else if (type == "pool2d") { + op->SetInput("X", {inputs[0]}); + } else { + FAIL() << "Unexpected operator type."; + } + op->SetOutput("Out", {outputs[0]}); +} + +// operator use_quantizer +// --------------------------------------- +// (a,b)->concat->c none +// (c,weights,bias)->conv->f false +// f->relu->g none +// g->pool->h false +// (h,weights2,bias2)->conv->k false +// k->pool->l false +ProgramDesc BuildProgramDesc() { + ProgramDesc prog; + + for (auto& v : + std::vector({"a", "b", "c", "weights", "bias", "f", "g", + "h", "weights2", "bias2", "k", "l"})) { + auto* var = prog.MutableBlock(0)->Var(v); + var->SetType(proto::VarType::SELECTED_ROWS); + if (v == "weights" || v == "bias") { + var->SetPersistable(true); + } + } + + SetOp(&prog, "concat", "concat1", {"a", "b"}, {"c"}, boost::indeterminate); + SetOp(&prog, "conv2d", "conv1", {"c", "weights", "bias"}, {"f"}, false); + SetOp(&prog, "relu", "relu1", {"f"}, {"g"}, boost::indeterminate); + SetOp(&prog, "pool2d", "pool1", {"g"}, {"h"}, false); + SetOp(&prog, "conv2d", "conv2", {"h", "weights2", "bias2"}, {"k"}, false); + SetOp(&prog, "pool2d", "pool2", {"k"}, {"l"}, false); + + return prog; +} + +void MainTest(std::initializer_list quantize_enabled_op_types, + std::initializer_list quantize_excluded_op_ids, + unsigned expected_use_quantizer_true_count) { + auto prog = BuildProgramDesc(); + + std::unique_ptr graph(new ir::Graph(prog)); + + auto pass = PassRegistry::Instance().Get("cpu_quantize_placement_pass"); + pass->Set("quantize_enabled_op_types", + new std::unordered_set(quantize_enabled_op_types)); + pass->Set("quantize_excluded_op_ids", + new std::unordered_set(quantize_excluded_op_ids)); + + graph.reset(pass->Apply(graph.release())); + + unsigned use_quantizer_true_count = 0; + + for (auto* node : graph->Nodes()) { + if (node->IsOp()) { + auto* op = node->Op(); + if (op->HasAttr("use_quantizer") && + boost::get(op->GetAttr("use_quantizer"))) { + ++use_quantizer_true_count; + } + } + } + + EXPECT_EQ(use_quantizer_true_count, expected_use_quantizer_true_count); +} + +TEST(QuantizerPlacementPass, enabled_pool) { MainTest({"pool2d"}, {}, 2); } + +TEST(QuantizerPlacementPass, enabled_conv_excluded_one) { + MainTest({"conv2d"}, {4}, 1); +} + +TEST(QuantizerPlacementPass, excluded_none) { + // 2 conv + 2 pool + MainTest({}, {}, 4); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +USE_PASS(cpu_quantize_placement_pass); diff --git a/paddle/fluid/framework/ir/mkldnn/cpu_quantize_squash_pass.cc b/paddle/fluid/framework/ir/mkldnn/cpu_quantize_squash_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..debbbd6440b05c3f8c0db708c8ad5c54e018f725 --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/cpu_quantize_squash_pass.cc @@ -0,0 +1,143 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file eint8_outcept in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either eint8_outpress or +// implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/mkldnn/cpu_quantize_squash_pass.h" +#include +#include +#include "paddle/fluid/platform/enforce.h" +#include "paddle/fluid/string/pretty_log.h" + +namespace paddle { +namespace framework { +namespace ir { + +using string::PrettyLogDetail; + +void CPUQuantizeSquashPass::FindNodesToKeep( + Graph* graph, + std::unordered_map* nodes_keep_counter) const { + GraphPatternDetector gpd; + patterns::DequantAny deq_any_pattern{gpd.mutable_pattern(), "deqant_any"}; + deq_any_pattern(); + + int found_count = 0; + auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, + Graph* g) { + GET_IR_NODE_FROM_SUBGRAPH(dequant_out, dequant_out, deq_any_pattern); + + if (nodes_keep_counter->find(dequant_out) == nodes_keep_counter->end()) + (*nodes_keep_counter)[dequant_out] = 1; + else + (*nodes_keep_counter)[dequant_out] += 1; + + found_count++; + }; + gpd(graph, handler); + AddStatis(found_count); +} + +void CPUQuantizeSquashPass::Squash( + Graph* graph, + std::unordered_map* nodes_keep_counter) const { + GraphPatternDetector gpd; + patterns::DequantQuantAny squash_pattern{gpd.mutable_pattern(), "squash"}; + squash_pattern(); + + int found_squash_count = 0; + auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, + Graph* g) { + VLOG(4) << "squash requantize-quantize ops pair"; + + GET_IR_NODE_FROM_SUBGRAPH(dequant_in, dequant_in, squash_pattern); + GET_IR_NODE_FROM_SUBGRAPH(dequant_op, dequant_op, squash_pattern); + GET_IR_NODE_FROM_SUBGRAPH(dequant_out, dequant_out, squash_pattern); + GET_IR_NODE_FROM_SUBGRAPH(quant_op, quant_op, squash_pattern); + GET_IR_NODE_FROM_SUBGRAPH(quant_out, quant_out, squash_pattern); + GET_IR_NODE_FROM_SUBGRAPH(next_op, next_op, squash_pattern); + + auto* next_op_desc = next_op->Op(); + float dequant_scale = boost::get(dequant_op->Op()->GetAttr("Scale")); + float quant_scale = boost::get(quant_op->Op()->GetAttr("Scale")); + PADDLE_ENFORCE(nodes_keep_counter->find(dequant_out) != + nodes_keep_counter->end()); + + // check if dequantize op should be kept or removed, decrease the counter + bool keep_dequant = (*nodes_keep_counter)[dequant_out]-- > 1; + + if (dequant_scale == quant_scale) { + // squash dequantize-quantize to nothing + auto quant_out_var_name = quant_out->Name(); + auto next_op_inputs = next_op_desc->InputNames(); + for (const auto& name : next_op_inputs) { + auto var_name = next_op_desc->Input(name)[0]; + if (var_name.compare(quant_out_var_name) == 0) { + next_op_desc->SetInput( + name, std::vector({dequant_in->Name()})); + break; + } + } + + if (keep_dequant) + GraphSafeRemoveNodes(graph, {quant_op, quant_out}); + else + GraphSafeRemoveNodes(graph, + {dequant_op, quant_op, dequant_out, quant_out}); + + IR_NODE_LINK_TO(dequant_in, next_op); + + found_squash_count++; + } else { + // squash dequantize-quantize to requantize op + OpDesc desc; + desc.SetType("requantize"); + desc.SetInput("Input", std::vector({dequant_in->Name()})); + desc.SetOutput("Output", std::vector({quant_out->Name()})); + desc.SetAttr("Scale_in", dequant_scale); + desc.SetAttr("Scale_out", quant_scale); + + auto requant_op = g->CreateOpNode(&desc); + + if (keep_dequant) + GraphSafeRemoveNodes(graph, {quant_op}); + else + GraphSafeRemoveNodes(graph, {dequant_op, quant_op, dequant_out}); + + IR_NODE_LINK_TO(dequant_in, requant_op); + IR_NODE_LINK_TO(requant_op, quant_out); + + found_squash_count++; + } + }; + gpd(graph, handler); + AddStatis(found_squash_count); + PrettyLogDetail("--- squashed %d dequantize-quantize pairs", + found_squash_count); +} + +void CPUQuantizeSquashPass::ApplyImpl(ir::Graph* graph) const { + PADDLE_ENFORCE(graph); + FusePassBase::Init("cpu_quantize_squash_pass", graph); + + std::unordered_map nodes_keep_counter; + FindNodesToKeep(graph, &nodes_keep_counter); + Squash(graph, &nodes_keep_counter); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(cpu_quantize_squash_pass, + paddle::framework::ir::CPUQuantizeSquashPass); diff --git a/paddle/fluid/framework/ir/mkldnn/cpu_quantize_squash_pass.h b/paddle/fluid/framework/ir/mkldnn/cpu_quantize_squash_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..e873994c57ea1a6aca4345d96438e8a7c569980b --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/cpu_quantize_squash_pass.h @@ -0,0 +1,57 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" +#include "paddle/fluid/framework/ir/pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +/* + * Squash dequantize->quantize pair pattern into requantize op + */ +class CPUQuantizeSquashPass : public FusePassBase { + public: + virtual ~CPUQuantizeSquashPass() {} + + protected: + void ApplyImpl(ir::Graph* graph) const override; + + /* + * For each dequantize's output find the number of operators it is an input to + */ + void FindNodesToKeep( + Graph* graph, + std::unordered_map* nodes_keep_counter) const; + + /* + * Squash dequantize-quantize ops pairs into requantize or nothing + */ + void Squash(Graph* graph, + std::unordered_map* nodes_keep_counter) const; + + const std::string name_scope_{"squash"}; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/mkldnn/cpu_quantize_squash_pass_tester.cc b/paddle/fluid/framework/ir/mkldnn/cpu_quantize_squash_pass_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..fda337066f4d43f88d0082b5bcebc587f0c7652b --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/cpu_quantize_squash_pass_tester.cc @@ -0,0 +1,179 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/mkldnn/cpu_quantize_squash_pass.h" +#include +#include "paddle/fluid/framework/naive_executor.h" +#include "paddle/fluid/platform/place.h" + +namespace paddle { +namespace framework { +namespace ir { + +void SetOp(ProgramDesc* prog, const std::string& type, const std::string& name, + const std::vector& inputs, + const std::vector& outputs, bool use_mkldnn, + float scale = 0) { + auto* op = prog->MutableBlock(0)->AppendOp(); + op->SetType(type); + op->SetAttr("use_mkldnn", use_mkldnn); + op->SetAttr("name", name); + if (type == "conv2d") { + op->SetInput("Input", {inputs[0]}); + if (inputs.size() > 1) op->SetInput("Filter", {inputs[1]}); + if (inputs.size() > 2) op->SetInput("Bias", {inputs[2]}); + op->SetOutput("Output", {outputs[0]}); + } else if (type == "quantize") { + op->SetInput("Input", {inputs[0]}); + op->SetOutput("Output", {outputs[0]}); + op->SetAttr("Scale", scale); + } else if (type == "dequantize") { + op->SetInput("Input", {inputs[0]}); + op->SetOutput("Output", {outputs[0]}); + op->SetAttr("Scale", scale); + } +} + +// (a,w1,b1)->Conv1->d +// d->Dequant->e +// e->Quant->f +// (f,w2,b2)->Conv2->i +ProgramDesc BuildProgramDesc(bool use_mkldnn, float scale1, float scale2) { + ProgramDesc prog; + for (auto& v : std::initializer_list( + {"a", "w1", "b1", "d", "e", "f", "w2", "b2", "i"})) { + auto* var = prog.MutableBlock(0)->Var(v); + if (v.find("w") == 0 || v.find("b") == 0) { + var->SetPersistable(true); + } + } + + SetOp(&prog, "conv2d", "Conv1", {"a", "w1", "b1"}, {"d"}, use_mkldnn); + SetOp(&prog, "dequantize", "Dequant", {"d"}, {"e"}, use_mkldnn, scale1); + SetOp(&prog, "quantize", "Quant", {"e"}, {"f"}, use_mkldnn, scale2); + SetOp(&prog, "conv2d", "Conv2", {"f", "w2", "b2"}, {"i"}, use_mkldnn); + return prog; +} + +static const std::initializer_list variable_names{ + "a", "b", "c", "d", "e", "f", "g", "h"}; +// a->Conv1->b +// b->Dequant->c +// +// c->Quant1->d and d->Conv2->e +// +// c->Conv3->f +// +// c->Quant2->g and g->Conv4->h +// +ProgramDesc BuildProgramDesc2(bool use_mkldnn, float scale1, float scale2, + float scale3) { + ProgramDesc prog; + for (auto& v : variable_names) { + prog.MutableBlock(0)->Var(v); + } + + SetOp(&prog, "conv2d", "Conv1", {"a"}, {"b"}, use_mkldnn); + SetOp(&prog, "dequantize", "Dequant", {"b"}, {"c"}, use_mkldnn, scale1); + + SetOp(&prog, "quantize", "Quant1", {"c"}, {"d"}, use_mkldnn, scale2); + SetOp(&prog, "conv2d", "Conv2", {"d"}, {"e"}, use_mkldnn); + + SetOp(&prog, "conv2d", "Conv3", {"c"}, {"f"}, use_mkldnn); + + SetOp(&prog, "quantize", "Quant2", {"c"}, {"g"}, use_mkldnn, scale3); + SetOp(&prog, "conv2d", "Conv4", {"g"}, {"h"}, use_mkldnn); + + return prog; +} + +void InitTensorHolder(Scope* scope, const paddle::platform::Place& place, + const char* var_name) { + auto x = scope->Var(var_name); + auto tensor = x->GetMutable(); + tensor->mutable_data(place, proto::VarType::FP32, + ::paddle::memory::Allocator::kDefault, 1); +} + +void MainTest(const ProgramDesc& prog, int removed_nodes_num) { + std::unique_ptr graph(new ir::Graph(prog)); + + // Init scope, as it is used in pass + auto place = paddle::platform::CPUPlace(); + NaiveExecutor exe{place}; + Scope scope; + exe.CreateVariables(prog, 0, true, &scope); + + for (auto& v : variable_names) { + InitTensorHolder(&scope, place, v.c_str()); + } + + graph->Set(kParamScopeAttr, new framework::Scope*(&scope)); + + auto pass = PassRegistry::Instance().Get("cpu_quantize_squash_pass"); + + int original_nodes_num = graph->Nodes().size(); + + graph.reset(pass->Apply(graph.release())); + + int current_nodes_num = graph->Nodes().size(); + + EXPECT_EQ(original_nodes_num - removed_nodes_num, current_nodes_num); +} + +TEST(CpuQuantizeSquashPass, equal_scales) { + auto scale = 1.2345f; + auto use_mkldnn = true; + // Remove 4 nodes: Dequant, Quant, e, f + auto remove_nodes = 4; + MainTest(BuildProgramDesc(use_mkldnn, scale, scale), remove_nodes); + + use_mkldnn = !use_mkldnn; + MainTest(BuildProgramDesc(use_mkldnn, scale, scale), remove_nodes); +} + +TEST(CpuQuantizeSquashPass, inequal_scales) { + auto scale1 = 1.2345f; + auto scale2 = 21.0f; + auto use_mkldnn = true; + // Remove 3 nodes: Dequant, Quant, e + // Insert 1 node: requantize + auto remove_nodes = 2; + MainTest(BuildProgramDesc(use_mkldnn, scale1, scale2), remove_nodes); + + use_mkldnn = !use_mkldnn; + MainTest(BuildProgramDesc(use_mkldnn, scale1, scale2), remove_nodes); +} + +TEST(CpuQuantizeSquashPass, branch_to_equal_inequal_and_fp32) { + // Delete both quantize ops, + // bypass dequantize in both branches, + // insert requantize on one branch + auto scale = 1.2345f; + auto scale2 = 21.0f; + auto use_mkldnn = true; + // Remove 3 nodes: Quant1, Quant2, g + // Insert 1 node: requantize + auto remove_nodes = 2; + MainTest(BuildProgramDesc2(use_mkldnn, scale, scale, scale2), remove_nodes); + + use_mkldnn = !use_mkldnn; + MainTest(BuildProgramDesc2(use_mkldnn, scale, scale, scale2), remove_nodes); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +USE_PASS(cpu_quantize_squash_pass); diff --git a/paddle/fluid/framework/ir/mkldnn/depthwise_conv_mkldnn_pass.cc b/paddle/fluid/framework/ir/mkldnn/depthwise_conv_mkldnn_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..e854559ae7a8765da604c2043e8e4e8cedbbcf88 --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/depthwise_conv_mkldnn_pass.cc @@ -0,0 +1,56 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/ir/mkldnn/depthwise_conv_mkldnn_pass.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" + +namespace paddle { +namespace framework { +namespace ir { + +#define GET_NODE(id, pattern) \ + PADDLE_ENFORCE(subgraph.count(pattern.RetrieveNode(#id)), \ + "pattern has no Node called %s", #id); \ + auto* id = subgraph.at(pattern.RetrieveNode(#id)); \ + PADDLE_ENFORCE_NOT_NULL(id, "subgraph has no node %s", #id); + +void DepthwiseConvMKLDNNPass::ApplyImpl(ir::Graph* graph) const { + PADDLE_ENFORCE(graph); + FusePassBase::Init("depthwise_conv_mkldnn_pass", graph); + GraphPatternDetector gpd; + + auto* pattern = gpd.mutable_pattern(); + pattern->NewNode("depthwise_conv") + ->assert_is_op("depthwise_conv2d") + ->assert_op_attr("use_mkldnn", true); + + int found_depthwise_conv_mkldnn_count = 0; + auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, + Graph* g) { + VLOG(3) << "handle DepthwiseConvMKLDNN fuse"; + GET_NODE(depthwise_conv, (*pattern)); + depthwise_conv->Op()->SetType("conv2d"); + found_depthwise_conv_mkldnn_count++; + }; + + gpd(graph, handler); + AddStatis(found_depthwise_conv_mkldnn_count); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(depthwise_conv_mkldnn_pass, + paddle::framework::ir::DepthwiseConvMKLDNNPass); diff --git a/paddle/fluid/framework/ir/mkldnn/depthwise_conv_mkldnn_pass.h b/paddle/fluid/framework/ir/mkldnn/depthwise_conv_mkldnn_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..ca314afde57bbc5a339b2016a2540309b31f0598 --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/depthwise_conv_mkldnn_pass.h @@ -0,0 +1,33 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include "paddle/fluid/framework/ir/fuse_pass_base.h" + +namespace paddle { +namespace framework { +namespace ir { + +class DepthwiseConvMKLDNNPass : public FusePassBase { + public: + virtual ~DepthwiseConvMKLDNNPass() {} + + protected: + void ApplyImpl(ir::Graph* graph) const override; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/mkldnn/depthwise_conv_mkldnn_pass_tester.cc b/paddle/fluid/framework/ir/mkldnn/depthwise_conv_mkldnn_pass_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..f2dfbc84a5a5a7feac2514731445eb191bd6f784 --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/depthwise_conv_mkldnn_pass_tester.cc @@ -0,0 +1,123 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/mkldnn/depthwise_conv_mkldnn_pass.h" + +#include + +namespace paddle { +namespace framework { +namespace ir { + +void SetOp(ProgramDesc* prog, const std::string& type, const std::string& name, + const std::vector& inputs, + const std::vector& outputs, bool use_mkldnn = false) { + auto* op = prog->MutableBlock(0)->AppendOp(); + op->SetType(type); + op->SetAttr("use_mkldnn", use_mkldnn); + op->SetAttr("name", name); + op->SetInput("Input", {inputs[0]}); + op->SetInput("Filter", {inputs[1]}); + op->SetInput("Bias", {inputs[2]}); + op->SetOutput("Out", outputs); +} + +// (a, weights, bias)->depthwise conv mkldnn->b +// (b, weights2, bias2)->depthwise conv no mkldnn->c +// (c, weights3, bias3)->conv mkldnn->d +// (d, weights3, bias3)->conv no mkldnn->e +ProgramDesc BuildProgramDesc() { + ProgramDesc prog; + for (auto& v : std::vector( + {"a", "b", "c", "d", "e", "weights", "bias", "weights2", "bias2", + "weights3", "bias3", "weights4", "bias4"})) { + auto* var = prog.MutableBlock(0)->Var(v); + var->SetType(proto::VarType::SELECTED_ROWS); + if (v == "weights" || v == "bias" || v == "weights2" || v == "bias2" || + v == "weights3" || v == "bias3" || v == "weights4" || v == "bias4") { + var->SetPersistable(true); + } + } + + // depthwise conv with MKL-DNN + SetOp(&prog, "depthwise_conv2d", "conv1", + std::vector({"a", "weights", "bias"}), + std::vector({"b"}), true); + // depthwise conv without MKL-DNN + SetOp(&prog, "depthwise_conv2d", "conv2", + std::vector({"b", "weights2", "bias2"}), + std::vector({"c"}), false); + // conv with MKL-DNN + SetOp(&prog, "conv2d", "conv3", + std::vector({"c", "weights3", "bias3"}), + std::vector({"d"}), true); + // conv without MKL-dNN + SetOp(&prog, "conv2d", "conv4", + std::vector({"d", "weights4", "bias4"}), + std::vector({"e"}), false); + + return prog; +} + +TEST(DepthwiseConvMKLDNNPass, basic) { + auto prog = BuildProgramDesc(); + + std::unique_ptr graph(new ir::Graph(prog)); + + auto pass = PassRegistry::Instance().Get("depthwise_conv_mkldnn_pass"); + + struct counters { + int mkldnn_depthwise_conv_nodes; + int other_depthwise_conv_nodes; + int mkldnn_conv_nodes; + int other_conv_nodes; + }; + + counters before{1, 1, 1, 1}; + + graph.reset(pass->Apply(graph.release())); + + // initialize counters before loop + counters after{0, 0, 0, 0}; + + for (auto* node : graph->Nodes()) { + if (node->IsOp()) { + auto* op = node->Op(); + if (op->Type() == "conv2d") { + if (boost::get(op->GetAttr("use_mkldnn"))) + after.mkldnn_conv_nodes++; + else + after.other_conv_nodes++; + } else if (op->Type() == "depthwise_conv2d") { + if (boost::get(op->GetAttr("use_mkldnn"))) + after.mkldnn_depthwise_conv_nodes++; + else + after.other_depthwise_conv_nodes++; + } + } + } + + EXPECT_EQ(after.other_depthwise_conv_nodes, + before.other_depthwise_conv_nodes); + EXPECT_EQ(after.other_conv_nodes, before.other_conv_nodes); + EXPECT_EQ(after.mkldnn_depthwise_conv_nodes, + before.mkldnn_depthwise_conv_nodes - 1); + EXPECT_EQ(after.mkldnn_conv_nodes, before.mkldnn_conv_nodes + 1); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +USE_PASS(depthwise_conv_mkldnn_pass); diff --git a/paddle/fluid/framework/ir/mkldnn/mkldnn_placement_pass.cc b/paddle/fluid/framework/ir/mkldnn/mkldnn_placement_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..500419e4b7819e576e4e9f2dcc9a01a414519ff8 --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/mkldnn_placement_pass.cc @@ -0,0 +1,47 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/ir/mkldnn/mkldnn_placement_pass.h" +#include +#include + +namespace paddle { +namespace framework { +namespace ir { + +void MKLDNNPlacementPass::ApplyImpl(ir::Graph* graph) const { + VLOG(3) << "Applies MKL-DNN placement strategy."; + const auto& op_types_list = + Get>("mkldnn_enabled_op_types"); + for (const Node* n : graph->Nodes()) { + if (n->IsOp()) { + auto* op = n->Op(); + if (op->HasAttr("use_mkldnn") || op->HasProtoAttr("use_mkldnn")) { + if (op_types_list.empty()) { + op->SetAttr("use_mkldnn", true); + } else if (std::find(op_types_list.begin(), op_types_list.end(), + n->Name()) != op_types_list.end()) { + op->SetAttr("use_mkldnn", true); + } + } + } + } +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(mkldnn_placement_pass, paddle::framework::ir::MKLDNNPlacementPass) + .RequirePassAttr("mkldnn_enabled_op_types"); diff --git a/paddle/fluid/framework/ir/mkldnn/mkldnn_placement_pass.h b/paddle/fluid/framework/ir/mkldnn/mkldnn_placement_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..ffa62273ece084c6c60855f628b7a921a004ac3e --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/mkldnn_placement_pass.h @@ -0,0 +1,34 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include "paddle/fluid/framework/ir/pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +/* + * Specifies which operators should use MKLDNN. + */ +class MKLDNNPlacementPass : public Pass { + protected: + void ApplyImpl(ir::Graph* graph) const override; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/mkldnn/mkldnn_placement_pass_tester.cc b/paddle/fluid/framework/ir/mkldnn/mkldnn_placement_pass_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..5885f327e610a5c3d931a00b36066194dac8994a --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn/mkldnn_placement_pass_tester.cc @@ -0,0 +1,136 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/mkldnn/mkldnn_placement_pass.h" + +#include +#include + +namespace paddle { +namespace framework { +namespace ir { + +void SetOp(ProgramDesc* prog, const std::string& type, const std::string& name, + const std::vector& inputs, + const std::vector& outputs, boost::tribool use_mkldnn) { + auto* op = prog->MutableBlock(0)->AppendOp(); + + op->SetType(type); + + if (!boost::indeterminate(use_mkldnn)) op->SetAttr("use_mkldnn", use_mkldnn); + + if (type == "conv2d") { + op->SetAttr("name", name); + op->SetInput("Input", {inputs[0]}); + op->SetInput("Filter", {inputs[1]}); + op->SetInput("Bias", {inputs[2]}); + } else if (type == "relu") { + op->SetInput("X", inputs); + } else if (type == "concat") { + op->SetAttr("axis", 1); + op->SetInput("X", {inputs[0], inputs[1]}); + } else if (type == "pool2d") { + op->SetInput("X", {inputs[0]}); + } else { + FAIL() << "Unexpected operator type."; + } + op->SetOutput("Out", {outputs[0]}); +} + +// operator use_mkldnn +// --------------------------------------- +// (a,b)->concat->c none +// (c,weights,bias)->conv->f none +// f->relu->g false +// g->pool->h false +// (h,weights2,bias2)->conv->k true +// k->relu->l true +ProgramDesc BuildProgramDesc() { + ProgramDesc prog; + + for (auto& v : + std::vector({"a", "b", "c", "weights", "bias", "f", "g", + "h", "weights2", "bias2", "k", "l"})) { + auto* var = prog.MutableBlock(0)->Var(v); + var->SetType(proto::VarType::SELECTED_ROWS); + if (v == "weights" || v == "bias") { + var->SetPersistable(true); + } + } + + SetOp(&prog, "concat", "concat1", std::vector({"a", "b"}), + std::vector({"c"}), boost::indeterminate); + SetOp(&prog, "conv2d", "conv1", + std::vector({"c", "weights", "bias"}), + std::vector({"f"}), boost::indeterminate); + SetOp(&prog, "relu", "relu1", std::vector({"f"}), + std::vector({"g"}), false); + SetOp(&prog, "pool2d", "pool1", std::vector({"g"}), + std::vector({"h"}), false); + SetOp(&prog, "conv2d", "conv2", + std::vector({"h", "weights2", "bias2"}), + std::vector({"k"}), true); + SetOp(&prog, "relu", "relu2", std::vector({"k"}), + std::vector({"l"}), true); + + return prog; +} + +void MainTest(std::initializer_list mkldnn_enabled_op_types, + unsigned expected_use_mkldnn_true_count) { + auto prog = BuildProgramDesc(); + + std::unique_ptr graph(new ir::Graph(prog)); + + auto pass = PassRegistry::Instance().Get("mkldnn_placement_pass"); + pass->Set("mkldnn_enabled_op_types", + new std::unordered_set(mkldnn_enabled_op_types)); + + graph.reset(pass->Apply(graph.release())); + + unsigned use_mkldnn_true_count = 0; + + for (auto* node : graph->Nodes()) { + if (node->IsOp()) { + auto* op = node->Op(); + if (op->HasAttr("use_mkldnn") && + boost::get(op->GetAttr("use_mkldnn"))) { + ++use_mkldnn_true_count; + } + } + } + + EXPECT_EQ(use_mkldnn_true_count, expected_use_mkldnn_true_count); +} + +TEST(MKLDNNPlacementPass, enable_conv_relu) { + // 1 conv (1 conv is always true) + 2 relu (1 relu is always true) + 0 pool + MainTest({"conv2d", "relu"}, 3); +} + +TEST(MKLDNNPlacementPass, enable_relu_pool) { + // 1 conv (1 conv is always true) + 2 relu (1 relu is always true) + 1 pool + MainTest({"relu", "pool2d"}, 4); +} + +TEST(MKLDNNPlacementPass, enable_all) { + // 1 conv (1 conv is always true) + 2 relu (1 relu is always true) + 1 pool + MainTest({}, 4); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +USE_PASS(mkldnn_placement_pass); diff --git a/paddle/fluid/framework/ir/multi_batch_merge_pass.cc b/paddle/fluid/framework/ir/multi_batch_merge_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..a8720ff4bfb5c7fa7aee6d23949b030c328b90e6 --- /dev/null +++ b/paddle/fluid/framework/ir/multi_batch_merge_pass.cc @@ -0,0 +1,336 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/multi_batch_merge_pass.h" + +#include +#include +#include +#include +#include +#include "paddle/fluid/framework/ir/graph_helper.h" +#include "paddle/fluid/framework/op_proto_maker.h" + +namespace paddle { +namespace framework { +namespace ir { + +static const char kNumRepeats[] = "num_repeats"; +typedef std::unordered_map> SSAVarList; + +ir::Node* SameNameVar(std::unordered_set all, ir::Node* target) { + for (auto n : all) { + if (target->IsVar() && target->Name() == n->Name()) { + return n; + } + } + return nullptr; +} + +VarDesc CopyVarDesc(VarDesc* var_desc) { + VarDesc repeated_var(var_desc->Name()); + // copy other variable attributes + if (var_desc->GetType() != proto::VarType::READER) { + repeated_var.SetType(var_desc->GetType()); + repeated_var.SetShape(var_desc->GetShape()); + repeated_var.SetDataType(var_desc->GetDataType()); + repeated_var.SetLoDLevel(var_desc->GetLoDLevel()); + repeated_var.SetPersistable(var_desc->Persistable()); + } else { + // TODO(typhoonzero): copy reader var + } + return repeated_var; +} + +VarDesc UpdateGradVarDesc( + VarDesc* var_desc, int repeat, + const std::unordered_set& grad_names, + const std::unordered_set& bn_vars_need_rename) { + if (grad_names.find(var_desc->Name()) != grad_names.end() || + bn_vars_need_rename.find(var_desc->Name()) != bn_vars_need_rename.end()) { + std::string new_gname = + string::Sprintf("%s.repeat.%d", var_desc->Name(), repeat); + VarDesc repeated_var = CopyVarDesc(var_desc); + repeated_var.SetName(new_gname); + VLOG(3) << "update " << var_desc->Name() << " to repeat " << repeat; + return repeated_var; + } + return *var_desc; +} + +void BatchMergePass::ApplyImpl(ir::Graph* graph) const { + int num_repeats = Get(kNumRepeats); + std::vector forward_backward_ops; + std::vector optimize_ops; + std::vector lr_ops; // ops other than forward/backward/optimize + std::unordered_set grad_names; + std::unordered_map gradname2paramname; + + std::vector nodes = TopologySortOperations(*graph); + auto origin_nodes = graph->ReleaseNodes(); + VLOG(3) << "origin nodes count: " << origin_nodes.size(); + ir::Graph& result = *graph; + + // 1. record op nodes of different roles + for (auto node : nodes) { + if (!node->IsOp()) continue; + PADDLE_ENFORCE(node->Op(), "must find opdesc"); + int op_role = boost::get(node->Op()->GetAttr( + framework::OpProtoAndCheckerMaker::OpRoleAttrName())); + if ((op_role == static_cast(framework::OpRole::kForward)) || + (op_role & static_cast(framework::OpRole::kBackward)) || + (op_role & static_cast(framework::OpRole::kLoss))) { + forward_backward_ops.push_back(node); + } else if ((op_role & static_cast(framework::OpRole::kOptimize)) || + (op_role & static_cast(framework::OpRole::kDist)) || + (op_role & static_cast(framework::OpRole::kRPC))) { + optimize_ops.push_back(node); + auto op_role_var = node->Op()->GetNullableAttr( + OpProtoAndCheckerMaker::OpRoleVarAttrName()); + auto op_role_vars = boost::get>(op_role_var); + for (size_t i = 0; i < op_role_vars.size(); i += 2) { + grad_names.insert(op_role_vars[i + 1]); + gradname2paramname[op_role_vars[i + 1]] = op_role_vars[i]; + } + } else if (op_role & static_cast(framework::OpRole::kLRSched)) { + lr_ops.push_back(node); + } else { // NOLINT + PADDLE_THROW("Invalid op_role: %d", static_cast(op_role)); + } + } + + // 2. copy forward backward + ir::Node* prev_repeat_last_op_node = nullptr; + // record origin_grad -> repeated_grad_list map. + std::map> grad_repeated_map; + std::map> created; + std::unordered_set bn_vars_need_rename; + for (int i = 0; i < num_repeats; ++i) { + std::unordered_set copied; + for (size_t node_idx = 0; node_idx < forward_backward_ops.size(); + ++node_idx) { + auto node = forward_backward_ops[node_idx]; + OpDesc repeated_op(*(node->Op()), node->Op()->Block()); + // 3. rename grad outputs to current repeat. + for (auto outname : repeated_op.OutputArgumentNames()) { + if (grad_names.find(outname) != grad_names.end()) { + std::string new_gname = string::Sprintf("%s.repeat.%d", outname, i); + repeated_op.RenameOutput(outname, new_gname); + // remove op_role_var for backward ops that outputs grad for a + // parameter. + repeated_op.SetAttr(OpProtoAndCheckerMaker::OpRoleVarAttrName(), + std::vector()); + } + } + // 3.5 let batch_norm ops use independent vars, note batch_norm_grad do + // not need this update, because only moving mean and variance should be + // differ, trainable parameter scale and bias is the same as other + // parameters. + if (node->Name() == "batch_norm") { + // NOTE: assume bn op created by layers use save var as output mean and + // variance + std::string new_mean_name = + string::Sprintf("%s.repeat.%d", repeated_op.Input("Mean")[0], i); + std::string new_var_name = string::Sprintf( + "%s.repeat.%d", repeated_op.Input("Variance")[0], i); + bn_vars_need_rename.insert(repeated_op.Input("Mean")[0]); + bn_vars_need_rename.insert(repeated_op.Input("Variance")[0]); + VLOG(3) << "renaming " << repeated_op.Input("Mean")[0] << " to " + << new_mean_name; + repeated_op.RenameInput(repeated_op.Input("Mean")[0], new_mean_name); + repeated_op.RenameInput(repeated_op.Input("Variance")[0], new_var_name); + repeated_op.RenameOutput(repeated_op.Output("MeanOut")[0], + new_mean_name); + repeated_op.RenameOutput(repeated_op.Output("VarianceOut")[0], + new_var_name); + } + + // 3.9 do copy + auto repeated_node = result.CreateOpNode(&repeated_op); + copied.insert(node); + + // 4. add deps between repeats + if (node_idx == forward_backward_ops.size() - 1) { + prev_repeat_last_op_node = repeated_node; + } + if (node_idx == 0 && prev_repeat_last_op_node) { + auto* depvar = result.CreateControlDepVar(); + prev_repeat_last_op_node->outputs.push_back(depvar); + depvar->inputs.push_back(prev_repeat_last_op_node); + repeated_node->inputs.push_back(depvar); + depvar->outputs.push_back(repeated_node); + } + + for (auto in_node : node->inputs) { + if (in_node->IsCtrlVar()) { + continue; + } + ir::Node* var = nullptr; + auto updated_var = UpdateGradVarDesc(in_node->Var(), i, grad_names, + bn_vars_need_rename); + // should be initialized by startup, how to initilize tensor in the + // scope? + if (node->Name() == "batch_norm" && + bn_vars_need_rename.find(in_node->Name()) != + bn_vars_need_rename.end()) { + // Create bn mean/variance for each repeat + var = result.CreateVarNode(&updated_var); + created[updated_var.Name()].push_back(var); + copied.insert(in_node); + repeated_node->inputs.push_back(var); + var->outputs.push_back(repeated_node); + continue; + } + + // for other ops + if (in_node->inputs.empty() && i > 0) { + // do not copy head vars (inputs, params) in repeats > 0 + var = created.at(in_node->Name()).back(); + } else { + if (copied.find(in_node) == copied.end()) { + var = result.CreateVarNode(&updated_var); + if (grad_names.find(in_node->Var()->Name()) != grad_names.end()) { + grad_repeated_map[in_node].push_back(var); + } + copied.insert(in_node); + created[updated_var.Name()].push_back(var); + } else { + var = created.at(updated_var.Name()).back(); + } + } + repeated_node->inputs.push_back(var); + var->outputs.push_back(repeated_node); + } + for (auto out_node : node->outputs) { + if (out_node->IsCtrlVar()) { + continue; + } + ir::Node* var = nullptr; + auto updated_var = UpdateGradVarDesc(out_node->Var(), i, grad_names, + bn_vars_need_rename); + if (copied.find(out_node) == copied.end()) { + var = result.CreateVarNode(&updated_var); + if (grad_names.find(out_node->Var()->Name()) != grad_names.end()) { + grad_repeated_map[out_node].push_back(var); + } + copied.insert(out_node); + created[updated_var.Name()].push_back(var); + } else { + var = created.at(updated_var.Name()).back(); + } + repeated_node->outputs.push_back(var); + var->inputs.push_back(repeated_node); + } + } + } // end copy forward backward + + // 5. create GRAD merge op node: sum(repeat.0...repeat.n) -> + // scale(1/num_repeats) + for (auto kv : grad_repeated_map) { + OpDesc sum_op; + sum_op.SetType("sum"); + std::vector repeated_grad_names; + std::vector param_grad_op_role_var; + for (auto r : kv.second) { + repeated_grad_names.push_back(r->Var()->Name()); + } + // NOTE: use op_role_var to control allreduce op appending in + // multi_devices_graph_pass, we want to append op_role_var + // only once for the merged gradient, so break after first call. + param_grad_op_role_var.push_back( + gradname2paramname.at(kv.first->Var()->Name())); // param + param_grad_op_role_var.push_back(kv.first->Var()->Name()); // grad + + sum_op.SetInput("X", repeated_grad_names); + sum_op.SetOutput("Out", {kv.first->Var()->Name()}); + sum_op.SetAttr(OpProtoAndCheckerMaker::OpRoleAttrName(), + static_cast(OpRole::kBackward)); + auto sum_op_node = result.CreateOpNode(&sum_op); + for (auto r : kv.second) { + sum_op_node->inputs.push_back(r); + r->outputs.push_back(sum_op_node); + } + auto sum_out_var_node = result.CreateVarNode(kv.first->Var()); + sum_op_node->outputs.push_back(sum_out_var_node); + sum_out_var_node->inputs.push_back(sum_op_node); + created[sum_out_var_node->Name()].push_back(sum_out_var_node); + + OpDesc scale_op; + scale_op.SetType("scale"); + scale_op.SetInput("X", {sum_out_var_node->Var()->Name()}); + // NOTE: inplace scale. + scale_op.SetOutput("Out", {sum_out_var_node->Var()->Name()}); + scale_op.SetAttr("scale", static_cast(1.0f / num_repeats)); + scale_op.SetAttr(OpProtoAndCheckerMaker::OpRoleAttrName(), + static_cast(OpRole::kBackward)); + + scale_op.SetAttr(OpProtoAndCheckerMaker::OpRoleVarAttrName(), + param_grad_op_role_var); + + auto scale_op_node = result.CreateOpNode(&scale_op); + scale_op_node->inputs.push_back(sum_out_var_node); + sum_out_var_node->outputs.push_back(scale_op_node); + auto scale_out_var_node = result.CreateVarNode(sum_out_var_node->Var()); + scale_op_node->outputs.push_back(scale_out_var_node); + scale_out_var_node->inputs.push_back(scale_op_node); + created[scale_out_var_node->Name()].push_back(scale_out_var_node); + } + // 6. add optimize ops + { + auto copy_node = [&result, &created](ir::Node* node) { + auto op_node = result.CreateOpNode(node->Op()); + // copy op ins/outs + // NOTE: for send/recv ops, the OpDesc uses ctrldepvar to describe + // dependencies, so create those depvars if OpDesc have in/outs. + for (auto in_node : node->inputs) { + if (in_node->IsCtrlVar() && !in_node->Var()) { + continue; + } + ir::Node* var = nullptr; + if (created.find(in_node->Name()) == created.end()) { + var = result.CreateVarNode(in_node->Var()); + created[in_node->Name()].push_back(var); + } else { + var = created.at(in_node->Name()).back(); + } + op_node->inputs.push_back(var); + var->outputs.push_back(op_node); + } + for (auto out_node : node->outputs) { + if (out_node->IsCtrlVar() && !out_node->Var()) { + continue; + } + auto var = result.CreateVarNode(out_node->Var()); + created[out_node->Name()].push_back(var); + op_node->outputs.push_back(var); + var->inputs.push_back(op_node); + } + }; + for (auto node : lr_ops) { + copy_node(node); + } + for (auto node : optimize_ops) { + copy_node(node); + } + } + + result.ResolveHazard(created); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(multi_batch_merge_pass, paddle::framework::ir::BatchMergePass) + .RequirePassAttr(paddle::framework::ir::kNumRepeats); diff --git a/paddle/fluid/framework/ir/multi_batch_merge_pass.h b/paddle/fluid/framework/ir/multi_batch_merge_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..a89616683d9c625111272fd8c1de237a5c9dbe8f --- /dev/null +++ b/paddle/fluid/framework/ir/multi_batch_merge_pass.h @@ -0,0 +1,44 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +// BatchMergePass is used to copy forward and backward ops for several +// times to run several batches to simulate large batch size training +// as if we have more than 1 GPUs. +// User can define how many batches to run, gradients will be merged +// through those repeats, and then do optimization using merged gradients. +// This pass is extremely useful when doing large batch-size distributed +// sync training, we can simulate even large batch size as if we have more +// GPUs. + +class BatchMergePass : public Pass { + public: + virtual ~BatchMergePass() {} + + protected: + void ApplyImpl(Graph* graph) const override; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/node.cc b/paddle/fluid/framework/ir/node.cc index 9277abe8c1b79c5f76f4610d0554bf337f329518..45d81b937392244f678fbd01395b3ffffd07f710 100644 --- a/paddle/fluid/framework/ir/node.cc +++ b/paddle/fluid/framework/ir/node.cc @@ -13,17 +13,31 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/framework/ir/node.h" +#include "paddle/fluid/framework/op_info.h" namespace paddle { namespace framework { namespace ir { +// msvc15 don't support constexpr in correct way. +#if !defined(_WIN32) constexpr char Node::kControlDepVarName[]; -int Node::count_ = 0; +#else +const char Node::kControlDepVarName[] = "__control_var"; +#endif -std::unique_ptr CreateNodeForTest(const std::string& name, +std::unique_ptr CreateNodeForTest(const std::string &name, Node::Type type) { return std::unique_ptr(new Node(name, type)); } + +std::unique_ptr CreateNodeForTest(VarDesc *var_desc) { + return std::unique_ptr(new Node(var_desc)); +} + +std::unique_ptr CreateNodeForTest(OpDesc *op_desc) { + return std::unique_ptr(new Node(op_desc)); +} + } // namespace ir } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/ir/node.h b/paddle/fluid/framework/ir/node.h index 5d6da9f1d76a3c0fc64b7ff35264e385cf19a14b..72fb876d98dc84164398583baf22c49014af483a 100644 --- a/paddle/fluid/framework/ir/node.h +++ b/paddle/fluid/framework/ir/node.h @@ -14,7 +14,10 @@ limitations under the License. */ #pragma once +#include #include +#include +#include #include #include "paddle/fluid/framework/op_desc.h" #include "paddle/fluid/framework/var_desc.h" @@ -24,17 +27,45 @@ namespace paddle { namespace framework { namespace ir { -// Node should normally created by Graph::CreateXXXNode(). +// Node should only created by Graph::CreateXXXNode(). +// 1. Every Node should be part of a graph. No dangling Node exists. +// 2. Node only contains members necessary for building graph structure. +// It doesn't contain other unrelated members, such as device, etc. +// +// Sometimes, for specific usages, Node needs to have additional members, +// such as device_placement, version in order to be executed. It is suggested +// to use composition pattern. +// +// class RunnableOp { +// RunnableOp(ir::Node* n) : n_(n) { n_.WrappedBy(this); } +// +// int any_thing_; +// } +// +// RunnableOp is owned by the ir::Node that composes it. In other words. +// ir::Node will be responsible for deleting RunnableOp, say, when ir::Node +// is deleted from the graph. class Node { public: + virtual ~Node() { + if (!wrapper_.empty()) { + VLOG(4) << "ir::Node deleting a wrapper node " << Name(); + wrapper_deleter_(); + } + } + enum class Type { kOperation, kVariable }; +#if !defined(_WIN32) // msvc not support constexpr correctly. static constexpr char kControlDepVarName[] = "__control_var"; +#else + static const char kControlDepVarName[]; +#endif Type NodeType() const { return type_; } std::string Name() const { return name_; } - VarDesc* Var() { + VarDesc* Var() const { PADDLE_ENFORCE(IsVar()); return var_desc_.get(); } @@ -44,6 +75,30 @@ class Node { return op_desc_.get(); } + // Set the `wrapper` that wraps the Node. `wrapper` is owned by Node. + template + void WrappedBy(T* wrapper) { + if (!wrapper_.empty()) { + wrapper_deleter_(); + } + wrapper_ = wrapper; + wrapper_deleter_ = [wrapper]() { delete wrapper; }; + wrapper_type_ = std::type_index(typeid(T)); + } + + // Return a reference to the `wrapper`. + template + T& Wrapper() { + return *boost::any_cast(wrapper_); + } + + // Test if the Node is wrapped by type T. + template + bool IsWrappedBy() { + return std::type_index(typeid(T)) == wrapper_type_; + } + + // Please don't use this API! int id() const { return id_; } bool IsOp() const { return type_ == Type::kOperation; } @@ -64,41 +119,44 @@ class Node { int id_; private: + // ID can only set by a Graph. + void SetId(int id) { id_ = id; } + friend class Graph; friend std::unique_ptr CreateNodeForTest(const std::string& name, Node::Type type); + friend std::unique_ptr CreateNodeForTest(VarDesc* var_desc); + friend std::unique_ptr CreateNodeForTest(OpDesc* op_desc); explicit Node(const std::string& name, Type type) - : name_(name), - var_desc_(nullptr), - op_desc_(nullptr), - type_(type), - id_(count_++) {} + : name_(name), var_desc_(nullptr), op_desc_(nullptr), type_(type) {} explicit Node(VarDesc* var_desc) : name_(var_desc->Name()), var_desc_(new VarDesc(*var_desc)), op_desc_(nullptr), - type_(Type::kVariable), - id_(count_++) {} + type_(Type::kVariable) {} explicit Node(OpDesc* op_desc) : name_(op_desc->Type()), var_desc_(nullptr), op_desc_(new OpDesc(*op_desc, op_desc->Block())), - type_(Type::kOperation), - id_(count_++) {} + type_(Type::kOperation) {} Node() = delete; - static int count_; - static void ResetId() { count_ = 0; } + boost::any wrapper_; + std::function wrapper_deleter_; + std::type_index wrapper_type_ = std::type_index(typeid(void)); + DISABLE_COPY_AND_ASSIGN(Node); }; std::unique_ptr CreateNodeForTest(const std::string& name, Node::Type type); +std::unique_ptr CreateNodeForTest(VarDesc* var_desc); +std::unique_ptr CreateNodeForTest(OpDesc* op_desc); } // namespace ir } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/ir/node_test.cc b/paddle/fluid/framework/ir/node_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..694efadda078169c993457181c00f7b357a09e87 --- /dev/null +++ b/paddle/fluid/framework/ir/node_test.cc @@ -0,0 +1,80 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "gtest/gtest.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +class RunnableOp { + public: + RunnableOp(Node* node, bool* alive) : node_(node), alive_(alive) { + node_->WrappedBy(this); + } + + virtual ~RunnableOp() { *alive_ = false; } + + private: + Node* node_; + bool* alive_; +}; + +class RunnableOp2 { + public: + RunnableOp2(Node* node, bool* alive) : node_(node), alive_(alive) { + node_->WrappedBy(this); + } + + virtual ~RunnableOp2() { *alive_ = false; } + + private: + Node* node_; + bool* alive_; +}; + +TEST(NodeTest, Basic) { + bool alive1 = true; + bool alive2 = true; + std::unique_ptr n1(CreateNodeForTest("n1", Node::Type::kVariable)); + std::unique_ptr n2(CreateNodeForTest("n2", Node::Type::kVariable)); + + EXPECT_FALSE(n1->IsWrappedBy()); + EXPECT_FALSE(n1->IsWrappedBy()); + EXPECT_FALSE(n2->IsWrappedBy()); + EXPECT_FALSE(n2->IsWrappedBy()); + + new RunnableOp(n1.get(), &alive1); + new RunnableOp2(n2.get(), &alive2); + + EXPECT_TRUE(n1->IsWrappedBy()); + EXPECT_FALSE(n1->IsWrappedBy()); + EXPECT_FALSE(n2->IsWrappedBy()); + EXPECT_TRUE(n2->IsWrappedBy()); + + EXPECT_TRUE(alive1); + EXPECT_TRUE(alive2); + + n1.reset(nullptr); + n2.reset(nullptr); + EXPECT_FALSE(alive1); + EXPECT_FALSE(alive2); +} + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/pass.cc b/paddle/fluid/framework/ir/pass.cc index 6cf405efe63d2bc284c4650771a747b27bb4a9f6..4a29bde0917d3cce97d69ff3b896d09a2aae82ba 100644 --- a/paddle/fluid/framework/ir/pass.cc +++ b/paddle/fluid/framework/ir/pass.cc @@ -13,13 +13,18 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/framework/ir/pass.h" + +#include +#include + #include "paddle/fluid/framework/ir/graph_helper.h" namespace paddle { namespace framework { namespace ir { -std::unique_ptr Pass::Apply(std::unique_ptr graph) const { - PADDLE_ENFORCE(graph.get(), "graph passed to Pass::Apply() cannot be empty."); + +Graph* Pass::Apply(Graph* graph) const { + PADDLE_ENFORCE(graph, "graph passed to Pass::Apply() cannot be empty."); for (const std::string& attr : required_pass_attrs_) { PADDLE_ENFORCE(attrs_.find(attr) != attrs_.end(), "Required pass atrribute %s not set.", attr); @@ -28,12 +33,16 @@ std::unique_ptr Pass::Apply(std::unique_ptr graph) const { PADDLE_ENFORCE(graph->Has(attr), "Required graph atrribute %s not set.", attr); } - auto applied_graph = ApplyImpl(std::move(graph)); + auto* native_graph = graph; + ApplyImpl(graph); // TODO(panyx0718): Add more verifications. - PADDLE_ENFORCE(!HasCircle(*applied_graph), + PADDLE_ENFORCE(!HasCircle(*graph), "Illegal Pass. Generated graph shouldn't has cycle."); + PADDLE_ENFORCE(graph == native_graph, + "Pass::Apply() cannot delete the passed graph and shouldn't " + "return a new graph.(For the need of pybind11)"); applied_ = true; - return applied_graph; + return graph; } PassRegistry& PassRegistry::Instance() { diff --git a/paddle/fluid/framework/ir/pass.h b/paddle/fluid/framework/ir/pass.h index 9570c59cff2a6afeb1c607f7219b7b455974d6ce..6cbe9a8212775512431860591526b52665ec4037 100644 --- a/paddle/fluid/framework/ir/pass.h +++ b/paddle/fluid/framework/ir/pass.h @@ -16,8 +16,10 @@ limitations under the License. */ #include #include +#include #include - +#include +#include #include "paddle/fluid/framework/ir/graph.h" #include "paddle/fluid/framework/ir/node.h" #include "paddle/fluid/framework/program_desc.h" @@ -44,18 +46,25 @@ class Pass { std::string Type() const { return type_; } - std::unique_ptr Apply(std::unique_ptr graph) const; + Graph *Apply(Graph *graph) const; // Get a reference to the attributed previously set. template AttrType &Get(const std::string &attr_name) const { PADDLE_ENFORCE(attrs_.find(attr_name) != attrs_.end(), "%s attr not registered for pass.", attr_name); - return *boost::any_cast(attrs_.at(attr_name)); + try { + return *boost::any_cast(attrs_.at(attr_name)); + } catch (boost::bad_any_cast &) { + PADDLE_THROW( + "Invalid attribute type of %s error, expected: %s, actual: %s", + attr_name, typeid(AttrType *).name(), + attrs_.at(attr_name).type().name()); + } } bool Has(const std::string &attr_name) const { - return attrs_.find(attr_name) != attrs_.end(); + return attrs_.count(attr_name) > 0; } void Erase(const std::string &attr_name) { @@ -91,7 +100,7 @@ class Pass { } protected: - virtual std::unique_ptr ApplyImpl(std::unique_ptr graph) const { + virtual void ApplyImpl(Graph *graph) const { LOG(FATAL) << "Calling virtual Pass not implemented."; } @@ -196,26 +205,26 @@ struct PassRegistrar : public Registrar { msg) // Register a new pass that can be applied on the IR. -#define REGISTER_PASS(pass_type, pass_class) \ - STATIC_ASSERT_PASS_GLOBAL_NAMESPACE( \ - __reg_pass__##pass_type, \ - "REGISTER_PASS must be called in global namespace"); \ - static ::paddle::framework::ir::PassRegistrar \ - __pass_registrar_##pass_type##__(#pass_type); \ - int TouchPassRegistrar_##pass_type() { \ - __pass_registrar_##pass_type##__.Touch(); \ - return 0; \ - } \ - static ::paddle::framework::ir::PassRegistrar \ - &__pass_tmp_registrar_##pass_type##__ __attribute__((unused)) = \ +#define REGISTER_PASS(pass_type, pass_class) \ + STATIC_ASSERT_PASS_GLOBAL_NAMESPACE( \ + __reg_pass__##pass_type, \ + "REGISTER_PASS must be called in global namespace"); \ + static ::paddle::framework::ir::PassRegistrar \ + __pass_registrar_##pass_type##__(#pass_type); \ + int TouchPassRegistrar_##pass_type() { \ + __pass_registrar_##pass_type##__.Touch(); \ + return 0; \ + } \ + static ::paddle::framework::ir::PassRegistrar \ + &__pass_tmp_registrar_##pass_type##__ UNUSED = \ __pass_registrar_##pass_type##__ -#define USE_PASS(pass_type) \ - STATIC_ASSERT_PASS_GLOBAL_NAMESPACE( \ - __use_pass_itself_##pass_type, \ - "USE_PASS must be called in global namespace"); \ - extern int TouchPassRegistrar_##pass_type(); \ - static int use_pass_itself_##pass_type##_ __attribute__((unused)) = \ +#define USE_PASS(pass_type) \ + STATIC_ASSERT_PASS_GLOBAL_NAMESPACE( \ + __use_pass_itself_##pass_type, \ + "USE_PASS must be called in global namespace"); \ + extern int TouchPassRegistrar_##pass_type(); \ + static int use_pass_itself_##pass_type##_ UNUSED = \ TouchPassRegistrar_##pass_type() } // namespace ir diff --git a/paddle/fluid/framework/ir/pass_test.cc b/paddle/fluid/framework/ir/pass_test.cc index 6ad7d1df8bdd016b617c820c022ef55f23ba21cd..87e3c96416926cb07550b1eb4d1fd3ec6131c8ec 100644 --- a/paddle/fluid/framework/ir/pass_test.cc +++ b/paddle/fluid/framework/ir/pass_test.cc @@ -13,7 +13,9 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/framework/ir/pass.h" +#include #include +#include #include "gtest/gtest.h" #include "paddle/fluid/framework/ir/graph.h" @@ -39,7 +41,7 @@ void BuildCircleGraph(Graph* g) { class TestPass : public Pass { protected: - std::unique_ptr ApplyImpl(std::unique_ptr graph) const { + void ApplyImpl(ir::Graph* graph) const { graph->Set("copy_test_pass_attr", new int); graph->Set("copy_test_graph_attr", new int); @@ -48,7 +50,6 @@ class TestPass : public Pass { int test_graph_attr = graph->Get("test_graph_attr"); graph->Get("copy_test_graph_attr") = test_graph_attr + 1; - return graph; } }; @@ -58,7 +59,7 @@ TEST(PassTest, TestPassAttrCheck) { std::unique_ptr graph(new Graph(prog)); std::string exception; try { - graph = pass->Apply(std::move(graph)); + graph.reset(pass->Apply(graph.release())); } catch (paddle::platform::EnforceNotMet e) { exception = std::string(e.what()); } @@ -69,7 +70,7 @@ TEST(PassTest, TestPassAttrCheck) { pass->SetNotOwned("test_pass_attr", &val); try { - graph = pass->Apply(std::move(graph)); + graph.reset(pass->Apply(graph.release())); } catch (paddle::platform::EnforceNotMet e) { exception = std::string(e.what()); } @@ -78,14 +79,14 @@ TEST(PassTest, TestPassAttrCheck) { graph.reset(new Graph(prog)); graph->Set("test_graph_attr", new int); graph->Get("test_graph_attr") = 1; - graph = pass->Apply(std::move(graph)); + graph.reset(pass->Apply(graph.release())); ASSERT_EQ(graph->Get("copy_test_pass_attr"), 2); ASSERT_EQ(graph->Get("copy_test_graph_attr"), 2); // Allow apply more than once. graph.reset(new Graph(prog)); graph->Set("test_graph_attr", new int); - graph = pass->Apply(std::move(graph)); + graph.reset(pass->Apply(graph.release())); pass = PassRegistry::Instance().Get("test_pass"); pass->SetNotOwned("test_pass_attr", &val); @@ -94,7 +95,7 @@ TEST(PassTest, TestPassAttrCheck) { graph->Set("test_graph_attr", new int); graph->Get("test_graph_attr") = 2; try { - auto tmp = pass->Apply(std::move(graph)); + pass->Apply(graph.release()); } catch (paddle::platform::EnforceNotMet e) { exception = std::string(e.what()); } diff --git a/paddle/fluid/framework/ir/quant_conv2d_dequant_fuse_pass.cc b/paddle/fluid/framework/ir/quant_conv2d_dequant_fuse_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..7cab9c353d35cb6d725d787986e992b6853d42ce --- /dev/null +++ b/paddle/fluid/framework/ir/quant_conv2d_dequant_fuse_pass.cc @@ -0,0 +1,173 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include +#include + +#include "paddle/fluid/framework/ir/graph_viz_pass.h" +#include "paddle/fluid/framework/ir/quant_conv2d_dequant_fuse_pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +void RunQuantDequant(ir::Graph* graph, Scope* scope, int times, + std::string op_type) { + const std::string pattern_name = "quant_dequant_fuse"; + // FusePassBase::Init(pattern_name, graph); + const int kNumFields = 5; + const int kQuantizedWeightOffset = 0; + const int kQuantizedOpOffset = 1; + const int kQuantizedOpOutOffset = 2; + const int kDequantOpOffset = 3; + const int kDequantOpOutOffset = 4; + + GraphPatternDetector gpd; + auto* x = gpd.mutable_pattern() + ->NewNode("x") + ->assert_is_op_input("fake_quantize_range_abs_max", "X") + ->AsInput(); + + std::string quantized_op_type = ""; + std::string weight_name = ""; + if (op_type == "conv2d") { + quantized_op_type = "conv2d"; + weight_name = "Filter"; + } else if (op_type == "conv2d_fusion") { + quantized_op_type = "conv2d_fusion"; + weight_name = "Filter"; + } else if (op_type == "mul") { + quantized_op_type = "mul"; + weight_name = "Y"; + } else if (op_type == "fc") { + quantized_op_type = "fc"; + weight_name = "W"; + } else { + PADDLE_ENFORCE( + "QuantDequantFuse: We only support conv2d, conv2d_fusion, fc, mul for " + "now."); + } + + patterns::QuantDequantOpFuse pattern(gpd.mutable_pattern(), pattern_name); + pattern(x, quantized_op_type, weight_name, times); + + auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, + Graph* g) { + PADDLE_ENFORCE(subgraph.count(x)); + auto* input_node = subgraph.at(x); + Node* quant_op_in_scale = + subgraph.at(pattern.GetPDNode("quant_op_in_scale")); + Node* quant_op = subgraph.at(pattern.GetPDNode("quant_op")); + Node* quant_op_out_scale = + subgraph.at(pattern.GetPDNode("quant_op_out_scale")); + Node* quant_op_out = subgraph.at(pattern.GetPDNode("quant_op_out")); + + std::vector nodes; + for (int i = 0; i < times; i++) { + nodes.push_back(subgraph.at( + pattern.GetPDNode("quantized_op_weight" + std::to_string(i)))); + nodes.push_back( + subgraph.at(pattern.GetPDNode("quantized_op" + std::to_string(i)))); + nodes.push_back(subgraph.at( + pattern.GetPDNode("quantized_op_out" + std::to_string(i)))); + nodes.push_back( + subgraph.at(pattern.GetPDNode("dequant_op" + std::to_string(i)))); + nodes.push_back( + subgraph.at(pattern.GetPDNode("dequant_op_out" + std::to_string(i)))); + } + + int bit_length = boost::get(quant_op->Op()->GetAttr("bit_length")); + int range = ((1 << (bit_length - 1)) - 1); + // Prepare input scale + std::string input_scale_var_name = quant_op->Op()->Input("InScale").front(); + PADDLE_ENFORCE(scope); + const LoDTensor& input_scale_tensor = + scope->FindVar(input_scale_var_name)->Get(); + + PADDLE_ENFORCE(paddle::platform::is_cpu_place(input_scale_tensor.place())); + const float* input_scale_data = input_scale_tensor.data(); + float input_scale = input_scale_data[0]; + std::unordered_set delete_nodes; + + for (int i = 0; i < times; i++) { + // max_range = (range * range) / weight_scale + float max_range = boost::get( + nodes[i * kNumFields + kDequantOpOffset]->Op()->GetAttr("max_range")); + float weight_scale = (range * range) / max_range; + + auto base_op_desc = + *nodes[i * kNumFields + kQuantizedOpOffset]->Op()->Proto(); + std::string new_input = input_node->Name(); + std::string new_output = + nodes[i * kNumFields + kDequantOpOutOffset]->Name(); + + framework::OpDesc new_op_desc(base_op_desc, nullptr); + new_op_desc.SetType(quantized_op_type); + + if (quantized_op_type == "conv2d" || + quantized_op_type == "conv2d_fusion") { + new_op_desc.SetInput("Input", {new_input}); + new_op_desc.SetOutput("Output", {new_output}); + } else if (quantized_op_type == "fc") { + new_op_desc.SetInput("Input", {new_input}); + new_op_desc.SetOutput("Out", {new_output}); + } else if (quantized_op_type == "mul") { + new_op_desc.SetInput("X", {new_input}); + new_op_desc.SetOutput("Out", {new_output}); + } + + new_op_desc.SetAttr("enable_int8", true); + new_op_desc.SetAttr("input_scale", input_scale); + new_op_desc.SetAttr("weight_scale", weight_scale); + new_op_desc.Flush(); + auto* new_op = graph->CreateOpNode(&new_op_desc); + IR_NODE_LINK_TO(input_node, new_op); + IR_NODE_LINK_TO(nodes[i * kNumFields + kQuantizedWeightOffset], new_op); + IR_NODE_LINK_TO(new_op, nodes[i * kNumFields + kDequantOpOutOffset]); + delete_nodes.insert(nodes[i * kNumFields + kQuantizedOpOffset]); + delete_nodes.insert(nodes[i * kNumFields + kQuantizedOpOutOffset]); + delete_nodes.insert(nodes[i * kNumFields + kDequantOpOffset]); + } + + delete_nodes.insert(quant_op_in_scale); + delete_nodes.insert(quant_op); + delete_nodes.insert(quant_op_out); + delete_nodes.insert(quant_op_out_scale); + // Delete the unneeded nodes. + GraphSafeRemoveNodes(graph, delete_nodes); + }; + gpd(graph, handler); +} + +void QuantDequantFusePass::ApplyImpl(ir::Graph* graph) const { + const std::string pattern_name = "quant_dequant_fuse"; + FusePassBase::Init(pattern_name, graph); + + std::unordered_set quantized_op_types = {"conv2d", "mul"}; + auto* scope = param_scope(); + for (auto& op_type : quantized_op_types) { + for (int i = 1; i <= 6; i++) { + RunQuantDequant(graph, scope, i, op_type); + } + } +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(quant_conv2d_dequant_fuse_pass, + paddle::framework::ir::QuantDequantFusePass); diff --git a/paddle/fluid/framework/ir/quant_conv2d_dequant_fuse_pass.h b/paddle/fluid/framework/ir/quant_conv2d_dequant_fuse_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..a61b34563acc4cbcee778509a097587222579295 --- /dev/null +++ b/paddle/fluid/framework/ir/quant_conv2d_dequant_fuse_pass.h @@ -0,0 +1,35 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include + +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" + +namespace paddle { +namespace framework { +namespace ir { + +class QuantDequantFusePass : public FusePassBase { + public: + virtual ~QuantDequantFusePass() {} + + protected: + void ApplyImpl(ir::Graph* graph) const override; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/repeated_fc_relu_fuse_pass.cc b/paddle/fluid/framework/ir/repeated_fc_relu_fuse_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..00263b8a34851b6d4cf2aac1456b3b4514356acd --- /dev/null +++ b/paddle/fluid/framework/ir/repeated_fc_relu_fuse_pass.cc @@ -0,0 +1,384 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. */ + +#include "paddle/fluid/framework/ir/repeated_fc_relu_fuse_pass.h" +#include // for max +#include +#include +#include +#include "paddle/fluid/framework/lod_tensor.h" + +#define MAX_NUM_FC 10 + +namespace paddle { +namespace framework { +namespace ir { + +PDNode* BuildRepeatedFCReluPattern(PDPattern* pattern, + const std::string& name_scope, int num_fc) { + auto var_next_is_fc_act = [=](Node* x, const std::string& act_type = "relu", + bool check_in_has_only_one_out = true, + int fc_idx = 0) -> bool { + bool next_is_fc = x && x->IsVar() && VarLinksToOp(x, "fc"); + if (check_in_has_only_one_out) { + next_is_fc = next_is_fc && x->outputs.size() == 1; + } + if (!next_is_fc) { + return false; + } + auto* fc_op = x->outputs[fc_idx]; + bool next_is_act = fc_op && fc_op->IsOp() && fc_op->outputs.size() == 1 && + fc_op->outputs[0] && fc_op->outputs[0]->IsVar() && + VarLinksToOp(fc_op->outputs[0], act_type) && + fc_op->outputs[0]->outputs.size() == 1; + if (!next_is_act) { + return false; + } + auto* act_op = fc_op->outputs[0]->outputs[0]; + return act_op && act_op->IsOp() && act_op->outputs.size() == 1; + }; + + auto find_fc_idx = [=](Node* x, const std::string& act_type = "relu") -> int { + bool next_is_fc = x && x->IsVar() && VarLinksToOp(x, "fc"); + if (!next_is_fc) { + return 0; + } + for (size_t k = 0; k < x->outputs.size(); ++k) { + auto* fc_op = x->outputs[k]; + bool next_is_act = fc_op && fc_op->IsOp() && fc_op->outputs.size() == 1 && + fc_op->outputs[0] && fc_op->outputs[0]->IsVar() && + VarLinksToOp(fc_op->outputs[0], act_type) && + fc_op->outputs[0]->outputs.size() == 1; + if (!next_is_act) { + continue; + } + auto* act_op = fc_op->outputs[0]->outputs[0]; + if (act_op && act_op->IsOp() && act_op->outputs.size() == 1) { + return k; + } + } + return 0; + }; + + auto next_var_of_part = [=](Node* x, int fc_idx = 0) -> Node* { + return x->outputs[fc_idx]->outputs[0]->outputs[0]->outputs[0]; + }; + auto var_next_is_fc_act_repeated_n_times = [=]( + Node* x, int repeated_times, const std::string& act_type = "relu", + bool check_in_has_only_one_out = true) -> bool { + for (int i = 0; i < repeated_times; ++i) { + if (!var_next_is_fc_act(x, act_type, + i == 0 && check_in_has_only_one_out)) { + return false; + } + x = next_var_of_part(x); + } + return true; + }; + + auto var_before_is_fc_act = [=](Node* x, const std::string& act_type = "relu", + bool at_top = false) -> bool { + bool before_is_act = + x && x->IsVar() && x->inputs.size() == 1 && VarLinksFromOp(x, "relu"); + if (!before_is_act) { + return false; + } + auto* relu_op = x->inputs[0]; + bool before_is_fc = relu_op->IsOp() && relu_op->inputs.size() == 1 && + relu_op->inputs[0]->IsVar() && + VarLinksFromOp(relu_op->inputs[0], "fc") && + relu_op->inputs[0]->inputs.size() == 1; + + if (!before_is_fc) { + return false; + } + auto* fc_op = relu_op->inputs[0]->inputs[0]; + bool is_fc = fc_op->IsOp() && fc_op->inputs.size() == 3; + if (!is_fc) { + return false; + } + for (auto* fc_i : fc_op->inputs) { + if (!fc_i->inputs.empty()) { + if (at_top) { + return true; + } else { + return VarLinksFromOp(fc_i, "relu"); + } + } + } + return false; + }; + + auto before_var_of_part = [=](Node* x) -> Node* { + auto* fc_op = x->inputs[0]->inputs[0]; + for (auto* fc_i : fc_op->inputs) { + if (!fc_i->inputs.empty()) { + return fc_i->inputs[0]; + } + } + return nullptr; + }; + + auto var_before_is_fc_act_repeated_n_times = [=]( + Node* x, int repeated_times, + const std::string& act_type = "relu") -> bool { + for (int i = 0; i < repeated_times; ++i) { + if (!var_before_is_fc_act(x, act_type, i == repeated_times - 1)) { + return false; + } + x = before_var_of_part(x); + } + return true; + }; + + std::vector fc_input_var(num_fc); + std::vector fc_output_var(num_fc); + std::vector fc_weight_var(num_fc); + std::vector fc_bias_var(num_fc); + std::vector fc_ops(num_fc); + std::vector relu_ops(num_fc); + + for (int i = 0; i < num_fc; ++i) { + fc_input_var[i] = pattern->NewNode( + [=](Node* x) { + if (i == 0 && x->outputs.size() > 0) { + bool ok = x->inputs.size() > 0; + if (!ok) { + return false; + } + int idx = find_fc_idx(x); + if (idx == 0) { + return var_next_is_fc_act_repeated_n_times(x, num_fc - i, "relu"); + } else { + x = next_var_of_part(x, idx); + return var_next_is_fc_act_repeated_n_times( + x, std::max(1, num_fc - i - 1), "relu"); + } + } else { + return var_next_is_fc_act_repeated_n_times(x, num_fc - i, "relu") && + x->inputs.size() > 0 && + var_before_is_fc_act_repeated_n_times(x, i, "relu"); + } + }, + name_scope + "/fc_in_" + std::to_string(i)); + + fc_weight_var[i] = pattern->NewNode( + [=](Node* x) { + return var_next_is_fc_act_repeated_n_times(x, num_fc - i, "relu") && + x->inputs.empty() && + var_before_is_fc_act_repeated_n_times(x->outputs[0]->inputs[0], + i, "relu") && + x->Name() == x->outputs[0]->Op()->Input("W")[0]; + }, + name_scope + "/fc_weight_" + std::to_string(i)); + + fc_bias_var[i] = pattern->NewNode( + [=](Node* x) { + return var_next_is_fc_act_repeated_n_times(x, num_fc - i, "relu") && + x->inputs.empty() && + var_before_is_fc_act_repeated_n_times(x->outputs[0]->inputs[0], + i, "relu") && + x->Name() == x->outputs[0]->Op()->Input("Bias")[0]; + }, + name_scope + "/fc_bias_" + std::to_string(i)); + + fc_output_var[i] = pattern->NewNode( + [=](Node* x) { + bool basic = x && x->IsVar() && VarLinksFromOp(x, "fc") && + VarLinksToOp(x, "relu") && x->inputs.size() == 1 && + x->inputs[0]->inputs.size() == 3; + if (!basic) { + return false; + } + x = x->inputs[0]->inputs[0]; + if (i == 0 && x->outputs.size() > 0) { + bool ok = x->inputs.size() > 0; + if (!ok) { + return false; + } + int idx = find_fc_idx(x); + if (idx == 0) { + return var_next_is_fc_act_repeated_n_times(x, num_fc - i, "relu"); + } else { + x = next_var_of_part(x, idx); + return var_next_is_fc_act_repeated_n_times( + x, std::max(1, num_fc - i - 1), "relu"); + } + } else { + return var_next_is_fc_act_repeated_n_times(x, num_fc - i, "relu") && + x->inputs.size() > 0 && + var_before_is_fc_act_repeated_n_times(x, i, "relu"); + } + }, + name_scope + "/fc_out_" + std::to_string(i)); + + fc_ops[i] = pattern->NewNode( + [=](Node* x) { + bool basic = x && x->IsOp() && x->Op()->Type() == "fc" && + x->inputs.size() == 3 && x->outputs.size() == 1; + if (!basic) { + return false; + } + auto* fc_out_var = x->outputs[0]; + return fc_out_var && fc_out_var->IsVar() && + fc_out_var->outputs.size() == 1 && + VarLinksToOp(fc_out_var, "relu") && + fc_out_var->outputs[0]->outputs.size() == 1 && + var_next_is_fc_act_repeated_n_times( + fc_out_var->outputs[0]->outputs[0], num_fc - i - 1, + "relu") && + var_before_is_fc_act_repeated_n_times( + fc_out_var->outputs[0]->outputs[0], i + 1, "relu"); + }, + name_scope + "/fc_op_" + std::to_string(i)); + + relu_ops[i] = pattern->NewNode( + [=](Node* x) { + return x && x->IsOp() && x->Op()->Type() == "relu" && + x->inputs.size() == 1 && x->outputs.size() == 1 && + x->inputs[0]->IsVar() && VarLinksFromOp(x->inputs[0], "fc") && + x->outputs[0]->IsVar() && + var_next_is_fc_act_repeated_n_times(x->outputs[0], + num_fc - i - 1, "relu") && + var_before_is_fc_act_repeated_n_times(x->outputs[0], i + 1, + "relu"); + }, + name_scope + "/act_op_" + std::to_string(i)); + + fc_ops[i] + ->LinksFrom({fc_input_var[i], fc_weight_var[i], fc_bias_var[i]}) + .LinksTo({fc_output_var[i]}); + relu_ops[i]->LinksFrom({fc_output_var[i]}); + } + + auto* last_out_var = pattern->NewNode( + [=](Node* x) { + return var_before_is_fc_act_repeated_n_times(x, num_fc, "relu"); + }, + name_scope + "/act_out"); + for (int i = 0; i < num_fc - 1; ++i) { + relu_ops[i]->LinksTo({fc_input_var[i + 1]}); + } + relu_ops[num_fc - 1]->LinksTo({last_out_var}); + return last_out_var; +} + +static int BuildFusion(Graph* graph, const std::string& name_scope, + int num_fc) { + GraphPatternDetector gpd; + auto* pattern = gpd.mutable_pattern(); + BuildRepeatedFCReluPattern(pattern, name_scope, num_fc); + + auto retrieve_node = [](const std::string& name, + const GraphPatternDetector::subgraph_t& subgraph, + const PDPattern& pat) -> Node* { + PADDLE_ENFORCE(subgraph.count(pat.RetrieveNode(name)), + "pattern has no Node called %s", name.c_str()); + Node* p = subgraph.at(pat.RetrieveNode(name)); + PADDLE_ENFORCE_NOT_NULL(p, "subgraph has no node %s", name.c_str()); + return p; + }; + + int fusion_count{0}; + auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, + Graph* g) { + LOG(INFO) << "handle Repeated FC Act fuse"; + std::vector weights_vars(num_fc); + std::vector bias_vars(num_fc); + std::vector relu_vars(num_fc - 1); + + std::vector weight_names(num_fc); + std::vector bias_names(num_fc); + std::vector relu_names(num_fc - 1); + + auto& fused_pattern = gpd.pattern(); + for (int i = 0; i < num_fc; ++i) { + if (i >= 1) { + relu_vars[i - 1] = + retrieve_node(name_scope + "/fc_in_" + std::to_string(i), subgraph, + fused_pattern); + relu_names[i - 1] = relu_vars[i - 1]->Name(); + } + + weights_vars[i] = + retrieve_node(name_scope + "/fc_weight_" + std::to_string(i), + subgraph, fused_pattern); + weight_names[i] = weights_vars[i]->Name(); + + bias_vars[i] = retrieve_node(name_scope + "/fc_bias_" + std::to_string(i), + subgraph, fused_pattern); + bias_names[i] = bias_vars[i]->Name(); + } + + auto* input_var = + retrieve_node(name_scope + "/fc_in_0", subgraph, fused_pattern); + auto* last_out_var = + retrieve_node(name_scope + "/act_out", subgraph, fused_pattern); + + // Create New OpDesc + OpDesc op_desc; + op_desc.SetType("fusion_repeated_fc_relu"); + op_desc.SetInput("X", {input_var->Name()}); + op_desc.SetInput("W", weight_names); + op_desc.SetInput("Bias", bias_names); + op_desc.SetOutput("ReluOut", relu_names); + op_desc.SetOutput("Out", {last_out_var->Name()}); + auto* op = graph->CreateOpNode(&op_desc); + IR_NODE_LINK_TO(input_var, op); + for (size_t i = 0; i < weights_vars.size(); ++i) { + IR_NODE_LINK_TO(weights_vars[i], op); + IR_NODE_LINK_TO(bias_vars[i], op); + } + for (size_t i = 0; i < relu_vars.size(); ++i) { + IR_NODE_LINK_TO(op, relu_vars[i]); + } + IR_NODE_LINK_TO(op, last_out_var); + + std::unordered_set marked_nodes; + for (auto& item : subgraph) { + marked_nodes.insert(item.second); + } + for (size_t i = 0; i < weights_vars.size(); ++i) { + marked_nodes.erase(weights_vars[i]); + marked_nodes.erase(bias_vars[i]); + } + for (size_t i = 0; i < relu_vars.size(); ++i) { + marked_nodes.erase(relu_vars[i]); + } + marked_nodes.erase(input_var); + marked_nodes.erase(last_out_var); + GraphSafeRemoveNodes(graph, marked_nodes); + ++fusion_count; + }; + + gpd(graph, handler); + return fusion_count; +} + +void RepeatedFCReluFusePass::ApplyImpl(ir::Graph* graph) const { + FusePassBase::Init(name_scope_, graph); + int fusion_count = 0; + for (int i = MAX_NUM_FC; i > 1; --i) { + fusion_count += + BuildFusion(graph, name_scope_ + "/" + std::to_string(i), i); + } + AddStatis(fusion_count); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(repeated_fc_relu_fuse_pass, + paddle::framework::ir::RepeatedFCReluFusePass); diff --git a/paddle/fluid/framework/ir/repeated_fc_relu_fuse_pass.h b/paddle/fluid/framework/ir/repeated_fc_relu_fuse_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..ae777bccebec9f99b4752fe495f96d3da38aac23 --- /dev/null +++ b/paddle/fluid/framework/ir/repeated_fc_relu_fuse_pass.h @@ -0,0 +1,41 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. */ + +#pragma once + +#include +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" + +namespace paddle { +namespace framework { +namespace ir { + +/** + * Fuse Repeated FC Relu + */ +class RepeatedFCReluFusePass : public FusePassBase { + public: + virtual ~RepeatedFCReluFusePass() {} + + protected: + void ApplyImpl(ir::Graph* graph) const override; + + const std::string name_scope_{"repeated_fc_relu_fuse"}; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/runtime_context_cache_pass.cc b/paddle/fluid/framework/ir/runtime_context_cache_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..566b654f237cbd71e1983c971374ee13d7b36805 --- /dev/null +++ b/paddle/fluid/framework/ir/runtime_context_cache_pass.cc @@ -0,0 +1,37 @@ +/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/ir/runtime_context_cache_pass.h" +#include +#include "paddle/fluid/framework/operator.h" + +namespace paddle { +namespace framework { +namespace ir { + +void RuntimeContextCachePass::ApplyImpl(ir::Graph* graph) const { + VLOG(3) << "Applies Runtime Context Cache strategy."; + for (const Node* n : graph->Nodes()) { + if (n->IsOp() && n->Op()) { + n->Op()->SetAttr(kEnableCacheRuntimeContext, true); + } + } +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(runtime_context_cache_pass, + paddle::framework::ir::RuntimeContextCachePass); diff --git a/paddle/fluid/framework/ir/runtime_context_cache_pass.h b/paddle/fluid/framework/ir/runtime_context_cache_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..e4783166e0cbde0be9037df5afe3e903a40a2065 --- /dev/null +++ b/paddle/fluid/framework/ir/runtime_context_cache_pass.h @@ -0,0 +1,31 @@ +/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include "paddle/fluid/framework/ir/pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +class RuntimeContextCachePass : public Pass { + protected: + void ApplyImpl(ir::Graph* graph) const override; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/seq_concat_fc_fuse_pass.cc b/paddle/fluid/framework/ir/seq_concat_fc_fuse_pass.cc index a7d5161c35db804703415066990f34da8109fbd9..b230c50167136d2616068078ce619e8362c38fde 100644 --- a/paddle/fluid/framework/ir/seq_concat_fc_fuse_pass.cc +++ b/paddle/fluid/framework/ir/seq_concat_fc_fuse_pass.cc @@ -13,6 +13,9 @@ // limitations under the License. #include "paddle/fluid/framework/ir/seq_concat_fc_fuse_pass.h" +#include +#include +#include #include "paddle/fluid/framework/ir/fuse_pass_base.h" #include "paddle/fluid/framework/ir/graph_pattern_detector.h" #include "paddle/fluid/framework/ir/graph_viz_pass.h" @@ -159,10 +162,7 @@ PDNode* BuildFCPattern(PDPattern* pattern, PDNode* fc_x) { std::set acts({"sigmoid", "tanh", "relu", "identity"}); PDNode* act = pattern->NewNode( - [=](Node* x) { - return x && x->IsOp() && acts.count(x->Op()->Type()); - - }, + [=](Node* x) { return x && x->IsOp() && acts.count(x->Op()->Type()); }, "act"); PDNode* fc_out = pattern->NewNode( @@ -178,9 +178,8 @@ PDNode* BuildFCPattern(PDPattern* pattern, PDNode* fc_x) { return fc_out; } -std::unique_ptr SeqConcatFcFusePass::ApplyImpl( - std::unique_ptr graph) const { - FusePassBase::Init("seq_concat_fc_fuse", graph.get()); +void SeqConcatFcFusePass::ApplyImpl(ir::Graph* graph) const { + FusePassBase::Init("seq_concat_fc_fuse", graph); GraphPatternDetector detector; auto* pattern = detector.mutable_pattern(); auto* concat_out = BuildSeqExpandConcatPattern(pattern); @@ -194,8 +193,8 @@ std::unique_ptr SeqConcatFcFusePass::ApplyImpl( int fuse_count{0}; - detector(graph.get(), [&](const GraphPatternDetector::subgraph_t& subgraph, - Graph* graph) { + detector(graph, [&](const GraphPatternDetector::subgraph_t& subgraph, + Graph* graph) { VLOG(4) << "get one concat pattern"; // fc GET_NODE(fc_w, detector.pattern()); @@ -246,8 +245,6 @@ std::unique_ptr SeqConcatFcFusePass::ApplyImpl( }); AddStatis(fuse_count); - - return graph; } } // namespace ir diff --git a/paddle/fluid/framework/ir/seq_concat_fc_fuse_pass.h b/paddle/fluid/framework/ir/seq_concat_fc_fuse_pass.h index 9f5fd1a29adf918806d8f30097d8c7f002f48f3e..d68840a554777e64082f7f9e467221bc0948d9dd 100644 --- a/paddle/fluid/framework/ir/seq_concat_fc_fuse_pass.h +++ b/paddle/fluid/framework/ir/seq_concat_fc_fuse_pass.h @@ -12,6 +12,8 @@ // See the License for the specific language governing permissions and // limitations under the License. +#pragma once + #include "paddle/fluid/framework/ir/fuse_pass_base.h" #include "paddle/fluid/framework/ir/graph.h" #include "paddle/fluid/framework/ir/pass.h" @@ -25,7 +27,7 @@ class SeqConcatFcFusePass : public FusePassBase { virtual ~SeqConcatFcFusePass() {} protected: - std::unique_ptr ApplyImpl(std::unique_ptr graph) const; + void ApplyImpl(ir::Graph* graph) const override; }; } // namespace ir diff --git a/paddle/fluid/framework/ir/seqconv_eltadd_relu_fuse_pass.cc b/paddle/fluid/framework/ir/seqconv_eltadd_relu_fuse_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..3fd368741fb09d41351a97c5e9cf1a5436f350d0 --- /dev/null +++ b/paddle/fluid/framework/ir/seqconv_eltadd_relu_fuse_pass.cc @@ -0,0 +1,99 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/seqconv_eltadd_relu_fuse_pass.h" +#include +#include +#include "paddle/fluid/framework/lod_tensor.h" + +namespace paddle { +namespace framework { +namespace ir { + +int BuildFusion(Graph* graph, const std::string& name_scope, Scope* scope) { + GraphPatternDetector gpd; + auto* pattern = gpd.mutable_pattern(); + + PDNode* x = pattern->NewNode(patterns::PDNodeName(name_scope, "X")) + ->assert_is_op_input("sequence_conv") + ->assert_var_not_persistable(); + patterns::SeqConvEltAddRelu fuse_pattern(pattern, name_scope); + fuse_pattern(x); + + // Create New OpDesc + auto fuse_creator = [&](Node* seqconv, Node* input, Node* seqconv_weight, + Node* eltadd_bias, Node* relu_out) { + OpDesc op_desc; + op_desc.SetType("fusion_seqconv_eltadd_relu"); + op_desc.SetInput("X", {input->Name()}); + op_desc.SetInput("Filter", {seqconv_weight->Name()}); + op_desc.SetInput("Bias", {eltadd_bias->Name()}); + op_desc.SetAttr("contextLength", seqconv->Op()->GetAttr("contextLength")); + op_desc.SetAttr("contextStart", seqconv->Op()->GetAttr("contextStart")); + op_desc.SetAttr("contextStride", seqconv->Op()->GetAttr("contextStride")); + PADDLE_ENFORCE(graph->Has(kParamScopeAttr)); + auto* scope = graph->Get(kParamScopeAttr); + const std::string ColMat = patterns::UniqueKey("SeqConvColMat"); + op_desc.SetOutput("ColMat", {ColMat}); + op_desc.SetOutput("Out", {relu_out->Name()}); + scope->Var(ColMat)->GetMutable(); + + auto* op = graph->CreateOpNode(&op_desc); + IR_NODE_LINK_TO(input, op); + IR_NODE_LINK_TO(seqconv_weight, op); + IR_NODE_LINK_TO(eltadd_bias, op); + IR_NODE_LINK_TO(op, relu_out); + return op; + }; + + int fusion_count{0}; + + auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, + Graph* g) { + VLOG(4) << "handle SeqConv EltAdd Relu fuse"; + GET_IR_NODE_FROM_SUBGRAPH(seqconv, seqconv, fuse_pattern); + GET_IR_NODE_FROM_SUBGRAPH(seqconv_weight, seqconv_weight, fuse_pattern); + GET_IR_NODE_FROM_SUBGRAPH(seqconv_out, seqconv_out, fuse_pattern); + GET_IR_NODE_FROM_SUBGRAPH(eltadd, eltadd, fuse_pattern); + GET_IR_NODE_FROM_SUBGRAPH(eltadd_bias, eltadd_bias, fuse_pattern); + GET_IR_NODE_FROM_SUBGRAPH(eltadd_out, eltadd_out, fuse_pattern); + GET_IR_NODE_FROM_SUBGRAPH(relu, relu, fuse_pattern); + GET_IR_NODE_FROM_SUBGRAPH(relu_out, relu_out, fuse_pattern); + + fuse_creator(seqconv, subgraph.at(x), seqconv_weight, eltadd_bias, + relu_out); + std::unordered_set marked_nodes( + {seqconv, seqconv_out, eltadd, eltadd_out, relu}); + GraphSafeRemoveNodes(graph, marked_nodes); + ++fusion_count; + }; + + gpd(graph, handler); + + return fusion_count; +} + +void SeqConvEltAddReluFusePass::ApplyImpl(ir::Graph* graph) const { + FusePassBase::Init(name_scope_, graph); + + int fusion_count = BuildFusion(graph, name_scope_, param_scope()); + AddStatis(fusion_count); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(seqconv_eltadd_relu_fuse_pass, + paddle::framework::ir::SeqConvEltAddReluFusePass); diff --git a/paddle/fluid/framework/ir/seqconv_eltadd_relu_fuse_pass.h b/paddle/fluid/framework/ir/seqconv_eltadd_relu_fuse_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..fde9b586c85712b14d285cec49f9e09efad78fc7 --- /dev/null +++ b/paddle/fluid/framework/ir/seqconv_eltadd_relu_fuse_pass.h @@ -0,0 +1,38 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" + +namespace paddle { +namespace framework { +namespace ir { + +class SeqConvEltAddReluFusePass : public FusePassBase { + public: + virtual ~SeqConvEltAddReluFusePass() {} + + protected: + void ApplyImpl(ir::Graph* graph) const override; + + const std::string name_scope_{"seqconv_eltadd_relu_fuse"}; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/seqpool_concat_fuse_pass.cc b/paddle/fluid/framework/ir/seqpool_concat_fuse_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..4ac379eb0471ea1a8a72c393dad405be90b2fa33 --- /dev/null +++ b/paddle/fluid/framework/ir/seqpool_concat_fuse_pass.cc @@ -0,0 +1,213 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. */ + +#include "paddle/fluid/framework/ir/seqpool_concat_fuse_pass.h" +#include +#include +#include +#include "paddle/fluid/framework/lod_tensor.h" + +#define MAX_CONCAT_INPUTS 200 + +namespace paddle { +namespace framework { +namespace ir { + +PDNode* BuildSeqPoolConcatPattern(PDPattern* pattern, + const std::string& name_scope, + int num_inputs) { + auto is_concat_op_with_inputs = [](Node* x, int num) -> bool { + return x && x->IsOp() && x->Op()->Type() == "concat" && + x->Op()->Input("X").size() == static_cast(num); + }; + + auto is_nth_input_var_of_concat = [=](Node* x, int idx) -> bool { + return x && x->IsVar() && VarLinksToOp(x, "concat") && + x->outputs.size() == 1 && IsNthInput(x, x->outputs[0], "X", idx) && + is_concat_op_with_inputs(x->outputs[0], num_inputs); + }; + + auto is_seqpool_op_with_pootype_of_nth_input_of_concat = [=]( + Node* x, const std::string& type, int idx) -> bool { + bool this_is_seqpool_op = + x && x->IsOp() && x->Op()->Type() == "sequence_pool" && + x->Op()->HasAttr("pooltype") && + boost::get(x->Op()->GetAttr("pooltype")) == type && + x->outputs.size() == 2; // seqpool should only have 2 outputs + bool satisfied_all = this_is_seqpool_op; + if (this_is_seqpool_op) { + // Only one output of seqpool_op is nth_input_var of concat, + // the other one should be unused empty var. + if (is_nth_input_var_of_concat(x->outputs[0], idx)) { + satisfied_all = satisfied_all && x->outputs[1]->IsVar() && + x->outputs[1]->outputs.empty(); + } else { + satisfied_all = + satisfied_all && is_nth_input_var_of_concat(x->outputs[1], idx) && + x->outputs[0]->IsVar() && x->outputs[0]->outputs.size() == 0; + } + } + return satisfied_all; + }; + + auto* concat_op = pattern->NewNode( + [=](Node* x) { return is_concat_op_with_inputs(x, num_inputs); }, + name_scope + "/concat_op"); + concat_op->assert_op_attr("axis", 1); + + auto* concat_out_var = pattern->NewNode( + [=](Node* x) { + return x && x->IsVar() && VarLinksFromOp(x, "concat") && + x->inputs.size() == 1 && + is_concat_op_with_inputs(x->inputs[0], num_inputs); + }, + name_scope + "/concat_out_var"); + concat_out_var->assert_is_only_output_of_op("concat"); + + std::vector seqpool_ops_input_var(num_inputs); + std::vector seqpool_ops_output_var(num_inputs); + std::vector seqpool_ops_output_unused_var(num_inputs); + std::vector seqpool_ops(num_inputs); + + for (int i = 0; i < num_inputs; ++i) { + seqpool_ops_output_var[i] = pattern->NewNode( + [=](Node* x) { + return x && x->IsVar() && is_nth_input_var_of_concat(x, i) && + x->inputs.size() == 1 && + is_seqpool_op_with_pootype_of_nth_input_of_concat(x->inputs[0], + "SUM", i); + }, + name_scope + "/sequence_pool_out_" + std::to_string(i)); + + seqpool_ops_output_unused_var[i] = pattern->NewNode( + [=](Node* x) { + return x && x->IsVar() && x->inputs.size() == 1 && + x->outputs.size() == 0 && + is_seqpool_op_with_pootype_of_nth_input_of_concat(x->inputs[0], + "SUM", i); + }, + name_scope + "/sequence_pool_unused_out_" + std::to_string(i)); + + seqpool_ops[i] = pattern->NewNode( + [=](Node* x) { + return x && x->IsOp() && + is_seqpool_op_with_pootype_of_nth_input_of_concat(x, "SUM", i); + }, + name_scope + "/sequence_pool_op_" + std::to_string(i)); + + seqpool_ops_input_var[i] = pattern->NewNode( + [=](Node* x) { + bool basic = x && x->IsVar() && x->outputs.size() >= 1; + bool next_is_fine = false; + for (auto* o : x->outputs) { + if (is_seqpool_op_with_pootype_of_nth_input_of_concat(o, "SUM", + i)) { + next_is_fine = true; + break; + } + } + return basic && next_is_fine; + }, + name_scope + "/sequence_pool_in_" + std::to_string(i)); + + // Links + seqpool_ops[i] + ->LinksFrom({seqpool_ops_input_var[i]}) + .LinksTo({seqpool_ops_output_var[i], seqpool_ops_output_unused_var[i]}); + } + concat_op->LinksFrom(seqpool_ops_output_var).LinksTo({concat_out_var}); + return concat_out_var; +} + +static int BuildFusion(Graph* graph, const std::string& name_scope, + int num_inputs) { + GraphPatternDetector gpd; + auto* pattern = gpd.mutable_pattern(); + BuildSeqPoolConcatPattern(pattern, name_scope, num_inputs); + + auto retrieve_node = [](const std::string& name, + const GraphPatternDetector::subgraph_t& subgraph, + const PDPattern& pat) -> Node* { + PADDLE_ENFORCE(subgraph.count(pat.RetrieveNode(name)), + "pattern has no Node called %s", name.c_str()); + Node* p = subgraph.at(pat.RetrieveNode(name)); + PADDLE_ENFORCE_NOT_NULL(p, "subgraph has no node %s", name.c_str()); + return p; + }; + + int fusion_count{0}; + auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, + Graph* g) { + VLOG(4) << "handle SeqPool Concat fuse"; + std::vector input_names(num_inputs); + std::vector input_vars(num_inputs); + auto& fused_pattern = gpd.pattern(); + for (int i = 0; i < num_inputs; ++i) { + input_vars[i] = + retrieve_node(name_scope + "/sequence_pool_in_" + std::to_string(i), + subgraph, fused_pattern); + input_names[i] = input_vars[i]->Name(); + } + auto* concat_op = + retrieve_node(name_scope + "/concat_op", subgraph, fused_pattern); + auto* concat_out_var = + retrieve_node(name_scope + "/concat_out_var", subgraph, fused_pattern); + auto* seqpool_op0 = retrieve_node(name_scope + "/sequence_pool_op_0", + subgraph, fused_pattern); + + // Create New OpDesc + OpDesc op_desc; + op_desc.SetType("fusion_seqpool_concat"); + op_desc.SetInput("X", input_names); + op_desc.SetAttr("pooltype", seqpool_op0->Op()->GetAttr("pooltype")); + op_desc.SetAttr("axis", concat_op->Op()->GetAttr("axis")); + op_desc.SetOutput("Out", {concat_out_var->Name()}); + auto* op = graph->CreateOpNode(&op_desc); + for (size_t i = 0; i < input_vars.size(); ++i) { + IR_NODE_LINK_TO(input_vars[i], op); + } + IR_NODE_LINK_TO(op, concat_out_var); + + std::unordered_set marked_nodes; + for (auto& item : subgraph) { + marked_nodes.insert(item.second); + } + for (size_t i = 0; i < input_vars.size(); ++i) { + marked_nodes.erase(input_vars[i]); + } + marked_nodes.erase(concat_out_var); + GraphSafeRemoveNodes(graph, marked_nodes); + ++fusion_count; + }; + + gpd(graph, handler); + return fusion_count; +} + +void SeqPoolConcatFusePass::ApplyImpl(ir::Graph* graph) const { + FusePassBase::Init(name_scope_, graph); + int fusion_count = 0; + for (int i = MAX_CONCAT_INPUTS; i > 0; --i) { + fusion_count += + BuildFusion(graph, name_scope_ + "/" + std::to_string(i), i); + } + AddStatis(fusion_count); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(seqpool_concat_fuse_pass, + paddle::framework::ir::SeqPoolConcatFusePass); diff --git a/paddle/fluid/framework/ir/seqpool_concat_fuse_pass.h b/paddle/fluid/framework/ir/seqpool_concat_fuse_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..40a9edc5e642320996f5bd3451479fe347f24081 --- /dev/null +++ b/paddle/fluid/framework/ir/seqpool_concat_fuse_pass.h @@ -0,0 +1,52 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. */ + +#pragma once + +#include +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" + +namespace paddle { +namespace framework { +namespace ir { + +/** + * Fuse SequencePool(with sum pooltype yet) and Concat; + * + * Before fuse: + * | | | + * seq_pool, seq_pool, ... seq_pool + * \ | ... / + * concat + * | + * After fuse: + * \ | / + * FusionSeqPoolConcat + * | + */ +class SeqPoolConcatFusePass : public FusePassBase { + public: + virtual ~SeqPoolConcatFusePass() {} + + protected: + void ApplyImpl(ir::Graph* graph) const override; + + const std::string name_scope_{"seqpool_concat_fuse"}; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/seqpool_concat_fuse_pass_tester.cc b/paddle/fluid/framework/ir/seqpool_concat_fuse_pass_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..d3668038518429ee04b6abba5b1f7f09eea1c9f3 --- /dev/null +++ b/paddle/fluid/framework/ir/seqpool_concat_fuse_pass_tester.cc @@ -0,0 +1,198 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/seqpool_concat_fuse_pass.h" +#include +#include "paddle/fluid/framework/op_proto_maker.h" + +namespace paddle { +namespace framework { +namespace ir { + +void SetOp(ProgramDesc* prog, const std::string& type, + const std::vector& inputs, + const std::vector& outputs) { + auto* op = prog->MutableBlock(0)->AppendOp(); + op->SetType(type); + if (type == "sequence_pool") { + op->SetInput("X", {inputs[0]}); + std::string pooltype = "SUM"; + op->SetAttr("pooltype", pooltype); + op->SetOutput("MaxIndex", {outputs[0]}); + op->SetOutput("Out", {outputs[1]}); + } else if (type == "concat") { + op->SetInput("X", inputs); + op->SetAttr("axis", 1); + op->SetOutput("Out", {outputs[0]}); + } else { + op->SetInput("X", inputs); + op->SetOutput("Out", outputs); + } + op->SetAttr(OpProtoAndCheckerMaker::OpRoleAttrName(), + static_cast(OpRole::kForward)); +} + +int CountOpType(const ir::Graph* graph, + const std::string& op_type = "fusion_seqpool_concat") { + int count = 0; + for (auto* node : graph->Nodes()) { + if (node->IsOp() && node->Op()->Type() == op_type) { + ++count; + } + } + return count; +} + +std::unique_ptr GetNumNodesOfBeforeAfter( + std::unique_ptr graph, int* before, int* after, + const std::string& pass_type = "seqpool_concat_fuse_pass") { + auto pass = PassRegistry::Instance().Get(pass_type); + *before = graph->Nodes().size(); + graph.reset(pass->Apply(graph.release())); + *after = graph->Nodes().size(); + return graph; +} + +/* + * Before fuse: + * a b c + * | | | + * op1 op2 op3 + * / \ / \ / \ + * d e f g h i + * \ | / + * concat + * | + * j + * Type of op1, op2 and op3 are sequence_pool, with "SUM" pooltype attr + * + * After fuse: + * a b c + * \ | / + * fusion_seqpool_concat + * | + * j + */ +TEST(SeqPoolConcatFusePass, basic) { + ProgramDesc prog; + for (auto& v : std::vector( + {"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"})) { + auto* var = prog.MutableBlock(0)->Var(v); + var->SetType(proto::VarType::LOD_TENSOR); + } + + SetOp(&prog, "sequence_pool", std::vector({"a"}), + std::vector({"d", "e"})); + SetOp(&prog, "sequence_pool", std::vector({"b"}), + std::vector({"f", "g"})); + SetOp(&prog, "sequence_pool", std::vector({"c"}), + std::vector({"h", "i"})); + SetOp(&prog, "concat", std::vector({"e", "g", "i"}), + std::vector({"j"})); + + std::unique_ptr graph(new ir::Graph(prog)); + int before, after; + graph = GetNumNodesOfBeforeAfter(std::move(graph), &before, &after); + // Remove 10 Nodes: op1, op2, op3, d, e, f, g, h, i, concat_op + // Add 1 Node: fusion_seqpool_concat + EXPECT_EQ(after, before - 9); + EXPECT_EQ(CountOpType(graph.get()), 1); +} + +/* + * Before fuse: + * a b + * | / \ + * op1 op2 op3 + * / \ / \ \ + * c d e f g + * \ / + * concat + * | + * h + * Type of op1 and op2 are sequence_pool, with "SUM" pooltype attr + * + * After fuse: + * a b + * \ / \ + * fusion_seqpool_concat op3 + * | | + * h g + */ +TEST(SeqPoolConcatFusePass, advanced) { + ProgramDesc prog; + for (auto& v : + std::vector({"a", "b", "c", "d", "e", "f", "g", "h"})) { + auto* var = prog.MutableBlock(0)->Var(v); + var->SetType(proto::VarType::LOD_TENSOR); + } + + SetOp(&prog, "sequence_pool", std::vector({"a"}), + std::vector({"c", "d"})); + SetOp(&prog, "sequence_pool", std::vector({"b"}), + std::vector({"e", "f"})); + SetOp(&prog, "op3", std::vector({"b"}), + std::vector({"g"})); + SetOp(&prog, "concat", std::vector({"d", "f"}), + std::vector({"h"})); + + std::unique_ptr graph(new ir::Graph(prog)); + int before, after; + graph = GetNumNodesOfBeforeAfter(std::move(graph), &before, &after); + // Remove 7 Nodes: op1, op2, c, d, e, f concat_op + // Add 1 Node: fusion_seqpool_concat + EXPECT_EQ(after, before - 6); + EXPECT_EQ(CountOpType(graph.get()), 1); +} + +ProgramDesc BuildProgramDesc(int num_inputs_of_concat) { + ProgramDesc prog; + auto new_var = [&](const std::string& name) { + auto* var = prog.MutableBlock(0)->Var(name); + var->SetType(proto::VarType::LOD_TENSOR); + }; + std::vector concat_inputs; + for (int i = 0; i < num_inputs_of_concat; ++i) { + std::string prefix = "seqpool_op_" + std::to_string(i); + new_var(prefix + "in"); + new_var(prefix + "out"); + new_var(prefix + "out_unused"); + SetOp(&prog, "sequence_pool", std::vector({prefix + "in"}), + std::vector({prefix + "out", prefix + "out_unused"})); + concat_inputs.push_back(prefix + "out"); + } + SetOp(&prog, "concat", concat_inputs, + std::vector({"concat_out"})); + return prog; +} + +// test more inputs of concat +TEST(SeqPoolConcatFusePass, more_inputs) { + for (int num : {1, 2, 10}) { + ProgramDesc prog = BuildProgramDesc(num); + std::unique_ptr graph(new ir::Graph(prog)); + int before, after; + graph = GetNumNodesOfBeforeAfter(std::move(graph), &before, &after); + // Remove Nodes: n * (seqpool_op, out, out_unused), and concat_op + // Add Node: fusion_seqpool_concat op + EXPECT_EQ(after, before - num * 3); + EXPECT_EQ(CountOpType(graph.get()), 1); + } +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +USE_PASS(seqpool_concat_fuse_pass); diff --git a/paddle/fluid/framework/ir/simplify_anakin_priorbox_detection_out_pass.cc b/paddle/fluid/framework/ir/simplify_anakin_priorbox_detection_out_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..b3606e4d922cc8f59dca90904466a889f83f6094 --- /dev/null +++ b/paddle/fluid/framework/ir/simplify_anakin_priorbox_detection_out_pass.cc @@ -0,0 +1,233 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include + +#include "paddle/fluid/framework/ir/graph_viz_pass.h" +#include "paddle/fluid/framework/ir/node.h" +#include "paddle/fluid/framework/ir/simplify_anakin_priorbox_detection_out_pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +void RunSimplifyAnakinDetection(ir::Graph *graph, int times, bool is_density, + bool is_reshape) { + const std::string pattern_name = + "simplify_anakin_detection_pattern_pass" + std::to_string(times); + std::string priorbox_type = is_density ? "density_prior_box" : "prior_box"; + + GraphPatternDetector gpd; + std::vector input_nodes; + for (int i = 0; i < times; i++) { + input_nodes.push_back(gpd.mutable_pattern() + ->NewNode("x" + std::to_string(i)) + ->assert_is_op_input(priorbox_type, "Input") + ->AsInput()); + } + input_nodes.push_back(gpd.mutable_pattern() + ->NewNode("x" + std::to_string(times)) + ->assert_is_op_input("box_coder", "TargetBox") + ->AsInput()); + + input_nodes.push_back(gpd.mutable_pattern() + ->NewNode("x" + std::to_string(times + 1)) + ->assert_is_op_input("transpose2") + ->AsInput()); + + patterns::AnakinDetectionPattern pattern(gpd.mutable_pattern(), pattern_name); + pattern(input_nodes, times, priorbox_type, is_reshape); + + auto handler = [&](const GraphPatternDetector::subgraph_t &subgraph, + Graph *g) { + const int kNumFields = 7; + const int kPriorBoxLocOffset = 1; + const int kReshape1Offset = 2; + const int kReshape1OutOffset = 3; + const int kPriorBoxVarOffset = 4; + const int kReshape2Offset = 5; + const int kReshape2OutOffset = 6; + std::vector nodes; + + for (int i = 0; i < times; i++) { + PADDLE_ENFORCE( + subgraph.at(pattern.GetPDNode("prior_box" + std::to_string(i)))); + PADDLE_ENFORCE( + subgraph.at(pattern.GetPDNode("box_out" + std::to_string(i)))); + PADDLE_ENFORCE( + subgraph.at(pattern.GetPDNode("reshape1" + std::to_string(i)))); + PADDLE_ENFORCE( + subgraph.at(pattern.GetPDNode("reshape1_out" + std::to_string(i)))); + PADDLE_ENFORCE( + subgraph.at(pattern.GetPDNode("reshape2" + std::to_string(i)))); + PADDLE_ENFORCE( + subgraph.at(pattern.GetPDNode("reshape2_out" + std::to_string(i)))); + + PADDLE_ENFORCE( + subgraph.at(pattern.GetPDNode("box_var_out" + std::to_string(i)))); + + nodes.push_back( + subgraph.at(pattern.GetPDNode("prior_box" + std::to_string(i)))); + nodes.push_back( + subgraph.at(pattern.GetPDNode("box_out" + std::to_string(i)))); + nodes.push_back( + subgraph.at(pattern.GetPDNode("reshape1" + std::to_string(i)))); + nodes.push_back( + subgraph.at(pattern.GetPDNode("reshape1_out" + std::to_string(i)))); + nodes.push_back( + subgraph.at(pattern.GetPDNode("box_var_out" + std::to_string(i)))); + nodes.push_back( + subgraph.at(pattern.GetPDNode("reshape2" + std::to_string(i)))); + nodes.push_back( + subgraph.at(pattern.GetPDNode("reshape2_out" + std::to_string(i)))); + } + + Node *concat_op1 = subgraph.at(pattern.GetPDNode("concat1")); + Node *concat_out1 = subgraph.at(pattern.GetPDNode("concat1_out")); + + Node *concat_op2 = subgraph.at(pattern.GetPDNode("concat2")); + Node *concat_out2 = subgraph.at(pattern.GetPDNode("concat2_out")); + + Node *box_coder_third_input = subgraph.at(input_nodes[times]); + Node *box_coder_op = subgraph.at(pattern.GetPDNode("box_coder")); + Node *box_coder_out = subgraph.at(pattern.GetPDNode("box_coder_out")); + + Node *multiclass_nms_second_input = subgraph.at(input_nodes[times + 1]); + Node *transpose_before_nms = + subgraph.at(pattern.GetPDNode("transpose_before_nms")); + Node *transpose_before_nms_out = + subgraph.at(pattern.GetPDNode("transpose_before_nms_out")); + + Node *multiclass_nms = subgraph.at(pattern.GetPDNode("multiclass_nms")); + Node *multiclass_nms_out = + subgraph.at(pattern.GetPDNode("multiclass_nms_out")); + + std::string code_type = + boost::get(box_coder_op->Op()->GetAttr("code_type")); + bool box_normalized = + boost::get(box_coder_op->Op()->GetAttr("box_normalized")); + + int background_label = + boost::get(multiclass_nms->Op()->GetAttr("background_label")); + float score_threshold = + boost::get(multiclass_nms->Op()->GetAttr("score_threshold")); + int nms_top_k = boost::get(multiclass_nms->Op()->GetAttr("nms_top_k")); + float nms_threshold = + boost::get(multiclass_nms->Op()->GetAttr("nms_threshold")); + float nms_eta = boost::get(multiclass_nms->Op()->GetAttr("nms_eta")); + int keep_top_k = + boost::get(multiclass_nms->Op()->GetAttr("keep_top_k")); + + std::vector concat1_input_names; + for (int i = 0; i < times; i++) { + concat1_input_names.push_back( + nodes[i * kNumFields + kPriorBoxLocOffset]->Name()); + } + + framework::OpDesc concat1_desc; + concat1_desc.SetType("concat"); + concat1_desc.SetInput("X", concat1_input_names); + concat1_desc.SetAttr("axis", 2); + concat1_desc.SetOutput("Out", {concat_out1->Name()}); + + auto *new_add_concat_op = graph->CreateOpNode(&concat1_desc); + + for (int i = 0; i < times; i++) { + nodes[i * kNumFields + kPriorBoxLocOffset]->outputs.push_back( + new_add_concat_op); + new_add_concat_op->inputs.push_back( + nodes[i * kNumFields + kPriorBoxLocOffset]); + } + + framework::OpDesc new_op_desc; + new_op_desc.SetType("detection_out"); + new_op_desc.SetInput("PriorBox", {concat_out1->Name()}); + new_op_desc.SetInput("TargetBox", {box_coder_third_input->Name()}); + new_op_desc.SetInput("Scores", {multiclass_nms_second_input->Name()}); + new_op_desc.SetAttr("code_type", code_type); + new_op_desc.SetAttr("box_normalized", box_normalized); + new_op_desc.SetAttr("background_label", background_label); + new_op_desc.SetAttr("score_threshold", score_threshold); + new_op_desc.SetAttr("nms_top_k", nms_top_k); + new_op_desc.SetAttr("nms_threshold", nms_threshold); + new_op_desc.SetAttr("nms_eta", nms_eta); + new_op_desc.SetAttr("keep_top_k", keep_top_k); + new_op_desc.SetOutput("Out", {multiclass_nms_out->Name()}); + new_op_desc.Flush(); + + // Create a new node for the fused op. + auto *detection_out_op = graph->CreateOpNode(&new_op_desc); + + std::unordered_set delete_nodes; + + for (int i = 0; i < times; i++) { + nodes[i * kNumFields + kPriorBoxLocOffset]->outputs.push_back(concat_op1); + delete_nodes.insert(nodes[i * kNumFields + kReshape1Offset]); + delete_nodes.insert(nodes[i * kNumFields + kReshape1OutOffset]); + delete_nodes.insert(nodes[i * kNumFields + kPriorBoxVarOffset]); + delete_nodes.insert(nodes[i * kNumFields + kReshape2Offset]); + delete_nodes.insert(nodes[i * kNumFields + kReshape2OutOffset]); + } + + delete_nodes.insert(concat_op1); + delete_nodes.insert(concat_op2); + delete_nodes.insert(concat_out2); + delete_nodes.insert(box_coder_op); + delete_nodes.insert(box_coder_out); + delete_nodes.insert(transpose_before_nms); + delete_nodes.insert(transpose_before_nms_out); + delete_nodes.insert(multiclass_nms); + + new_add_concat_op->outputs.push_back(concat_out1); + concat_out1->inputs.push_back(new_add_concat_op); + + detection_out_op->inputs.push_back(concat_out1); + detection_out_op->inputs.push_back(box_coder_third_input); + detection_out_op->inputs.push_back(multiclass_nms_second_input); + detection_out_op->outputs.push_back(multiclass_nms_out); + + concat_out1->outputs.push_back(detection_out_op); + box_coder_third_input->outputs.push_back(detection_out_op); + multiclass_nms_second_input->outputs.push_back(detection_out_op); + multiclass_nms_out->inputs.push_back(detection_out_op); + + // Delete the unneeded nodes. + GraphSafeRemoveNodes(graph, delete_nodes); + }; + + gpd(graph, handler); +} + +void SimplifyAnakinDetectionPatternPass::ApplyImpl(ir::Graph *graph) const { + const int pattern_nums = 6; + const std::string pattern_name = "simplify_anakin_detection_pattern_pass"; + FusePassBase::Init(pattern_name, graph); + std::vector options = {true, false}; + for (const auto &is_density : options) { + for (const auto &is_reshape : options) { + for (int i = 1; i <= pattern_nums; i++) { + RunSimplifyAnakinDetection(graph, i, is_density, is_reshape); + } + } + } +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +typedef paddle::framework::ir::SimplifyAnakinDetectionPatternPass + priorbox_pattern; +REGISTER_PASS(simplify_anakin_priorbox_detection_out_pass, priorbox_pattern); diff --git a/paddle/fluid/framework/ir/simplify_anakin_priorbox_detection_out_pass.h b/paddle/fluid/framework/ir/simplify_anakin_priorbox_detection_out_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..e882b9dc252e61a2e9e4e3666de49b7eee6d714a --- /dev/null +++ b/paddle/fluid/framework/ir/simplify_anakin_priorbox_detection_out_pass.h @@ -0,0 +1,39 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" + +namespace paddle { +namespace framework { +namespace ir { + +// There may be many transpose-flatten structures in a model, and the output of +// these structures will be used as inputs to the concat Op. This pattern will +// be detected by our pass. The times here represents the repeat times of this +// structure. +class SimplifyAnakinDetectionPatternPass : public FusePassBase { + public: + virtual ~SimplifyAnakinDetectionPatternPass() {} + + protected: + void ApplyImpl(ir::Graph* graph) const override; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/squared_mat_sub_fuse_pass.cc b/paddle/fluid/framework/ir/squared_mat_sub_fuse_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..42f4a91a6f421c28826d62bf30cbd4b2cb73805a --- /dev/null +++ b/paddle/fluid/framework/ir/squared_mat_sub_fuse_pass.cc @@ -0,0 +1,377 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. */ + +#include "paddle/fluid/framework/ir/squared_mat_sub_fuse_pass.h" +#include +#include +#include +#include "paddle/fluid/framework/lod_tensor.h" + +namespace paddle { +namespace framework { +namespace ir { + +PDNode* BuildSquaredMatSubPattern(PDPattern* pattern, + const std::string& name_scope) { + auto var_is_op_input = [=](Node* x, const std::string& op_type, + const std::string& arg_name = "") -> bool { + if (!(x && x->IsVar())) { + return false; + } + for (auto* op : x->outputs) { + if (op && op->IsOp() && op->Op() && op->Op()->Type() == op_type) { + if (arg_name.empty()) { + return true; + } + for (auto& name : op->Op()->Input(arg_name)) { + if (name == x->Name()) { + return true; + } + } + } + } + return false; + }; + + auto var_is_op_only_output = [](Node* x, const std::string& op_type) -> bool { + return x && x->IsVar() && x->inputs.size() == 1 && x->inputs[0] && + x->inputs[0]->IsOp() && x->inputs[0]->Op()->Type() == op_type && + x->inputs[0]->outputs.size() == 1; + }; + + auto next_op = [=](Node* x, const std::string& op_type) -> Node* { + if (!(x && x->IsVar())) { + return nullptr; + } + for (auto* op : x->outputs) { + if (op && op->IsOp() && op->Op() && op->Op()->Type() == op_type) { + return op; + } + } + return nullptr; + }; + + auto get_op_input_var = [=](Node* x, const std::string& arg_name) -> Node* { + if (!(x && x->IsOp())) { + return nullptr; + } + for (auto* var : x->inputs) { + for (auto name : x->Op()->Input(arg_name)) { + if (var->Name() == name) { + return var; + } + } + } + return nullptr; + }; + + auto is_fusion_input_var = [=](Node* x, const std::string& arg_name) { + bool basic = var_is_op_input(x, "matmul", arg_name) && + var_is_op_input(x, "square", "X"); + if (!basic) { + return false; + } + auto* squared_x_op = next_op(x, "square"); + if (!(squared_x_op && squared_x_op->outputs.size() == 1)) { + return false; + } + auto* squared_x = squared_x_op->outputs[0]; + bool next_is_matmul_from_arg = + var_is_op_input(squared_x, "matmul", arg_name) && + squared_x->outputs.size() == 1 && + squared_x->outputs[0]->outputs.size() == 1; + if (!next_is_matmul_from_arg) { + return false; + } + auto* sub_y_in = squared_x->outputs[0]->outputs[0]; + return var_is_op_input(sub_y_in, "elementwise_sub", "Y") && + sub_y_in->outputs[0]->outputs.size() == 1 && + var_is_op_input(sub_y_in->outputs[0]->outputs[0], "elementwise_mul"); + }; + + auto is_fusion_first_mul_out = [=](Node* x) -> bool { + bool input_is_matmul_op = x && x->inputs.size() == 1 && + x->inputs[0]->IsOp() && + x->inputs[0]->Op()->Type() == "matmul"; + if (!input_is_matmul_op) { + return false; + } + auto* mat_x = get_op_input_var(x->inputs[0], "X"); + auto* mat_y = get_op_input_var(x->inputs[0], "Y"); + bool input_mul_is_valid = mat_x && is_fusion_input_var(mat_x, "X") && + mat_y && is_fusion_input_var(mat_y, "Y"); + if (!input_mul_is_valid) { + return false; + } + + bool next_is_square = var_is_op_input(x, "square", "X") && + x->outputs.size() == 1 && + x->outputs[0]->outputs.size() == 1; + if (!next_is_square) { + return false; + } + auto* sub_x_in = x->outputs[0]->outputs[0]; + return var_is_op_input(sub_x_in, "elementwise_sub", "X") && + sub_x_in->outputs[0]->outputs.size() == 1 && + var_is_op_input(sub_x_in->outputs[0]->outputs[0], "elementwise_mul"); + }; + + auto* x = pattern->NewNode( + [=](Node* x) { return is_fusion_input_var(x, "X"); }, name_scope + "/x"); + + auto* y = pattern->NewNode( + [=](Node* x) { return is_fusion_input_var(x, "Y"); }, name_scope + "/y"); + + auto* square_x_op = pattern->NewNode( + [=](Node* x) { + return x && x->IsOp() && x->Op()->Type() == "square" && + is_fusion_input_var(x->inputs[0], "X"); + }, + name_scope + "/squared_x_op"); + + auto* square_y_op = pattern->NewNode( + [=](Node* x) { + return x && x->IsOp() && x->Op()->Type() == "square" && + is_fusion_input_var(x->inputs[0], "Y"); + }, + name_scope + "/squared_y_op"); + + auto* squared_x = pattern->NewNode( + [=](Node* x) { + return x && x->inputs.size() == 1 && x->inputs[0]->inputs.size() == 1 && + is_fusion_input_var(x->inputs[0]->inputs[0], "X"); + }, + name_scope + "/squared_x"); + + auto* squared_y = pattern->NewNode( + [=](Node* x) { + return x && x->inputs.size() == 1 && x->inputs[0]->inputs.size() == 1 && + is_fusion_input_var(x->inputs[0]->inputs[0], "Y"); + }, + name_scope + "/squared_y"); + + auto* matmuled_xy = + pattern->NewNode([=](Node* x) { return is_fusion_first_mul_out(x); }, + name_scope + "/matmuled_xy"); + + auto* matmul_xy_op = pattern->NewNode( + [=](Node* x) { + return x && x->IsOp() && x->Op()->Type() == "matmul" && + is_fusion_first_mul_out(x->outputs[0]); + }, + name_scope + "/matmul_xy_op"); + + auto* square_matmuled_xy_op = pattern->NewNode( + [=](Node* x) { + return x && x->IsOp() && x->Op()->Type() == "square" && + is_fusion_first_mul_out(x->inputs[0]); + }, + name_scope + "/square_matmuled_xy_op"); + + auto* squared_xmuly = pattern->NewNode( + [=](Node* x) { + return x && x->IsVar() && x->inputs.size() == 1 && + x->inputs[0]->IsOp() && x->inputs[0]->Op()->Type() == "square" && + is_fusion_first_mul_out(x->inputs[0]->inputs[0]); + }, + name_scope + "/squared_xmuly"); + + auto is_fusion_mat_squared_x_y_op_out = [=](Node* x) -> bool { + bool basic = x && x->IsVar() && x->inputs.size() == 1 && + x->inputs[0]->IsOp() && x->inputs[0]->Op()->Type() == "matmul"; + if (!basic) { + return false; + } + auto* sqx = get_op_input_var(x->inputs[0], "X"); + auto* sqy = get_op_input_var(x->inputs[0], "Y"); + + return var_is_op_only_output(sqx, "square") && + var_is_op_only_output(sqy, "square") && sqx->inputs[0] && + sqx->inputs[0]->inputs.size() == 1 && + is_fusion_input_var(sqx->inputs[0]->inputs[0], "X") && + sqy->inputs[0] && sqy->inputs[0]->inputs.size() == 1 && + is_fusion_input_var(sqy->inputs[0]->inputs[0], "Y"); + }; + + auto* matmul_squared_x_y_op = pattern->NewNode( + [=](Node* x) { + return x && x->IsOp() && x->Op()->Type() == "matmul" && + is_fusion_mat_squared_x_y_op_out(x->outputs[0]); + }, + name_scope + "/matmul_squared_x_y_op"); + + auto* mat_squared_x_y_op_out = pattern->NewNode( + [=](Node* x) { return is_fusion_mat_squared_x_y_op_out(x); }, + name_scope + "/mat_squared_x_y_op_out"); + + auto is_fusion_sub_op = [=](Node* x) -> bool { + bool is_sub_op = x && x->IsOp() && x->Op()->Type() == "elementwise_sub"; + if (!is_sub_op) { + return false; + } + auto* matmul_sqx_sqy_var = get_op_input_var(x, "Y"); + return is_fusion_mat_squared_x_y_op_out(matmul_sqx_sqy_var); + }; + + auto* sub_op = pattern->NewNode([=](Node* x) { return is_fusion_sub_op(x); }, + name_scope + "/sub_op"); + + auto* sub_op_out = pattern->NewNode( + [=](Node* x) { + return x && x->IsVar() && x->inputs.size() == 1 && + is_fusion_sub_op(x->inputs[0]); + }, + name_scope + "/sub_op_out"); + + auto is_fusion_element_op = [=](Node* x) -> bool { + bool is_elemul_op = x && x->IsOp() && x->Op()->Type() == "elementwise_mul"; + if (!is_elemul_op) { + return false; + } + for (auto* in : x->inputs) { + if (in && in->inputs[0] && is_fusion_sub_op(in->inputs[0])) { + return true; + } + } + return false; + }; + + auto* elementmul_op = + pattern->NewNode([=](Node* x) { return is_fusion_element_op(x); }, + name_scope + "/elementmul_op"); + + auto* constant_op = pattern->NewNode( + [=](Node* x) { + return x && x->IsOp() && x->Op()->Type() == "fill_constant" && + x->outputs.size() == 1 && + is_fusion_element_op(x->outputs[0]->outputs[0]); + }, + name_scope + "/fill_constant_op"); + + auto* constant_op_out = pattern->NewNode( + [=](Node* x) { + return x && x->IsVar() && var_is_op_input(x, "elementwise_mul") && + x->inputs[0] && x->inputs[0]->IsOp() && + x->inputs[0]->Op()->Type() == "fill_constant" && x->outputs[0] && + is_fusion_element_op(x->outputs[0]); + }, + name_scope + "/constant_op_out"); + + auto* last_out_var = pattern->NewNode( + [=](Node* x) { + return var_is_op_only_output(x, "elementwise_mul") && + is_fusion_element_op(x->inputs[0]); + }, + name_scope + "/out"); + + square_x_op->LinksFrom({x}).LinksTo({squared_x}); + square_y_op->LinksFrom({y}).LinksTo({squared_y}); + matmul_xy_op->LinksFrom({x, y}).LinksTo({matmuled_xy}); + matmul_squared_x_y_op->LinksFrom({squared_x, squared_y}) + .LinksTo({mat_squared_x_y_op_out}); + square_matmuled_xy_op->LinksFrom({matmuled_xy}).LinksTo({squared_xmuly}); + sub_op->LinksFrom({squared_xmuly, mat_squared_x_y_op_out}) + .LinksTo({sub_op_out}); + constant_op->LinksFrom({}).LinksTo({constant_op_out}); + elementmul_op->LinksFrom({constant_op_out, sub_op_out}) + .LinksTo({last_out_var}); + + return last_out_var; +} + +static int BuildFusion(Graph* graph, const std::string& name_scope) { + GraphPatternDetector gpd; + auto* pattern = gpd.mutable_pattern(); + + BuildSquaredMatSubPattern(pattern, name_scope); + + auto retrieve_node = [](const std::string& name, + const GraphPatternDetector::subgraph_t& subgraph, + const PDPattern& pat) -> Node* { + PADDLE_ENFORCE(subgraph.count(pat.RetrieveNode(name)), + "pattern has no Node called %s", name.c_str()); + Node* p = subgraph.at(pat.RetrieveNode(name)); + PADDLE_ENFORCE_NOT_NULL(p, "subgraph has no node %s", name.c_str()); + return p; + }; + + int fusion_count{0}; + auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, + Graph* g) { + LOG(INFO) << "handle sqaure mat sub fuse"; + auto& fused_pattern = gpd.pattern(); + + auto* matx = retrieve_node(name_scope + "/x", subgraph, fused_pattern); + auto* maty = retrieve_node(name_scope + "/y", subgraph, fused_pattern); + auto* squaredx = + retrieve_node(name_scope + "/squared_x", subgraph, fused_pattern); + auto* squaredy = + retrieve_node(name_scope + "/squared_y", subgraph, fused_pattern); + auto* squaredxy = + retrieve_node(name_scope + "/squared_xmuly", subgraph, fused_pattern); + auto* last_out_var = + retrieve_node(name_scope + "/out", subgraph, fused_pattern); + auto* fill_constant_op = retrieve_node(name_scope + "/fill_constant_op", + subgraph, fused_pattern); + + // Create New OpDesc + OpDesc op_desc; + op_desc.SetType("fusion_squared_mat_sub"); + op_desc.SetInput("X", {matx->Name()}); + op_desc.SetInput("Y", {maty->Name()}); + op_desc.SetOutput("SquaredX", {squaredx->Name()}); + op_desc.SetOutput("SquaredY", {squaredy->Name()}); + op_desc.SetOutput("SquaredXY", {squaredxy->Name()}); + op_desc.SetOutput("Out", {last_out_var->Name()}); + op_desc.SetAttr("scalar", fill_constant_op->Op()->GetAttr("value")); + + auto* op = graph->CreateOpNode(&op_desc); + IR_NODE_LINK_TO(matx, op); + IR_NODE_LINK_TO(maty, op); + IR_NODE_LINK_TO(op, squaredx); + IR_NODE_LINK_TO(op, squaredy); + IR_NODE_LINK_TO(op, squaredxy); + IR_NODE_LINK_TO(op, last_out_var); + + std::unordered_set marked_nodes; + for (auto& item : subgraph) { + marked_nodes.insert(item.second); + } + + marked_nodes.erase(matx); + marked_nodes.erase(maty); + marked_nodes.erase(squaredx); + marked_nodes.erase(squaredy); + marked_nodes.erase(squaredxy); + marked_nodes.erase(last_out_var); + GraphSafeRemoveNodes(graph, marked_nodes); + ++fusion_count; + }; + + gpd(graph, handler); + return fusion_count; +} + +void SquaredMatSubFusePass::ApplyImpl(ir::Graph* graph) const { + FusePassBase::Init(name_scope_, graph); + int fusion_count = BuildFusion(graph, name_scope_); + AddStatis(fusion_count); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(squared_mat_sub_fuse_pass, + paddle::framework::ir::SquaredMatSubFusePass); diff --git a/paddle/fluid/framework/ir/squared_mat_sub_fuse_pass.h b/paddle/fluid/framework/ir/squared_mat_sub_fuse_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..b6165a512acdb9b6e3bdbf49196692ef83edb58f --- /dev/null +++ b/paddle/fluid/framework/ir/squared_mat_sub_fuse_pass.h @@ -0,0 +1,41 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. */ + +#pragma once + +#include +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" + +namespace paddle { +namespace framework { +namespace ir { + +/** + * Fuse ( (A.^2 * B.^2) - (A * B).^2 ) .* scalar + */ +class SquaredMatSubFusePass : public FusePassBase { + public: + virtual ~SquaredMatSubFusePass() {} + + protected: + void ApplyImpl(ir::Graph* graph) const override; + + const std::string name_scope_{"squared_mat_sub_fuse"}; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/sync_batch_norm_pass.cc b/paddle/fluid/framework/ir/sync_batch_norm_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..f4f924a604a231d1a25e169c4dd13f51eb90f266 --- /dev/null +++ b/paddle/fluid/framework/ir/sync_batch_norm_pass.cc @@ -0,0 +1,43 @@ +/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/ir/sync_batch_norm_pass.h" +#include +#include +#include + +namespace paddle { +namespace framework { +namespace ir { + +void SyncBatchNormPass::ApplyImpl(ir::Graph* graph) const { + VLOG(3) << "Use synchronous batch norm"; + for (const Node* n : graph->Nodes()) { + if (n->IsOp()) { + auto* op = n->Op(); + if (op->Type() == "batch_norm") { + op->SetType("sync_batch_norm"); + } + if (op->Type() == "batch_norm_grad") { + op->SetType("sync_batch_norm_grad"); + } + } + } +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(sync_batch_norm_pass, paddle::framework::ir::SyncBatchNormPass); diff --git a/paddle/fluid/framework/ir/sync_batch_norm_pass.h b/paddle/fluid/framework/ir/sync_batch_norm_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..694fae74943060880ef199298064d20c5a526d18 --- /dev/null +++ b/paddle/fluid/framework/ir/sync_batch_norm_pass.h @@ -0,0 +1,31 @@ +/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include "paddle/fluid/framework/ir/pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +class SyncBatchNormPass : public Pass { + protected: + void ApplyImpl(ir::Graph* graph) const override; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/sync_batch_norm_pass_tester.cc b/paddle/fluid/framework/ir/sync_batch_norm_pass_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..894f96050edd607e1ea7df1c319cfeb3570662e5 --- /dev/null +++ b/paddle/fluid/framework/ir/sync_batch_norm_pass_tester.cc @@ -0,0 +1,80 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/sync_batch_norm_pass.h" +#include + +namespace paddle { +namespace framework { +namespace ir { + +void SetOp(ProgramDesc* prog, const std::string& type, const std::string& name, + const std::vector& inputs, + const std::vector& outputs) { + auto* op = prog->MutableBlock(0)->AppendOp(); + op->SetType(type); + op->SetAttr("name", name); + op->SetInput("X", inputs); + op->SetOutput("Out", outputs); +} + +// (a, conv_w)->conv2d->b +// (b, bn_scale, bn_bias, mean, var)->batch_norm +// ->(c, mean, var, save_mean, save_inv_var) +ProgramDesc BuildProgramDesc() { + ProgramDesc prog; + for (auto& v : std::vector({"a", "conv_w", "b", "bn_scale", + "bn_bias", "mean", "var", "c", + "save_mean", "save_inv_var"})) { + auto* var = prog.MutableBlock(0)->Var(v); + if (v == "conv_w" || v == "bn_scale" || v == "bn_bias" || v == "mean" || + v == "var") { + var->SetPersistable(true); + } + } + + SetOp(&prog, "conv2d", "conv", std::vector({"a", "conv_w"}), + std::vector({"b"})); + SetOp(&prog, "batch_norm", "bn", + std::vector({"b", "bn_scale", "bn_bias", "mean", "var"}), + std::vector( + {"c", "mean", "var", "save_mean", "save_inv_var"})); + return prog; +} + +TEST(IsTestPass, basic) { + auto prog = BuildProgramDesc(); + + std::unique_ptr graph(new ir::Graph(prog)); + + auto pass = PassRegistry::Instance().Get("sync_batch_norm_pass"); + + graph.reset(pass->Apply(graph.release())); + + for (auto* node : graph->Nodes()) { + if (node->IsOp()) { + auto* op = node->Op(); + auto op_name = boost::get(op->GetAttr("name")); + if (op_name == "bn") { + ASSERT_EQ(op->Type(), "sync_batch_norm"); + } + } + } +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +USE_PASS(sync_batch_norm_pass); diff --git a/paddle/fluid/framework/ir/transpose_flatten_concat_fuse_pass.cc b/paddle/fluid/framework/ir/transpose_flatten_concat_fuse_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..a984a4942b374c3e2c5f148f8147c55d0f5deb24 --- /dev/null +++ b/paddle/fluid/framework/ir/transpose_flatten_concat_fuse_pass.cc @@ -0,0 +1,137 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include +#include + +#include "paddle/fluid/framework/ir/graph_viz_pass.h" +#include "paddle/fluid/framework/ir/node.h" +#include "paddle/fluid/framework/ir/transpose_flatten_concat_fuse_pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +void RunTransposeFlattenConcatFuse(ir::Graph *graph, int times) { + const std::string pattern_name = + "transpose_flatten" + std::to_string(times) + "_concat_fuse"; + + GraphPatternDetector gpd; + std::vector input_nodes; + for (int i = 0; i < times; i++) { + input_nodes.push_back(gpd.mutable_pattern() + ->NewNode("x" + std::to_string(i)) + ->assert_is_op_input("transpose2", "X") + ->AsInput()); + } + + patterns::TransposeFlattenConcat pattern(gpd.mutable_pattern(), pattern_name); + pattern(input_nodes, times); + + auto handler = [&](const GraphPatternDetector::subgraph_t &subgraph, + Graph *g) { + const int kNumFields = 5; + const int kTransOffset = 1; + const int kTransOutOffset = 2; + const int kFlattenOffset = 3; + const int kFlattenOutOffset = 4; + std::vector nodes; + + for (int i = 0; i < times; i++) { + PADDLE_ENFORCE( + subgraph.at(pattern.GetPDNode("transpose" + std::to_string(i)))); + PADDLE_ENFORCE( + subgraph.at(pattern.GetPDNode("transpose_out" + std::to_string(i)))); + PADDLE_ENFORCE( + subgraph.at(pattern.GetPDNode("flatten" + std::to_string(i)))); + PADDLE_ENFORCE( + subgraph.at(pattern.GetPDNode("flatten_out" + std::to_string(i)))); + PADDLE_ENFORCE(subgraph.at(input_nodes[i])); + + nodes.push_back(subgraph.at(input_nodes[i])); + nodes.push_back( + subgraph.at(pattern.GetPDNode("transpose" + std::to_string(i)))); + nodes.push_back( + subgraph.at(pattern.GetPDNode("transpose_out" + std::to_string(i)))); + nodes.push_back( + subgraph.at(pattern.GetPDNode("flatten" + std::to_string(i)))); + nodes.push_back( + subgraph.at(pattern.GetPDNode("flatten_out" + std::to_string(i)))); + } + + Node *concat_op = subgraph.at(pattern.GetPDNode("concat")); + Node *concat_out = subgraph.at(pattern.GetPDNode("concat_out")); + std::vector input_names; + std::vector trans_axis = boost::get>( + nodes[kTransOffset]->Op()->GetAttr("axis")); + int flatten_axis = + boost::get(nodes[kFlattenOffset]->Op()->GetAttr("axis")); + int concat_axis = boost::get(concat_op->Op()->GetAttr("axis")); + std::string output_name = concat_out->Name(); + + for (int i = 0; i < times; i++) { + input_names.push_back(nodes[i * kNumFields]->Name()); + } + + framework::OpDesc new_op_desc; + new_op_desc.SetType("fusion_transpose_flatten_concat"); + new_op_desc.SetInput("X", input_names); + new_op_desc.SetAttr("trans_axis", trans_axis); + new_op_desc.SetAttr("flatten_axis", flatten_axis); + new_op_desc.SetAttr("concat_axis", concat_axis); + new_op_desc.SetOutput("Out", {output_name}); + new_op_desc.Flush(); + + // Create a new node for the fused op. + auto *new_conv_op = graph->CreateOpNode(&new_op_desc); + + std::unordered_set delete_nodes; + + for (int i = 0; i < times; i++) { + nodes[i * kNumFields]->outputs.push_back(new_conv_op); + new_conv_op->inputs.push_back(nodes[i * kNumFields]); + delete_nodes.insert(nodes[i * kNumFields + kTransOffset]); + delete_nodes.insert(nodes[i * kNumFields + kTransOutOffset]); + delete_nodes.insert(nodes[i * kNumFields + kFlattenOffset]); + delete_nodes.insert(nodes[i * kNumFields + kFlattenOutOffset]); + } + delete_nodes.insert(concat_op); + + new_conv_op->outputs.push_back(concat_out); + concat_out->inputs.push_back(new_conv_op); + + // Delete the unneeded nodes. + GraphSafeRemoveNodes(graph, delete_nodes); + }; + + gpd(graph, handler); +} + +void TransposeFlattenConcatFusePass::ApplyImpl(ir::Graph *graph) const { + const int pattern_nums = 6; + const std::string pattern_name = "transpose_flatten_concat_fuse"; + FusePassBase::Init(pattern_name, graph); + for (int i = 1; i <= pattern_nums; i++) { + RunTransposeFlattenConcatFuse(graph, i); + } +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(transpose_flatten_concat_fuse_pass, + paddle::framework::ir::TransposeFlattenConcatFusePass); diff --git a/paddle/fluid/framework/ir/transpose_flatten_concat_fuse_pass.h b/paddle/fluid/framework/ir/transpose_flatten_concat_fuse_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..939a8c31e5501e23968f9b44b4fe09e78280fd07 --- /dev/null +++ b/paddle/fluid/framework/ir/transpose_flatten_concat_fuse_pass.h @@ -0,0 +1,39 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include + +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" + +namespace paddle { +namespace framework { +namespace ir { + +// There may be many transpose-flatten structures in a model, and the output of +// these structures will be used as inputs to the concat Op. This pattern will +// be detected by our pass. The times here represents the repeat times of this +// structure. +class TransposeFlattenConcatFusePass : public FusePassBase { + public: + virtual ~TransposeFlattenConcatFusePass() {} + + protected: + void ApplyImpl(ir::Graph* graph) const override; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/lod_tensor.cc b/paddle/fluid/framework/lod_tensor.cc index 1e7da9a69c7cbf8c13306656599a759515802b76..f46bdf96ba1e9e1e137c690057051d9a127d45c9 100644 --- a/paddle/fluid/framework/lod_tensor.cc +++ b/paddle/fluid/framework/lod_tensor.cc @@ -26,10 +26,8 @@ limitations under the License. */ #include "paddle/fluid/memory/memcpy.h" #include "paddle/fluid/memory/memory.h" -#if !defined(_WIN32) #include "paddle/fluid/recordio/scanner.h" #include "paddle/fluid/recordio/writer.h" -#endif // _WIN32 namespace paddle { namespace framework { @@ -56,13 +54,14 @@ std::ostream &operator<<(std::ostream &os, const LoD &lod) { std::ostream &operator<<(std::ostream &os, const LoDTensor &t) { if (!platform::is_cpu_place(t.place())) { - LoDTensor tt; - framework::TensorCopy(t, platform::CPUPlace(), &tt); + LoDTensor cpu_tensor; + cpu_tensor.set_lod(t.lod()); + framework::TensorCopy(t, platform::CPUPlace(), &cpu_tensor); platform::DeviceContextPool &pool = platform::DeviceContextPool::Instance(); auto &dev_ctx = *pool.Get(t.place()); dev_ctx.Wait(); - os << tt; + os << cpu_tensor; return os; } @@ -72,9 +71,9 @@ std::ostream &operator<<(std::ostream &os, const LoDTensor &t) { // only print first ten elements int64_t size = t.numel() < 10 ? t.numel() : 10; for (int64_t i = 0; i < size; ++i) { - if (IsType(t.type())) { + if (t.type() == proto::VarType::FP32) { os << t.data()[i] << " "; - } else if (IsType(t.type())) { + } else if (t.type() == proto::VarType::INT64) { os << t.data()[i] << " "; } else { PADDLE_THROW("LoDTensor data type not in [float, int64_t]"); @@ -159,13 +158,8 @@ bool CheckLoD(const LoD &in, int tensor_height) { if (level.size() < 2) return false; // check: the first offset(the begin offset) of each level should be 0. if (level.front() != 0) return false; - // check: all the offsets in a level should be ascending(no same items - // allows). - if (!std::is_sorted(level.begin(), level.begin(), [](size_t a, size_t b) { - if (a < b) return true; - return false; - })) { - LOG(INFO) << "ascending error"; + // check: all the offsets in a level should be ascending(allow same items) + if (!std::is_sorted(level.begin(), level.end())) { return false; } } @@ -305,7 +299,6 @@ void DeserializeFromStream(std::istream &is, LoDTensor *tensor, TensorFromStream(is, static_cast(tensor), dev_ctx); } -#if !defined(_WIN32) void WriteToRecordIO(recordio::Writer *writer, const std::vector &tensor, const platform::DeviceContext &dev_ctx) { @@ -335,19 +328,7 @@ bool ReadFromRecordIO(recordio::Scanner *scanner, return true; } -#else -class Writer {}; -class Scanner {}; -void WriteToRecordIO(recordio::Writer *writer, - const std::vector &tensor, - const platform::DeviceContext &dev_ctx) {} -bool ReadFromRecordIO(recordio::Scanner *scanner, - const platform::DeviceContext &dev_ctx, - std::vector *result_ptr) { - PADDLE_ENFORCE("windows didn't supported recordio!."); - return true; -} -#endif // _WIN32 + std::vector LoDTensor::SplitLoDTensor( const std::vector places) const { check_memory_size(); @@ -402,7 +383,7 @@ void LoDTensor::MergeLoDTensor( PADDLE_ENFORCE(!lod_tensors.empty()); framework::DDim new_dim = lod_tensors[0]->dims(); - std::type_index new_type = lod_tensors[0]->type(); + auto new_type = lod_tensors[0]->type(); framework::DataLayout new_layout = lod_tensors[0]->layout(); LoD new_lod = lod_tensors[0]->lod(); for (size_t i = 1; i < lod_tensors.size(); ++i) { @@ -418,7 +399,7 @@ void LoDTensor::MergeLoDTensor( PADDLE_ENFORCE_EQ(new_lod.size(), lod.size()); for (size_t j = 0; j < lod.size(); ++j) { auto &sub_lod = new_lod[j]; - auto &offset = sub_lod.back(); + size_t offset = sub_lod.back(); for (size_t k = 1; k < lod[j].size(); ++k) { sub_lod.push_back(lod[j][k] + offset); } diff --git a/paddle/fluid/framework/lod_tensor.h b/paddle/fluid/framework/lod_tensor.h index e9b473d547252e80ed26ec61e1a33fbe1742dbe0..fb6e781fd07b9033bea547118b8338ad8b705c5e 100644 --- a/paddle/fluid/framework/lod_tensor.h +++ b/paddle/fluid/framework/lod_tensor.h @@ -111,9 +111,6 @@ class LoDTensor : public Tensor { public: LoDTensor() : Tensor() {} - /* Constructor with place should only be used in pybind */ - explicit LoDTensor(const platform::Place& place) : Tensor(place) {} - explicit LoDTensor(const LoD& lod) : lod_(lod) {} void set_lod(const LoD& lod) { lod_ = lod; } diff --git a/paddle/fluid/framework/lod_tensor_array.h b/paddle/fluid/framework/lod_tensor_array.h index 6d7b6a4ada8729e3698dab5d2b1861aac632be79..36a5c3c5d601390beedaf37ceb98ee2c63ecf5a6 100644 --- a/paddle/fluid/framework/lod_tensor_array.h +++ b/paddle/fluid/framework/lod_tensor_array.h @@ -18,6 +18,8 @@ limitations under the License. */ namespace paddle { namespace framework { + using LoDTensorArray = std::vector; -} + +} // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/lod_tensor_test.cc b/paddle/fluid/framework/lod_tensor_test.cc index cbf5fd04d73007d303d0fd96064e3a2d7f21cfb8..15928c18d38b8a513b00f993b57faab43978bf53 100644 --- a/paddle/fluid/framework/lod_tensor_test.cc +++ b/paddle/fluid/framework/lod_tensor_test.cc @@ -217,6 +217,11 @@ TEST(LoD, CheckLoD) { // check with underlying tensor storage. ASSERT_TRUE(CheckLoD(relative_lod, 5)); ASSERT_FALSE(CheckLoD(relative_lod, 9)); + + // check whether lod is ascending-sorted (allow same items) + ASSERT_TRUE(CheckLoD({{0, 1, 2, 3, 4, 5}}, 5)); + ASSERT_TRUE(CheckLoD({{0, 1, 3, 3, 4, 5}}, 5)); + ASSERT_FALSE(CheckLoD({{0, 1, 3, 2, 5}}, 5)); } TEST(LoD, CheckAbsLoD) { @@ -274,7 +279,6 @@ TEST(LoD, ConvertToOffsetBasedLoD) { EXPECT_EQ(offset_lod, expected); } -#if !defined(_WIN32) template static void TestRecordIO() { LoDTensor tensor; @@ -321,7 +325,6 @@ TEST(LoDTensor, RecordIO) { TestRecordIO(); TestRecordIO(); } -#endif // !defined(_WIN32) } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/mixed_vector.h b/paddle/fluid/framework/mixed_vector.h index 77386f4f069489b6ff7b927a281bdc286ff816e0..5d854cb8d7856a631faf01741d29d3cecfd9a627 100644 --- a/paddle/fluid/framework/mixed_vector.h +++ b/paddle/fluid/framework/mixed_vector.h @@ -1,16 +1,16 @@ /* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at - http://www.apache.org/licenses/LICENSE-2.0 + http://www.apache.org/licenses/LICENSE-2.0 - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. */ +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ #pragma once @@ -23,6 +23,7 @@ #include "paddle/fluid/framework/details/cow_ptr.h" #include "paddle/fluid/framework/tensor.h" #include "paddle/fluid/framework/tensor_util.h" +#include "paddle/fluid/memory/malloc.h" #include "paddle/fluid/memory/memcpy.h" #include "glog/logging.h" @@ -31,46 +32,6 @@ namespace paddle { namespace framework { #if defined(PADDLE_WITH_CUDA) -namespace details { -struct CUDABuffer { - void *data_{nullptr}; - size_t size_{0}; - platform::CUDAPlace place_; - - CUDABuffer() {} - CUDABuffer(platform::Place place, size_t size) - : size_(size), place_(boost::get(place)) { - data_ = memory::Alloc(place_, size); - } - - ~CUDABuffer() { ClearMemory(); } - - CUDABuffer(const CUDABuffer &o) = delete; - CUDABuffer &operator=(const CUDABuffer &o) = delete; - - void Resize(platform::Place place, size_t size) { - ClearMemory(); - place_ = boost::get(place); - data_ = memory::Alloc(place_, size); - PADDLE_ENFORCE_NOT_NULL(data_); - size_ = size; - } - - void Swap(CUDABuffer &o) { - std::swap(data_, o.data_); - std::swap(place_, o.place_); - std::swap(size_, o.size_); - } - - private: - void ClearMemory() const { - if (data_ != nullptr) { - memory::Free(place_, data_); - } - } -}; -} // namespace details - // Vector implements the std::vector interface, and can get Data or // MutableData from any place. The data will be synced implicitly inside. template @@ -103,8 +64,6 @@ class Vector { o.ImmutableCPU(); cpu_ = o.cpu_; flag_ = kDataInCPU; - details::CUDABuffer null; - gpu_.Swap(null); return *this; } @@ -199,7 +158,7 @@ class Vector { PADDLE_ENFORCE(platform::is_gpu_place(place), "CUDA Data must on CUDA place"); ImmutableCUDA(place); - return reinterpret_cast(gpu_.data_); + return reinterpret_cast(gpu_->ptr()); } // get cuda ptr. mutable @@ -234,13 +193,11 @@ class Vector { std::mutex &Mutex() const { return mtx_; } - std::unique_ptr CUDAPlace() const { - if (gpu_.data_ == nullptr) { - return nullptr; - } else { - return std::unique_ptr( - new platform::CUDAPlace(gpu_.place_)); - } + boost::optional CUDAPlace() const { + return gpu_ == nullptr + ? boost::none + : boost::optional( + boost::get(gpu_->place())); } private: @@ -254,13 +211,12 @@ class Vector { void CopyToCPU() const { // COPY GPU Data To CPU auto *dev_ctx = static_cast( - platform::DeviceContextPool::Instance().Get( - platform::Place(gpu_.place_))); + platform::DeviceContextPool::Instance().Get(gpu_->place())); auto stream = dev_ctx->stream(); - void *src = gpu_.data_; + void *src = gpu_->ptr(); void *dst = cpu_.data(); - memory::Copy(platform::CPUPlace(), dst, gpu_.place_, src, gpu_.size_, - stream); + paddle::memory::Copy(platform::CPUPlace(), dst, CUDAPlace().get(), src, + gpu_->size(), stream); dev_ctx->Wait(); } @@ -277,8 +233,7 @@ class Vector { CopyCPUDataToCUDA(place); UnsetFlag(kDirty); SetFlag(kDataInCUDA); - } else if (IsInCUDA() && - !(boost::get(place) == gpu_.place_)) { + } else if (IsInCUDA() && !(place == gpu_->place())) { PADDLE_THROW("This situation should not happen"); // Still dirty } else { @@ -290,7 +245,7 @@ class Vector { // Even data is not dirty. However, data is not in CUDA. Copy data. CopyCPUDataToCUDA(place); SetFlag(kDataInCUDA); - } else if (!(boost::get(place) == gpu_.place_)) { + } else if (!(place == gpu_->place())) { PADDLE_THROW("This situation should not happen."); } else { // Not Dirty && DataInCUDA && Device is same @@ -301,13 +256,13 @@ class Vector { void CopyCPUDataToCUDA(const platform::Place &place) const { void *src = cpu_.data(); - gpu_.Resize(place, cpu_.size() * sizeof(T)); - void *dst = gpu_.data_; + gpu_ = memory::Alloc(place, cpu_.size() * sizeof(T)); + void *dst = gpu_->ptr(); auto *dev_ctx = static_cast( platform::DeviceContextPool::Instance().Get(place)); auto stream = dev_ctx->stream(); - memory::Copy(gpu_.place_, dst, platform::CPUPlace(), src, gpu_.size_, - stream); + paddle::memory::Copy(CUDAPlace().get(), dst, platform::CPUPlace(), src, + gpu_->size(), stream); } void ImmutableCPU() const { @@ -329,7 +284,7 @@ class Vector { bool IsInCPU() const { return flag_ & kDataInCPU; } mutable std::vector cpu_; - mutable details::CUDABuffer gpu_; + mutable paddle::memory::AllocationPtr gpu_; mutable int flag_; mutable std::mutex mtx_; @@ -428,8 +383,8 @@ class Vector { auto &mtx = m_.Data().Mutex(); std::lock_guard guard(mtx); auto cuda_place = m_.Data().CUDAPlace(); - if (cuda_place == nullptr || - *cuda_place == boost::get(place)) { + if (cuda_place == boost::none || + cuda_place == boost::get(place)) { return m_.Data().CUDAData(place); } } @@ -444,8 +399,8 @@ class Vector { auto &mtx = m_.Data().Mutex(); std::lock_guard guard(mtx); auto cuda_place = m_.Data().CUDAPlace(); - if (cuda_place == nullptr || - *cuda_place == boost::get(place)) { + if (cuda_place == boost::none || + cuda_place == boost::get(place)) { return m_.MutableData()->CUDAMutableData(place); } } @@ -542,6 +497,33 @@ class CPUVector : public std::vector> { this->reserve(this->size() + size_t(end - begin)); this->insert(this->end(), begin, end); } + + const T *CUDAData(platform::Place place) const { + PADDLE_THROW( + "Vector::CUDAData() method is not supported in CPU-only version"); + } + + T *CUDAMutableData(platform::Place place) { + PADDLE_THROW( + "Vector::CUDAMutableData() method is not supported in CPU-only " + "version"); + } + + const T *Data(platform::Place place) const { + PADDLE_ENFORCE( + platform::is_cpu_place(place), + "Vector::Data() method is not supported when not in CPUPlace"); + return this->data(); + } + + T *MutableData(platform::Place place) { + PADDLE_ENFORCE( + platform::is_cpu_place(place), + "Vector::MutableData() method is not supported when not in CPUPlace"); + return this->data(); + } + + const void *Handle() const { return static_cast(this); } }; template diff --git a/paddle/fluid/framework/multi_trainer.cc b/paddle/fluid/framework/multi_trainer.cc new file mode 100644 index 0000000000000000000000000000000000000000..3a266e4bda91d5962ce09b241cc5e5671d67a142 --- /dev/null +++ b/paddle/fluid/framework/multi_trainer.cc @@ -0,0 +1,83 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include "paddle/fluid/framework/data_feed_factory.h" +#include "paddle/fluid/framework/device_worker_factory.h" +#include "paddle/fluid/framework/trainer.h" + +namespace paddle { +namespace framework { + +void MultiTrainer::Initialize(const TrainerDesc& trainer_desc, + Dataset* dataset) { + thread_num_ = trainer_desc.thread_num(); + SetDataset(dataset); + // get filelist from trainer_desc here + dataset->CreateReaders(); + VLOG(3) << "readers created"; + const std::vector> readers = + dataset->GetReaders(); + VLOG(3) << "readers num: " << readers.size(); + // change thread num to readers num + thread_num_ = readers.size(); + VLOG(3) << "worker thread num: " << thread_num_; + workers_.resize(thread_num_); + for (int i = 0; i < thread_num_; ++i) { + workers_[i] = DeviceWorkerFactory::CreateDeviceWorker( + trainer_desc.device_worker_name()); + workers_[i]->Initialize(trainer_desc); + workers_[i]->SetDeviceIndex(i); + workers_[i]->SetDataFeed(readers[i]); + } + + // set debug here + SetDebug(trainer_desc.debug()); +} + +// call only after all resources are set in current trainer +void MultiTrainer::InitTrainerEnv(const ProgramDesc& main_program, + const platform::Place& place) { + for (int i = 0; i < thread_num_; ++i) { + workers_[i]->SetPlace(place); + workers_[i]->SetRootScope(root_scope_); + workers_[i]->CreateDeviceResource(main_program); // Program + workers_[i]->BindingDataFeedMemory(); + } +} + +void MultiTrainer::Run() { + VLOG(3) << "Going to run"; + for (int thidx = 0; thidx < thread_num_; ++thidx) { + if (!debug_) { + threads_.push_back( + std::thread(&DeviceWorker::TrainFiles, workers_[thidx].get())); + } else { + threads_.push_back(std::thread(&DeviceWorker::TrainFilesWithProfiler, + workers_[thidx].get())); + } + } +} + +void MultiTrainer::Finalize() { + for (auto& th : threads_) { + th.join(); + } + dataset_ptr_->DestroyReaders(); + root_scope_->DropKids(); +} + +} // end namespace framework +} // end namespace paddle diff --git a/paddle/fluid/framework/naive_executor.cc b/paddle/fluid/framework/naive_executor.cc index 2840d503f1454271afb309efdd435225ab077dc0..a37bb6f4da1fc3baffad36c13c690c6410ac4270 100644 --- a/paddle/fluid/framework/naive_executor.cc +++ b/paddle/fluid/framework/naive_executor.cc @@ -21,97 +21,78 @@ #include "paddle/fluid/framework/naive_executor.h" #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/framework/reader.h" +#include "paddle/fluid/framework/variable_helper.h" #include "paddle/fluid/string/pretty_log.h" namespace paddle { namespace framework { - -// These code can be shared with Executor. -static void InitializeVariable(Variable *var, proto::VarType::Type var_type) { - if (var_type == proto::VarType::LOD_TENSOR) { - var->GetMutable(); - } else if (var_type == proto::VarType::SELECTED_ROWS) { - var->GetMutable(); - } else if (var_type == proto::VarType::FEED_MINIBATCH) { - var->GetMutable(); - } else if (var_type == proto::VarType::FETCH_LIST) { - var->GetMutable(); - } else if (var_type == proto::VarType::STEP_SCOPES) { - var->GetMutable>(); - } else if (var_type == proto::VarType::LOD_RANK_TABLE) { - var->GetMutable(); - } else if (var_type == proto::VarType::LOD_TENSOR_ARRAY) { - var->GetMutable(); - } else if (var_type == proto::VarType::PLACE_LIST) { - var->GetMutable(); - } else if (var_type == proto::VarType::READER) { - var->GetMutable(); - } else if (var_type == proto::VarType::RAW) { - // GetMutable will be called in operator - } else { - PADDLE_THROW( - "Variable type %d is not in " - "[LOD_TENSOR, SELECTED_ROWS, FEED_MINIBATCH, FETCH_LIST, " - "LOD_RANK_TABLE, PLACE_LIST, READER, CHANNEL, RAW]", - var_type); - } -} - -void NaiveExecutor::Prepare(Scope *parent_scope, - const ProgramDesc &program_desc, int block_id, - bool with_feed_fetch_ops) { - if (!parent_scope) { +void NaiveExecutor::Prepare(Scope *scope, const ProgramDesc &program_desc, + int block_id, bool with_feed_fetch_ops) { + if (!scope) { scope_ = new framework::Scope; } else { - scope_ = &parent_scope->NewScope(); + scope_ = scope; } - CreateVariables(program_desc, scope_, block_id); + + VLOG(3) << "NaiveExecutor init with scope " << scope; CreateOps(program_desc, block_id, with_feed_fetch_ops); } void NaiveExecutor::Run() { +#ifndef PADDLE_ON_INFERENCE + LOG_FIRST_N(WARNING, 5) << "The NaiveExecutor can not work properly if the " + "cmake flag ON_INFER is not set."; + LOG_FIRST_N(WARNING, 5) << "Unlike the training phase, all the scopes and " + "variables will be reused to save the allocation " + "overhead."; + LOG_FIRST_N(WARNING, 5) << "Please re-compile the inference library by " + "setting the cmake flag ON_INFER=ON if you are " + "running Paddle Inference"; +#endif // PADDLE_ON_INFERENCE for (auto &op : ops_) { - VLOG(4) << "run " << op->Type(); + VLOG(4) << std::this_thread::get_id() << " run " + << op->DebugStringEx(scope_) << " on scope " << scope_; + op->SetIsCalledByExecutor(false); op->Run(*scope_, place_); } } -void NaiveExecutor::CreateVariables(const ProgramDesc &desc, Scope *scope, - int block_id) { - PADDLE_ENFORCE(scope); +void NaiveExecutor::CreateVariables(const ProgramDesc &desc, int block_id, + bool persistable, Scope *scope) { + PADDLE_ENFORCE_NOT_NULL(scope); + auto &global_block = desc.Block(block_id); - const Scope *ancestor_scope = scope; - while (ancestor_scope->parent()) { - ancestor_scope = ancestor_scope->parent(); + const auto *anc = scope; + PADDLE_ENFORCE(anc->parent() != anc); + while (anc->parent()) { + anc = anc->parent(); } - if (ancestor_scope != scope) { - for (auto &var : global_block.AllVars()) { - if (var->Name() == framework::kEmptyVarName) { - continue; - } - // Create persistable vars in ancestor scope. - if (var->Persistable()) { - auto *ptr = const_cast(ancestor_scope)->Var(var->Name()); - InitializeVariable(ptr, var->GetType()); - VLOG(3) << "Create Variable " << var->Name() - << " global, which pointer is " << ptr; - } else { // Create temporary variables in local scope. - auto *ptr = scope->Var(var->Name()); + int num_vars = 0; + for (auto &var : global_block.AllVars()) { + if (var->Name() == framework::kEmptyVarName) { + continue; + } + num_vars++; + + if (persistable == var->Persistable()) { + if (persistable) { + if (!anc->FindVar(var->Name())) { + auto *ptr = const_cast(anc)->Var(var->Name()); + VLOG(3) << scope << " Create persistable variable " << var->Name() + << ", which pointer is " << ptr; + InitializeVariable(ptr, var->GetType()); + } + } else { + auto *ptr = const_cast(scope)->Var(var->Name()); + VLOG(3) << scope << " Create variable " << var->Name() + << ", which pointer is " << ptr; InitializeVariable(ptr, var->GetType()); - VLOG(3) << "Create Variable " << var->Name() - << " locally, which pointer is " << ptr; } } - } else { - for (auto &var : global_block.AllVars()) { - auto *ptr = scope->Var(var->Name()); - InitializeVariable(ptr, var->GetType()); - VLOG(3) << "Create variable " << var->Name() << ", which pointer is " - << ptr; - } } + VLOG(4) << "naive executor create " << num_vars << " vars"; } void NaiveExecutor::CreateOps(const ProgramDesc &desc, int block_id, @@ -146,22 +127,5 @@ void NaiveExecutor::CleanFeedFetchOps() { ops_.swap(ops); } -void NaiveExecutor::EnableMKLDNN(const ProgramDesc &program) { -#ifdef PADDLE_WITH_MKLDNN - VLOG(3) << "use_mkldnn=True"; - for (size_t block_id = 0; block_id < program.Size(); ++block_id) { - auto *block = const_cast(program).MutableBlock(block_id); - for (auto *op : block->AllOps()) { - if (op->HasAttr("use_mkldnn")) { - op->SetAttr("use_mkldnn", true); - } - } - } -#else - LOG(WARNING) - << "'MKLDNN' is not supported, Please re-compile with WITH_MKLDNN option"; -#endif -} - } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/naive_executor.h b/paddle/fluid/framework/naive_executor.h index 9374f3f4a35cc0f90e5b2d6e8b397784b8eae123..5e673f68574c4ddaa4c9260367d09e9f62f6b751 100644 --- a/paddle/fluid/framework/naive_executor.h +++ b/paddle/fluid/framework/naive_executor.h @@ -35,8 +35,14 @@ class NaiveExecutor { // Create child scope. // Create variables. // @with_feed_fetch_ops: whether to work with the feed and fetch operators. - void Prepare(Scope* parent_scope, const ProgramDesc& program_desc, - int block_id, bool with_feed_fetch_ops); + void Prepare(Scope* scope, const ProgramDesc& program_desc, int block_id, + bool with_feed_fetch_ops); + + // Create variables before head. + // Create parameters if persistable is ture, or create the temporary variables + // instead. + void CreateVariables(const ProgramDesc& desc, int block_id, bool persistable, + Scope* scope); // Run all the operators. void Run(); @@ -48,11 +54,7 @@ class NaiveExecutor { void CleanFeedFetchOps(); - void EnableMKLDNN(const ProgramDesc& program); - protected: - void CreateVariables(const ProgramDesc& desc, Scope* scope, int block_id); - void CreateOps(const ProgramDesc& desc, int block_id, bool with_feed_fetch_ops); diff --git a/paddle/fluid/framework/naive_executor_test.cc b/paddle/fluid/framework/naive_executor_test.cc index 6b9f79b9d398bc5a0ee6ba66587924daad0dbbc5..c917630666b082ab7148550707f9f1f720aa25d3 100644 --- a/paddle/fluid/framework/naive_executor_test.cc +++ b/paddle/fluid/framework/naive_executor_test.cc @@ -39,7 +39,7 @@ TEST(NaiveExecutor, Basic) { auto place = platform::CPUPlace(); NaiveExecutor exe(place); - exe.Prepare(nullptr, program, 0, false /*with feed fetch ops*/); + exe.Prepare(nullptr, program, 0, false); auto* a_tensor = exe.FindTensor("a"); auto* b_tensor = exe.FindTensor("b"); auto* c_tensor = exe.FindTensor("c"); diff --git a/paddle/fluid/framework/no_need_buffer_vars_inference.h b/paddle/fluid/framework/no_need_buffer_vars_inference.h new file mode 100644 index 0000000000000000000000000000000000000000..2c933659840d02e65c3b222144a31e558e8e8ae8 --- /dev/null +++ b/paddle/fluid/framework/no_need_buffer_vars_inference.h @@ -0,0 +1,60 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include "paddle/fluid/framework/op_desc.h" + +namespace paddle { +namespace framework { + +class NoNeedBufferVarsInference { + public: + NoNeedBufferVarsInference(const VariableNameMap &inputs, + const VariableNameMap &outputs, + const AttributeMap &attrs) + : inputs_(inputs), outputs_(outputs), attrs_(attrs) {} + + virtual ~NoNeedBufferVarsInference() = default; + + const VariableNameMap &Inputs() const { return inputs_; } + + const VariableNameMap &Outputs() const { return outputs_; } + + const AttributeMap &Attrs() const { return attrs_; } + + virtual std::unordered_set operator()() const = 0; + + private: + const VariableNameMap &inputs_; + const VariableNameMap &outputs_; + const AttributeMap &attrs_; +}; + +#define DECLARE_NO_NEED_BUFFER_VARS_INFERENCE(class_type, ...) \ + class class_type : public ::paddle::framework::NoNeedBufferVarsInference { \ + public: \ + using ::paddle::framework::NoNeedBufferVarsInference:: \ + NoNeedBufferVarsInference; \ + \ + std::unordered_set operator()() const override { \ + return {__VA_ARGS__}; \ + } \ + } + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/op_desc.cc b/paddle/fluid/framework/op_desc.cc index 121e00b1a3d761f570138092a3e76ae2b722d28a..e6f5b15af8cd440a9304235acfe62787c5f1b134 100644 --- a/paddle/fluid/framework/op_desc.cc +++ b/paddle/fluid/framework/op_desc.cc @@ -24,6 +24,7 @@ limitations under the License. */ #include "paddle/fluid/framework/operator.h" #include "paddle/fluid/framework/program_desc.h" #include "paddle/fluid/framework/shape_inference.h" +#include "paddle/fluid/framework/var_type_inference.h" namespace paddle { namespace framework { @@ -81,29 +82,154 @@ class CompileTimeInferShapeContext : public InferShapeContext { "The %s[%d] is @EMPTY@", out, j); auto *in_var = block_.FindVarRecursive(Inputs(in)[i]); auto *out_var = block_.FindVarRecursive(Outputs(out)[j]); - if (in_var->GetType() != proto::VarType::LOD_TENSOR) { - VLOG(3) << "input " << in << " is not LodTensor"; + if (in_var->GetType() != proto::VarType::LOD_TENSOR && + in_var->GetType() != proto::VarType::LOD_TENSOR_ARRAY) { + VLOG(3) << "input " << in << " is not LodTensor or LodTensorArray."; return; } out_var->SetLoDLevel(in_var->GetLoDLevel()); } + void DecreaseLoDLevel(const std::string &in, const std::string &out, + size_t i = 0, size_t j = 0) const override { + PADDLE_ENFORCE_LT(i, Inputs(in).size()); + PADDLE_ENFORCE_LT(j, Outputs(out).size()); + PADDLE_ENFORCE(Inputs(in)[i] != framework::kEmptyVarName, + "The %s[%d] is @EMPTY@", in, i); + PADDLE_ENFORCE(Outputs(out)[j] != framework::kEmptyVarName, + "The %s[%d] is @EMPTY@", out, j); + auto *in_var = block_.FindVarRecursive(Inputs(in)[i]); + auto *out_var = block_.FindVarRecursive(Outputs(out)[j]); + PADDLE_ENFORCE(out_var->GetType() == proto::VarType::LOD_TENSOR_ARRAY || + out_var->GetType() == proto::VarType::LOD_TENSOR, + "The input %s should be LodTensorArray or LodTensor.", + out_var->Name()); + PADDLE_ENFORCE(in_var->GetType() == proto::VarType::LOD_TENSOR, + "The input %s should be LodTensor.", in_var->Name()); + if (in_var->GetLoDLevel() > 0) { + out_var->SetLoDLevel(in_var->GetLoDLevel() - 1); + } + } + + std::vector GetInputVarPtrs( + const std::string &name) override { + const std::vector arg_names = Inputs(name); + std::vector res; + res.reserve(arg_names.size()); + std::transform(arg_names.begin(), arg_names.end(), std::back_inserter(res), + [this](const std::string &name) { + return block_.FindVarRecursive(name); + }); + return res; + } + + std::vector GetOutputVarPtrs( + const std::string &name) override { + const std::vector arg_names = Outputs(name); + std::vector res; + res.reserve(arg_names.size()); + std::transform(arg_names.begin(), arg_names.end(), std::back_inserter(res), + [this](const std::string &name) { + return block_.FindVarRecursive(name); + }); + return res; + } + + DDim GetInputDim(const std::string &name) const override { + const std::vector &arg_names = Inputs(name); + PADDLE_ENFORCE_EQ(arg_names.size(), 1UL, + "Input(%s) should hold one element, but now it holds %d", + name, arg_names.size()); + return this->GetDim(arg_names[0]); + } + + std::vector GetInputsDim(const std::string &name) const override { + const std::vector &arg_names = Inputs(name); + return GetDims(arg_names); + } + bool IsRuntime() const override; + std::vector GetInputsVarType( + const std::string &name) const override { + return GetVarTypes(Inputs(name)); + } + + std::vector GetOutputsVarType( + const std::string &name) const override { + return GetVarTypes(Outputs(name)); + } + + void SetOutputDim(const std::string &name, const DDim &dim) override { + auto &arg_names = Outputs(name); + PADDLE_ENFORCE_EQ(arg_names.size(), 1UL, + "Output(%s) should hold one element, but now it holds %d", + name, arg_names.size()); + SetDim(arg_names[0], dim); + } + + void SetOutputsDim(const std::string &name, + const std::vector &dims) override { + auto &names = Outputs(name); + SetDims(names, dims); + } + protected: - proto::VarType::Type GetVarType(const std::string &name) const override; + std::vector GetVarTypes( + const std::vector &names) const { + std::vector retv; + retv.resize(names.size()); + std::transform( + names.begin(), names.end(), retv.begin(), + std::bind(std::mem_fn(&CompileTimeInferShapeContext::GetVarType), this, + std::placeholders::_1)); + return retv; + } + + proto::VarType::Type GetVarType(const std::string &name) const; + + DDim GetDim(const std::string &name) const { + auto var = block_.FindVarRecursive(name); + PADDLE_ENFORCE(var != nullptr, "Cannot find variable %s", name); + DDim res; + try { + auto shape = var->GetShape(); + res = shape.empty() ? make_ddim({0UL}) : make_ddim(shape); + } catch (...) { + VLOG(5) << "GetDim of variable " << name << " error"; + std::rethrow_exception(std::current_exception()); + } + return res; + } - DDim GetDim(const std::string &name) const override; + std::vector GetDims(const std::vector &names) const { + std::vector ret; + ret.reserve(names.size()); + std::transform( + names.begin(), names.end(), std::back_inserter(ret), + [this](const std::string &name) { return this->GetDim(name); }); + return ret; + } - void SetDim(const std::string &name, const DDim &dim) override; + void SetDim(const std::string &name, const DDim &dim); + + void SetDims(const std::vector &names, + const std::vector &dims) { + size_t length = names.size(); + PADDLE_ENFORCE_EQ(length, dims.size()); + for (size_t i = 0; i < length; ++i) { + if (names[i] == framework::kEmptyVarName) { + continue; + } + SetDim(names[i], dims[i]); + } + } std::vector GetRepeatedDims(const std::string &name) const override; void SetRepeatedDims(const std::string &name, const std::vector &dims) override; - InferShapeVarPtr GetVarPtr(const std::string &name) override; - const OpDesc &op_; const BlockDesc &block_; }; @@ -215,6 +341,23 @@ void OpDesc::SetOutput(const std::string ¶m_name, this->outputs_[param_name] = args; } +bool OpDesc::HasProtoAttr(const std::string &name) const { + auto &op_info = OpInfoMap::Instance(); + if (op_info.Has(desc_.type())) { + auto op_info_ptr = op_info.Get(desc_.type()); + if (op_info_ptr.HasOpProtoAndChecker()) { + const proto::OpProto &proto = op_info_ptr.Proto(); + for (int i = 0; i != proto.attrs_size(); ++i) { + const proto::OpProto::Attr &attr = proto.attrs(i); + if (attr.name() == name) { + return true; + } + } + } + } + return false; +} + proto::AttrType OpDesc::GetAttrType(const std::string &name) const { auto it = attrs_.find(name); PADDLE_ENFORCE(it != attrs_.end(), "Attribute %s is not found", name); @@ -230,6 +373,11 @@ std::vector OpDesc::AttrNames() const { return retv; } +void OpDesc::RemoveAttr(const std::string &name) { + attrs_.erase(name); + need_update_ = true; +} + void OpDesc::SetAttr(const std::string &name, const Attribute &v) { // NOTICE(minqiyang): pybind11 will take the empty list in python as // the std::vector type in C++; so we have to change the attr's type @@ -252,6 +400,12 @@ void OpDesc::SetAttr(const std::string &name, const Attribute &v) { this->attrs_[name] = std::vector(); break; } + case proto::AttrType::LONGS: { + VLOG(11) << "SetAttr: " << Type() << ", " << name + << " from LONGS to LONGS"; + this->attrs_[name] = std::vector(); + break; + } case proto::AttrType::FLOATS: { VLOG(11) << "SetAttr: " << Type() << ", " << name << " from INTS to FLOATS"; @@ -419,8 +573,15 @@ struct SetAttrDescVisitor : public boost::static_visitor { } VectorToRepeated(blocks_idx, attr_->mutable_blocks_idx()); } + void operator()(BlockDesc *desc) const { attr_->set_block_idx(desc->ID()); } + void operator()(int64_t v) const { attr_->set_l(v); } + + void operator()(const std::vector &v) const { + VectorToRepeated(v, attr_->mutable_longs()); + } + void operator()(boost::blank) const { PADDLE_THROW("Unexpected branch"); } }; @@ -456,6 +617,25 @@ void OpDesc::Flush() { static std::once_flag init_infer_shape_funcs; +/** + * NOTE(paddle-dev): Very tricky code here. Maybe we should find a + * better way to register compile-time infershape method gentlely. + * + * Normally, we can register a class derived from InferShapeBase, so that + * we can set the field of `infer_shape_` inside OpInfo when registering op. + * + * However, there is another way we can set the field of `infer_shape_` inside + * OpInfo. Usually, we overload InferShape method of OperatorWithKernel. After + * running the following method InitInferShapeFuncs, `infer_shape_` would be set + * to be the InferShape method of OperatorWithKernel. That is to say, we borrow + * the run-time InferShape method of OperatorWithKernel to be the compile-time + * InferShape method. + * + * However, during compiling time, we may not know inputs, outputs and attrs of + * run-time OperatorWithKernel. So the following code creates a fake + * OperatorWithKernel object. That is why the field info_ of OperatorBase + * would be null. + */ static void InitInferShapeFuncs() { std::call_once(init_infer_shape_funcs, [] { auto &map = OpInfoMap::Instance(); @@ -467,11 +647,16 @@ static void InitInferShapeFuncs() { PADDLE_ENFORCE(it != info_map.end(), "%s has not been registered", op_type); auto &op_info = it->second; - auto op = static_cast(op_info.Creator()( - "", VariableNameMap{}, VariableNameMap{}, AttributeMap{})); if (op_info.infer_shape_) { // infer_shape has been registered. continue; } + + auto op = dynamic_cast(op_info.Creator()( + "", VariableNameMap{}, VariableNameMap{}, AttributeMap{})); + + PADDLE_ENFORCE_NOT_NULL( + op, "InferShapeBase is not registered to Operator %s", op_type); + op_info.infer_shape_ = [op](InferShapeContext *ctx) { op->InferShape(ctx); }; @@ -488,7 +673,8 @@ void OpDesc::CheckAttrs() { // not by users. return; } - checker->Check(attrs_); + VLOG(10) << "begin to check attribute of " << Type(); + checker->Check(&attrs_); } void OpDesc::InferShape(const BlockDesc &block) const { @@ -515,20 +701,15 @@ void OpDesc::InferShape(const BlockDesc &block) const { } void OpDesc::InferVarType(BlockDesc *block) const { + // There are a few places that var type can be set. + // When VarDesc is created, default set to LOD_TENSOR. + // When output variable is created, default is defaut set to LOD_TENSOR. + // We limit here to be the only place that operator defines its customized + // var type inference. Hence, we don't do any "default" setting here. auto &info = OpInfoMap::Instance().Get(this->Type()); if (info.infer_var_type_) { - info.infer_var_type_(*this, block); - } else { - // all output type is LoDTensor by default - VLOG(10) << this->Type() - << " has not registered InferVarType. Set output variables to " - "LOD_TENSOR"; - for (auto &out_pair : this->outputs_) { - for (auto &out_var_name : out_pair.second) { - block->FindRecursiveOrCreateVar(out_var_name) - .SetType(proto::VarType::LOD_TENSOR); - } - } + InferVarTypeContext context(this, block); + info.infer_var_type_(&context); } } @@ -598,20 +779,6 @@ const std::vector &CompileTimeInferShapeContext::Outputs( return op_.Output(name); } -DDim CompileTimeInferShapeContext::GetDim(const std::string &name) const { - auto var = block_.FindVarRecursive(name); - PADDLE_ENFORCE(var != nullptr, "Cannot find variable %s", name); - DDim res; - try { - auto shape = var->GetShape(); - res = shape.empty() ? make_ddim({0UL}) : make_ddim(shape); - } catch (...) { - VLOG(5) << "GetDim of variable " << name << " error"; - std::rethrow_exception(std::current_exception()); - } - return res; -} - std::vector CompileTimeInferShapeContext::GetRepeatedDims( const std::string &name) const { auto var = block_.FindVarRecursive(name); @@ -650,10 +817,5 @@ proto::VarType::Type CompileTimeInferShapeContext::GetVarType( return block_.FindVarRecursive(name)->GetType(); } -InferShapeVarPtr CompileTimeInferShapeContext::GetVarPtr( - const std::string &name) { - return block_.FindVarRecursive(name); -} - } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/op_desc.h b/paddle/fluid/framework/op_desc.h index 440e0509be727ec2b84abc76fca44edda11f8a0a..dedaf24364703877a4cacb23a27550b54dad53f8 100644 --- a/paddle/fluid/framework/op_desc.h +++ b/paddle/fluid/framework/op_desc.h @@ -65,11 +65,14 @@ class OpDesc { return attrs_.find(name) != attrs_.end(); } + bool HasProtoAttr(const std::string &name) const; + proto::AttrType GetAttrType(const std::string &name) const; std::vector AttrNames() const; void SetAttr(const std::string &name, const Attribute &v); + void RemoveAttr(const std::string &name); void SetBlockAttr(const std::string &name, BlockDesc *block); @@ -121,9 +124,7 @@ class OpDesc { BlockDesc *Block() { return this->block_; } - const BlockDesc &BlockRef() const { return *this->block_; } - - void SetBlock(BlockDesc *block) { this->block_ = block; } + const BlockDesc *Block() const { return this->block_; } private: template diff --git a/paddle/fluid/framework/op_info.cc b/paddle/fluid/framework/op_info.cc index af75baa5c4b98f7d092834c05eb57e9c7e131b29..c815e194d43e149f9efe0daec820c42e87f81d0c 100644 --- a/paddle/fluid/framework/op_info.cc +++ b/paddle/fluid/framework/op_info.cc @@ -13,6 +13,9 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/framework/op_info.h" +#include +#include +#include namespace paddle { namespace framework { @@ -24,5 +27,17 @@ OpInfoMap& OpInfoMap::Instance() { static OpInfoMap g_op_info_map; return g_op_info_map; } + +std::vector OpInfoMap::GetUseDefaultGradOpDescMakerOps() const { + // Use set to sort op names + std::set result_ops; + for (auto& pair : map_) { + if (pair.second.use_default_grad_op_desc_maker_) { + result_ops.insert(pair.first); + } + } + return std::vector(result_ops.begin(), result_ops.end()); +} + } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/op_info.h b/paddle/fluid/framework/op_info.h index 19e5c2c73eac74dee030a4f7820531800f737e4e..daa72769c4957ff5ad0e7b3141bbf97bd348b408 100644 --- a/paddle/fluid/framework/op_info.h +++ b/paddle/fluid/framework/op_info.h @@ -17,8 +17,10 @@ limitations under the License. */ #include #include #include +#include #include "paddle/fluid/framework/attribute.h" +#include "paddle/fluid/framework/no_need_buffer_vars_inference.h" #include "paddle/fluid/framework/type_defs.h" #include "paddle/fluid/platform/macros.h" @@ -38,6 +40,12 @@ struct OpInfo { OpAttrChecker* checker_{nullptr}; InferVarTypeFN infer_var_type_; InferShapeFN infer_shape_; + InferInplaceOpFN infer_inplace_; + InferNoNeedBufferVarsFN infer_no_need_buffer_vars_; + + // NOTE(zjl): this flag is added to check whether + // the grad maker is the default one. + bool use_default_grad_op_desc_maker_{false}; bool HasOpProtoAndChecker() const { return proto_ != nullptr && checker_ != nullptr; @@ -63,6 +71,10 @@ struct OpInfo { } const OpAttrChecker* Checker() const { return checker_; } + + const InferNoNeedBufferVarsFN& NoNeedBufferVarsInferer() const { + return infer_no_need_buffer_vars_; + } }; class OpInfoMap { @@ -98,6 +110,8 @@ class OpInfoMap { std::unordered_map* mutable_map() { return &map_; } + std::vector GetUseDefaultGradOpDescMakerOps() const; + private: OpInfoMap() = default; std::unordered_map map_; diff --git a/paddle/fluid/framework/op_kernel_type.cc b/paddle/fluid/framework/op_kernel_type.cc new file mode 100644 index 0000000000000000000000000000000000000000..6d4801e4a0eed7083e671e1d49b8628dfb280cf9 --- /dev/null +++ b/paddle/fluid/framework/op_kernel_type.cc @@ -0,0 +1,54 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/op_kernel_type.h" + +namespace paddle { +namespace framework { + +size_t OpKernelType::Hash::operator()(const OpKernelType& key) const { + int cur_loc = 0; + + int place = key.place_.which(); + cur_loc += OpKernelType::kPlaceBits; + + int data_type = static_cast(key.data_type_) << cur_loc; + cur_loc += OpKernelType::kPrimaryDTypeBits; + + int data_layout = static_cast(key.data_layout_) << cur_loc; + cur_loc += OpKernelType::kLayoutBits; + + int library_type = static_cast(key.library_type_) << cur_loc; + cur_loc += OpKernelType::kLibBits; + + int customized_value = key.customized_type_value_; + PADDLE_ENFORCE(customized_value < (1 << OpKernelType::kCustomizeBits)); + customized_value = customized_value << cur_loc; + cur_loc += OpKernelType::kCustomizeBits; + PADDLE_ENFORCE(cur_loc < 64); + + std::hash hasher; + return hasher(place + data_type + data_layout + library_type + + customized_value); +} + +bool OpKernelType::operator==(const OpKernelType& o) const { + return platform::places_are_same_class(place_, o.place_) && + data_type_ == o.data_type_ && data_layout_ == o.data_layout_ && + library_type_ == o.library_type_ && + customized_type_value_ == o.customized_type_value_; +} + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/op_kernel_type.h b/paddle/fluid/framework/op_kernel_type.h index c59b232191c49ccb47bb9f51dcaf2fd9280fae19..9edc1a3e150027b5a3dbd8483dc8b58d1d4ab918 100644 --- a/paddle/fluid/framework/op_kernel_type.h +++ b/paddle/fluid/framework/op_kernel_type.h @@ -24,52 +24,55 @@ limitations under the License. */ namespace paddle { namespace framework { -struct OpKernelType { - struct Hash { - size_t operator()(const OpKernelType& key) const { - int place = key.place_.which(); - int data_type = static_cast(key.data_type_) << LEFT_SHIFT; - int data_layout = static_cast(key.data_layout_) << (LEFT_SHIFT * 2); - int library_type = static_cast(key.library_type_) - << (LEFT_SHIFT * 3); - - std::hash hasher; - return hasher(place + data_type + data_layout + library_type); - } - }; +class OpKernelType { + public: + constexpr static int kDefaultCustomizedTypeValue = 0; - // place, data_type, library_type kinds less than 2^8 - constexpr static int LEFT_SHIFT = 8; - - proto::VarType::Type data_type_; - DataLayout data_layout_; - platform::Place place_; - LibraryType library_type_; + // In total should be smaller than 64. + constexpr static int kPlaceBits = 4; + constexpr static int kPrimaryDTypeBits = 8; + constexpr static int kLayoutBits = 4; + constexpr static int kLibBits = 4; + constexpr static int kCustomizeBits = 4; OpKernelType(proto::VarType::Type data_type, platform::Place place, DataLayout data_layout = DataLayout::kAnyLayout, - LibraryType library_type = LibraryType::kPlain) + LibraryType library_type = LibraryType::kPlain, + int customized_type_value = kDefaultCustomizedTypeValue) : data_type_(data_type), data_layout_(data_layout), place_(place), - library_type_(library_type) {} + library_type_(library_type), + customized_type_value_(customized_type_value) {} OpKernelType(proto::VarType::Type data_type, const platform::DeviceContext& dev_ctx, DataLayout data_layout = DataLayout::kAnyLayout, - LibraryType library_type = LibraryType::kPlain) + LibraryType library_type = LibraryType::kPlain, + int customized_type_value = kDefaultCustomizedTypeValue) : data_type_(data_type), data_layout_(data_layout), place_(dev_ctx.GetPlace()), - library_type_(library_type) {} + library_type_(library_type), + customized_type_value_(customized_type_value) {} - bool operator==(const OpKernelType& o) const { - return platform::places_are_same_class(place_, o.place_) && - data_type_ == o.data_type_ && data_layout_ == o.data_layout_ && - library_type_ == o.library_type_; - } + virtual ~OpKernelType() {} + + struct Hash { + size_t operator()(const OpKernelType& key) const; + }; + + size_t hash_key() const { return Hash()(*this); } + + bool operator==(const OpKernelType& o) const; bool operator!=(const OpKernelType& o) const { return !(*this == o); } + + proto::VarType::Type data_type_; + DataLayout data_layout_; + platform::Place place_; + LibraryType library_type_; + int customized_type_value_; }; inline std::ostream& operator<<(std::ostream& os, diff --git a/paddle/fluid/framework/op_kernel_type_test.cc b/paddle/fluid/framework/op_kernel_type_test.cc index 3e17a512ce154de88ac890f3b29f03385595d95c..40db85400d2c8776b82ce0fa2fb4deed993b0255 100644 --- a/paddle/fluid/framework/op_kernel_type_test.cc +++ b/paddle/fluid/framework/op_kernel_type_test.cc @@ -34,7 +34,8 @@ TEST(OpKernelType, ToString) { OpKernelType op_kernel_type2(DataType::FP16, CUDAPlace(0), DataLayout::kNCHW, LibraryType::kCUDNN); ASSERT_EQ(paddle::framework::KernelTypeToString(op_kernel_type2), - "data_type[float16]:data_layout[NCHW]:place[CUDAPlace(0)]:library_" + "data_type[::paddle::platform::float16]:data_layout[NCHW]:place[" + "CUDAPlace(0)]:library_" "type[CUDNN]"); } diff --git a/paddle/fluid/framework/op_proto_maker.cc b/paddle/fluid/framework/op_proto_maker.cc index df2a7a27ca4a6011b214202ac9bf4f30dc482ece..2311614c335a56501ac777d787f6653659294765 100644 --- a/paddle/fluid/framework/op_proto_maker.cc +++ b/paddle/fluid/framework/op_proto_maker.cc @@ -21,7 +21,6 @@ namespace framework { void OpProtoAndCheckerMaker::Validate() { validated_ = true; CheckNoDuplicatedInOutAttrs(); - CheckReuseVars(); } OpProtoAndCheckerMaker::VariableBuilder OpProtoAndCheckerMaker::AddInput( @@ -40,40 +39,6 @@ OpProtoAndCheckerMaker::VariableBuilder OpProtoAndCheckerMaker::AddOutput( return OpProtoAndCheckerMaker::VariableBuilder{output}; } -void OpProtoAndCheckerMaker::Reuse(const std::string& name, - const std::string& reused_name) { - bool found = false; - proto::OpProto::Var* var; - - for (auto& var : proto_->inputs()) { - if (var.name() == reused_name) { - found = true; - break; - } - } - PADDLE_ENFORCE(found == true, - "Input/Output name: %s reused_name: %s, one of them is not " - "exists or not matched.", - name, reused_name); - - found = false; - for (int i = 0; i < proto_->outputs().size(); ++i) { - var = proto_->mutable_outputs()->Mutable(i); - if (var->name() == name) { - PADDLE_ENFORCE(!var->has_reuse(), - "Output(%s) has been set reused var of %s", name, - var->reuse()); - found = true; - var->set_reuse(reused_name); - break; - } - } - PADDLE_ENFORCE(found == true, - "Input/Output name: %s reused_name: %s, one of them is not " - "exists or not matched.", - name, reused_name); -} - void OpProtoAndCheckerMaker::CheckNoDuplicatedInOutAttrs() { std::unordered_set names; auto checker = [&](const std::string& name) { @@ -91,24 +56,6 @@ void OpProtoAndCheckerMaker::CheckNoDuplicatedInOutAttrs() { } } -void OpProtoAndCheckerMaker::CheckReuseVars() { - std::unordered_set names; - for (auto& input : proto_->inputs()) { - names.insert(input.name()); - } - auto checker = [&](const std::string& name, const std::string& reused) { - PADDLE_ENFORCE( - names.count(reused), - "Output [%s] reuse Input [%s], but the input is not registered.", name, - reused); - }; - for (auto& output : proto_->outputs()) { - if (output.has_reuse()) { - checker(output.name(), output.reuse()); - } - } -} - void OpProtoAndCheckerMaker::operator()(proto::OpProto* proto, OpAttrChecker* attr_checker) { proto_ = proto; @@ -124,6 +71,8 @@ void OpProtoAndCheckerMaker::operator()(proto::OpProto* proto, static_cast(OpRole::kLoss) | static_cast(OpRole::kForward), static_cast(OpRole::kLoss) | static_cast(OpRole::kBackward), + static_cast(OpRole::kOptimize) | + static_cast(OpRole::kLRSched), static_cast(OpRole::kNotSpecified)}) .SetDefault(static_cast(OpRole::kNotSpecified)); AddAttr>(OpRoleVarAttrName(), @@ -133,6 +82,10 @@ void OpProtoAndCheckerMaker::operator()(proto::OpProto* proto, AddAttr(OpNamescopeAttrName(), "Operator name with namesope.") .SetDefault(""); + AddAttr>(OpCreationCallstackAttrName(), + "Callstack for Op Creatation.") + .SetDefault({}); + Validate(); } diff --git a/paddle/fluid/framework/op_proto_maker.h b/paddle/fluid/framework/op_proto_maker.h index 4ed3cc45d66849267ef4945a03da1db76b53e4ea..5f3ce60e1d94ea4078cf0b709df362bad317f621 100644 --- a/paddle/fluid/framework/op_proto_maker.h +++ b/paddle/fluid/framework/op_proto_maker.h @@ -14,25 +14,26 @@ limitations under the License. */ #pragma once #include -#include - #include "glog/logging.h" #include "paddle/fluid/framework/attribute.h" #include "paddle/fluid/framework/framework.pb.h" namespace paddle { namespace framework { +////////////////////////// +// Don't add more roles to make this too complicated! +////////////////////////// enum class OpRole { kForward = 0x0000, kBackward = 0x0001, kOptimize = 0x0002, - // RPC role is for send/recv releated op - kRPC = 0x0003, + // RPC role is for send/recv related op + kRPC = 0x0004, // Dist role is for split_byref/split_selected_rows/concat // used for distributed training. - kDist = 0x0004, + kDist = 0x0008, // Tag all learning rate scheduler operators. - kLRSched = 0x0005, + kLRSched = 0x0010, kLoss = 0x0100, // The default value of op's role. This should be only used for unittests and @@ -46,6 +47,7 @@ class OpProtoAndCheckerMaker { static const char *OpRoleAttrName() { return "op_role"; } static const char *OpRoleVarAttrName() { return "op_role_var"; } static const char *OpNamescopeAttrName() { return "op_namescope"; } + static const char *OpCreationCallstackAttrName() { return "op_callstack"; } void operator()(proto::OpProto *proto, OpAttrChecker *attr_checker); @@ -73,11 +75,6 @@ class OpProtoAndCheckerMaker { var_->set_dispensable(true); return *this; } - - VariableBuilder &Reuse(const std::string &name) { - var_->set_reuse(name); - return *this; - } }; VariableBuilder AddInput(const std::string &name, const std::string &comment); @@ -85,8 +82,6 @@ class OpProtoAndCheckerMaker { VariableBuilder AddOutput(const std::string &name, const std::string &comment); - void Reuse(const std::string &name, const std::string &reused_name); - template TypedAttrChecker &AddAttr(const std::string &name, const std::string &comment, @@ -105,8 +100,6 @@ class OpProtoAndCheckerMaker { void CheckNoDuplicatedInOutAttrs(); void Validate(); - void CheckReuseVars(); - proto::OpProto *proto_; OpAttrChecker *op_checker_; bool validated_{false}; diff --git a/paddle/fluid/framework/op_proto_maker_test.cc b/paddle/fluid/framework/op_proto_maker_test.cc index b71c7b646857e11f291748c4c7c2af92b6d53231..a8030d377fdb4d4aef74b315e21792dad10fac96 100644 --- a/paddle/fluid/framework/op_proto_maker_test.cc +++ b/paddle/fluid/framework/op_proto_maker_test.cc @@ -47,120 +47,3 @@ TEST(ProtoMaker, DuplicatedInOut) { ASSERT_THROW(proto_maker(&op_proto, &op_checker), paddle::platform::EnforceNotMet); } - -class TestInplaceProtoMaker : public paddle::framework::OpProtoAndCheckerMaker { - public: - void Make() { - AddInput("X", "input of test op"); - AddOutput("XOut", "output of test op").Reuse("X"); - } -}; - -class TestInplaceProtoMaker2 - : public paddle::framework::OpProtoAndCheckerMaker { - public: - void Make() { - AddInput("X", "input of test op"); - AddOutput("XOut", "output of test op").Reuse("X"); - AddOutput("NoOut", "output of test op").Reuse("NotExists"); - } -}; - -TEST(ProtoMaker, InplaceOutput) { - paddle::framework::proto::OpProto op_proto, op_proto2; - paddle::framework::OpAttrChecker op_checker; - TestInplaceProtoMaker proto_maker; - TestInplaceProtoMaker2 proto_maker2; - - proto_maker(&op_proto, &op_checker); - - ASSERT_THROW(proto_maker2(&op_proto2, &op_checker), - paddle::platform::EnforceNotMet); -} - -// normal reuse -class TestReuseProtoMaker : public paddle::framework::OpProtoAndCheckerMaker { - public: - void Make() { - AddInput("X", "input of test op"); - AddInput("Y", "input of test op"); - AddOutput("Out", "output of test op"); - AddOutput("XOut", "output of test op"); - // avoid destructor exception. - // Validate(); - TestReuse(); - } - - virtual void TestReuse() {} -}; - -// test duplicate reuse error -class TestReuseProtoMaker2 : public TestReuseProtoMaker { - public: - void TestReuse() { - Reuse("Out", "X"); - Reuse("Out", "Y"); - } -}; - -// NotExists Input -class TestReuseProtoMaker3 : public TestReuseProtoMaker { - public: - void TestReuse() { - Reuse("Out", "NotExists"); - Reuse("XOut", "X"); - } -}; - -// NotExists Output -class TestReuseProtoMaker4 : public TestReuseProtoMaker { - public: - void TestReuse() { Reuse("NotExists", "X"); } -}; - -TEST(ProtoMaker, Reuse) { - paddle::framework::proto::OpProto op_proto; - paddle::framework::OpAttrChecker op_checker; - TestReuseProtoMaker proto_maker; - proto_maker(&op_proto, &op_checker); -} - -// NOTE(dzhwinter): -// There is a Fatal CHECK on base class destructor, which will call abort inside -// instead of -// throw an exception. If we throw an exception in Make(), we will trigger the -// CHECK and terminate the tests. -// -// I had tried to replace the default CHECK with a exception, however, it's -// still not supported by glog. -// the details: -// https://github.com/google/glog/issues/249 -// https://github.com/facebookresearch/TensorComprehensions/issues/351 -/* -TEST(ProtoMaker, ReuseWithException) { - paddle::framework::proto::OpProto op_proto2, op_proto3, op_proto4; - paddle::framework::OpAttrChecker op_checker; - TestReuseProtoMaker2 proto_maker2; - TestReuseProtoMaker3 proto_maker3; - TestReuseProtoMaker4 proto_maker4; - EXPECT_THROW(proto_maker2(&op_proto2, &op_checker), - paddle::platform::EnforceNotMet); - - EXPECT_THROW(proto_maker3(&op_proto3, &op_checker), - paddle::platform::EnforceNotMet); - - EXPECT_THROW(proto_maker4(&op_proto4, &op_checker), - paddle::platform::EnforceNotMet); -} - -void FailureFunction() { - throw std::runtime_error("Check failed in destructor."); - // return 0; -} - -int main(int argc, char** argv) { - testing::InitGoogleTest(&argc, argv); - google::InstallFailureFunction(&FailureFunction); - return RUN_ALL_TESTS(); -} -*/ diff --git a/paddle/fluid/framework/op_registry.cc b/paddle/fluid/framework/op_registry.cc index bfc411ca2c4a483e344b368da089392d8e4a87c1..346d14d408ea1ed2cfbdbed5f48e56902e6e95b2 100644 --- a/paddle/fluid/framework/op_registry.cc +++ b/paddle/fluid/framework/op_registry.cc @@ -24,7 +24,7 @@ std::unique_ptr OpRegistry::CreateOp( const VariableNameMap& outputs, AttributeMap attrs) { auto& info = OpInfoMap::Instance().Get(type); if (info.Checker() != nullptr) { - info.Checker()->Check(attrs); + info.Checker()->Check(&attrs); } auto op = info.Creator()(type, inputs, outputs, attrs); return std::unique_ptr(op); diff --git a/paddle/fluid/framework/op_registry.h b/paddle/fluid/framework/op_registry.h index ef2eb334a4e7f3f482ba6d62d3f325f109c69302..a53a81c270aeec1b6ee4ed30e77526f4ea2e7977 100644 --- a/paddle/fluid/framework/op_registry.h +++ b/paddle/fluid/framework/op_registry.h @@ -23,12 +23,8 @@ limitations under the License. */ #include #include -#if defined(_WIN32) #define GLOG_NO_ABBREVIATED_SEVERITIES // msvc conflict logging with windows.h -#define GOOGLE_GLOG_DLL_DECL -#endif - -#include "glog/logging.h" // For VLOG() +#include "glog/logging.h" // For VLOG() #include "paddle/fluid/framework/attribute.h" #include "paddle/fluid/framework/details/op_registry.h" #include "paddle/fluid/framework/framework.pb.h" @@ -40,6 +36,7 @@ limitations under the License. */ namespace paddle { namespace framework { + class Registrar { public: // In our design, various kinds of classes, e.g., operators and kernels, @@ -83,7 +80,7 @@ struct OpKernelRegistrarFunctor; template inline void RegisterKernelClass(const char* op_type, const char* library_type, - Func func) { + int customized_type_value, Func func) { std::string library(library_type); std::string data_layout = "ANYLAYOUT"; if (library == "MKLDNN") { @@ -91,7 +88,7 @@ inline void RegisterKernelClass(const char* op_type, const char* library_type, } OpKernelType key(ToDataType(std::type_index(typeid(T))), PlaceType(), StringToDataLayout(data_layout), - StringToLibraryType(library_type)); + StringToLibraryType(library_type), customized_type_value); OperatorWithKernel::AllOpKernels()[op_type][key] = func; } @@ -100,22 +97,26 @@ struct OpKernelRegistrarFunctor { using KERNEL_TYPE = typename std::tuple_element>::type; - void operator()(const char* op_type, const char* library_type) const { + void operator()(const char* op_type, const char* library_type, + int customized_type_value) const { using T = typename KERNEL_TYPE::ELEMENT_TYPE; RegisterKernelClass( - op_type, library_type, [](const framework::ExecutionContext& ctx) { + op_type, library_type, customized_type_value, + + [](const framework::ExecutionContext& ctx) { KERNEL_TYPE().Compute(ctx); }); constexpr auto size = std::tuple_size>::value; OpKernelRegistrarFunctor func; - func(op_type, library_type); + func(op_type, library_type, customized_type_value); } }; template struct OpKernelRegistrarFunctor { - void operator()(const char* op_type, const char* library_type) const {} + void operator()(const char* op_type, const char* library_type, + int customized_type_value) const {} }; // User can register many kernel in one place. The data type could be @@ -123,9 +124,10 @@ struct OpKernelRegistrarFunctor { template class OpKernelRegistrar : public Registrar { public: - explicit OpKernelRegistrar(const char* op_type, const char* library_type) { + explicit OpKernelRegistrar(const char* op_type, const char* library_type, + int customized_type_value) { OpKernelRegistrarFunctor func; - func(op_type, library_type); + func(op_type, library_type, customized_type_value); } }; @@ -135,17 +137,19 @@ struct OpKernelRegistrarFunctorEx; template class OpKernelRegistrarEx : public Registrar { public: - explicit OpKernelRegistrarEx(const char* op_type, const char* library_type) { + explicit OpKernelRegistrarEx(const char* op_type, const char* library_type, + int customized_type_value) { OpKernelRegistrarFunctorEx func; - func(op_type, library_type); + func(op_type, library_type, customized_type_value); } }; template struct OpKernelRegistrarFunctorEx { - void operator()(const char* op_type, const char* library_type) const {} + void operator()(const char* op_type, const char* library_type, + int customized_type_value) const {} }; template @@ -158,18 +162,21 @@ struct OpKernelRegistrarFunctorEx>::type; - void operator()(const char* op_type, const char* library_type) const { - RegisterKernelClass(op_type, library_type, Functor()); + void operator()(const char* op_type, const char* library_type, + int customized_type_value) const { + RegisterKernelClass(op_type, library_type, + customized_type_value, Functor()); constexpr auto size = std::tuple_size>::value; OpKernelRegistrarFunctorEx= size, I + 2, DataTypeAndKernelType...> func; - func(op_type, library_type); + func(op_type, library_type, customized_type_value); } }; +// clang-format off /** * check if MACRO is used in GLOBAL NAMESPACE. */ @@ -204,42 +211,64 @@ struct OpKernelRegistrarFunctorEx \ - __op_kernel_registrar_##op_type##_##library_type##__(#op_type, \ - #library_type); \ - int TouchOpKernelRegistrar_##op_type##_##library_type() { \ - __op_kernel_registrar_##op_type##_##library_type##__.Touch(); \ - return 0; \ +#define REGISTER_OP_KERNEL_WITH_CUSTOM_TYPE(op_type, library_type, \ + place_class, customized_name, \ + customized_type_value, ...) \ + STATIC_ASSERT_GLOBAL_NAMESPACE( \ + __reg_op_kernel_##op_type##_##library_type##_##customized_name##__, \ + "REGISTER_OP_KERNEL must be called in " \ + "global namespace"); \ + static ::paddle::framework::OpKernelRegistrar \ + __op_kernel_registrar_##op_type##_##library_type##_##customized_name##__(\ + #op_type, #library_type, customized_type_value); \ + int TouchOpKernelRegistrar_##op_type##_##library_type##_##customized_name() {\ + __op_kernel_registrar_##op_type##_##library_type##_##customized_name##__ \ + .Touch(); \ + return 0; \ } +#define REGISTER_OP_KERNEL(op_type, library_type, place_class, ...) \ + REGISTER_OP_KERNEL_WITH_CUSTOM_TYPE( \ + op_type, library_type, place_class, DEFAULT_TYPE, \ + ::paddle::framework::OpKernelType::kDefaultCustomizedTypeValue, \ + __VA_ARGS__) + #define REGISTER_OP_CUDA_KERNEL(op_type, ...) \ REGISTER_OP_KERNEL(op_type, CUDA, ::paddle::platform::CUDAPlace, __VA_ARGS__) #define REGISTER_OP_CPU_KERNEL(op_type, ...) \ REGISTER_OP_KERNEL(op_type, CPU, ::paddle::platform::CPUPlace, __VA_ARGS__) -#define REGISTER_OP_KERNEL_EX(op_type, library_type, place_class, ...) \ - STATIC_ASSERT_GLOBAL_NAMESPACE( \ - __reg_op_kernel_##op_type##_##library_type##__, \ - "REGISTER_OP_KERNEL_EX must be called in global namespace"); \ - static ::paddle::framework::OpKernelRegistrarEx \ - __op_kernel_registrar_##op_type##_##library_type##__(#op_type, \ - #library_type); \ - int TouchOpKernelRegistrar_##op_type##_##library_type() { \ - __op_kernel_registrar_##op_type##_##library_type##__.Touch(); \ - return 0; \ +#define REGISTER_OP_KERNEL_EX(op_type, library_type, place_class, \ + customized_name, \ + customized_type_value, \ + ...) \ + STATIC_ASSERT_GLOBAL_NAMESPACE( \ + __reg_op_kernel_##op_type##_##library_type##_##customized_name##__, \ + "REGISTER_OP_KERNEL_EX must be called in " \ + "global namespace"); \ + static ::paddle::framework::OpKernelRegistrarEx \ + __op_kernel_registrar_##op_type##_##library_type##_##customized_name##__(\ + #op_type, #library_type, customized_type_value); \ + int TouchOpKernelRegistrar_##op_type##_##library_type##_##customized_name() {\ + __op_kernel_registrar_##op_type##_##library_type##_##customized_name##__ \ + .Touch(); \ + return 0; \ } #define REGISTER_OP_CUDA_KERNEL_FUNCTOR(op_type, ...) \ - REGISTER_OP_KERNEL_EX(op_type, CUDA, ::paddle::platform::CUDAPlace, \ - __VA_ARGS__) + REGISTER_OP_KERNEL_EX( \ + op_type, CUDA, ::paddle::platform::CUDAPlace, DEFAULT_TYPE, \ + ::paddle::framework::OpKernelType::kDefaultCustomizedTypeValue, \ + __VA_ARGS__) -#define REGISTER_OP_CPU_KERNEL_FUNCTOR(op_type, ...) \ - REGISTER_OP_KERNEL_EX(op_type, CPU, ::paddle::platform::CPUPlace, __VA_ARGS__) +#define REGISTER_OP_CPU_KERNEL_FUNCTOR(op_type, ...) \ + REGISTER_OP_KERNEL_EX( \ + op_type, CPU, ::paddle::platform::CPUPlace, DEFAULT_TYPE, \ + ::paddle::framework::OpKernelType::kDefaultCustomizedTypeValue, \ + __VA_ARGS__) /** * Macro to mark what Operator and Kernel @@ -253,13 +282,19 @@ struct OpKernelRegistrarFunctorEx #include #include - +#include +#include +#include +#include #include "paddle/fluid/framework/data_transform.h" #include "paddle/fluid/framework/executor.h" #include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/op_proto_maker.h" #include "paddle/fluid/framework/operator.h" #include "paddle/fluid/framework/shape_inference.h" +#include "paddle/fluid/framework/transfer_scope_cache.h" #include "paddle/fluid/framework/var_type.h" #include "paddle/fluid/platform/profiler.h" @@ -31,6 +34,7 @@ DECLARE_bool(benchmark); DEFINE_bool(check_nan_inf, false, "Checking whether operator produce NAN/INF or not. It will be " "extremely slow so please use this flag wisely."); +DEFINE_int32(inner_op_parallelism, 0, "number of threads for inner op"); namespace paddle { namespace framework { @@ -44,17 +48,16 @@ std::vector> kKernelPriority = { proto::VarType::Type GetDataTypeOfVar(const Variable* var) { if (var->IsType()) { - return framework::ToDataType(var->Get().type()); + return var->Get().type(); } else if (var->IsType()) { - return framework::ToDataType( - var->Get().value().type()); + return var->Get().value().type(); } else { PADDLE_THROW("Var should be LoDTensor or SelectedRows"); } } -static DDim GetDims(const Scope& scope, const std::string& name, - bool get_actual_dim = false) { +static DDim GetDimsDebug(const Scope& scope, const std::string& name, + bool get_actual_dim = false) { Variable* var = scope.FindVar(name); if (var == nullptr) { return DDim({-1}); @@ -94,13 +97,13 @@ static std::string GetDtype(const Scope& scope, const std::string& name) { if (UNLIKELY(!tensor.IsInitialized())) { return ""; } - return DataTypeToString(ToDataType(tensor.type())); + return DataTypeToString(tensor.type()); } else if (var->IsType()) { auto tensor = var->Get().value(); if (UNLIKELY(!tensor.IsInitialized())) { return "uninited"; } else { - return DataTypeToString(ToDataType(tensor.type())); + return DataTypeToString(tensor.type()); } } else { return ""; @@ -120,7 +123,7 @@ static int GetRowSize(const Scope& scope, const std::string& name) { return -1; } -static LoD GetLoD(const Scope& scope, const std::string& name) { +static LoD GetLoDDebug(const Scope& scope, const std::string& name) { Variable* var = scope.FindVar(name); auto default_lod = LoD({{}}); @@ -139,36 +142,77 @@ static LoD GetLoD(const Scope& scope, const std::string& name) { } } +RuntimeContext::RuntimeContext(const VariableNameMap& innames, + const VariableNameMap& outnames, + const Scope& scope) { + for (auto& var_name_item : innames) { + std::vector& input_vars = inputs[var_name_item.first]; + input_vars.reserve(var_name_item.second.size()); + for (auto& var_name : var_name_item.second) { + input_vars.push_back(scope.FindVar(var_name)); + } + } + for (auto& var_name_item : outnames) { + std::vector& output_vars = outputs[var_name_item.first]; + output_vars.reserve(var_name_item.second.size()); + for (auto& var_name : var_name_item.second) { + output_vars.push_back(scope.FindVar(var_name)); + } + } +} + void OperatorBase::Run(const Scope& scope, const platform::Place& place) { - VLOG(4) << place << " " << DebugStringEx(&scope); - if (platform::is_gpu_place(place)) { + try { + VLOG(4) << place << " " << DebugStringEx(&scope); + if (platform::is_gpu_place(place)) { #ifndef PADDLE_WITH_CUDA - PADDLE_THROW("Cannot run operator on place %s", place); + PADDLE_THROW("Cannot run operator on place %s", place); #else - auto dev_id = boost::get(place).device; - platform::SetDeviceId(dev_id); + auto dev_id = boost::get(place).device; + platform::SetDeviceId(dev_id); #endif - } + } - // The profile has a process-wide mutex, results in serious performance issue - // in concurrency scenerio. Here use an `if` to fix this issue. - // Please not remove the `if`, ask @Superjomn if there are any concern. - if (platform::IsProfileEnabled()) { - platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance(); - platform::RecordEvent record_event(Type(), pool.Get(place)); - RunImpl(scope, place); - } else { - RunImpl(scope, place); + // The profile has a process-wide mutex, results in serious performance + // issue + // in concurrency scenerio. Here use an `if` to fix this issue. + // Please not remove the `if`, ask @Superjomn if there are any concern. + if (platform::IsProfileEnabled()) { + platform::RecordEvent record_event(Type()); + RunImpl(scope, place); + } else { + RunImpl(scope, place); + } + + VLOG(3) << place << " " << DebugStringEx(&scope); + } catch (platform::EnforceNotMet exception) { + if (Attrs().count("sub_block") != 0) { + throw std::move(exception); + } + + auto& callstack = Attr>( + OpProtoAndCheckerMaker::OpCreationCallstackAttrName()); + + if (callstack.empty()) { + throw std::move(exception); + } + std::ostringstream sout; + sout << "Invoke operator " << Type() << " error.\n"; + sout << "Python Callstacks: \n"; + for (auto& line : callstack) { + sout << line; + } + sout << "C++ Callstacks: \n"; + sout << exception.err_str_; + exception.err_str_ = sout.str(); + throw std::move(exception); + } catch (...) { + std::rethrow_exception(std::current_exception()); } - VLOG(3) << place << " " << DebugStringEx(&scope); } bool OperatorBase::HasInputs(const std::string& name) const { - if (inputs_.find(name) != inputs_.end()) { - return true; - } else { - return false; - } + return inputs_.find(name) != inputs_.end(); } std::string OperatorBase::Input(const std::string& name) const { @@ -230,8 +274,8 @@ std::string OperatorBase::DebugStringEx(const Scope* scope) const { } std::string dtype = GetDtype(*scope, var_name); ss << ":" << dtype; - ss << "[" << GetDims(*scope, var_name, true) << "]"; - ss << "(" << GetLoD(*scope, var_name) << ")"; + ss << "[" << GetDimsDebug(*scope, var_name, true) << "]"; + ss << "(" << GetLoDDebug(*scope, var_name) << ")"; } } if (i != input.second.size() - 1) { @@ -259,8 +303,10 @@ std::string OperatorBase::DebugStringEx(const Scope* scope) const { if (row_size >= 0) { ss << "[row_size=" << row_size << "]"; } - ss << "[" << GetDims(*scope, var_name, true) << "]"; - ss << "(" << GetLoD(*scope, var_name) << ")"; + std::string dtype = GetDtype(*scope, output.second[i]); + ss << ":" << dtype; + ss << "[" << GetDimsDebug(*scope, var_name, true) << "]"; + ss << "(" << GetLoDDebug(*scope, var_name) << ")"; } } if (i != output.second.size() - 1) { @@ -281,7 +327,12 @@ OperatorBase::OperatorBase(const std::string& type, const VariableNameMap& inputs, const VariableNameMap& outputs, const AttributeMap& attrs) - : type_(type), inputs_(inputs), outputs_(outputs), attrs_(attrs) { + : type_(type), + inputs_(inputs), + outputs_(outputs), + attrs_(attrs), + // NOTE(zjl): why op_info may be nullptr? + info_(OpInfoMap::Instance().GetNullable(type)) { GenerateTemporaryNames(); CheckAllInputOutputSet(); } @@ -305,7 +356,7 @@ std::vector OperatorBase::OutputVars(bool has_intermediate) const { } return ret_val; } - auto& info = OpInfoMap::Instance().Get(Type()); + auto& info = Info(); // get all OpProto::Var for outputs for (auto& o : info.Proto().outputs()) { @@ -321,18 +372,16 @@ std::vector OperatorBase::OutputVars(bool has_intermediate) const { } void OperatorBase::CheckAllInputOutputSet() const { - auto& info_map = OpInfoMap::Instance(); - auto* op_info = info_map.GetNullable(Type()); - if (op_info == nullptr || op_info->proto_ == nullptr) return; + if (info_ == nullptr || info_->proto_ == nullptr) return; - for (auto& in : op_info->Proto().inputs()) { + for (auto& in : info_->Proto().inputs()) { if (!in.dispensable()) { PADDLE_ENFORCE(inputs_.find(in.name()) != inputs_.end(), "Operator %s's input, %s, is not set", Type(), in.name()); } } - for (auto& out : op_info->Proto().outputs()) { + for (auto& out : info_->Proto().outputs()) { if (!out.dispensable()) { PADDLE_ENFORCE(outputs_.find(out.name()) != outputs_.end(), "Operator %s's output, %s, is not set", Type(), @@ -354,29 +403,29 @@ void OperatorBase::GenerateTemporaryNames() { } } -static bool VarIsTensor(const Variable* var) { - return var->IsType() || var->IsType(); +static bool VarIsTensor(const Variable& var) { + return var.IsType() || var.IsType(); } -static const Tensor* GetTensorFromVar(Variable* var) { - if (var->IsType()) { - return var->GetMutable(); - } else if (var->IsType()) { - return var->GetMutable()->mutable_value(); +const Tensor* GetLoDTensorOrSelectedRowsValueFromVar(const Variable& var) { + if (var.IsType()) { + return static_cast(&(var.Get())); + } else if (var.IsType()) { + return &(var.Get().value()); } else { PADDLE_THROW("Variable type_id %s, expect LoDTensor/SelectedRows.", - var->Type().name()); + ToTypeName(var.Type())); } } -static Tensor* GetMutableTensorFromVar(Variable* var) { +Tensor* GetMutableLoDTensorOrSelectedRowsValueFromVar(Variable* var) { if (var->IsType()) { return var->GetMutable(); } else if (var->IsType()) { return var->GetMutable()->mutable_value(); } else { PADDLE_THROW("Variable type_id %s, expect LoDTensor/SelectedRows.", - var->Type().name()); + ToTypeName(var->Type())); } } @@ -412,44 +461,72 @@ bool ExecutionContext::HasOutput(const std::string& name) const { return var != nullptr; } +const Variable* ExecutionContext::InputVar(const std::string& name) const { + auto it = ctx_.inputs.find(name); + if (it == ctx_.inputs.end()) return nullptr; + + PADDLE_ENFORCE_LE(it->second.size(), 1UL, + "Operator %s's input %s should contain only one variable.", + op_.Type(), name); + return it->second.empty() ? nullptr : it->second[0]; +} + +Variable* ExecutionContext::OutputVar(const std::string& name) const { + auto it = ctx_.outputs.find(name); + if (it == ctx_.outputs.end()) return nullptr; + + PADDLE_ENFORCE_LE(it->second.size(), 1UL, + "Operator %s's output %s should contain only one variable.", + op_.Type(), name); + return it->second.empty() ? nullptr : it->second[0]; +} + template <> const Tensor* ExecutionContext::Input(const std::string& name) const { - auto* var = InputVar(name); - return var == nullptr ? nullptr - : GetTensorFromVar(const_cast(var)); + return Input(name); } template <> const std::vector ExecutionContext::MultiInput( const std::string& name) const { - auto names = op().Inputs(name); + auto it = ctx_.inputs.find(name); + if (it == ctx_.inputs.end()) { + return {}; + } + const std::vector& vars = it->second; std::vector res; - res.reserve(names.size()); - std::transform(names.begin(), names.end(), std::back_inserter(res), - [&](const std::string& sub_name) { - auto var = scope_.FindVar(sub_name); - return var == nullptr ? nullptr : GetTensorFromVar(var); + res.reserve(vars.size()); + std::transform(vars.begin(), vars.end(), std::back_inserter(res), + [&](Variable* var) -> const Tensor* { + if (var == nullptr) return nullptr; + PADDLE_ENFORCE( + var->IsType(), + "should be LoDTensor, but the received type is %s", + ToTypeName(var->Type())); + return &(var->Get()); }); return res; } template <> Tensor* ExecutionContext::Output(const std::string& name) const { - auto var = OutputVar(name); - return var == nullptr ? nullptr : GetMutableTensorFromVar(var); + return Output(name); } template <> std::vector ExecutionContext::MultiOutput( const std::string& name) const { - auto names = op().Outputs(name); + auto it = ctx_.outputs.find(name); + if (it == ctx_.outputs.end()) { + return {}; + } + const std::vector& vars = it->second; std::vector res; - res.reserve(names.size()); - std::transform(names.begin(), names.end(), std::back_inserter(res), - [&](const std::string& sub_name) { - auto var = scope_.FindVar(sub_name); + res.reserve(vars.size()); + std::transform(vars.begin(), vars.end(), std::back_inserter(res), + [&](Variable* var) -> Tensor* { return var == nullptr ? nullptr - : GetMutableTensorFromVar(var); + : var->GetMutable(); }); return res; } @@ -471,51 +548,48 @@ bool OpSupportGPU(const std::string& op_type) { class RuntimeInferShapeContext : public InferShapeContext { public: - RuntimeInferShapeContext(const OperatorBase& op, const Scope& scope) - : op_(op), scope_(scope) {} + RuntimeInferShapeContext(const OperatorBase& op, const Scope& scope, + const RuntimeContext& ctx) + : op_(op), ctx_(ctx) {} bool HasInput(const std::string& name) const override { // has only one input - const auto& ins = op_.Inputs(); + const auto& ins = ctx_.inputs; auto it = ins.find(name); if (it == ins.end()) { return false; } const auto& in = it->second; - if (in.size() == 0 || in[0] == kEmptyVarName) { - return false; - } + if (in.size() == 0) return false; PADDLE_ENFORCE_EQ(in.size(), 1UL, "Input %s should not have more than one inputs", name); - return scope_.FindVar(in[0]) != nullptr; + return in[0] != nullptr; } bool HasOutput(const std::string& name) const override { // has only one output - const auto& outs = op_.Outputs(); + const auto& outs = ctx_.outputs; auto it = outs.find(name); if (it == outs.end()) { return false; } const auto& out = it->second; - if (out.size() == 0 || out[0] == kEmptyVarName) { + if (out.size() == 0) { return false; } PADDLE_ENFORCE_EQ(out.size(), 1UL, "Output %s should not have more than one outputs", name); - return scope_.FindVar(out[0]) != nullptr; + return out[0] != nullptr; } bool HasInputs(const std::string& name) const override { - if (!op_.HasInputs(name)) { - return false; - } - auto inputs = op_.Inputs(name); - if (inputs.empty()) { + const auto& ins = ctx_.inputs; + auto it = ins.find(name); + if (it == ins.end() || it->second.empty()) { return false; } - for (auto& input : inputs) { - if (scope_.FindVar(input) == nullptr) { + for (auto& input : it->second) { + if (input == nullptr) { return false; } } @@ -523,15 +597,13 @@ class RuntimeInferShapeContext : public InferShapeContext { } bool HasOutputs(const std::string& name) const override { - if (!op_.HasOutputs(name)) { - return false; - } - auto outputs = op_.Outputs(name); - if (outputs.empty()) { + const auto& outs = ctx_.outputs; + auto it = outs.find(name); + if (it == outs.end() || it->second.empty()) { return false; } - for (auto& output : outputs) { - if (scope_.FindVar(output) == nullptr) { + for (auto& output : it->second) { + if (output == nullptr) { return false; } } @@ -552,16 +624,18 @@ class RuntimeInferShapeContext : public InferShapeContext { void ShareDim(const std::string& in, const std::string& out, size_t i = 0, size_t j = 0) override { - PADDLE_ENFORCE_LT(i, Inputs(in).size()); - PADDLE_ENFORCE_LT(j, Outputs(out).size()); - const std::string& input_n = Inputs(in)[i]; - const std::string& output_n = Outputs(out)[j]; + auto in_it = ctx_.inputs.find(in); + auto out_it = ctx_.outputs.find(out); + PADDLE_ENFORCE(in_it != ctx_.inputs.end() && in_it->second.size() > i, + "Inputs %s should have %llu argument", in, i); + PADDLE_ENFORCE(out_it != ctx_.outputs.end() && out_it->second.size() > j, + "Outputs %s should have %llu argument", out, j); + + Variable* in_var = in_it->second[i]; + Variable* out_var = out_it->second[j]; - Variable* in_var = scope_.FindVar(input_n); - Variable* out_var = scope_.FindVar(output_n); PADDLE_ENFORCE(in_var->Type() == out_var->Type(), - "The type of %s and %s is not the same.", output_n, - GetDim(input_n)); + "The type of %s and %s is not the same.", in, out); if (in_var->IsType()) { auto& in_sele_rows = in_var->Get(); @@ -582,13 +656,16 @@ class RuntimeInferShapeContext : public InferShapeContext { void ShareLoD(const std::string& in, const std::string& out, size_t i = 0, size_t j = 0) const override { - const std::vector& inputs = Inputs(in); - const std::vector& outputs = Outputs(out); - PADDLE_ENFORCE_LT(i, inputs.size()); - PADDLE_ENFORCE_LT(j, outputs.size()); - Variable* in_var = scope_.FindVar(inputs.at(i)); + auto in_it = ctx_.inputs.find(in); + auto out_it = ctx_.outputs.find(out); + PADDLE_ENFORCE(in_it != ctx_.inputs.end() && in_it->second.size() > i, + "Inputs %s should have %llu argument", in, i); + PADDLE_ENFORCE(out_it != ctx_.outputs.end() && out_it->second.size() > j, + "Outputs %s should have %llu argument", out, j); + + Variable* in_var = in_it->second.at(i); if (!in_var->IsType()) return; - Variable* out_var = scope_.FindVar(outputs.at(j)); + Variable* out_var = out_it->second.at(j); PADDLE_ENFORCE(out_var->IsType(), "The %d-th output of Output(%s) must be LoDTensor.", j, out); auto in_tensor = in_var->Get(); @@ -616,11 +693,71 @@ class RuntimeInferShapeContext : public InferShapeContext { out_tensor->set_layout(in_tensor.layout()); } + void DecreaseLoDLevel(const std::string& in, const std::string& out, + size_t i = 0, size_t j = 0) const override { + PADDLE_THROW("DecreaseLoDLevel is only used in compile time."); + } + bool IsRuntime() const override { return true; } + // TODO(paddle-dev): Can this be template? + std::vector GetInputVarPtrs( + const std::string& name) override { + const std::vector& vars = InputVars(name); + std::vector res; + res.reserve(vars.size()); + res.insert(res.begin(), vars.begin(), vars.end()); + return res; + } + + std::vector GetOutputVarPtrs( + const std::string& name) override { + const std::vector& vars = OutputVars(name); + std::vector res; + res.reserve(vars.size()); + res.insert(res.begin(), vars.begin(), vars.end()); + return res; + } + + DDim GetInputDim(const std::string& name) const override { + const std::vector& vars = InputVars(name); + PADDLE_ENFORCE_EQ(vars.size(), 1UL, + "Input(%s) should hold one element, but now it holds %d", + name, vars.size()); + return this->GetDim(vars[0]); + } + + std::vector GetInputsDim(const std::string& name) const override { + const std::vector& vars = InputVars(name); + return GetDims(vars); + } + + std::vector GetInputsVarType( + const std::string& name) const override { + return GetVarTypes(InputVars(name)); + } + + std::vector GetOutputsVarType( + const std::string& name) const override { + return GetVarTypes(OutputVars(name)); + } + + void SetOutputDim(const std::string& name, const DDim& dim) override { + auto& vars = OutputVars(name); + PADDLE_ENFORCE_EQ(vars.size(), 1UL, + "Output(%s) should hold one element, but now it holds %d", + name, vars.size()); + SetDim(vars[0], dim); + } + + void SetOutputsDim(const std::string& name, + const std::vector& dims) override { + auto& vars = OutputVars(name); + SetDims(vars, dims); + } + protected: - DDim GetDim(const std::string& name) const override { - Variable* var = scope_.FindVar(name); + DDim GetDim(Variable* var) const { PADDLE_ENFORCE_NOT_NULL(var); if (var->IsType()) { return var->Get().dims(); @@ -628,25 +765,44 @@ class RuntimeInferShapeContext : public InferShapeContext { return var->Get().GetCompleteDims(); } else { PADDLE_THROW( - "Only LoDTensor/SelectedRows support 'GetDim', but Variable %s's " + "Only LoDTensor/SelectedRows support 'GetDim', but Variables " "type_id is %s.", - name, var->Type().name()); + ToTypeName(var->Type())); } } + std::vector GetDims(const std::vector& vars) const { + std::vector ret; + ret.reserve(vars.size()); + std::transform(vars.begin(), vars.end(), std::back_inserter(ret), + [this](Variable* var) { return this->GetDim(var); }); + return ret; + } + std::vector GetRepeatedDims(const std::string& name) const override { PADDLE_THROW("Only compile time support this method"); } - void SetDim(const std::string& name, const DDim& dim) override { - Variable* var = scope_.FindVar(name); + void SetDim(Variable* var, const DDim& dim) { if (var->IsType()) { var->GetMutable()->Resize(dim); } else if (var->IsType()) { var->GetMutable()->set_height(dim[0]); } else { - PADDLE_THROW("Variable %s type_id %s, expect LoDTensor/SelectedRows.", - name, var->Type().name()); + PADDLE_THROW("Variable type_id %s, expect LoDTensor/SelectedRows.", + ToTypeName(var->Type())); + } + } + + void SetDims(const std::vector& vars, + const std::vector& dims) { + size_t length = vars.size(); + PADDLE_ENFORCE_EQ(length, dims.size()); + for (size_t i = 0; i < length; ++i) { + if (vars[i] == nullptr) { + continue; + } + SetDim(vars[i], dims[i]); } } @@ -655,92 +811,130 @@ class RuntimeInferShapeContext : public InferShapeContext { PADDLE_THROW("Only compile time support this method"); } - proto::VarType::Type GetVarType(const std::string& name) const override { - auto* var = scope_.FindVar(name); - return ToVarType(var->Type()); + std::vector GetVarTypes( + const std::vector& vars) const { + std::vector retv; + retv.resize(vars.size()); + std::transform(vars.begin(), vars.end(), retv.begin(), + std::bind(std::mem_fn(&RuntimeInferShapeContext::GetVarType), + this, std::placeholders::_1)); + return retv; } - InferShapeVarPtr GetVarPtr(const std::string& name) override { - return scope_.FindVar(name); + proto::VarType::Type GetVarType(Variable* var) const { + return ToVarType(var->Type()); } private: + const std::vector& InputVars(const std::string& name) const { + auto it = ctx_.inputs.find(name); + PADDLE_ENFORCE(it != ctx_.inputs.end(), + "Operator %s does not have the input %s.", op_.Type(), name); + return it->second; + } + + const std::vector& OutputVars(const std::string& name) const { + auto it = ctx_.outputs.find(name); + PADDLE_ENFORCE(it != ctx_.outputs.end(), + "Operator %s does not have the outputs %s.", op_.Type(), + name); + return it->second; + } + const OperatorBase& op_; - const Scope& scope_; + const RuntimeContext& ctx_; }; -static void CheckTensorNANOrInf(const std::string& name, +static void CheckTensorNANOrInf(const std::string& op_type, + const std::string& name, const framework::Tensor& tensor) { if (tensor.memory_size() == 0) { return; } - if (!IsType(tensor.type()) && !IsType(tensor.type())) { + if (tensor.type() != proto::VarType::FP32 && + tensor.type() != proto::VarType::FP64) { return; } PADDLE_ENFORCE(!framework::TensorContainsInf(tensor), - "Tensor %s contains Inf", name); + "Operator %s output Tensor %s contains Inf", op_type, name); PADDLE_ENFORCE(!framework::TensorContainsNAN(tensor), - "Tensor %s contains NAN", name); + "Operator %s output Tensor %s contains NAN", op_type, name); } -void OperatorWithKernel::RunImpl(const Scope& scope, - const platform::Place& place) const { - RuntimeInferShapeContext infer_shape_ctx(*this, scope); +void OperatorWithKernel::RuntimeInferShape(const Scope& scope, + const platform::Place& place, + const RuntimeContext& ctx) const { + RuntimeInferShapeContext infer_shape_ctx(*this, scope, ctx); this->InferShape(&infer_shape_ctx); - platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance(); - auto* dev_ctx = pool.Get(place); +} - // check if op[type] has kernel registered. - auto& all_op_kernels = AllOpKernels(); - auto kernels_iter = all_op_kernels.find(type_); - if (kernels_iter == all_op_kernels.end()) { - PADDLE_THROW( - "There are no kernels which are registered in the %s operator.", type_); +std::vector* OperatorWithKernel::GetKernelConfig( + const OpKernelType& key) const { + auto config_iter = kernel_configs_map_.find(key); + std::vector* kernel_configs = nullptr; + if (config_iter != kernel_configs_map_.end()) { + kernel_configs = &(config_iter->second); } + return kernel_configs; +} - OpKernelMap& kernels = kernels_iter->second; - - // TODO(dzhwinter) : kernel fallback mechanism will be added when all the - // transform functions are ready. - - // for (auto& candidate : kKernelPriority) { - // Do selection - // } +void OperatorWithKernel::RunImpl(const Scope& scope, + const platform::Place& place) const { + // To reduce the elapsed time of HasAttr, we use bool variable to record the + // result of HasAttr. + if (!enable_cache_runtime_context && HasAttr(kEnableCacheRuntimeContext)) + enable_cache_runtime_context = true; + if (!enable_cache_expected_kernel && HasAttr(kEnableCacheExpectedKernel)) + enable_cache_expected_kernel = true; + if (!all_kernels_must_compute_runtime_shape && + HasAttr(kAllKernelsMustComputeRuntimeShape)) + all_kernels_must_compute_runtime_shape = true; + if (!enable_cache_runtime_context) { + RuntimeContext ctx(Inputs(), Outputs(), scope); + RunImpl(scope, place, &ctx); + } else { + const Scope* cur_scope = &scope; + if (!runtime_ctx_ || pre_scope_ != cur_scope) { + runtime_ctx_.reset(new RuntimeContext(Inputs(), Outputs(), scope)); + pre_scope_ = cur_scope; + } + RunImpl(scope, place, runtime_ctx_.get()); + } +} - auto expected_kernel_key = - this->GetExpectedKernelType(ExecutionContext(*this, scope, *dev_ctx)); - VLOG(3) << "expected_kernel_key:" << expected_kernel_key; +void OperatorWithKernel::RunImpl(const Scope& scope, + const platform::Place& place, + RuntimeContext* runtime_ctx) const { + platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance(); + auto* dev_ctx = pool.Get(place); - auto kernel_iter = kernels.find(expected_kernel_key); -#ifdef PADDLE_WITH_MKLDNN - // workaround for missing MKLDNN kernel when FLAGS_use_mkldnn env var is set - if (kernel_iter == kernels.end() && - expected_kernel_key.library_type_ == LibraryType::kMKLDNN) { - VLOG(3) << "missing MKLDNN kernel: fallbacking to PLAIN one"; - expected_kernel_key.library_type_ = LibraryType::kPlain; - expected_kernel_key.data_layout_ = DataLayout::kAnyLayout; - kernel_iter = kernels.find(expected_kernel_key); - } -#endif - if (kernel_iter == kernels.end()) { - PADDLE_THROW("op %s does not have kernel for %s", type_, - KernelTypeToString(expected_kernel_key)); + if (!enable_cache_expected_kernel || !kernel_type_) { + ChooseKernel(*runtime_ctx, scope, place); } + std::vector* kernel_configs = GetKernelConfig(*kernel_type_); + // do data transformScope &transfer_scope; std::vector transfered_inplace_vars; auto* transfer_scope = - TryTransferData(scope, expected_kernel_key, &transfered_inplace_vars); + PrepareData(scope, *kernel_type_, &transfered_inplace_vars, runtime_ctx); // exec scope is the scope that kernel actually executed on. const Scope& exec_scope = (transfer_scope == nullptr ? scope : *transfer_scope); - if (!(expected_kernel_key.place_ == dev_ctx->GetPlace())) { - dev_ctx = pool.Get(expected_kernel_key.place_); + if (!(kernel_type_->place_ == dev_ctx->GetPlace())) { + dev_ctx = pool.Get(kernel_type_->place_); } - kernel_iter->second(ExecutionContext(*this, exec_scope, *dev_ctx)); + if (!all_kernels_must_compute_runtime_shape) { + RuntimeInferShapeContext infer_shape_ctx(*this, exec_scope, *runtime_ctx); + this->InferShape(&infer_shape_ctx); + } + // TODO(panyx0718): ExecutionContext should only depend on RuntimeContext + // not Scope. Imperative mode only pass inputs and get outputs. + (*kernel_func_)(ExecutionContext(*this, exec_scope, *dev_ctx, *runtime_ctx, + kernel_configs)); if (!transfered_inplace_vars.empty()) { // there is inplace variable has been transfered. @@ -757,38 +951,111 @@ void OperatorWithKernel::RunImpl(const Scope& scope, auto* var = exec_scope.FindVar(vname); if (var == nullptr) continue; if (var->IsType()) { - CheckTensorNANOrInf(vname, var->Get()); + CheckTensorNANOrInf(type_, vname, var->Get()); } else if (var->IsType()) { - CheckTensorNANOrInf(vname, var->Get().value()); + CheckTensorNANOrInf(type_, vname, + var->Get().value()); } } } } + +void OperatorWithKernel::ChooseKernel(const RuntimeContext& ctx, + const Scope& scope, + const platform::Place& place) const { + platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance(); + auto* dev_ctx = pool.Get(place); + + // check if op[type] has kernel registered. + auto& all_op_kernels = AllOpKernels(); + auto kernels_iter = all_op_kernels.find(type_); + if (kernels_iter == all_op_kernels.end()) { + PADDLE_THROW( + "There are no kernels which are registered in the %s operator.", type_); + } + + OpKernelMap& kernels = kernels_iter->second; + + auto expected_kernel_key = this->GetExpectedKernelType( + ExecutionContext(*this, scope, *dev_ctx, ctx, nullptr)); + VLOG(3) << "expected_kernel_key:" << expected_kernel_key; + + auto kernel_iter = kernels.find(expected_kernel_key); +#ifdef PADDLE_WITH_MKLDNN + // workaround for missing MKLDNN kernel when FLAGS_use_mkldnn env var is set + if (kernel_iter == kernels.end() && + expected_kernel_key.library_type_ == LibraryType::kMKLDNN) { + VLOG(3) << "missing MKLDNN kernel: fallbacking to PLAIN one"; + expected_kernel_key.library_type_ = LibraryType::kPlain; + expected_kernel_key.data_layout_ = DataLayout::kAnyLayout; + kernel_iter = kernels.find(expected_kernel_key); + } +#endif + if (kernel_iter == kernels.end()) { + PADDLE_THROW("op %s does not have kernel for %s", type_, + KernelTypeToString(expected_kernel_key)); + } + + kernel_type_.reset(new OpKernelType(expected_kernel_key)); + kernel_func_.reset(new OpKernelFunc(kernel_iter->second)); +} + void OperatorWithKernel::TransferInplaceVarsBack( const Scope& scope, const std::vector& inplace_vars, const Scope& transfer_scope) const { for (auto& var_name : inplace_vars) { VLOG(3) << "share inplace var " + var_name + " back to it's original scope"; - auto* original_tensor = GetMutableTensorFromVar(scope.FindVar(var_name)); - auto* transformed_tensor = - GetTensorFromVar(transfer_scope.FindVar(var_name)); + auto* origin_var = scope.FindVar(var_name); + PADDLE_ENFORCE_NOT_NULL(origin_var, "The var[%s] should not be nullptr.", + var_name); + auto* original_tensor = + GetMutableLoDTensorOrSelectedRowsValueFromVar(origin_var); + auto* var = transfer_scope.FindVar(var_name); + PADDLE_ENFORCE_NOT_NULL(var, "The var[%s] should not be nullptr.", + var_name); + auto* transformed_tensor = GetLoDTensorOrSelectedRowsValueFromVar(*var); original_tensor->ShareDataWith(*transformed_tensor); } } -Scope* OperatorWithKernel::TryTransferData( +Scope* OperatorWithKernel::PrepareData( const Scope& scope, const OpKernelType& expected_kernel_key, - std::vector* transfered_inplace_vars) const { + std::vector* transfered_inplace_vars, + RuntimeContext* ctx) const { Scope* new_scope = nullptr; + + std::unordered_set no_buffer_ins; + if (info_) { + auto& no_buffer_inferer = info_->NoNeedBufferVarsInferer(); + // Some op may not register NoNeedBufferVarsInferer + if (no_buffer_inferer) { + no_buffer_ins = no_buffer_inferer(Inputs(), Outputs(), Attrs()); + } + } + for (auto& var_name_item : Inputs()) { - for (auto& var_name : var_name_item.second) { - auto* var = scope.FindVar(var_name); + // NOTE(zjl): STL does not guarantee fast std::unordered_set::count when set + // is empty. At least STL implemented on my mac does calculate hash code + // of search key even though the set is empty. + if (!no_buffer_ins.empty() && + no_buffer_ins.count(var_name_item.first) > 0) { + VLOG(7) << "Skip scanning input " << var_name_item.first + << " in Operator " << type_; + continue; + } + + std::vector& input_vars = ctx->inputs[var_name_item.first]; + + for (size_t i = 0; i < var_name_item.second.size(); ++i) { + auto& var_name = var_name_item.second[i]; + auto* var = input_vars[i]; + // Only tensor can be tranfer to another device. - if (var == nullptr || !VarIsTensor(var)) { + if (var == nullptr || !VarIsTensor(*var)) { continue; } - auto* tensor_in = GetTensorFromVar(var); + auto* tensor_in = GetLoDTensorOrSelectedRowsValueFromVar(*var); if (!tensor_in->IsInitialized()) { continue; } @@ -809,11 +1076,29 @@ Scope* OperatorWithKernel::TryTransferData( VLOG(3) << "Transform Variable " << var_name << " from " << kernel_type_for_var << " to " << expected_kernel_key; - if (new_scope == nullptr) { + // In the inference scenerio, the scopes will be reused across the + // batches, so the `new_scope` here will result in GPU memroy explosion + // over the running of operators. + // We use a thread_local cache to fix that issue, the key in the cache is + // the combination of the `scope` argument, from_kernel_type, + // target_kernel_type. + // Have a discussion with @Superjomn or the inference developers if some + // changes on this logic for this macro might not tested on the other + // scenerios. + // If this op is not called by an Executor or ParallelExecutor, it should + // called by a NaiveExecutor, the NaiveExecutor will cache the scopes and + // variables, that behavior a lot different. + if (!run_by_executor_) { + new_scope = TryCreateTransferScope(kernel_type_for_var, + expected_kernel_key, &scope); + } + if (!new_scope) { new_scope = &scope.NewScope(); } auto* trans_var = new_scope->Var(var_name); + input_vars[i] = trans_var; + Tensor out; TransformData(expected_kernel_key, kernel_type_for_var, *tensor_in, &out); SetTensorToVariable(*var, out, trans_var); @@ -825,12 +1110,13 @@ Scope* OperatorWithKernel::TryTransferData( proto::VarType::Type OperatorWithKernel::IndicateDataType( const ExecutionContext& ctx) const { - auto& scope = ctx.scope(); - int data_type = -1; - std::string last_input_name; + proto::VarType::Type dafault_data_type = + static_cast(-1); + proto::VarType::Type data_type = dafault_data_type; for (auto& input : this->inputs_) { - for (auto& ipt_name : input.second) { - auto* var = scope.FindVar(ipt_name); + const std::vector vars = ctx.MultiInputVar(input.first); + for (size_t i = 0; i < vars.size(); ++i) { + const Variable* var = vars[i]; if (var != nullptr) { const Tensor* t = nullptr; if (var->IsType()) { @@ -841,19 +1127,22 @@ proto::VarType::Type OperatorWithKernel::IndicateDataType( t = &(var->Get().value()); } if (t != nullptr) { - int tmp = static_cast(ToDataType(t->type())); + PADDLE_ENFORCE(t->IsInitialized(), "Input %s(%lu)is not initialized", + input.first, i); + proto::VarType::Type tmp = t->type(); PADDLE_ENFORCE( - tmp == data_type || data_type == -1, - "DataType of Paddle Op %s must be the same. Get %s(%d) != %s(%d)", - Type(), last_input_name, data_type, ipt_name, tmp); + tmp == data_type || data_type == dafault_data_type, + "DataType of Paddle Op %s %s must be the same. Get (%d) != (%d)", + Type(), input.first, DataTypeToString(data_type), + DataTypeToString(tmp)); data_type = tmp; - last_input_name = ipt_name; } } } } - PADDLE_ENFORCE(data_type != -1, "DataType should be indicated by input"); - return static_cast(data_type); + PADDLE_ENFORCE(data_type != dafault_data_type, + "DataType should be indicated by input"); + return data_type; } OpKernelType OperatorWithKernel::GetExpectedKernelType( diff --git a/paddle/fluid/framework/operator.h b/paddle/fluid/framework/operator.h index 626b50edfd39424473be33e9f8baec5970471477..489b66099658d522fe1f1adaad763b66bdd22c91 100644 --- a/paddle/fluid/framework/operator.h +++ b/paddle/fluid/framework/operator.h @@ -16,12 +16,12 @@ limitations under the License. */ #include #include +#include #include #include #include +#include #include -#define GLOG_NO_ABBREVIATED_SEVERITIES -#define GOOGLE_GLOG_DLL_DECL #include "glog/logging.h" // For VLOG #include "paddle/fluid/framework/attribute.h" @@ -30,12 +30,15 @@ limitations under the License. */ #include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/framework/op_info.h" #include "paddle/fluid/framework/op_kernel_type.h" +#include "paddle/fluid/framework/operator_kernel_configs.h" #include "paddle/fluid/framework/scope.h" #include "paddle/fluid/framework/selected_rows.h" #include "paddle/fluid/framework/tensor.h" #include "paddle/fluid/platform/device_context.h" #include "paddle/fluid/platform/variant.h" +DECLARE_int32(inner_op_parallelism); + namespace paddle { namespace framework { @@ -51,24 +54,80 @@ constexpr char kTempVarName[] = "@TEMP@"; /// e.g. Variable "x@GRAD" is the gradient of varibale "x". constexpr char kGradVarSuffix[] = "@GRAD"; +constexpr size_t kGradVarSuffixSize = 5U; + /// Variables with this suffix are supposed to be filled up with zeros. constexpr char kZeroVarSuffix[] = "@ZERO"; +/// Variables with this suffix are the new Gradient. +constexpr char kNewGradSuffix[] = "@NEWGRAD@"; + +/// RuntimeContext is used to relate input/output names of Operator with +/// the corresponding variables in name scope. +/// If an Op has attribute kEnableCacheRuntimeContext, it means that in a same +/// name scope, since the input/output names of this Op do not change in the +/// execution, RuntimeContext could be created only at the first iteration of +/// this Op's execution to save the elapsed time. +constexpr char kEnableCacheRuntimeContext[] = "@ENABLE_CACHE_RUNTIME_CONTEXT@"; + +/// If an Op has attribtue kEnableCacheExpectedKernel, it means that in a same +/// name scope and same place, since the expected kerenl of this Op does not +/// change in the execution, it could be recorded only at the first iteration of +/// this Op's execution to save the elapsed time. +constexpr char kEnableCacheExpectedKernel[] = "@ENABLE_CACHE_EXPECTED_KERNEL@"; + +/// If an Op has this attribute, all its kernels should calculate output +/// variable's shape in the corresponding Compute() function. And +/// OperatorWithKernel::RunImpl() would skip call this Op's InferShape() +/// function in its runtime for speedup. +/// TODO(luotao): Note that this temporal attribute would be deleted after all +/// ops contain it. +constexpr char kAllKernelsMustComputeRuntimeShape[] = + "@ALL_KERNELS_MUST_COMPUTE_RUNTIME_SHAPE@"; + // define some kernel priority /* Define multiple kernel type fallback order*/ extern std::vector> kKernelPriority; inline std::string GradVarName(const std::string& var_name) { - return var_name + kGradVarSuffix; + std::string result; + result.reserve(var_name.size() + kGradVarSuffixSize); + result += var_name; + result += kGradVarSuffix; + return result; +} + +inline std::string GradOriginalVarName(const std::string& grad_var_name) { + std::size_t pos = grad_var_name.rfind(kGradVarSuffix); + if (pos == std::string::npos) { + return grad_var_name; + } else { + return grad_var_name.substr(0, pos); + } } proto::VarType::Type GetDataTypeOfVar(const Variable* var); +const Tensor* GetLoDTensorOrSelectedRowsValueFromVar(const Variable& var); +Tensor* GetMutableLoDTensorOrSelectedRowsValueFromVar(Variable* var); class OperatorBase; class ExecutionContext; +class RuntimeContext { + public: + RuntimeContext(const VariableNameMap& innames, + const VariableNameMap& outnames, const Scope& scope); + + RuntimeContext(const VariableValueMap& invars, + const VariableValueMap& outvars) + : inputs(invars), outputs(outvars) {} + + VariableValueMap inputs; + VariableValueMap outputs; +}; + /** - * OperatorBase has the basic element that Net will call to do computation. + * OperatorBase has the basic elements that Net will call to do computation. * Only CreateOperator from OpRegistry will new Operator directly. User * should always construct a proto message OpDesc and call * OpRegistry::CreateOp(op_desc) to get an Operator instance. @@ -95,10 +154,11 @@ class OperatorBase { const std::string& Type() const { return type_; } + bool HasAttr(const std::string& name) const { return attrs_.count(name); } template inline const T& Attr(const std::string& name) const { - PADDLE_ENFORCE(attrs_.count(name) != 0, "%s should be in AttributeMap", - name); + PADDLE_ENFORCE(attrs_.find(name) != attrs_.end(), + "%s should be in AttributeMap", name); return boost::get(attrs_.at(name)); } const AttributeMap& Attrs() const { return attrs_; } @@ -106,6 +166,11 @@ class OperatorBase { const VariableNameMap& Inputs() const { return inputs_; } const VariableNameMap& Outputs() const { return outputs_; } + const OpInfo& Info() const { + PADDLE_ENFORCE_NOT_NULL(info_, "OpInfo of %s is not found", type_); + return *info_; + } + bool HasInputs(const std::string& name) const; //! Get a input with argument's name described in `op_proto` std::string Input(const std::string& name) const; @@ -123,6 +188,11 @@ class OperatorBase { //! Get all outputs variable names virtual std::vector OutputVars(bool has_intermediate) const; + void SetIsCalledByExecutor(bool x) { run_by_executor_ = x; } + virtual void RuntimeInferShape(const Scope& scope, + const platform::Place& place, + const RuntimeContext& ctx) const {} + protected: std::string type_; // NOTE: in case of OpGrad, inputs_ contains: @@ -136,6 +206,12 @@ class OperatorBase { VariableNameMap outputs_; AttributeMap attrs_; + // OpInfo + const OpInfo* info_; + + // Whether this operator executes in an Executor. + bool run_by_executor_{true}; + private: void GenerateTemporaryNames(); void CheckAllInputOutputSet() const; @@ -143,11 +219,30 @@ class OperatorBase { const platform::Place& place) const = 0; }; +#ifdef PADDLE_WITH_CUDA +using KernelConfig = boost::variant< + std::shared_ptr>, + std::shared_ptr>, + std::shared_ptr>>; +#else +using KernelConfig = boost::variant; +#endif + +using OpKernelConfigsMap = + std::unordered_map, + OpKernelType::Hash>; + class ExecutionContext { public: ExecutionContext(const OperatorBase& op, const Scope& scope, - const platform::DeviceContext& device_context) - : op_(op), scope_(scope), device_context_(device_context) {} + const platform::DeviceContext& device_context, + const RuntimeContext& ctx, + std::vector* configs) + : op_(op), + scope_(scope), + device_context_(device_context), + ctx_(ctx), + kernel_configs_(configs) {} const OperatorBase& op() const { return op_; } @@ -170,39 +265,26 @@ class ExecutionContext { return op_.Outputs(name).size(); } - const Variable* InputVar(const std::string& name) const { - auto ipt = op_.Input(name); - return ipt == kEmptyVarName ? nullptr : scope_.FindVar(ipt); - } + const Variable* InputVar(const std::string& name) const; - Variable* OutputVar(const std::string& name) const { - auto opt = op_.Output(name); - return opt == kEmptyVarName ? nullptr : scope_.FindVar(opt); - } + Variable* OutputVar(const std::string& name) const; const std::vector MultiInputVar( const std::string& name) const { - auto names = op_.Inputs(name); - std::vector res; - res.reserve(names.size()); - std::transform(names.begin(), names.end(), std::back_inserter(res), - [this](const std::string& name) { - return name == kEmptyVarName ? nullptr - : scope_.FindVar(name); - }); - return res; + auto it = ctx_.inputs.find(name); + if (it == ctx_.inputs.end()) { + return {}; + } + return {it->second.begin(), it->second.end()}; } std::vector MultiOutputVar(const std::string& name) const { auto names = op_.Outputs(name); - std::vector res; - res.reserve(names.size()); - std::transform(names.begin(), names.end(), std::back_inserter(res), - [this](const std::string& name) { - return name == kEmptyVarName ? nullptr - : scope_.FindVar(name); - }); - return res; + auto it = ctx_.outputs.find(name); + if (it == ctx_.outputs.end()) { + return {}; + } + return it->second; } template @@ -219,12 +301,15 @@ class ExecutionContext { template const std::vector MultiInput(const std::string& name) const { - auto names = op_.Inputs(name); + auto it = ctx_.inputs.find(name); + if (it == ctx_.inputs.end()) { + return {}; + } + const std::vector& vars = it->second; std::vector res; - res.reserve(names.size()); - std::transform(names.begin(), names.end(), std::back_inserter(res), - [&](const std::string& sub_name) { - auto var = scope_.FindVar(sub_name); + res.reserve(vars.size()); + std::transform(vars.begin(), vars.end(), std::back_inserter(res), + [&](Variable* var) -> const T* { return var == nullptr ? nullptr : &var->Get(); }); return res; @@ -232,12 +317,15 @@ class ExecutionContext { template std::vector MultiOutput(const std::string& name) const { - auto names = op_.Outputs(name); + auto it = ctx_.outputs.find(name); + if (it == ctx_.outputs.end()) { + return {}; + } + const std::vector& vars = it->second; std::vector res; - res.reserve(names.size()); - std::transform(names.begin(), names.end(), std::back_inserter(res), - [&](const std::string& sub_name) { - auto var = scope_.FindVar(sub_name); + res.reserve(vars.size()); + std::transform(vars.begin(), vars.end(), std::back_inserter(res), + [&](Variable* var) -> T* { return var == nullptr ? nullptr : var->GetMutable(); }); return res; @@ -272,10 +360,44 @@ class ExecutionContext { return op_.Outputs(name); } + template + Tensor AllocateTmpTensor(const framework::DDim& dim, + const DevContext& dev_ctx) const { + auto tmp_allocation_ptr = platform::DeviceTemporaryAllocator::Instance() + .Get(dev_ctx) + .Allocate(product(dim) * sizeof(T)); + auto& deleter = tmp_allocation_ptr.get_deleter(); + auto* allocation_ptr = tmp_allocation_ptr.release(); + auto shared_allocation = std::shared_ptr( + allocation_ptr, deleter); + + PADDLE_ENFORCE( + dynamic_cast(allocation_ptr) != nullptr, + "The AllocationPtr must be TemporaryAllocation."); + PADDLE_ENFORCE_GE(allocation_ptr->size(), + framework::product(dim) * sizeof(T)); + + paddle::framework::Tensor temp_tensor( + framework::ToDataType(std::type_index(typeid(T)))); + temp_tensor.Resize(dim); + temp_tensor.ResetHolder(std::move(shared_allocation)); + return temp_tensor; + } + + template + T& GetKernelConfig(int idx) const { + PADDLE_ENFORCE(kernel_configs_ && kernel_configs_->size() > idx, + "%s selected kernel doesn't have kernel config %lu <= %d", + op_.Type().c_str(), kernel_configs_->size(), idx); + return *boost::get>(kernel_configs_->at(idx)); + } + private: const OperatorBase& op_; const Scope& scope_; const platform::DeviceContext& device_context_; + const RuntimeContext& ctx_; + mutable std::vector* kernel_configs_; }; template <> @@ -337,11 +459,17 @@ class OperatorWithKernel : public OperatorBase { } virtual void InferShape(InferShapeContext* ctx) const { - OpInfoMap::Instance().Get(Type()).infer_shape_(ctx); + Info().infer_shape_(ctx); } - protected: + void RuntimeInferShape(const Scope& scope, const platform::Place& place, + const RuntimeContext& ctx) const override; + virtual OpKernelType GetExpectedKernelType(const ExecutionContext& ctx) const; + + std::vector* GetKernelConfig(const OpKernelType& key) const; + + protected: virtual OpKernelType GetKernelTypeForVar( const std::string& var_name, const Tensor& tensor, const OpKernelType& expected_kernel_type) const; @@ -351,6 +479,8 @@ class OperatorWithKernel : public OperatorBase { // same. proto::VarType::Type IndicateDataType(const ExecutionContext& ctx) const; void RunImpl(const Scope& scope, const platform::Place& place) const final; + void RunImpl(const Scope& scope, const platform::Place& place, + RuntimeContext* runtime_ctx) const; /** * Transfer data from scope to a transfered scope. If there is no data need to @@ -358,13 +488,27 @@ class OperatorWithKernel : public OperatorBase { * * * transfered_inplace_vars is a output vector. */ - Scope* TryTransferData( - const Scope& scope, const OpKernelType& expected_kernel_key, - std::vector* transfered_inplace_vars) const; + Scope* PrepareData(const Scope& scope, + const OpKernelType& expected_kernel_key, + std::vector* transfered_inplace_vars, + RuntimeContext* ctx) const; void TransferInplaceVarsBack(const Scope& scope, const std::vector& inplace_vars, const Scope& exec_scope) const; + + void ChooseKernel(const RuntimeContext& ctx, const Scope& scope, + const platform::Place& place) const; + + protected: + mutable OpKernelConfigsMap kernel_configs_map_; + mutable std::unique_ptr kernel_type_; + mutable std::unique_ptr kernel_func_; + mutable std::unique_ptr runtime_ctx_; + mutable const Scope* pre_scope_ = nullptr; + mutable bool enable_cache_runtime_context = false; + mutable bool enable_cache_expected_kernel = false; + mutable bool all_kernels_must_compute_runtime_shape = false; }; extern bool OpSupportGPU(const std::string& op_type); diff --git a/paddle/fluid/framework/operator_kernel_configs.h b/paddle/fluid/framework/operator_kernel_configs.h new file mode 100644 index 0000000000000000000000000000000000000000..c520c222350ceeef246dae756a7157872ae087fa --- /dev/null +++ b/paddle/fluid/framework/operator_kernel_configs.h @@ -0,0 +1,118 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include + +namespace paddle { +namespace framework { + +// Not thread-safe. Should be owned per-kernel. +template +class AlgorithmsCache { + public: + AlgorithmsCache() : search_times_(0) { hash_.clear(); } + // Caches the best algorithm for a given + // combination of tensor dimensions & compute data type. + TAlgorithm GetAlgorithm( + const std::vector& dims1, const std::vector& dims2, + const std::vector& strides, const std::vector& paddings, + const std::vector& dilations, + int algorithmFlags, // can set for different data type + std::function gen_func); + + TAlgorithm GetAlgorithm(int64_t area, int search_times, int algorithmFlags, + std::function gen_func); + + private: + std::unordered_map hash_; + int search_times_; +}; + +template +TAlgorithm framework::AlgorithmsCache::GetAlgorithm( + const std::vector& dims1, const std::vector& dims2, + const std::vector& strides, const std::vector& paddings, + const std::vector& dilations, int algorithmFlags, + std::function gen_func) { + int64_t seed = 0; + // Hash all of the inputs, use to try and look up a previously + // discovered algorithm, or fall back to generating a new one. + std::hash hashFn; + // do hash like boost + // https://stackoverflow.com/questions/2590677/how-do-i-combine-hash-values-in-c0x + for (const auto num : dims1) { + seed ^= hashFn(num) + 0x9e3779b9 + (seed << 6) + (seed >> 2); + } + + for (const auto num : dims2) { + seed ^= hashFn(num) + 0x9e3779b9 + (seed << 6) + (seed >> 2) + 1; + } + + for (const auto num : strides) { + seed ^= hashFn(static_cast(num)) + 0x9e3779b9 + (seed << 6) + + (seed >> 2) + 2; + } + + for (const auto num : paddings) { + seed ^= hashFn(static_cast(num)) + 0x9e3779b9 + (seed << 6) + + (seed >> 2) + 3; + } + + for (const auto num : dilations) { + seed ^= hashFn(static_cast(num)) + 0x9e3779b9 + (seed << 6) + + (seed >> 2) + 4; + } + + seed ^= hashFn(static_cast(algorithmFlags)) + 0x9e3779b9 + + (seed << 6) + (seed >> 2) + 5; + + if (seed == 0) return gen_func(); + + if (hash_.find(seed) == hash_.end()) { + TAlgorithm value = gen_func(); + hash_[seed] = value; + } + return hash_[seed]; +} + +template +TAlgorithm AlgorithmsCache::GetAlgorithm( + int64_t area, int search_times, int algorithmFlags, + std::function gen_func) { + if (hash_.find(area) != hash_.end()) { + return hash_[area]; + } + if (search_times_ < search_times) { + auto algo = gen_func(); + hash_[area] = algo; + ++search_times_; + return algo; + } + TAlgorithm algo; + int64_t min = static_cast(INT_MAX); + for (const auto& m : hash_) { + if (m.first < min) { + min = m.first; + algo = m.second; + } + } + return algo; +} + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/operator_test.cc b/paddle/fluid/framework/operator_test.cc index ac9dd8245ad4e0e8842f219b23d3866b03fdaedb..fe4804ac253925c112cf7b508efc42c45868a2fa 100644 --- a/paddle/fluid/framework/operator_test.cc +++ b/paddle/fluid/framework/operator_test.cc @@ -50,6 +50,8 @@ class OpWithoutKernelCheckerMaker : public OpProtoAndCheckerMaker { AddInput("input", "input of test op"); AddOutput("output", "output of test op"); AddAttr("scale", "scale of cosine op"); + AddAttr("kernel_sub_type", "kernels with different implementations.") + .SetDefault(0); AddComment("This is test op"); } }; @@ -95,6 +97,8 @@ TEST(OperatorBase, all) { namespace paddle { namespace framework { +static int special_type_value = 1; + class OpKernelTestProtoAndCheckerMaker : public OpProtoAndCheckerMaker { public: void Make() { @@ -103,11 +107,14 @@ class OpKernelTestProtoAndCheckerMaker : public OpProtoAndCheckerMaker { AddAttr("scale", "scale of cosine op") .SetDefault(1.0) .GreaterThan(0.0); + AddAttr("kernel_sub_type", "kernels with different implementations.") + .SetDefault(0); AddComment("This is test op"); } }; static int cpu_kernel_run_num = 0; +static int cpu_kernel2_run_num = 0; class OpWithKernelTest : public OperatorWithKernel { public: @@ -117,7 +124,10 @@ class OpWithKernelTest : public OperatorWithKernel { void InferShape(framework::InferShapeContext* ctx) const override {} OpKernelType GetExpectedKernelType( const ExecutionContext& ctx) const override { - return OpKernelType(proto::VarType::FP32, ctx.GetPlace()); + int sub_type = ctx.Attr("kernel_sub_type"); + return OpKernelType(proto::VarType::FP32, ctx.GetPlace(), + framework::DataLayout::kAnyLayout, + framework::LibraryType::kPlain, sub_type); } }; @@ -132,6 +142,17 @@ class CPUKernelTest : public OpKernel { } }; +template +class CPUKernel2Test : public OpKernel { + public: + void Compute(const ExecutionContext& ctx) const { + std::cout << ctx.op().DebugString() << std::endl; + cpu_kernel2_run_num++; + ASSERT_EQ(ctx.op().Input("x"), "IN1"); + ASSERT_EQ(ctx.op().Output("y"), "OUT1"); + } +}; + class OpKernelTestMultiInputsProtoAndCheckerMaker : public OpProtoAndCheckerMaker { public: @@ -142,6 +163,8 @@ class OpKernelTestMultiInputsProtoAndCheckerMaker AddAttr("scale", "scale of cosine op") .SetDefault(1.0) .GreaterThan(0.0); + AddAttr("kernel_sub_type", "kernels with different implementations.") + .SetDefault(0); AddComment("This is test op"); } }; @@ -189,9 +212,15 @@ class CPUKernalMultiInputsTest : public OpKernel { REGISTER_OP_WITHOUT_GRADIENT( op_with_kernel, paddle::framework::OpWithKernelTest, paddle::framework::OpKernelTestProtoAndCheckerMaker); + REGISTER_OP_CPU_KERNEL(op_with_kernel, paddle::framework::CPUKernelTest); +REGISTER_OP_KERNEL_WITH_CUSTOM_TYPE( + op_with_kernel, CPU, paddle::platform::CPUPlace, MY_SPECIAL_NAME, + paddle::framework::special_type_value, + paddle::framework::CPUKernel2Test); + // test with single input TEST(OpKernel, all) { paddle::framework::InitDevices(true); @@ -211,7 +240,19 @@ TEST(OpKernel, all) { auto op = paddle::framework::OpRegistry::CreateOp(op_desc); ASSERT_EQ(paddle::framework::cpu_kernel_run_num, 0); op->Run(scope, cpu_place); + // kerne_sub_type = 0, hence cpu_kernel is called, cpu_kernel2 is not called. ASSERT_EQ(paddle::framework::cpu_kernel_run_num, 1); + ASSERT_EQ(paddle::framework::cpu_kernel2_run_num, 0); + + attr = op_desc.mutable_attrs()->Add(); + attr->set_name("kernel_sub_type"); + attr->set_type(paddle::framework::proto::AttrType::INT); + attr->set_i(1); + auto op2 = paddle::framework::OpRegistry::CreateOp(op_desc); + op2->Run(scope, cpu_place); + // kerne_sub_type = 1, hence cpu_kernel2 is called, cpu_kernel is not called. + ASSERT_EQ(paddle::framework::cpu_kernel_run_num, 1); + ASSERT_EQ(paddle::framework::cpu_kernel2_run_num, 1); } REGISTER_OP_WITHOUT_GRADIENT( @@ -247,3 +288,30 @@ TEST(OpKernel, multi_inputs) { auto op = paddle::framework::OpRegistry::CreateOp(op_desc); op->Run(scope, cpu_place); } + +TEST(VarNameTest, all) { + std::string var_name("X"); + std::string grad_var_name = paddle::framework::GradVarName(var_name); + ASSERT_EQ(grad_var_name, "X@GRAD"); + std::string original_var_name = + paddle::framework::GradOriginalVarName(grad_var_name); + ASSERT_EQ(original_var_name, "X"); + original_var_name = paddle::framework::GradOriginalVarName(original_var_name); + ASSERT_EQ(original_var_name, "X"); + + std::string var_name_2("XYZ"); + grad_var_name = paddle::framework::GradVarName(var_name_2); + ASSERT_EQ(grad_var_name, "XYZ@GRAD"); + original_var_name = paddle::framework::GradOriginalVarName(grad_var_name); + ASSERT_EQ(original_var_name, "XYZ"); + original_var_name = paddle::framework::GradOriginalVarName(original_var_name); + ASSERT_EQ(original_var_name, "XYZ"); + + std::string var_name_3(""); + grad_var_name = paddle::framework::GradVarName(var_name_3); + ASSERT_EQ(grad_var_name, "@GRAD"); + original_var_name = paddle::framework::GradOriginalVarName(grad_var_name); + ASSERT_EQ(original_var_name, ""); + original_var_name = paddle::framework::GradOriginalVarName(original_var_name); + ASSERT_EQ(original_var_name, ""); +} diff --git a/paddle/fluid/framework/parallel_executor.cc b/paddle/fluid/framework/parallel_executor.cc index e8adabd26540754d5b9206294eeeed79757220bf..c4bf2b7e8c017b22f917c9f9bd40e75b8cde08b2 100644 --- a/paddle/fluid/framework/parallel_executor.cc +++ b/paddle/fluid/framework/parallel_executor.cc @@ -13,69 +13,194 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/framework/parallel_executor.h" +#include +#include #include #include +#include #include -#include "paddle/fluid/framework/ir/graph_helper.h" - -#include "paddle/fluid/framework/ir/graph.h" - -#ifdef PADDLE_WITH_CUDA -#include "paddle/fluid/platform/nccl_helper.h" -#endif - +#include "paddle/fluid/framework/details/async_ssa_graph_executor.h" #include "paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.h" #include "paddle/fluid/framework/details/multi_devices_helper.h" +#include "paddle/fluid/framework/details/parallel_ssa_graph_executor.h" +#include "paddle/fluid/framework/details/reference_count_pass_helper.h" #include "paddle/fluid/framework/details/scope_buffered_ssa_graph_executor.h" #include "paddle/fluid/framework/details/threaded_ssa_graph_executor.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_helper.h" #include "paddle/fluid/platform/profiler.h" +#ifdef WITH_GPERFTOOLS +#include "gperftools/profiler.h" +#endif +DEFINE_string(pe_profile_fname, "", + "Profiler filename for PE, which generated by gperftools." + "Only valid when compiled `WITH_PRIFILER=ON`. Empty if disable."); +DEFINE_bool(enable_parallel_graph, false, + "Force disable parallel graph execution mode if set false."); + namespace paddle { namespace framework { +static std::once_flag gProfileOnce; +#ifdef WITH_GPERFTOOLS +static bool gProfileStarted = false; +#endif class ParallelExecutorPrivate { public: explicit ParallelExecutorPrivate(const std::vector &places) - : places_(places) {} + : places_(places) { + if (!FLAGS_pe_profile_fname.empty()) { + std::call_once(gProfileOnce, [] { +#ifdef WITH_GPERFTOOLS + ProfilerStart(FLAGS_pe_profile_fname.c_str()); + gProfileStarted = true; +#else + LOG(WARNING) << "Paddle is not compiled with gperftools. " + "FLAGS_pe_profile_fname will be ignored"; +#endif + }); + } + } + + ~ParallelExecutorPrivate() { + if (own_local_scope_) { + for (size_t i = 1; i < local_scopes_.size(); ++i) { + // Skip the first scope, since it is the global scope. + Scope *local_scope = local_scopes_[i]; + if (global_scope_->HasKid(local_scope)) { + global_scope_->DeleteScope(local_scope); + } + } + } + } + + ir::Graph *PrepareGCAndRefCnts(ir::Graph *graph, size_t max_memory_size); + + inline bool HasGarbageCollectors() const { return !gcs_.empty(); } + + void ResetRuntimeReferenceCount(const std::vector &fetch_tensors, + const std::string &fetched_var_name) { + for (size_t i = 0; i < runtime_ref_cnts_.size(); ++i) { + for (auto &pair : global_ref_cnts_[i]) { + runtime_ref_cnts_[i][pair.first] = pair.second; + } + + for (auto &fetch_name : fetch_tensors) { + runtime_ref_cnts_[i].erase(fetch_name); + } + runtime_ref_cnts_[i].erase(fetched_var_name); + } + } + BuildStrategy build_strategy_; std::vector places_; std::vector local_scopes_; - Scope *global_scope_; + Scope *global_scope_; // not owned std::unique_ptr executor_; -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) std::unique_ptr nccl_ctxs_; #endif bool own_local_scope_; bool use_cuda_; bool use_all_reduce_; + size_t nranks_; + + // global_ref_cnts_ is only initialized when ParallelExecutor constructs, and + // then keeps unchanged + // Before each iteration, runtime_ref_cnts_ is reset to global_ref_cnts_ + std::vector global_ref_cnts_; + std::vector runtime_ref_cnts_; + details::GarbageCollectorMap gcs_; }; +ir::Graph *ParallelExecutorPrivate::PrepareGCAndRefCnts( + ir::Graph *graph, size_t max_memory_size) { + for (size_t i = 0; i < places_.size(); ++i) { + auto &place = places_[i]; + if (gcs_.count(place) > 0) { + continue; + } + std::unique_ptr gc; +#ifdef PADDLE_WITH_CUDA + if (platform::is_gpu_place(place)) { + if (IsFastEagerDeletionModeEnabled()) { + gc.reset(new UnsafeFastGPUGarbageCollector( + boost::get(place), max_memory_size)); + } else { + gc.reset(new StreamGarbageCollector( + boost::get(place), max_memory_size)); + } + VLOG(10) << "Created " << i << "-th GarbageCollector at " << place; + } else { +#endif + if (platform::is_cpu_place(place)) { + gc.reset(new CPUGarbageCollector(boost::get(place), + max_memory_size)); + VLOG(10) << "Created GarbageCollector at " << place; + } else { + PADDLE_THROW("Unsupported place for garbage collection"); + } +#ifdef PADDLE_WITH_CUDA + } +#endif + + gcs_.emplace(place, std::move(gc)); + } + + if (!gcs_.empty()) { + std::vector last_live_ops_of_vars; + + auto ref_cnt_pass = + ir::PassRegistry::Instance().Get("reference_count_pass"); + ref_cnt_pass->SetNotOwned(details::kGlobalReferenceCount, + &global_ref_cnts_); + ref_cnt_pass->SetNotOwned(details::kLastLiveOpsOfVars, + &last_live_ops_of_vars); + graph = ref_cnt_pass->Apply(graph); + VLOG(10) << "ReferenceCountPass Applied"; + + auto eager_deletion_pass = + ir::PassRegistry::Instance().Get("eager_deletion_pass"); + eager_deletion_pass->SetNotOwned(details::kRuntimeReferenceCount, + &runtime_ref_cnts_); + eager_deletion_pass->SetNotOwned(details::kGarbageCollector, &gcs_); + eager_deletion_pass->SetNotOwned(details::kLastLiveOpsOfVars, + &last_live_ops_of_vars); + eager_deletion_pass->SetNotOwned(details::kAllPlaces, &places_); + graph = eager_deletion_pass->Apply(graph); + VLOG(10) << "EagerDeletionPass Applied"; + } + return graph; +} + std::vector &ParallelExecutor::GetLocalScopes() { return member_->local_scopes_; } -ParallelExecutor::ParallelExecutor( - const std::vector &places, - const std::unordered_set ¶ms, - const std::unordered_set &bcast_vars, - const ProgramDesc &main_program, const std::string &loss_var_name, - Scope *scope, const std::vector &local_scopes, - const ExecutionStrategy &exec_strategy, const BuildStrategy &build_strategy, - size_t num_trainers, size_t trainer_id) +ParallelExecutor::ParallelExecutor(const std::vector &places, + const std::vector &bcast_vars, + const std::string &loss_var_name, + Scope *scope, + const std::vector &local_scopes, + const ExecutionStrategy &exec_strategy, + const BuildStrategy &build_strategy, + ir::Graph *graph) : member_(new ParallelExecutorPrivate(places)) { member_->global_scope_ = scope; member_->use_cuda_ = exec_strategy.use_cuda_; + member_->build_strategy_ = build_strategy; member_->use_all_reduce_ = build_strategy.reduce_ == BuildStrategy::ReduceStrategy::kAllReduce; - + member_->nranks_ = build_strategy.num_trainers_ * places.size(); if (!member_->use_all_reduce_) { PADDLE_ENFORCE(places.size() > 1, "If you set build_strategy.reduce with 'Reduce'," "the number of places must be greater than 1."); } - // Step 1. Bcast the params to devs. + // Step 1. Bcast the bcast_vars to devs. // Create local scopes if (local_scopes.empty()) { member_->own_local_scope_ = true; @@ -91,94 +216,211 @@ ParallelExecutor::ParallelExecutor( } } + std::vector graphs; + if (build_strategy.async_mode_) { + PADDLE_ENFORCE(!member_->use_cuda_, + "gpu mode does not support async_mode_ now!"); + graphs.push_back(graph); + for (size_t i = 1; i < places.size(); ++i) { + auto *tmp_graph = new ir::Graph(graph->OriginProgram()); + async_graphs_.emplace_back(tmp_graph); + graphs.push_back(tmp_graph); + } + } + + // FIXME(Yancey1989): parallel graph mode get better performance + // in GPU allreduce distributed training. Need an elegant way to + // choice the execution strategy. + build_strategy.enable_parallel_graph_ = + EnableParallelGraphExecution(*graph, exec_strategy, build_strategy); + if (build_strategy.enable_parallel_graph_) + VLOG(0) << "The Executor would execute the graph by ParallelGraph " + "Execution which can get better performance," + << "you can force it off by env FLAGS_enable_parallel_graph=0"; + if (member_->use_cuda_) { // Bcast Parameters to all GPUs -#ifdef PADDLE_WITH_CUDA - auto *nccl_id_var = scope->FindVar(NCCL_ID_VARNAME); +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) ncclUniqueId *nccl_id = nullptr; + // gen_nccl_id operator can broadcast the ncclUniqueId for nccl2 collective + // distributed training + auto *nccl_id_var = scope->FindVar(NCCL_ID_VARNAME); if (nccl_id_var != nullptr) { nccl_id = nccl_id_var->GetMutable(); } + if (build_strategy.enable_parallel_graph_ && member_->nranks_ > 1UL) { + if (nccl_id == nullptr) { + local_nccl_id_.reset(new ncclUniqueId()); + platform::dynload::ncclGetUniqueId(local_nccl_id_.get()); + nccl_id = local_nccl_id_.get(); + } + } + member_->nccl_ctxs_.reset(new platform::NCCLContextMap( - member_->places_, nccl_id, num_trainers, trainer_id)); + member_->places_, nccl_id, build_strategy.num_trainers_, + build_strategy.trainer_id_)); + + // Initialize device context's nccl comm, will be used by normal + // Operators like sync_batch_norm, and collective ops. + // NOTE: more than one ParallelExecutor with same place, the nccl comm will + // be rewrite and there will be some problem. + // NOTE: NCCL group-calls and non-group-calls can not use the same + // NCCL communicator, so for ParallelGraph and Multi-Process mode, re-use + // same communicators. + std::unique_ptr dev_nccl_ctxs; + if (nccl_id == nullptr) { + dev_nccl_ctxs.reset(new platform::NCCLContextMap(member_->places_)); + } + for (size_t dev_id = 0; dev_id < member_->places_.size(); ++dev_id) { + platform::DeviceContextPool &pool = + platform::DeviceContextPool::Instance(); + auto *dev_ctx = static_cast( + pool.Get(member_->places_[dev_id])); + if (nccl_id != nullptr) { + auto &nccl_ctx = member_->nccl_ctxs_->at(member_->places_[dev_id]); + dev_ctx->set_nccl_comm(nccl_ctx.comm()); + } else { + auto &nccl_ctx = dev_nccl_ctxs->at(member_->places_[dev_id]); + dev_ctx->set_nccl_comm(nccl_ctx.comm()); + } + } #else PADDLE_THROW("Not compiled with CUDA"); #endif } + // broadcast parameters from the 0th device to others: + auto need_broadcast = [&]() -> bool { + if (build_strategy.num_trainers_ > 1) { + // 1. num_tariners would be grater than 1 for nccl distributed training. + return true; + } else if (member_->local_scopes_.size() != 1 && local_scopes.empty()) { + // 2. Only one trainer process, but ParallelExecutor hold multiple + // devices. + return true; + } + return false; + }; - if (member_->local_scopes_.size() != 1 && local_scopes.empty()) { - BCastParamsToDevices(bcast_vars); + if (need_broadcast()) { + BCastParamsToDevices(bcast_vars, build_strategy.trainer_id_); } // Startup Program has been run. All local scopes has correct parameters. - // Step 2. Create vars in each scope; - std::vector var_infos; - for (auto *var : main_program.Block(0).AllVars()) { - var_infos.emplace_back(); - var_infos.back().name_ = var->Name(); - var_infos.back().type_ = var->GetType(); - var_infos.back().persistable_ = var->Persistable(); + // Step 2. Convert main_program to SSA form and dependency graph. Also, insert + // ncclOp + std::vector async_graphs(places.size()); +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + if (build_strategy.async_mode_) { + VLOG(3) << "use local async mode"; + graph = build_strategy.Apply(graph, {member_->places_[0]}, loss_var_name, + {member_->local_scopes_[0]}, 1, + member_->use_cuda_, member_->nccl_ctxs_.get()); + for (size_t i = 1; i < member_->places_.size(); ++i) { + graphs[i] = + build_strategy.Apply(graphs[i], {member_->places_[i]}, loss_var_name, + {member_->local_scopes_[i]}, 1, + member_->use_cuda_, member_->nccl_ctxs_.get()); + async_graphs[i] = graphs[i]; + } + } else { + graph = build_strategy.Apply(graph, member_->places_, loss_var_name, + member_->local_scopes_, member_->nranks_, + member_->use_cuda_, member_->nccl_ctxs_.get()); + } +#else + if (build_strategy.async_mode_) { + VLOG(3) << "use local async mode"; + graph = build_strategy.Apply(graph, {member_->places_[0]}, loss_var_name, + {member_->local_scopes_[0]}, 1, + member_->use_cuda_); + for (int i = 1; i < member_->places_.size(); ++i) { + graphs[i] = build_strategy.Apply( + graphs[i], {member_->places_[i]}, loss_var_name, + {member_->local_scopes_[i]}, 1, member_->use_cuda_); + async_graphs[i] = graphs[i]; + } + } else { + graph = build_strategy.Apply(graph, member_->places_, loss_var_name, + member_->local_scopes_, member_->nranks_, + member_->use_cuda_); } -// Step 3. Convert main_program to SSA form and dependency graph. Also, insert -// ncclOp -#ifdef PADDLE_WITH_CUDA - std::unique_ptr graph = build_strategy.Apply( - main_program, member_->places_, loss_var_name, params, - member_->local_scopes_, member_->use_cuda_, member_->nccl_ctxs_.get()); - +#endif auto max_memory_size = GetEagerDeletionThreshold(); + VLOG(10) << "Eager Deletion Threshold " + << static_cast(max_memory_size) / (1 << 30); if (max_memory_size >= 0) { - for (auto &place : member_->places_) { - if (!platform::is_gpu_place(place)) continue; - auto gpu_place = boost::get(place); - if (gcs_[gpu_place.device] == nullptr) { - ref_cnts_[gpu_place.device].reset(new details::ReferenceCountMap()); - cur_ref_cnts_[gpu_place.device].reset( - new details::AtomicReferenceCountMap()); - gcs_[gpu_place.device].reset( - new StreamGarbageCollector(gpu_place, max_memory_size)); - } - } - if (!gcs_.empty()) { - auto ref_cnt_pass = - ir::PassRegistry::Instance().Get("reference_count_pass"); - ref_cnt_pass->SetNotOwned(details::kGlobalReferenceCount, &ref_cnts_); - ref_cnt_pass->SetNotOwned(details::kCurReferenceCount, &cur_ref_cnts_); - ref_cnt_pass->SetNotOwned(details::kGarbageCollector, &gcs_); - graph = ref_cnt_pass->Apply(std::move(graph)); - graph->SetNotOwned("garbage_collector", &gcs_); + graph = member_->PrepareGCAndRefCnts(graph, + static_cast(max_memory_size)); + } + + async_graphs[0] = graph; + + // Step 3. Create vars in each scope. Passes may also create new vars. + // skip control vars and empty vars + std::vector var_infos; + for (auto &node : graph->Nodes()) { + if (node->IsVar() && !node->IsCtrlVar() && node->Var()) { + var_infos.emplace_back(); + var_infos.back().name_ = node->Var()->Name(); + var_infos.back().type_ = node->Var()->GetType(); + var_infos.back().persistable_ = node->Var()->Persistable(); } } -#else - std::unique_ptr graph = - build_strategy.Apply(main_program, member_->places_, loss_var_name, - params, member_->local_scopes_, member_->use_cuda_); -#endif - if (VLOG_IS_ON(5)) { - // If the loss_var_name is given, the number of graph should be only one. - if (loss_var_name.size()) { - PADDLE_ENFORCE_EQ(ir::GraphNum(*graph), 1, - "The number of graph should be only one"); + // If the loss_var_name is given, the number of graph should be only one. + if (loss_var_name.size()) { + size_t graph_num = ir::GraphNum(*graph); + if (graph_num > 1) { + LOG(WARNING) + << "The number of graph should be only one, " + "but the current graph has " + << ir::GraphNum(*graph) + << " sub_graphs. If you want to see the nodes of the " + "sub_graphs, you should use 'FLAGS_print_sub_graph_dir' " + "to specify the output dir. NOTES: if you not do training, " + "please don't pass loss_var_name."; } } - if (exec_strategy.type_ == ExecutionStrategy::kDefault) { - member_->executor_.reset(new details::ThreadedSSAGraphExecutor( - exec_strategy, member_->local_scopes_, places, std::move(graph))); + if (build_strategy.async_mode_) { + VLOG(3) << "use AsyncSSAGraphExecutor"; + member_->executor_.reset(new details::AsyncSSAGraphExecutor( + exec_strategy, member_->local_scopes_, member_->places_, async_graphs)); + } else if (build_strategy.enable_parallel_graph_) { + VLOG(3) << "use ParallelSSAGraphExecutor"; +#ifdef PADDLE_WITH_CUDA + // TODO(Yancey1989): Remove passing in the main_program when + // allreduce_seq_pass doesn't need it as the attr. + member_->executor_.reset(new details::ParallelSSAGraphExecutor( + exec_strategy, member_->local_scopes_, member_->places_, graph)); +#else + PADDLE_THROW( + "Paddle should be compiled with CUDA for ParallelGraph Execution."); +#endif } else { - member_->executor_.reset(new details::FastThreadedSSAGraphExecutor( - exec_strategy, member_->local_scopes_, places, std::move(graph))); + if (exec_strategy.type_ == ExecutionStrategy::kDefault) { + VLOG(3) << "use ThreadedSSAGraphExecutor"; + member_->executor_.reset(new details::ThreadedSSAGraphExecutor( + exec_strategy, member_->local_scopes_, member_->places_, graph)); + } else { + VLOG(3) << "use FastThreadedSSAGraphExecutor"; + member_->executor_.reset(new details::FastThreadedSSAGraphExecutor( + exec_strategy, member_->local_scopes_, member_->places_, graph)); + } } - member_->executor_.reset(new details::ScopeBufferedSSAGraphExecutor( - exec_strategy, member_->local_scopes_, std::move(var_infos), - member_->places_, std::move(member_->executor_))); + VLOG(3) << "use ScopeBufferedSSAGraphExecutor"; + if (!build_strategy.async_mode_) { + member_->executor_.reset(new details::ScopeBufferedSSAGraphExecutor( + exec_strategy, member_->local_scopes_, std::move(var_infos), + member_->places_, std::move(member_->executor_))); + } } void ParallelExecutor::BCastParamsToDevices( - const std::unordered_set &vars) const { + const std::vector &vars, int trainer_id) const { + VLOG(3) << "BCastParamsToDevices"; // the initializing bcast, all vars would be bcast from device(0). for (auto &var : vars) { framework::Variable *main_var = member_->local_scopes_[0]->FindVar(var); @@ -187,17 +429,22 @@ void ParallelExecutor::BCastParamsToDevices( } auto &main_tensor = main_var->Get(); + if (!main_tensor.IsInitialized()) { + VLOG(3) << "one in var not inited, return!"; + continue; + } auto &dims = main_tensor.dims(); if (paddle::platform::is_gpu_place(main_tensor.place())) { -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) std::vector buffers; + buffers.reserve(member_->places_.size()); size_t numel = main_tensor.numel(); ncclDataType_t data_type = platform::ToNCCLDataType(main_tensor.type()); for (size_t i = 0; i < member_->places_.size(); ++i) { auto place = member_->places_[i]; void *buffer; - if (i == 0) { + if (i == 0 && trainer_id == 0) { buffer = const_cast(main_tensor.data()); } else { auto local_scope = member_->local_scopes_[i]; @@ -224,20 +471,26 @@ void ParallelExecutor::BCastParamsToDevices( #endif } else { platform::CPUPlace cpu; - for (size_t i = 0; i < member_->places_.size(); ++i) { - if (i == 0) continue; - + for (size_t i = 1; i < member_->places_.size(); ++i) { auto local_scope = member_->local_scopes_[i]; auto *t = local_scope->Var(var)->GetMutable(); - // FIXME(zcd): LR_DECAY_COUNTER should not be shared. This is a hot fix. - if (member_->use_all_reduce_ || member_->use_cuda_ || - var == "@LR_DECAY_COUNTER@") { + auto copy_memory = [&] { t->Resize(dims); t->mutable_data(cpu, main_tensor.type()); paddle::framework::TensorCopy(main_tensor, cpu, t); + }; + + auto share_memory = [&] { t->ShareDataWith(main_tensor); }; + + // FIXME(zcd): LR_DECAY_COUNTER should not be shared. This is a hot fix. + if (member_->build_strategy_.async_mode_) { + share_memory(); + } else if (member_->use_all_reduce_ || member_->use_cuda_ || + var == "@LR_DECAY_COUNTER@") { + copy_memory(); } else { - t->ShareDataWith(main_tensor); + share_memory(); } } } @@ -246,19 +499,16 @@ void ParallelExecutor::BCastParamsToDevices( void ParallelExecutor::Run(const std::vector &fetch_tensors, const std::string &fetched_var_name) { - platform::RecordBlock b(0); -#ifdef PADDLE_WITH_CUDA - if (!gcs_.empty()) { - ResetReferenceCount(); - for (auto &pair : cur_ref_cnts_) { - auto &name_map = *(pair.second); - for (auto &fetch_name : fetch_tensors) { - name_map.erase(fetch_name); - } - name_map.erase(fetched_var_name); - } +#ifdef WITH_GPERFTOOLS + if (gProfileStarted) { + ProfilerFlush(); } #endif + + platform::RecordBlock b(0); + if (member_->HasGarbageCollectors()) { + member_->ResetRuntimeReferenceCount(fetch_tensors, fetched_var_name); + } auto fetch_data = member_->executor_->Run(fetch_tensors); *member_->global_scope_->Var(fetched_var_name)->GetMutable() = fetch_data; @@ -299,22 +549,45 @@ void ParallelExecutor::FeedAndSplitTensorIntoLocalScopes( } ParallelExecutor::~ParallelExecutor() { - if (member_->own_local_scope_) { - for (size_t i = 1; i < member_->local_scopes_.size(); ++i) { - Scope *local_scope = member_->local_scopes_[i]; - if (member_->global_scope_->HasKid(local_scope)) { - member_->global_scope_->DeleteScope(local_scope); + for (auto &p : member_->places_) { + platform::DeviceContextPool::Instance().Get(p)->Wait(); + } + delete member_; +} + +bool ParallelExecutor::EnableParallelGraphExecution( + const ir::Graph &graph, const ExecutionStrategy &exec_strategy, + const BuildStrategy &build_strategy) const { + if (!FLAGS_enable_parallel_graph) return false; + + bool enable_parallel_graph = true; + + for (ir::Node *node : graph.Nodes()) { + if (node->IsVar() && node->Var()) { + // TODO(Yancey1989): support sparse update in ParallelGraph mode. + if (node->Var()->GetType() == proto::VarType::SELECTED_ROWS) { + enable_parallel_graph = false; + break; + } + } else if (node->IsOp() && node->Op()) { + // TODO(Yancey1989): support pserver mode + if (node->Op()->Type() == "send" || node->Op()->Type() == "recv") { + enable_parallel_graph = false; + break; } } } - // member_ must be destructed before gcs_ since the destructor of - // ReferenceCountOpHandle use raw pointers of gcs_ inside. - member_.reset(); + if (!member_->use_all_reduce_ || !member_->use_cuda_) + + if (build_strategy.enable_sequential_execution_ || + exec_strategy.type_ == ExecutionStrategy::ExecutorType::kExperimental) + enable_parallel_graph = false; + return enable_parallel_graph; } } // namespace framework } // namespace paddle -#ifdef PADDLE_WITH_CUDA + USE_PASS(reference_count_pass); -#endif +USE_PASS(eager_deletion_pass); diff --git a/paddle/fluid/framework/parallel_executor.h b/paddle/fluid/framework/parallel_executor.h index ef09b98b2aa91a9d729b94d15dbb676dde4092b6..5756627fbd8583428014e24e5aa3f626c908ce1c 100644 --- a/paddle/fluid/framework/parallel_executor.h +++ b/paddle/fluid/framework/parallel_executor.h @@ -14,10 +14,11 @@ limitations under the License. */ #pragma once -#include +#include #include #include #include +#include #include #include "paddle/fluid/framework/details/build_strategy.h" @@ -29,8 +30,8 @@ limitations under the License. */ #include "paddle/fluid/framework/tensor.h" #include "paddle/fluid/platform/device_context.h" -#ifdef PADDLE_WITH_CUDA -#include "paddle/fluid/framework/details/reference_count_pass.h" +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) +#include "paddle/fluid/platform/nccl_helper.h" #endif namespace paddle { @@ -46,14 +47,12 @@ class ParallelExecutor { public: explicit ParallelExecutor(const std::vector &places, - const std::unordered_set ¶ms, - const std::unordered_set &bcast_vars, - const ProgramDesc &main_program, + const std::vector &bcast_vars, const std::string &loss_var_name, Scope *scope, const std::vector &local_scopes, const ExecutionStrategy &exec_strategy, const BuildStrategy &build_strategy, - size_t num_trainers = 1, size_t trainer_id = 0); + ir::Graph *graph); ~ParallelExecutor(); @@ -73,25 +72,18 @@ class ParallelExecutor { const std::string &fetched_var_name); private: - void BCastParamsToDevices(const std::unordered_set &vars) const; - - std::unique_ptr member_; - -#ifdef PADDLE_WITH_CUDA - // ref_cnts_ is only initialized when ParallelExecutor constructs, and then - // keeps unchanged - // Before each iteration, cur_ref_cnts_ is reset to ref_cnts_ - details::DeviceReferenceCountMap ref_cnts_; - details::AtomicDeviceReferenceCountMap cur_ref_cnts_; - details::DeviceGarbageCollectorMap gcs_; - - void ResetReferenceCount() { - for (auto &pair1 : ref_cnts_) { - for (auto &pair2 : *(pair1.second)) { - (*(cur_ref_cnts_[pair1.first]))[pair2.first] = pair2.second; - } - } - } + // broadcast the parameters from the 0th device. + // trainer_id the trainer index in nccl distributed training. + void BCastParamsToDevices(const std::vector &vars, + int trainer_id = 0) const; + bool EnableParallelGraphExecution(const ir::Graph &graph, + const ExecutionStrategy &exec_strategy, + const BuildStrategy &build_strategy) const; + + ParallelExecutorPrivate *member_; + std::vector> async_graphs_; +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + std::unique_ptr local_nccl_id_; #endif }; diff --git a/paddle/fluid/framework/program_desc_test.cc b/paddle/fluid/framework/program_desc_test.cc index 7e689a37da8a16bd9b1ac6650b9322d2eb5a2c85..48bde2785e6a51afc0d2905ac31fe20a3c3019b6 100644 --- a/paddle/fluid/framework/program_desc_test.cc +++ b/paddle/fluid/framework/program_desc_test.cc @@ -103,7 +103,7 @@ TEST(ProgramDesc, copy_ctor) { ASSERT_EQ(1, op->GetBlockAttrId("sub_block")); found_sub_block = true; - ASSERT_EQ(2, op->GetBlocksAttrIds("sub_blocks").size()); + ASSERT_EQ(2UL, op->GetBlocksAttrIds("sub_blocks").size()); found_sub_blocks = true; } } diff --git a/paddle/fluid/framework/pull_dense_worker.cc b/paddle/fluid/framework/pull_dense_worker.cc new file mode 100644 index 0000000000000000000000000000000000000000..c48c7872ec23f6cfaac650b4940752ac9b8fd36c --- /dev/null +++ b/paddle/fluid/framework/pull_dense_worker.cc @@ -0,0 +1,136 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ +#include +#include "paddle/fluid/framework/device_worker.h" + +namespace paddle { +namespace framework { + +std::shared_ptr PullDenseWorker::s_instance_ = NULL; +std::mutex PullDenseWorker::mutex_for_version_; +std::map PullDenseWorker::last_versions_; +std::map PullDenseWorker::current_version_; +std::map> PullDenseWorker::training_versions_; +std::map> + PullDenseWorker::dense_value_names_; + +void PullDenseWorker::Initialize(const TrainerDesc& param) { + running_ = false; + param_ = param.pull_dense_param(); + dwp_param_ = param.downpour_param(); + threshold_ = param_.threshold(); + thread_num_ = param_.device_num(); + sleep_time_ms_ = param_.sleep_time_ms(); + for (size_t i = 0; + i < dwp_param_.program_config(0).pull_dense_table_id_size(); ++i) { + uint64_t tid = static_cast( + dwp_param_.program_config(0).pull_dense_table_id(i)); + TableParameter table; + for (auto i : param_.dense_table()) { + if (i.table_id() == tid) { + table = i; + break; + } + } + // setup dense variables for each table + int var_num = table.dense_value_name_size(); + dense_value_names_[tid].resize(var_num); + for (int j = 0; j < var_num; ++j) { + dense_value_names_[tid][j] = table.dense_value_name(j); + } + // setup training version for each table + training_versions_[tid].resize(thread_num_, 0); + last_versions_[tid] = 0; + current_version_[tid] = 0; + } + fleet_ptr_ = FleetWrapper::GetInstance(); +} + +void PullDenseWorker::Wait(std::vector<::std::future>* status_vec) { + for (auto& t : *status_vec) { + t.wait(); + auto status = t.get(); + if (status != 0) { + LOG(WARNING) << "Current Pull Dense Thread Failed Times" + << ++pull_dense_fail_times_; + } + } + + int MAX_FAIL_NUM = 20; + if (pull_dense_fail_times_ > MAX_FAIL_NUM) { + LOG(FATAL) << "Pull Dense Failed Times More Than " << MAX_FAIL_NUM + << " Times"; + exit(-1); + } + status_vec->resize(0); +} + +void PullDenseWorker::Stop() { + if (running_) { + running_ = false; + t_.join(); + } +} + +int PullDenseWorker::Start() { + running_ = true; + t_ = std::thread(&PullDenseWorker::Run, this); + return 0; +} + +void PullDenseWorker::Run() { + while (running_) { + pull_dense_status_.resize(0); + for (size_t i = 0; + i < dwp_param_.program_config(0).pull_dense_table_id_size(); ++i) { + uint64_t tid = static_cast( + dwp_param_.program_config(0).pull_dense_table_id(i)); + if (CheckUpdateParam(tid)) { + fleet_ptr_->PullDenseVarsAsync( + *root_scope_, tid, dense_value_names_[tid], &pull_dense_status_); + ResetThreadVersion(tid); + } + } + if (pull_dense_status_.size() != 0) { + Wait(&pull_dense_status_); + } +#ifndef _WIN32 + usleep(sleep_time_ms_ * 1000); +#endif + } +} + +void PullDenseWorker::IncreaseThreadVersion(int thread_id, uint64_t table_id) { + std::lock_guard lock(mutex_for_version_); + training_versions_[table_id][thread_id]++; +} + +bool PullDenseWorker::CheckUpdateParam(uint64_t table_id) { + std::lock_guard lock(mutex_for_version_); + auto& version = training_versions_[table_id]; + current_version_[table_id] = + *(std::min_element(version.begin(), version.end())); + if (current_version_[table_id] - last_versions_[table_id] < threshold_) { + return false; + } + return true; +} + +void PullDenseWorker::ResetThreadVersion(uint64_t table_id) { + std::lock_guard lock(mutex_for_version_); + last_versions_[table_id] = current_version_[table_id]; +} + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/python_headers.h b/paddle/fluid/framework/python_headers.h new file mode 100644 index 0000000000000000000000000000000000000000..8f9e3fad57f7bb87e78e334e741be23751417a78 --- /dev/null +++ b/paddle/fluid/framework/python_headers.h @@ -0,0 +1,34 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +// workaround for Python 2 issue: https://bugs.python.org/issue17120 +#pragma push_macro("_XOPEN_SOURCE") +#pragma push_macro("_POSIX_C_SOURCE") +#undef _XOPEN_SOURCE +#undef _POSIX_C_SOURCE + +#include "pybind11/pybind11.h" + +#pragma pop_macro("_XOPEN_SOURCE") +#pragma pop_macro("_POSIX_C_SOURCE") + +#if !defined(PYBIND11_HIDDEN) +#ifdef _WIN32 +#define PYBIND11_HIDDEN __declspec(dllexport) +#else +#define PYBIND11_HIDDEN __attribute__((visibility("hidden"))) +#endif +#endif diff --git a/paddle/fluid/framework/reader.cc b/paddle/fluid/framework/reader.cc index 40eafda9bf294f7e8ddd067e9014447f4de1cc0e..d3513fb7dbed0413e61796d8a843c38fbbcf93dc 100644 --- a/paddle/fluid/framework/reader.cc +++ b/paddle/fluid/framework/reader.cc @@ -69,6 +69,9 @@ void ReaderBase::Start() { ReaderBase::~ReaderBase() {} -DecoratedReader::~DecoratedReader() { reader_->Shutdown(); } +DecoratedReader::~DecoratedReader() { + VLOG(1) << "~DecoratedReader"; + reader_->Shutdown(); +} } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/reader.h b/paddle/fluid/framework/reader.h index 82562bf883d88787858912f7039cf8fef003eccf..4b400e72a4cacd3848b57ac3ba2b3ef5f9a9a9c4 100644 --- a/paddle/fluid/framework/reader.h +++ b/paddle/fluid/framework/reader.h @@ -16,6 +16,7 @@ #include #include +#include #include #include "paddle/fluid/framework/ddim.h" @@ -77,7 +78,10 @@ class DecoratedReader : public ReaderBase, ~DecoratedReader(); protected: - void ShutdownImpl() override { reader_->Shutdown(); } + void ShutdownImpl() override { + VLOG(1) << "ShutdownImpl"; + reader_->Shutdown(); + } void StartImpl() override { reader_->Start(); } @@ -98,6 +102,8 @@ class ReaderHolder { reader_ = reader_base; } + ~ReaderHolder() { VLOG(1) << "~ReaderHolder"; } + const std::shared_ptr& Get() const { return reader_; } void ReadNext(std::vector* out) { @@ -106,6 +112,7 @@ class ReaderHolder { } void ResetAll() { + VLOG(1) << "ResetAll"; auto end_readers = reader_->GetEndPoints(); for (auto* reader : end_readers) { reader->Shutdown(); @@ -116,11 +123,13 @@ class ReaderHolder { } void Shutdown() { + VLOG(1) << "Shutdown"; PADDLE_ENFORCE_NOT_NULL(reader_); reader_->Shutdown(); } void Start() { + VLOG(1) << "start"; PADDLE_ENFORCE_NOT_NULL(reader_); reader_->Start(); } diff --git a/paddle/fluid/framework/reader_test.cc b/paddle/fluid/framework/reader_test.cc index 50aca4b5a4ba7a93a1584a03cc16fe5d712a32b5..d812417a38200bcfdbdeac78800190647510a144 100644 --- a/paddle/fluid/framework/reader_test.cc +++ b/paddle/fluid/framework/reader_test.cc @@ -40,7 +40,7 @@ TEST(READER, decorate_chain) { auto endpoints = root->GetEndPoints(); ASSERT_EQ(endpoints.size(), 2U); ASSERT_NE(endpoints.count(end_point1.get()), 0UL); - ASSERT_NE(endpoints.count(end_point2.get()), 0); + ASSERT_NE(endpoints.count(end_point2.get()), 0UL); } { diff --git a/paddle/fluid/framework/rw_lock.h b/paddle/fluid/framework/rw_lock.h index dbf00f3a79f7d1dcf97b346fccfdb68f119d4aa3..f8aa87519a2fc1a14765887e95c96883d7b4589f 100644 --- a/paddle/fluid/framework/rw_lock.h +++ b/paddle/fluid/framework/rw_lock.h @@ -16,7 +16,9 @@ limitations under the License. */ #if !defined(_WIN32) #include -#endif // !_WIN32 +#else +#include // NOLINT +#endif // !_WIN32 #include "paddle/fluid/platform/enforce.h" @@ -29,17 +31,17 @@ struct RWLock { ~RWLock() { pthread_rwlock_destroy(&lock_); } - void RDLock() { + inline void RDLock() { PADDLE_ENFORCE_EQ(pthread_rwlock_rdlock(&lock_), 0, "acquire read lock failed"); } - void WRLock() { + inline void WRLock() { PADDLE_ENFORCE_EQ(pthread_rwlock_wrlock(&lock_), 0, "acquire write lock failed"); } - void UNLock() { + inline void UNLock() { PADDLE_ENFORCE_EQ(pthread_rwlock_unlock(&lock_), 0, "unlock failed"); } @@ -51,81 +53,46 @@ struct RWLock { // https://stackoverflow.com/questions/7125250/making-pthread-rwlock-wrlock-recursive // In windows, rw_lock seems like a hack. Use empty object and do nothing. struct RWLock { - void RDLock() {} - void WRLock() {} - void UNLock() {} + // FIXME(minqiyang): use mutex here to do fake lock + inline void RDLock() { mutex_.lock(); } + + inline void WRLock() { mutex_.lock(); } + + inline void UNLock() { mutex_.unlock(); } + + private: + std::mutex mutex_; }; #endif -class RWLockGuard { +class AutoWRLock { public: - enum Status { kUnLock, kWRLock, kRDLock }; - - RWLockGuard(RWLock* rw_lock, Status init_status) - : lock_(rw_lock), status_(Status::kUnLock) { - switch (init_status) { - case Status::kRDLock: { - RDLock(); - break; - } - case Status::kWRLock: { - WRLock(); - break; - } - case Status::kUnLock: { - break; - } - } - } + explicit AutoWRLock(RWLock* rw_lock) : lock_(rw_lock) { Lock(); } - void WRLock() { - switch (status_) { - case Status::kUnLock: { - lock_->WRLock(); - status_ = Status::kWRLock; - break; - } - case Status::kWRLock: { - break; - } - case Status::kRDLock: { - PADDLE_THROW( - "Please unlock read lock first before invoking write lock."); - break; - } - } - } + ~AutoWRLock() { UnLock(); } - void RDLock() { - switch (status_) { - case Status::kUnLock: { - lock_->RDLock(); - status_ = Status::kRDLock; - break; - } - case Status::kRDLock: { - break; - } - case Status::kWRLock: { - PADDLE_THROW( - "Please unlock write lock first before invoking read lock."); - break; - } - } - } + private: + inline void Lock() { lock_->WRLock(); } - void UnLock() { - if (status_ != Status::kUnLock) { - lock_->UNLock(); - status_ = Status::kUnLock; - } - } + inline void UnLock() { lock_->UNLock(); } + + private: + RWLock* lock_; +}; + +class AutoRDLock { + public: + explicit AutoRDLock(RWLock* rw_lock) : lock_(rw_lock) { Lock(); } + + ~AutoRDLock() { UnLock(); } + + private: + inline void Lock() { lock_->RDLock(); } - ~RWLockGuard() { UnLock(); } + inline void UnLock() { lock_->UNLock(); } private: RWLock* lock_; - Status status_; }; } // namespace framework diff --git a/paddle/fluid/framework/scope.cc b/paddle/fluid/framework/scope.cc index a4abd1b1283f08fb8431fbeea0cea17c8439fdd7..49e22a5ad3093c2d61d0ef513974c9938e287729 100644 --- a/paddle/fluid/framework/scope.cc +++ b/paddle/fluid/framework/scope.cc @@ -15,53 +15,63 @@ limitations under the License. */ #include "paddle/fluid/framework/scope.h" #include // for unique_ptr +#include #include +#include #include "glog/logging.h" #include "paddle/fluid/framework/threadpool.h" #include "paddle/fluid/string/printf.h" -DEFINE_bool(benchmark, false, - "Doing memory benchmark. It will make deleting scope synchronized, " - "and add some memory usage logs." - "Default cuda is asynchronous device, set to True will" - "force op run in synchronous mode."); +DECLARE_bool(benchmark); DEFINE_bool( eager_delete_scope, true, "Delete local scope eagerly. It will reduce GPU memory usage but " "slow down the destruction of variables.(around 1% performance harm)"); -DEFINE_double( - eager_delete_tensor_gb, -1.0, - "Memory size threshold (GB) when the garbage collector clear tensors." - "Disabled when this value is less than 0"); +// When in inference scenario, the scopes will not be written by two threads in +// a mean time, but a scope may be read by multiple threads concurrently, and +// the mutex will cause serious performance issue. +// So the mutex is disabled when `ON_INFER`. +#ifdef PADDLE_ON_INFERENCE +#define SCOPE_KIDS_READER_LOCK +#define SCOPE_KIDS_WRITER_LOCK +#define SCOPE_VARS_READER_LOCK +#define SCOPE_VARS_WRITER_LOCK +#else +#define SCOPE_KIDS_READER_LOCK AutoRDLock auto_lock(&kids_lock_); +#define SCOPE_KIDS_WRITER_LOCK AutoWRLock auto_lock(&kids_lock_); +#define SCOPE_VARS_READER_LOCK AutoRDLock auto_lock(&vars_lock_); +#define SCOPE_VARS_WRITER_LOCK AutoWRLock auto_lock(&vars_lock_); +#endif namespace paddle { namespace framework { -int64_t GetEagerDeletionThreshold() { - return FLAGS_eager_delete_tensor_gb < 0 - ? -1 - : static_cast(FLAGS_eager_delete_tensor_gb * - (static_cast(1) << 30)); -} - Scope::~Scope() { DropKids(); } Scope& Scope::NewScope() const { - std::lock_guard lock(mutex_); - kids_.push_back(new Scope(this)); - return *kids_.back(); + Scope* child = new Scope(this); + { + SCOPE_KIDS_WRITER_LOCK + kids_.push_back(child); + } + return *child; +} + +std::unique_ptr Scope::NewTmpScope() const { + return std::unique_ptr(new Scope(this)); } Variable* Scope::Var(const std::string& name) { - std::lock_guard lock(mutex_); + SCOPE_VARS_WRITER_LOCK return VarInternal(name); } Variable* Scope::Var(std::string* name) { - std::lock_guard lock(mutex_); - auto new_name = string::Sprintf("%p.%d", this, vars_.size()); + SCOPE_VARS_WRITER_LOCK + auto new_name = std::to_string(reinterpret_cast(this)) + "." + + std::to_string(vars_.size()); if (name != nullptr) { *name = new_name; } @@ -69,46 +79,49 @@ Variable* Scope::Var(std::string* name) { } Variable* Scope::FindVar(const std::string& name) const { - std::lock_guard lock(mutex_); + SCOPE_VARS_READER_LOCK return FindVarInternal(name); } Variable* Scope::FindLocalVar(const std::string& name) const { - std::lock_guard lock(mutex_); + SCOPE_VARS_READER_LOCK return FindVarLocally(name); } const Scope* Scope::FindScope(const Variable* var) const { - std::lock_guard lock(mutex_); + SCOPE_VARS_READER_LOCK return FindScopeInternal(var); } void Scope::DropKids() { - std::lock_guard lock(mutex_); + SCOPE_KIDS_WRITER_LOCK for (Scope* s : kids_) delete s; kids_.clear(); } bool Scope::HasKid(const Scope* scope) const { - std::lock_guard lock(mutex_); + SCOPE_KIDS_READER_LOCK auto it = std::find(this->kids_.begin(), this->kids_.end(), scope); return it != this->kids_.end(); } std::vector Scope::LocalVarNames() const { - std::lock_guard lock(mutex_); std::vector known_vars; - known_vars.reserve(this->vars_.size()); - for (auto& p : vars_) { - known_vars.emplace_back(p.first); + { + SCOPE_VARS_READER_LOCK + known_vars.reserve(this->vars_.size()); + for (auto& p : vars_) { + known_vars.emplace_back(p.first); + } } return known_vars; } void Scope::DeleteScope(Scope* scope) const { - std::lock_guard lock(mutex_); + SCOPE_KIDS_WRITER_LOCK auto it = std::find(this->kids_.begin(), this->kids_.end(), scope); - PADDLE_ENFORCE(it != this->kids_.end(), "Cannot find %p as kid scope", scope); + PADDLE_ENFORCE(it != this->kids_.end(), "%p Cannot find %p as kid scope", + this, scope); this->kids_.erase(it); // When making memory benchmark on Fluid, we have to delete scope sync. if (FLAGS_benchmark || FLAGS_eager_delete_scope) { @@ -119,8 +132,8 @@ void Scope::DeleteScope(Scope* scope) const { } void Scope::EraseVars(const std::vector& var_names) { - std::lock_guard lock(mutex_); std::set var_set(var_names.begin(), var_names.end()); + SCOPE_VARS_WRITER_LOCK for (auto it = vars_.begin(); it != vars_.end();) { if (var_set.find(it->first) != var_set.end()) { it = vars_.erase(it); @@ -132,12 +145,12 @@ void Scope::EraseVars(const std::vector& var_names) { void Scope::Rename(const std::string& origin_name, const std::string& new_name) const { - std::lock_guard lock(mutex_); + SCOPE_VARS_WRITER_LOCK RenameInternal(origin_name, new_name); } std::string Scope::Rename(const std::string& origin_name) const { - std::lock_guard lock(mutex_); + SCOPE_VARS_WRITER_LOCK auto new_name = string::Sprintf("%p.%d", this, vars_.size()); RenameInternal(origin_name, new_name); return new_name; @@ -146,11 +159,9 @@ std::string Scope::Rename(const std::string& origin_name) const { Variable* Scope::VarInternal(const std::string& name) { auto* v = FindVarLocally(name); if (v != nullptr) return v; - v = new Variable(); - vars_[name].reset(v); + vars_.emplace(name, std::unique_ptr(v)); VLOG(3) << "Create variable " << name; - v->name_ = &(vars_.find(name)->first); return v; } @@ -189,5 +200,46 @@ Variable* Scope::FindVarLocally(const std::string& name) const { return nullptr; } +std::string GenScopeTreeDebugInfo(Scope* root) { + std::stringstream os; + + if (!root) return ""; + + // level traversal + std::queue queue; + queue.push(root); + + std::vector scopes; + + while (!queue.empty()) { + auto* end = queue.back(); + Scope* q = nullptr; + while (q != end) { + q = queue.front(); + queue.pop(); + os << q << " "; + scopes.push_back(q); + + for (auto* c : q->kids()) { + queue.push(c); + } + } + // end of a level + os << "\n------------------------------------------\n"; + } + + os << "\nDetails:\n\n"; + + for (Scope* q : scopes) { + os << "====\n"; + os << q << ":\n"; + for (auto& var : q->LocalVarNames()) { + os << " - " << var << "\n"; + } + } + + return os.str(); +} + } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/scope.h b/paddle/fluid/framework/scope.h index 14f9f36812d690fc4a7440f2e7e6a85e9993a535..5f3d106e091ace05cfbdbbde2d79d48fe01b4a38 100644 --- a/paddle/fluid/framework/scope.h +++ b/paddle/fluid/framework/scope.h @@ -14,20 +14,24 @@ limitations under the License. */ #pragma once +extern "C" { +#include +} + #include -#include // NOLINT +#include #include #include +#include #include +#include "paddle/fluid/framework/rw_lock.h" #include "paddle/fluid/framework/variable.h" #include "paddle/fluid/platform/macros.h" namespace paddle { namespace framework { -int64_t GetEagerDeletionThreshold(); - class Scope; /** @@ -48,6 +52,10 @@ class Scope { /// Mark it to const because that new kid scope cannot change parent scope. Scope& NewScope() const; + /// Create a sub-scope for current scope but do not record it in the kids to + /// avoid performance problems. + std::unique_ptr NewTmpScope() const; + /// Create a variable with given name if it doesn't exist. /// Caller doesn't own the returned Variable. Variable* Var(const std::string& name); @@ -81,6 +89,8 @@ class Scope { /// Find if a scope exists in the kid scopes bool HasKid(const Scope* scope) const; + const std::list& kids() const { return kids_; } + // enumerate all the variables current contains. std::vector LocalVarNames() const; @@ -92,7 +102,14 @@ class Scope { std::string Rename(const std::string& origin_name) const; protected: - mutable std::unordered_map> vars_; + struct KeyHasher { + std::size_t operator()(const std::string& key) const { + return XXH32(key.c_str(), key.size(), 1); + } + }; + + mutable std::unordered_map, KeyHasher> + vars_; private: // Call Scope::NewScope for a sub-scope. @@ -116,12 +133,18 @@ class Scope { // Scope in `kids_` are owned by this class. mutable std::list kids_; - Scope const* parent_{nullptr}; + const Scope* parent_{nullptr}; DISABLE_COPY_AND_ASSIGN(Scope); private: - mutable std::mutex mutex_; + mutable RWLock kids_lock_; + mutable RWLock vars_lock_; }; + +// Generate some debug string about the inherience structure of scope, quite +// naive. +std::string GenScopeTreeDebugInfo(Scope*); + } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/scope_pool.cc b/paddle/fluid/framework/scope_pool.cc new file mode 100644 index 0000000000000000000000000000000000000000..5cb241a7a341d793d8450f0c9cde3929acef8965 --- /dev/null +++ b/paddle/fluid/framework/scope_pool.cc @@ -0,0 +1,54 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/scope_pool.h" +#include "paddle/fluid/framework/threadpool.h" + +namespace paddle { +namespace framework { + +ScopePool &ScopePool::Instance() { // NOLINT + static ScopePool pool; + return pool; +} + +void ScopePool::DeleteScope(Scope *scope) { delete scope; } + +void ScopePool::Insert(std::unique_ptr &&s) { + std::lock_guard guard(mtx_); + scopes_.insert(s.release()); +} + +void ScopePool::Remove(Scope *s) { + size_t has_scope; + { + std::lock_guard guard(mtx_); + has_scope = scopes_.erase(s); + } + PADDLE_ENFORCE(has_scope > 0, "Delete non-existing global scope"); + DeleteScope(s); +} + +ScopePool::~ScopePool() { Clear(); } + +void ScopePool::Clear() { + std::lock_guard guard(mtx_); + for (auto *s : scopes_) { + DeleteScope(s); + } + scopes_.clear(); +} + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/scope_pool.h b/paddle/fluid/framework/scope_pool.h new file mode 100644 index 0000000000000000000000000000000000000000..a8b468699abe148d44a395cf888158aefab4380b --- /dev/null +++ b/paddle/fluid/framework/scope_pool.h @@ -0,0 +1,46 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include // NOLINT +#include +#include "paddle/fluid/framework/scope.h" + +namespace paddle { +namespace framework { + +class ScopePool { + public: + static ScopePool &Instance(); // NOLINT + + void Insert(std::unique_ptr &&s); + + void Remove(Scope *s); + + void Clear(); + + ~ScopePool(); + + private: + ScopePool() = default; + + static void DeleteScope(Scope *scope); + + std::unordered_set scopes_; + std::mutex mtx_; +}; + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/selected_rows.cc b/paddle/fluid/framework/selected_rows.cc index 8c290bb095d554a973e66a3a19606a06759fd668..54a818250b45e593de4110f56e42a04a9ea65e00 100644 --- a/paddle/fluid/framework/selected_rows.cc +++ b/paddle/fluid/framework/selected_rows.cc @@ -63,6 +63,26 @@ struct TensorCopyVisitor { int64_t size_; }; +struct TensorFillVisitor { + TensorFillVisitor(framework::Tensor* dst, int64_t dst_offset, int64_t size, + float value) + : dst_(dst), dst_offset_(dst_offset), size_(size) {} + + template + void apply() const { + // TODO(qiao): support other place + platform::CPUPlace cpu; + auto* tensor_data = dst_->mutable_data(cpu); + auto* start = tensor_data + dst_offset_; + auto* end = start + size_; + std::fill(start, end, static_cast(0.0)); + } + + framework::Tensor* dst_; + int64_t dst_offset_; + int64_t size_; +}; + void SerializeToStream(std::ostream& os, const SelectedRows& selected_rows, const platform::DeviceContext& dev_ctx) { { // the 1st field, uint32_t version @@ -120,7 +140,17 @@ bool SelectedRows::HasKey(int64_t key) const { : true; } -int64_t SelectedRows::AutoGrownIndex(int64_t key, bool auto_grown) { +int64_t SelectedRows::AutoGrownIndex(int64_t key, bool auto_grown, + bool is_test) { + if (is_test) { + auto iter = id_to_index_.find(key); + if (iter == id_to_index_.end()) { + return -1; + } else { + return iter->second; + } + } + rwlock_->RDLock(); auto iter = id_to_index_.find(key); if (iter == id_to_index_.end()) { @@ -172,7 +202,7 @@ void SelectedRows::SyncIndex() { } void SelectedRows::Get(const framework::Tensor& ids, framework::Tensor* value, - bool auto_grown) { + bool auto_grown, bool is_test) { PADDLE_ENFORCE(value->IsInitialized(), "The value tensor should be initialized."); if (ids.numel() == 0) { @@ -183,11 +213,19 @@ void SelectedRows::Get(const framework::Tensor& ids, framework::Tensor* value, "output tensor should have the same shape with table " "except the dims[0]."); for (int i = 0; i < ids.numel(); ++i) { - int64_t index = AutoGrownIndex(ids.data()[i], auto_grown); - framework::VisitDataType( - framework::ToDataType(value_->type()), - TensorCopyVisitor(value, i * value_width, *value_.get(), - index * value_width, value_width)); + auto id = ids.data()[i]; + int64_t index = AutoGrownIndex(id, auto_grown, is_test); + if (index < 0) { + VLOG(5) << "id " << id << " not in the table, return 0"; + framework::VisitDataType( + value_->type(), + TensorFillVisitor(value, i * value_width, value_width, 0.0)); + } else { + framework::VisitDataType( + value_->type(), + TensorCopyVisitor(value, i * value_width, *value_.get(), + index * value_width, value_width)); + } } } } diff --git a/paddle/fluid/framework/selected_rows.h b/paddle/fluid/framework/selected_rows.h index daf5e95304fb84eaba26a30c45414d5021e7ffcb..e1bdba9b46a4cbdb664b70c7419f567ef95bdf31 100644 --- a/paddle/fluid/framework/selected_rows.h +++ b/paddle/fluid/framework/selected_rows.h @@ -32,8 +32,7 @@ namespace framework { class SelectedRows { /* * @brief We can use the SelectedRows structure to reproduce a sparse table. - * A sparse table is a key-value structure that the key is an `int64_t` - * number, + * A sparse table is a key-value structure that the key is an `int64_t`, * and the value is a Tensor which the first dimension is 0. * You can use the following interface to operate the sparse table, and you * can find @@ -105,7 +104,7 @@ class SelectedRows { * the value */ void Get(const framework::Tensor& ids, framework::Tensor* value, - bool auto_grown = false); + bool auto_grown = false, bool is_test = false); /* * @brief Get the index of the key from id_to_index_ map. If the key not @@ -118,10 +117,24 @@ class SelectedRows { * * @return index of the key. */ - int64_t AutoGrownIndex(int64_t key, bool auto_grown); + int64_t AutoGrownIndex(int64_t key, bool auto_grown, bool is_test = false); - void SyncIndex(); + /* + * @brief Get the index of the key from id_to_index_ map. + */ + inline int64_t GetIndexFromId(int64_t key) { + auto iter = id_to_index_.find(key); + if (iter == id_to_index_.end()) { + return -1; + } else { + return iter->second; + } + } + void SyncIndex(); + /* + * @brief Get complete Dims before + */ DDim GetCompleteDims() const { std::vector dims = vectorize(value_->dims()); dims[0] = height_; @@ -133,9 +146,10 @@ class SelectedRows { // SelectedRows are simply concated when adding together. Until a // SelectedRows add a Tensor, will the duplicate rows be handled. Vector rows_; - std::unordered_map id_to_index_; + std::unordered_map + id_to_index_; // should not be used when rows_ has duplicate member std::unique_ptr value_{nullptr}; - int64_t height_; + int64_t height_; // height indicates the underline tensor's height std::unique_ptr rwlock_{nullptr}; }; diff --git a/paddle/fluid/framework/selected_rows_test.cc b/paddle/fluid/framework/selected_rows_test.cc index 9c427a4ae4c9660b107ca891a60db306cb09301f..3b0509e0344efedf08ab21cac0a075049617ca97 100644 --- a/paddle/fluid/framework/selected_rows_test.cc +++ b/paddle/fluid/framework/selected_rows_test.cc @@ -84,10 +84,14 @@ TEST(SelectedRows, SparseTable) { data[i * embedding_width + j] = static_cast(i); } } - ASSERT_EQ(table.AutoGrownIndex(10, true), 0); - ASSERT_EQ(table.AutoGrownIndex(8, true), 1); - ASSERT_EQ(table.AutoGrownIndex(8, true), 1); - ASSERT_EQ(table.AutoGrownIndex(6, true), 2); + ASSERT_EQ(table.AutoGrownIndex(10, true, false), 0); + ASSERT_EQ(table.AutoGrownIndex(8, true, false), 1); + ASSERT_EQ(table.AutoGrownIndex(8, true, false), 1); + ASSERT_EQ(table.AutoGrownIndex(6, true, false), 2); + for (int64_t i = 11; i < 20; i++) { + ASSERT_EQ(table.AutoGrownIndex(i, true, true), -1); + ASSERT_TRUE(!table.HasKey(i)); + } ASSERT_TRUE(table.HasKey(10)); ASSERT_TRUE(table.HasKey(8)); ASSERT_TRUE(table.HasKey(6)); diff --git a/paddle/fluid/framework/shape_inference.cc b/paddle/fluid/framework/shape_inference.cc index ddff2c7c261746ac9986e79cff3da7e0a9654adc..4ac872ac3d3bf918678f5294a4c35097c3fb18ab 100644 --- a/paddle/fluid/framework/shape_inference.cc +++ b/paddle/fluid/framework/shape_inference.cc @@ -22,20 +22,6 @@ limitations under the License. */ namespace paddle { namespace framework { -DDim InferShapeContext::GetInputDim(const std::string &name) const { - const std::vector &arg_names = Inputs(name); - PADDLE_ENFORCE_EQ(arg_names.size(), 1UL, - "Input(%s) should hold one element, but now it holds %d", - name, arg_names.size()); - return this->GetDim(arg_names[0]); -} - -std::vector InferShapeContext::GetInputsDim( - const std::string &name) const { - const std::vector &arg_names = Inputs(name); - return GetDims(arg_names); -} - std::vector InferShapeContext::GetReaderDims( const std::string &name) const { const std::vector &arg_names = Inputs(name); @@ -46,26 +32,6 @@ std::vector InferShapeContext::GetReaderDims( return this->GetRepeatedDims(arg_names[0]); } -DDim InferShapeContext::GetInputsElementDim(const std::string &name, - int idx) const { - const std::vector &names = Inputs(name); - return this->GetDim(names[idx]); -} - -void InferShapeContext::SetOutputDim(const std::string &name, const DDim &dim) { - auto &arg_names = Outputs(name); - PADDLE_ENFORCE_EQ(arg_names.size(), 1UL, - "Output(%s) should hold one element, but now it holds %d", - name, arg_names.size()); - SetDim(arg_names[0], dim); -} - -void InferShapeContext::SetOutputsDim(const std::string &name, - const std::vector &dims) { - auto &names = Outputs(name); - SetDims(names, dims); -} - void InferShapeContext::SetReaderDims(const std::string &name, const std::vector &dims) { const std::vector &arg_names = Outputs(name); @@ -76,69 +42,5 @@ void InferShapeContext::SetReaderDims(const std::string &name, return this->SetRepeatedDims(arg_names[0], dims); } -std::vector InferShapeContext::GetInputVarPtrs( - const std::string &name) { - const std::vector arg_names = Inputs(name); - std::vector res; - res.reserve(arg_names.size()); - std::transform( - arg_names.begin(), arg_names.end(), std::back_inserter(res), - [this](const std::string &name) { return this->GetVarPtr(name); }); - return res; -} - -std::vector InferShapeContext::GetOutputVarPtrs( - const std::string &name) { - const std::vector arg_names = Outputs(name); - std::vector res; - res.reserve(arg_names.size()); - std::transform( - arg_names.begin(), arg_names.end(), std::back_inserter(res), - [this](const std::string &name) { return this->GetVarPtr(name); }); - return res; -} - -std::vector InferShapeContext::GetDims( - const std::vector &names) const { - std::vector ret; - ret.reserve(names.size()); - std::transform( - names.begin(), names.end(), std::back_inserter(ret), - [this](const std::string &name) { return this->GetDim(name); }); - return ret; -} - -void InferShapeContext::SetDims(const std::vector &names, - const std::vector &dims) { - size_t length = names.size(); - PADDLE_ENFORCE_EQ(length, dims.size()); - for (size_t i = 0; i < length; ++i) { - if (names[i] == framework::kEmptyVarName) { - continue; - } - SetDim(names[i], dims[i]); - } -} - -std::vector InferShapeContext::GetInputsVarType( - const std::string &name) const { - return GetVarTypes(Inputs(name)); -} - -std::vector InferShapeContext::GetOutputsVarType( - const std::string &name) const { - return GetVarTypes(Outputs(name)); -} - -std::vector InferShapeContext::GetVarTypes( - const std::vector &names) const { - std::vector retv; - retv.resize(names.size()); - std::transform(names.begin(), names.end(), retv.begin(), - std::bind(std::mem_fn(&InferShapeContext::GetVarType), this, - std::placeholders::_1)); - return retv; -} - } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/shape_inference.h b/paddle/fluid/framework/shape_inference.h index 280bc19dce7b604d67aefdc572de96b479b8d2d7..e0a848273b8d6b50eb1706998e368141a0d1f7f3 100644 --- a/paddle/fluid/framework/shape_inference.h +++ b/paddle/fluid/framework/shape_inference.h @@ -25,6 +25,8 @@ limitations under the License. */ namespace paddle { namespace framework { +class OperatorBase; + using InferShapeVarPtr = boost::variant; class InferShapeContext { @@ -33,22 +35,23 @@ class InferShapeContext { virtual bool HasInput(const std::string &name) const = 0; virtual bool HasOutput(const std::string &name) const = 0; - std::vector GetInputsVarType( - const std::string &name) const; - std::vector GetOutputsVarType( - const std::string &name) const; + virtual std::vector GetInputsVarType( + const std::string &name) const = 0; + virtual std::vector GetOutputsVarType( + const std::string &name) const = 0; virtual bool HasInputs(const std::string &name) const = 0; virtual bool HasOutputs(const std::string &name) const = 0; - DDim GetInputDim(const std::string &name) const; - std::vector GetInputsDim(const std::string &name) const; - std::vector GetReaderDims(const std::string &name) const; - DDim GetInputsElementDim(const std::string &name, int idx) const; + virtual DDim GetInputDim(const std::string &name) const = 0; + virtual std::vector GetInputsDim(const std::string &name) const = 0; + virtual std::vector GetReaderDims(const std::string &name) const; - void SetOutputDim(const std::string &name, const DDim &dim); - void SetOutputsDim(const std::string &name, const std::vector &dims); - void SetReaderDims(const std::string &name, const std::vector &dims); + virtual void SetOutputDim(const std::string &name, const DDim &dim) = 0; + virtual void SetOutputsDim(const std::string &name, + const std::vector &dims) = 0; + virtual void SetReaderDims(const std::string &name, + const std::vector &dims); virtual AttrReader Attrs() const = 0; virtual const std::vector &Inputs( @@ -62,29 +65,20 @@ class InferShapeContext { virtual void ShareLoD(const std::string &in, const std::string &out, size_t i = 0, size_t j = 0) const = 0; - virtual bool IsRuntime() const = 0; + virtual void DecreaseLoDLevel(const std::string &in, const std::string &out, + size_t i = 0, size_t j = 0) const = 0; - std::vector GetInputVarPtrs(const std::string &name); - std::vector GetOutputVarPtrs(const std::string &name); - virtual InferShapeVarPtr GetVarPtr(const std::string &name) = 0; + virtual bool IsRuntime() const = 0; - // Note: In while op, we need this to be public - void SetDims(const std::vector &names, - const std::vector &dims); + virtual std::vector GetInputVarPtrs( + const std::string &name) = 0; + virtual std::vector GetOutputVarPtrs( + const std::string &name) = 0; protected: - virtual DDim GetDim(const std::string &name) const = 0; - virtual void SetDim(const std::string &name, const DDim &dim) = 0; virtual std::vector GetRepeatedDims(const std::string &name) const = 0; virtual void SetRepeatedDims(const std::string &name, const std::vector &dims) = 0; - - std::vector GetDims(const std::vector &names) const; - - std::vector GetVarTypes( - const std::vector &names) const; - - virtual proto::VarType::Type GetVarType(const std::string &name) const = 0; }; } // namespace framework diff --git a/paddle/fluid/framework/tensor.cc b/paddle/fluid/framework/tensor.cc index b6ba0df033af12d48e88eb57a3b97b559077250d..ea7f8c496a9fc3ff78fce06b69fb21e44e5be9ee 100644 --- a/paddle/fluid/framework/tensor.cc +++ b/paddle/fluid/framework/tensor.cc @@ -13,10 +13,11 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/framework/tensor.h" +#include "paddle/fluid/framework/var_type.h" namespace paddle { namespace framework { -extern size_t SizeOfType(std::type_index type); +extern size_t SizeOfType(proto::VarType::Type type); void Tensor::check_memory_size() const { PADDLE_ENFORCE_NOT_NULL( holder_, "Tensor holds no memory. Call Tensor::mutable_data first."); @@ -27,15 +28,16 @@ void Tensor::check_memory_size() const { "or maybe the required data-type mismatches the data already stored."); } +Tensor::Tensor(const proto::VarType::Type& dtype) : type_(dtype), offset_(0) {} + size_t Tensor::memory_size() const { return holder_ == nullptr ? 0UL : holder_->size() - offset_; } -void* Tensor::mutable_data(platform::Place place, std::type_index type, +void* Tensor::mutable_data(platform::Place place, proto::VarType::Type type, + memory::Allocator::Attr attr, size_t requested_size) { - if (holder_ != nullptr) { - holder_->set_type(type); - } + type_ = type; PADDLE_ENFORCE_GE(numel(), 0, "When calling this method, the Tensor's numel must be " "equal or larger than zero. " @@ -48,35 +50,18 @@ void* Tensor::mutable_data(platform::Place place, std::type_index type, /* some versions of boost::variant don't have operator!= */ if (holder_ == nullptr || !(holder_->place() == place) || holder_->size() < size + offset_) { - if (platform::is_cpu_place(place)) { - holder_.reset(new PlaceholderImpl( - boost::get(place), size, type)); - } else if (platform::is_gpu_place(place) || - platform::is_cuda_pinned_place(place)) { -#ifndef PADDLE_WITH_CUDA - PADDLE_THROW( - "CUDAPlace or CUDAPinnedPlace is not supported in CPU-only mode."); - } -#else - if (platform::is_gpu_place(place)) { - holder_.reset(new PlaceholderImpl( - boost::get(place), size, type)); - } else if (platform::is_cuda_pinned_place(place)) { - holder_.reset(new PlaceholderImpl( - boost::get(place), size, type)); - } - } -#endif + holder_ = memory::AllocShared(place, size, attr); offset_ = 0; } return reinterpret_cast(reinterpret_cast(holder_->ptr()) + offset_); } -void* Tensor::mutable_data(platform::Place place, size_t requested_size) { +void* Tensor::mutable_data(platform::Place place, memory::Allocator::Attr attr, + size_t requested_size) { PADDLE_ENFORCE(this->holder_ != nullptr, "Cannot invoke mutable data if current hold nothing."); - return mutable_data(place, holder_->type(), requested_size); + return mutable_data(place, type_, attr, requested_size); } Tensor& Tensor::ShareDataWith(const Tensor& src) { @@ -85,7 +70,7 @@ Tensor& Tensor::ShareDataWith(const Tensor& src) { return *this; } -Tensor Tensor::Slice(int begin_idx, int end_idx) const { +Tensor Tensor::Slice(int64_t begin_idx, int64_t end_idx) const { check_memory_size(); PADDLE_ENFORCE_GE(begin_idx, 0, "The start row index must be greater than 0."); @@ -101,6 +86,7 @@ Tensor Tensor::Slice(int begin_idx, int end_idx) const { Tensor dst; dst.holder_ = holder_; dst.set_layout(layout_); + dst.type_ = type_; DDim dst_dims = dims_; dst_dims[0] = end_idx - begin_idx; dst.Resize(dst_dims); @@ -118,5 +104,12 @@ const DDim& Tensor::dims() const { return dims_; } int64_t Tensor::numel() const { return product(dims_); } +void Tensor::ResetHolder(std::shared_ptr holder) { + if (holder_) { + PADDLE_ENFORCE_EQ(numel() * SizeOfType(type()), holder->size()); + } + holder_ = holder; +} + } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/tensor.h b/paddle/fluid/framework/tensor.h index f1d268548578fea12082e2edb213a3749eccbfaf..0fa76f943ec1417dc712771565f7ff2b263e6365 100644 --- a/paddle/fluid/framework/tensor.h +++ b/paddle/fluid/framework/tensor.h @@ -18,10 +18,11 @@ limitations under the License. */ #include #include #include +#include #include - #include "paddle/fluid/framework/data_layout.h" #include "paddle/fluid/framework/ddim.h" +#include "paddle/fluid/framework/framework.pb.h" #include "paddle/fluid/memory/memory.h" #include "paddle/fluid/platform/device_context.h" #include "paddle/fluid/platform/enforce.h" @@ -67,12 +68,9 @@ class Tensor { friend struct EigenVector; public: - Tensor() : offset_(0) {} + Tensor() : type_(proto::VarType::FP32), offset_(0) {} - /*! Constructor with place should only be used in pybind. */ - explicit Tensor(const platform::Place& place) : offset_(0) { - holder_->set_place(place); - } + explicit Tensor(const proto::VarType::Type&); /*! Return a pointer to mutable memory block. */ template @@ -89,12 +87,17 @@ class Tensor { * @note If not exist, then allocation. */ template - T* mutable_data(platform::Place place, size_t requested_size = 0); + T* mutable_data(platform::Place place, + memory::Allocator::Attr attr = memory::Allocator::kDefault, + size_t requested_size = 0); - void* mutable_data(platform::Place place, std::type_index type, + void* mutable_data(platform::Place place, proto::VarType::Type type, + memory::Allocator::Attr attr = memory::Allocator::kDefault, size_t requested_size = 0); - void* mutable_data(platform::Place place, size_t requested_size = 0); + void* mutable_data(platform::Place place, + memory::Allocator::Attr attr = memory::Allocator::kDefault, + size_t requested_size = 0); /** * @brief Return a pointer to mutable memory block. @@ -106,7 +109,9 @@ class Tensor { * @note If not exist, then allocation. */ template - T* mutable_data(DDim dims, platform::Place place, size_t requested_size = 0); + T* mutable_data(DDim dims, platform::Place place, + memory::Allocator::Attr attr = memory::Allocator::kDefault, + size_t requested_size = 0); /*! Return the dimensions of the memory block. */ const DDim& dims() const; @@ -128,7 +133,7 @@ class Tensor { * @param[in] end_idx The index of the end row(exclusive) to slice. * The index number begins from 0. */ - Tensor Slice(int begin_idx, int end_idx) const; + Tensor Slice(int64_t begin_idx, int64_t end_idx) const; platform::Place place() const { PADDLE_ENFORCE_NOT_NULL( @@ -136,10 +141,10 @@ class Tensor { return holder_->place(); } - std::type_index type() const { + proto::VarType::Type type() const { PADDLE_ENFORCE_NOT_NULL( holder_, "Tensor not initialized yet when Tensor::type() is called."); - return holder_->type(); + return type_; } // memory size returns the holding memory size in byte. @@ -153,56 +158,19 @@ class Tensor { void clear() { holder_ = nullptr; } - private: - /** - * @note Placeholder hides type T, so it doesn't appear as a template - * parameter of Variable. - */ - struct Placeholder { - virtual ~Placeholder() = default; - virtual void* ptr() const = 0; - virtual size_t size() const = 0; - virtual std::type_index type() const = 0; - virtual platform::Place place() const = 0; - virtual void set_type(std::type_index type) = 0; - virtual void set_place(platform::Place place) = 0; - }; - - template - struct PlaceholderImpl : public Placeholder { - PlaceholderImpl(Place place, size_t size, std::type_index type) - : ptr_(static_cast(memory::Alloc(place, size)), - memory::PODDeleter(place)), - place_(place), - size_(size), - type_(type) { - PADDLE_ENFORCE_NOT_NULL(ptr_, "Insufficient %s memory to allocation.", - (is_cpu_place(place_) ? "CPU" : "GPU")); - } - - virtual size_t size() const { return size_; } - virtual platform::Place place() const { return place_; } - virtual void* ptr() const { return static_cast(ptr_.get()); } - virtual std::type_index type() const { return type_; } - virtual void set_type(std::type_index type) { type_ = type; } - virtual void set_place(platform::Place place) { place_ = place; } - - /*! the pointer of memory block. */ - std::unique_ptr> ptr_; - - /*! the place of memory block. */ - platform::Place place_; - - /*! the size of memory block. */ - size_t size_; - - /* the current type of memory */ - std::type_index type_; - }; + const std::shared_ptr& Holder() const { return holder_; } + size_t offset() const { return offset_; } - /*! holds the memory block if allocated. */ - std::shared_ptr holder_; + std::shared_ptr MoveMemoryHolder() { + return std::move(holder_); + } + + void ResetHolder(std::shared_ptr holder); + private: + /*! holds the memory block if allocated. */ + std::shared_ptr holder_; + proto::VarType::Type type_; /** * @brief points to elements dimensions. * diff --git a/paddle/fluid/framework/tensor_impl.h b/paddle/fluid/framework/tensor_impl.h index 6d3047c95d6cf30c2a5308d4f69ded367066d78c..ef5404e4755817cefc925acbf4882ff86d1f0ba3 100644 --- a/paddle/fluid/framework/tensor_impl.h +++ b/paddle/fluid/framework/tensor_impl.h @@ -23,10 +23,10 @@ namespace framework { template inline const T* Tensor::data() const { check_memory_size(); - bool valid = std::is_same::value || - holder_->type() == std::type_index(typeid(T)); - PADDLE_ENFORCE(valid, "Tensor holds the wrong type, it holds %s", - this->holder_->type().name()); + bool valid = + std::is_same::value || type_ == DataTypeTrait::DataType; + PADDLE_ENFORCE(valid, "Tensor holds the wrong type, it holds %d", + DataTypeToString(type_)); return reinterpret_cast( reinterpret_cast(holder_->ptr()) + offset_); @@ -37,26 +37,29 @@ inline bool Tensor::IsInitialized() const { return holder_ != nullptr; } template inline T* Tensor::data() { check_memory_size(); - bool valid = std::is_same::value || - holder_->type() == std::type_index(typeid(T)); - PADDLE_ENFORCE(valid, "Tensor holds the wrong type, it holds %s", - this->holder_->type().name()); + bool valid = + std::is_same::value || type_ == DataTypeTrait::DataType; + PADDLE_ENFORCE(valid, "Tensor holds the wrong type, it holds %s", type_); return reinterpret_cast(reinterpret_cast(holder_->ptr()) + offset_); } template inline T* Tensor::mutable_data(DDim dims, platform::Place place, + memory::Allocator::Attr attr, size_t requested_size) { static_assert(std::is_pod::value, "T must be POD"); Resize(dims); - return mutable_data(place, requested_size); + return mutable_data(place, attr, requested_size); } template -inline T* Tensor::mutable_data(platform::Place place, size_t requested_size) { +inline T* Tensor::mutable_data(platform::Place place, + memory::Allocator::Attr attr, + size_t requested_size) { static_assert(std::is_pod::value, "T must be POD"); - return reinterpret_cast(mutable_data(place, typeid(T), requested_size)); + return reinterpret_cast( + mutable_data(place, DataTypeTrait::DataType, attr, requested_size)); } inline Tensor ReshapeToMatrix(const Tensor& src, int num_col_dims) { diff --git a/paddle/fluid/framework/tensor_test.cc b/paddle/fluid/framework/tensor_test.cc index cb2061c06a429d8e8116001a4aa4e8c46ea13428..83dea8639010f77619a6fc2a81e092ae513c6e79 100644 --- a/paddle/fluid/framework/tensor_test.cc +++ b/paddle/fluid/framework/tensor_test.cc @@ -74,6 +74,35 @@ TEST(Tensor, MutableData) { p2 = src_tensor.mutable_data(framework::make_ddim({2, 2}), platform::CPUPlace()); EXPECT_EQ(p1, p2); + + float* p3 = nullptr; + float* p4 = nullptr; + // set src_tensor a different type but smaller size. + // memory block is supposed to be unchanged. + auto* tmp = src_tensor.mutable_data(framework::make_ddim({2, 2}), + platform::CPUPlace()); + p3 = reinterpret_cast(tmp); + EXPECT_EQ(p1, p3); + + // set src_tensor a different type but bigger size. + // memory block is supposed to be changed. + auto* tmp2 = src_tensor.mutable_data( + framework::make_ddim({2, 2, 3}), platform::CPUPlace()); + p4 = reinterpret_cast(tmp2); + EXPECT_NE(p1, p4); + } + // Not sure if it's desired, but currently, Tensor type can be changed. + { + framework::Tensor src_tensor; + int8_t* p1 = src_tensor.mutable_data(framework::make_ddim({1}), + platform::CPUPlace()); + EXPECT_NE(p1, nullptr); + *p1 = 1; + + uint8_t* p2 = src_tensor.mutable_data(framework::make_ddim({1}), + platform::CPUPlace()); + EXPECT_NE(p2, nullptr); + EXPECT_EQ(static_cast(p2[0]), 1); } #ifdef PADDLE_WITH_CUDA diff --git a/paddle/fluid/framework/tensor_util.cc b/paddle/fluid/framework/tensor_util.cc index 69bcbc0e5891f95af4de8dfd49a25648ca920ab1..a7f09df4917532e7261cee471c711897c8eb3447 100644 --- a/paddle/fluid/framework/tensor_util.cc +++ b/paddle/fluid/framework/tensor_util.cc @@ -14,8 +14,11 @@ #include "paddle/fluid/framework/tensor_util.h" #include #include +#include +#include #include #include "paddle/fluid/framework/data_type.h" +#include "paddle/fluid/platform/profiler.h" namespace paddle { namespace framework { @@ -135,16 +138,19 @@ void TensorCopySync(const Tensor& src, const platform::Place& dst_place, #ifdef PADDLE_WITH_CUDA else if (platform::is_gpu_place(src_place) && // NOLINT platform::is_cpu_place(dst_place)) { + platform::RecordEvent record_event("TensorCopy:GPU->CPU"); auto src_gpu_place = boost::get(src_place); auto dst_cpu_place = boost::get(dst_place); memory::Copy(dst_cpu_place, dst_ptr, src_gpu_place, src_ptr, size, nullptr); } else if (platform::is_cpu_place(src_place) && platform::is_gpu_place(dst_place)) { + platform::RecordEvent record_event("TensorCopy:CPU->GPU"); auto src_cpu_place = boost::get(src_place); auto dst_gpu_place = boost::get(dst_place); memory::Copy(dst_gpu_place, dst_ptr, src_cpu_place, src_ptr, size, nullptr); } else if (platform::is_gpu_place(src_place) && platform::is_gpu_place(dst_place)) { + platform::RecordEvent record_event("TensorCopy:GPU->GPU"); if (src_ptr == dst_ptr && platform::is_same_place(src_place, dst_place)) { VLOG(3) << "Skip copy the same data from " << src_place << " to " << dst_place; @@ -153,6 +159,13 @@ void TensorCopySync(const Tensor& src, const platform::Place& dst_place, auto src_gpu_place = boost::get(src_place); auto dst_gpu_place = boost::get(dst_place); memory::Copy(dst_gpu_place, dst_ptr, src_gpu_place, src_ptr, size, nullptr); + } else if (platform::is_cuda_pinned_place(src_place) && + platform::is_gpu_place(dst_place)) { + platform::RecordEvent record_event("TensorCopy:CUDAPinned->GPU"); + auto src_pinned_place = boost::get(src_place); + auto dst_gpu_place = boost::get(dst_place); + memory::Copy(dst_gpu_place, dst_ptr, src_pinned_place, src_ptr, size, + nullptr); } #endif } @@ -180,8 +193,8 @@ struct AnyDTypeVisitor { template inline void AnyImpl(Predicate predicate, const framework::Tensor& tensor, const DevCtx& ctx, framework::Tensor* out) { - VisitDataType(ToDataType(tensor.type()), AnyDTypeVisitor( - predicate, tensor, ctx, out)); + VisitDataType(tensor.type(), AnyDTypeVisitor( + predicate, tensor, ctx, out)); } template @@ -373,7 +386,7 @@ void TensorToStream(std::ostream& os, const Tensor& tensor, // int32_t size // void* protobuf message proto::VarType::TensorDesc desc; - desc.set_data_type(framework::ToDataType(tensor.type())); + desc.set_data_type(tensor.type()); auto dims = framework::vectorize(tensor.dims()); auto* pb_dims = desc.mutable_dims(); pb_dims->Resize(static_cast(dims.size()), 0); @@ -455,9 +468,7 @@ void TensorFromStream(std::istream& is, Tensor* tensor, tensor->Resize(framework::make_ddim(dims)); void* buf; auto ctx = platform::CPUDeviceContext(); - size_t size = - tensor->numel() * - framework::SizeOfType(framework::ToTypeIndex(desc.data_type())); + size_t size = tensor->numel() * framework::SizeOfType(desc.data_type()); if (platform::is_gpu_place(dev_ctx.GetPlace())) { #ifdef PADDLE_WITH_CUDA Tensor cpu_tensor; diff --git a/paddle/fluid/framework/tensor_util.h b/paddle/fluid/framework/tensor_util.h index cab6d9b67e4e64335be0a386bfffb7ebe4373b3e..1ffd357e62b4bdc72dbec627c463730aa2c8f720 100644 --- a/paddle/fluid/framework/tensor_util.h +++ b/paddle/fluid/framework/tensor_util.h @@ -19,6 +19,7 @@ limitations under the License. */ #include "paddle/fluid/framework/framework.pb.h" #include "paddle/fluid/framework/tensor.h" #include "paddle/fluid/platform/device_context.h" +#include "paddle/fluid/platform/temporary_allocator.h" namespace paddle { namespace framework { @@ -150,6 +151,5 @@ void TensorToVector(const Tensor& src, std::vector* dst) { memory::Copy(dst_place, dst_ptr, boost::get(src.place()), src_ptr, size); } - } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/tensor_util_test.cc b/paddle/fluid/framework/tensor_util_test.cc index 793ccfc79fe56707f226477b9d50b1d972ab6a59..17c55378178325b40e394f4b422c22c1c10bd130 100644 --- a/paddle/fluid/framework/tensor_util_test.cc +++ b/paddle/fluid/framework/tensor_util_test.cc @@ -379,7 +379,9 @@ TEST(Tensor, FromAndToStream) { TensorToStream(oss, gpu_tensor, gpu_ctx); std::istringstream iss(oss.str()); - TensorFromStream(iss, &dst_tensor, gpu_ctx); + TensorFromStream( + iss, &dst_tensor, + *platform::DeviceContextPool::Instance().Get(platform::CPUPlace())); int* dst_ptr = dst_tensor.mutable_data(platform::CPUPlace()); for (int i = 0; i < 6; ++i) { diff --git a/paddle/fluid/framework/threadpool.cc b/paddle/fluid/framework/threadpool.cc index 18cdca3a658a6a89e6ab637a7f38825756acfea8..d34f826c1abb99198fd4dbe9537495edff7b63af 100644 --- a/paddle/fluid/framework/threadpool.cc +++ b/paddle/fluid/framework/threadpool.cc @@ -25,7 +25,6 @@ DEFINE_int32(dist_threadpool_size, 0, namespace paddle { namespace framework { - std::unique_ptr ThreadPool::threadpool_(nullptr); std::once_flag ThreadPool::init_flag_; @@ -47,8 +46,7 @@ void ThreadPool::Init() { } } -ThreadPool::ThreadPool(int num_threads) - : total_threads_(num_threads), idle_threads_(num_threads), running_(true) { +ThreadPool::ThreadPool(int num_threads) : running_(true) { threads_.resize(num_threads); for (auto& thread : threads_) { // TODO(Yancey1989): binding the thread on the specify CPU number @@ -59,9 +57,10 @@ ThreadPool::ThreadPool(int num_threads) ThreadPool::~ThreadPool() { { // notify all threads to stop running + std::unique_lock l(mutex_); running_ = false; - scheduled_.notify_all(); } + scheduled_.notify_all(); for (auto& t : threads_) { t->join(); @@ -69,36 +68,29 @@ ThreadPool::~ThreadPool() { } } -void ThreadPool::Wait() { - std::unique_lock lock(mutex_); - completed_.wait(lock, [=] { return Done() == true; }); -} - void ThreadPool::TaskLoop() { - while (running_) { - std::unique_lock lock(mutex_); - scheduled_.wait(lock, [=] { return !tasks_.empty() || !running_; }); - - if (!running_) { - break; - } - // pop a task from the task queue - auto task = std::move(tasks_.front()); - tasks_.pop(); - - --idle_threads_; - lock.unlock(); - - // run the task - task(); + while (true) { + Task task; { std::unique_lock lock(mutex_); - ++idle_threads_; - if (Done()) { - completed_.notify_all(); + scheduled_.wait( + lock, [this] { return !this->tasks_.empty() || !this->running_; }); + + if (!running_ && tasks_.empty()) { + return; + } + + if (tasks_.empty()) { + PADDLE_THROW("This thread has no task to Run"); } + + // pop a task from the task queue + task = std::move(tasks_.front()); + tasks_.pop(); } + // run the task + task(); } } diff --git a/paddle/fluid/framework/threadpool.h b/paddle/fluid/framework/threadpool.h index 94111ee335b1a5df327b3e46d62069b4735c54f6..7a51d18fbbf65f68725aa86a6a0ce4d15dff5673 100644 --- a/paddle/fluid/framework/threadpool.h +++ b/paddle/fluid/framework/threadpool.h @@ -57,17 +57,8 @@ class ThreadPool { ~ThreadPool(); - // Returns the number of threads created by the constructor. - size_t Threads() const { return total_threads_; } - - // Returns the number of currently idle threads. - size_t IdleThreads() { - std::unique_lock lock(mutex_); - return idle_threads_; - } - // Run pushes a function to the task queue and returns a std::future - // object. To wait for the completion of the task, call + // object. To wait for the completion of the task, call // std::future::wait(). template std::future Run(Callback fn) { @@ -78,7 +69,6 @@ class ThreadPool { template std::future> RunAndGetException( Callback fn) { - std::unique_lock lock(mutex_); Task task([fn]() -> std::unique_ptr { try { fn(); @@ -93,26 +83,20 @@ class ThreadPool { return nullptr; }); std::future> f = task.get_future(); - tasks_.push(std::move(task)); - lock.unlock(); + { + std::unique_lock lock(mutex_); + if (!running_) { + PADDLE_THROW("enqueue on stopped ThreadPool"); + } + tasks_.push(std::move(task)); + } scheduled_.notify_one(); return f; } - // Wait until all the tasks are completed. - void Wait(); - private: DISABLE_COPY_AND_ASSIGN(ThreadPool); - // If the task queue is empty and avaialbe is equal to the number of - // threads, means that all tasks are completed. Note: this function - // is not thread-safe. Returns true if all tasks are completed. - // Note: don't delete the data member total_threads_ and use - // threads_.size() instead; because you'd need to lock the mutex - // before accessing threads_. - bool Done() { return tasks_.empty() && idle_threads_ == total_threads_; } - // The constructor starts threads to run TaskLoop, which retrieves // and runs tasks from the queue. void TaskLoop(); @@ -125,14 +109,11 @@ class ThreadPool { static std::once_flag init_flag_; std::vector> threads_; - const size_t total_threads_; - size_t idle_threads_; std::queue tasks_; std::mutex mutex_; bool running_; std::condition_variable scheduled_; - std::condition_variable completed_; }; class ThreadPoolIO : ThreadPool { diff --git a/paddle/fluid/framework/threadpool_test.cc b/paddle/fluid/framework/threadpool_test.cc index 27a4ffd4fcbf293a3dea1744b29384d0bee0c137..884d61e23428a0ad758946295ca9c470767e93ef 100644 --- a/paddle/fluid/framework/threadpool_test.cc +++ b/paddle/fluid/framework/threadpool_test.cc @@ -19,10 +19,11 @@ limitations under the License. */ namespace framework = paddle::framework; -void do_sum(framework::ThreadPool* pool, std::atomic* sum, int cnt) { - std::vector> fs; +void do_sum(std::vector>* fs, std::mutex* mu, + std::atomic* sum, int cnt) { for (int i = 0; i < cnt; ++i) { - fs.push_back(framework::Async([sum]() { sum->fetch_add(1); })); + std::lock_guard l(*mu); + fs->push_back(framework::Async([sum]() { sum->fetch_add(1); })); } } @@ -40,18 +41,21 @@ TEST(ThreadPool, ConcurrentInit) { } TEST(ThreadPool, ConcurrentRun) { - framework::ThreadPool* pool = framework::ThreadPool::GetInstance(); std::atomic sum(0); std::vector threads; + std::vector> fs; + std::mutex fs_mu; int n = 50; // sum = (n * (n + 1)) / 2 for (int i = 1; i <= n; ++i) { - std::thread t(do_sum, pool, &sum, i); + std::thread t(do_sum, &fs, &fs_mu, &sum, i); threads.push_back(std::move(t)); } for (auto& t : threads) { t.join(); } - pool->Wait(); + for (auto& t : fs) { + t.wait(); + } EXPECT_EQ(sum, ((n + 1) * n) / 2); } diff --git a/paddle/fluid/framework/trainer.cc b/paddle/fluid/framework/trainer.cc new file mode 100644 index 0000000000000000000000000000000000000000..644bd33a1420aa0ff54e34005eedd10c28342665 --- /dev/null +++ b/paddle/fluid/framework/trainer.cc @@ -0,0 +1,23 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/trainer.h" + +namespace paddle { +namespace framework { + +void TrainerBase::SetScope(Scope* root_scope) { root_scope_ = root_scope; } + +} // end namespace framework +} // end namespace paddle diff --git a/paddle/fluid/framework/trainer.h b/paddle/fluid/framework/trainer.h new file mode 100644 index 0000000000000000000000000000000000000000..b29736cfbbebc183d969dcf1863a6a1d097d2358 --- /dev/null +++ b/paddle/fluid/framework/trainer.h @@ -0,0 +1,95 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include // NOLINT +#include +#include // NOLINT +#include + +#include "paddle/fluid/framework/data_feed.h" +#include "paddle/fluid/framework/data_set.h" +#include "paddle/fluid/framework/device_worker.h" +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/program_desc.h" +#include "paddle/fluid/framework/reader.h" +#include "paddle/fluid/framework/trainer_desc.pb.h" +#include "paddle/fluid/framework/variable_helper.h" +#include "paddle/fluid/operators/reader/blocking_queue.h" +#include "paddle/fluid/platform/port.h" + +namespace paddle { +namespace framework { + +class TrainerBase { + public: + TrainerBase() {} + virtual ~TrainerBase() {} + // model memory are hosted in root_scope + void SetScope(Scope* root_scope); + void SetDebug(const bool debug) { debug_ = debug; } + void SetDataset(Dataset* dataset_ptr) { dataset_ptr_ = dataset_ptr; } + virtual void Initialize(const TrainerDesc& trainer_desc, + Dataset* data_set) = 0; + virtual void InitTrainerEnv(const ProgramDesc& main_program, + const platform::Place& place) = 0; + virtual void InitOtherEnv(const ProgramDesc& main_program) = 0; + virtual void Run() = 0; + virtual void Finalize() = 0; + + protected: + Scope* root_scope_; + bool debug_; + Dataset* dataset_ptr_; +}; + +// general trainer for async execution +// local trainer and distributed trainer are supported +// depends on the assigned device_worker +class MultiTrainer : public TrainerBase { + public: + MultiTrainer() {} + virtual ~MultiTrainer() {} + virtual void Initialize(const TrainerDesc& trainer_desc, Dataset* data_set); + virtual void InitTrainerEnv(const ProgramDesc& main_program, + const platform::Place& place); + virtual void InitOtherEnv(const ProgramDesc& main_program) {} + virtual void Run(); + virtual void Finalize(); + + protected: + int thread_num_; + std::vector threads_; + std::vector> readers_; + std::vector> workers_; +}; + +class DistMultiTrainer : public MultiTrainer { + public: + DistMultiTrainer() {} + virtual ~DistMultiTrainer() {} + virtual void Initialize(const TrainerDesc& trainer_desc, Dataset* data_set); + virtual void InitOtherEnv(const ProgramDesc& main_program); + virtual void Run(); + virtual void Finalize(); + + protected: + std::shared_ptr pull_dense_worker_; +}; + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/trainer_desc.proto b/paddle/fluid/framework/trainer_desc.proto new file mode 100644 index 0000000000000000000000000000000000000000..4fc05ccf5c9be37e80b4ae7263166ad76eb6d6a7 --- /dev/null +++ b/paddle/fluid/framework/trainer_desc.proto @@ -0,0 +1,92 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +syntax = "proto2"; +import "data_feed.proto"; +package paddle.framework; + +message TrainerDesc { + // class name for create trainer desc + // the matchness of trainer name and device worker name + // will be checked in python API + optional string class_name = 1; + // class name for creating device worker + optional string device_worker_name = 2; + // thread number + optional int32 thread_num = 3; + // if we need to binding cpu + optional bool binding_cpu = 4 [ default = false ]; + repeated string filelist = 5; + optional bool debug = 6 [ default = false ]; + optional FetchConfig fetch_config = 7; + + // device worker parameters + optional HogwildWorkerParameter hogwild_param = 101; + optional DownpourWorkerParameter downpour_param = 103; + optional PullDenseWorkerParameter pull_dense_param = 102; + // datafeed desc + optional DataFeedDesc data_desc = 201; +} + +message HogwildWorkerParameter { repeated string skip_ops = 1; } + +message DownpourWorkerParameter { + repeated TableParameter sparse_table = 1; + repeated TableParameter dense_table = 2; + repeated string skip_ops = 3; + repeated ProgramConfig program_config = 4; + optional bool push_sparse = 5 [ default = true ]; + optional bool push_dense = 6 [ default = true ]; +} + +message FetchConfig { + enum Method { PRINT = 0; } + repeated string fetch_var_names = 1; + repeated string fetch_var_str_format = 2; + optional int32 print_period = 3 [ default = 100 ]; + optional Method method = 4 [ default = PRINT ]; +} + +message ProgramConfig { + required string program_id = 1; + repeated int32 push_sparse_table_id = 2; + repeated int32 push_dense_table_id = 3; + repeated int32 pull_sparse_table_id = 4; + repeated int32 pull_dense_table_id = 5; +} + +message PullDenseWorkerParameter { + // dense table only and specialized usage + optional int32 threshold = 1 [ default = 1 ]; + optional int32 device_num = 2; + optional int32 sleep_time_ms = 3 [ default = 2 ]; + repeated TableParameter dense_table = 4; +} + +message TableParameter { + // dense table only + optional uint64 table_id = 1; + repeated string dense_value_name = 2; + repeated string dense_grad_name = 3; + repeated int32 push_dense_wait_times = 5; + // sparse table only + repeated string sparse_key_name = 6; + repeated string sparse_value_name = 7; + repeated string sparse_grad_name = 8; + repeated int32 push_sparse_wait_times = 9; + // sparse table only and specialized usage + optional int32 emb_dim = 10; + optional int32 fea_dim = 11; + optional string label_var_name = 12; +} diff --git a/paddle/fluid/framework/trainer_factory.cc b/paddle/fluid/framework/trainer_factory.cc new file mode 100644 index 0000000000000000000000000000000000000000..6b4461c0c429d5b1809dd69d91390421cc8b14ad --- /dev/null +++ b/paddle/fluid/framework/trainer_factory.cc @@ -0,0 +1,67 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/trainer_factory.h" +#include +#include +#include + +#include "paddle/fluid/framework/trainer.h" + +namespace paddle { +namespace framework { + +typedef std::shared_ptr (*CreatetrainerFunction)(); +typedef std::unordered_map trainerMap; +trainerMap g_trainer_map; + +#define REGISTER_TRAINER_CLASS(trainer_class) \ + namespace { \ + std::shared_ptr Creator_##trainer_class() { \ + return std::shared_ptr(new trainer_class); \ + } \ + class __Registerer_##trainer_class { \ + public: \ + __Registerer_##trainer_class() { \ + g_trainer_map[#trainer_class] = &Creator_##trainer_class; \ + } \ + }; \ + __Registerer_##trainer_class g_registerer_##trainer_class; \ + } // namespace + +std::string TrainerFactory::TrainerTypeList() { + std::string trainer_types; + for (auto iter = g_trainer_map.begin(); iter != g_trainer_map.end(); ++iter) { + if (iter != g_trainer_map.begin()) { + trainer_types += ", "; + } + trainer_types += iter->first; + } + return trainer_types; +} + +std::shared_ptr TrainerFactory::CreateTrainer( + std::string trainer_class) { + if (g_trainer_map.count(trainer_class) < 1) { + LOG(WARNING) << "Trainer class: " << trainer_class << " not defined"; + LOG(WARNING) << TrainerTypeList(); + exit(-1); + } + return g_trainer_map[trainer_class](); +} + +REGISTER_TRAINER_CLASS(MultiTrainer); +REGISTER_TRAINER_CLASS(DistMultiTrainer); +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/trainer_factory.h b/paddle/fluid/framework/trainer_factory.h new file mode 100644 index 0000000000000000000000000000000000000000..9c772a4f19ed9ba50f704ed62ef361555b1285fb --- /dev/null +++ b/paddle/fluid/framework/trainer_factory.h @@ -0,0 +1,30 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include "paddle/fluid/framework/trainer.h" + +namespace paddle { +namespace framework { + +class TrainerFactory { + public: + static std::string TrainerTypeList(); + static std::shared_ptr CreateTrainer(std::string trainer_class); +}; +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/trainer_test.cc b/paddle/fluid/framework/trainer_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..f689679d48696ced2ff1fe5c2d3706e3ed2190a4 --- /dev/null +++ b/paddle/fluid/framework/trainer_test.cc @@ -0,0 +1,27 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/trainer.h" +#include + +namespace paddle { +namespace framework { +TEST() { + // create multi trainer + // create hogwild device worker + // create dataset + // train for a while +} +} +} diff --git a/paddle/fluid/framework/transfer_scope_cache.cc b/paddle/fluid/framework/transfer_scope_cache.cc new file mode 100644 index 0000000000000000000000000000000000000000..e52a8317e2113a9489f8c05bcf47bc96bea33c64 --- /dev/null +++ b/paddle/fluid/framework/transfer_scope_cache.cc @@ -0,0 +1,72 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/transfer_scope_cache.h" + +namespace paddle { +namespace framework { + +std::unordered_map& global_transfer_data_cache() { + thread_local auto* x = new std::unordered_map; + return *x; +} + +std::unordered_set& global_transfer_scope_cache() { + thread_local auto* x = new std::unordered_set; + return *x; +} + +Scope* TryCreateTransferScope(OpKernelType type0, OpKernelType type1, + const Scope* scope) { + Scope* new_scope{nullptr}; + size_t infer_cache_key = + CombineHash(OpKernelType::Hash()(type0), OpKernelType::Hash()(type1)); + infer_cache_key = + CombineHash(infer_cache_key, std::hash()(scope)); + + auto it = global_transfer_data_cache().find(infer_cache_key); + if (it != global_transfer_data_cache().end()) { + new_scope = global_transfer_data_cache()[infer_cache_key]; + } else { + new_scope = &scope->NewScope(); + global_transfer_data_cache()[infer_cache_key] = new_scope; + } + global_transfer_scope_cache().insert(new_scope); + return new_scope; +} + +void RemoveKidsFromTransferScopeCache(Scope* scope) { + auto it = global_transfer_scope_cache().find(scope); + if (it != global_transfer_scope_cache().end()) { + global_transfer_scope_cache().erase(it); + } + for (auto* s : scope->kids()) { + auto it = global_transfer_scope_cache().find(s); + if (it != global_transfer_scope_cache().end()) { + global_transfer_scope_cache().erase(it); + } + } + + // remove global transfer data cache + auto& cache = global_transfer_data_cache(); + for (auto it = cache.begin(); it != cache.end();) { + if (it->second == scope) + it = cache.erase(it); + else + it++; + } +} + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/transfer_scope_cache.h b/paddle/fluid/framework/transfer_scope_cache.h new file mode 100644 index 0000000000000000000000000000000000000000..86fc0bf52972ae22a9bf269d71291f7fcd593880 --- /dev/null +++ b/paddle/fluid/framework/transfer_scope_cache.h @@ -0,0 +1,41 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include // NOLINT +#include +#include +#include "paddle/fluid/framework/op_kernel_type.h" +#include "paddle/fluid/framework/scope.h" + +namespace paddle { +namespace framework { + +std::unordered_map& global_transfer_data_cache(); + +std::unordered_set& global_transfer_scope_cache(); + +// Combine two hash values to a single hash. +static size_t CombineHash(size_t seed, size_t a) { + return (seed ^ a) + 0x9e3779b9 + (seed << 6) + (seed >> 2); +} + +Scope* TryCreateTransferScope(OpKernelType type0, OpKernelType type1, + const Scope* scope); + +void RemoveKidsFromTransferScopeCache(Scope* scope); + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/type_defs.h b/paddle/fluid/framework/type_defs.h index e099e40f121ff13657e563eb608feecbca0551be..4ae6a272d5b043f25015ad8d5cfc2139d394ed5c 100644 --- a/paddle/fluid/framework/type_defs.h +++ b/paddle/fluid/framework/type_defs.h @@ -27,16 +27,21 @@ namespace framework { class OperatorBase; class OpDesc; class InferShapeContext; +class InferVarTypeContext; class BlockDesc; +class Variable; +class NoNeedBufferVarsInference; using VariableNameMap = std::map>; +// TODO(panyx0718): Replace vector with something like gtl::Vector. +using VariableValueMap = std::map>; // The order should be as same as framework.proto using Attribute = boost::variant, std::vector, std::vector, bool, std::vector, BlockDesc*, int64_t, - std::vector>; + std::vector, std::vector>; using AttributeMap = std::unordered_map; @@ -50,9 +55,16 @@ using GradOpMakerFN = std::function>( const std::vector& grad_block)>; using InferVarTypeFN = - std::function; + std::function; using InferShapeFN = std::function; +using InplacePair = std::unordered_map; +using InferInplaceOpFN = std::function; + +using InferNoNeedBufferVarsFN = std::function( + const VariableNameMap& /*inputs*/, const VariableNameMap& /*outputs*/, + const AttributeMap& /*attrs*/)>; + } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/unroll_array_ops.h b/paddle/fluid/framework/unroll_array_ops.h new file mode 100644 index 0000000000000000000000000000000000000000..731da74eff4d22da6730e589a1af919514f1c4b7 --- /dev/null +++ b/paddle/fluid/framework/unroll_array_ops.h @@ -0,0 +1,179 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include "paddle/fluid/platform/hostdevice.h" + +namespace paddle { +namespace framework { + +namespace detail { + +template +struct UnrollFillConstant { + template + HOSTDEVICE inline static void Run(T *data, T val) { + data[kStart] = val; + UnrollFillConstant::Run(data, val); + } +}; + +template +struct UnrollFillConstant { + template + HOSTDEVICE inline static void Run(T *data, T val) {} +}; + +template +struct UnrollAssign { + template + HOSTDEVICE inline static void Run(const Tin *d1, Tout *d2) { + d2[kStart] = static_cast(d1[kStart]); + UnrollAssign::Run(d1, d2); + } +}; + +template +struct UnrollAssign { + template + HOSTDEVICE inline static void Run(const Tin *d1, Tout *d2) {} +}; + +template +struct UnrollVarArgsAssignImpl { + template + HOSTDEVICE inline static void Run(T *d, T val, Args... args) { + static_assert(sizeof...(args) + 1 == kEnd - kStart, "Wrong argument"); + d[kStart] = val; + UnrollVarArgsAssignImpl::Run( + d, args...); + } +}; + +template +struct UnrollVarArgsAssignImpl { + HOSTDEVICE inline static void Run(T *d) {} +}; + +template +struct UnrollVarArgsAssign { + template + HOSTDEVICE inline static void Run(T *d, Args... args) { + UnrollVarArgsAssignImpl::Run( + d, args...); + } +}; + +template +struct UnrollCompare { + template + HOSTDEVICE inline static bool Run(const T *d1, const T *d2) { + return d1[kStart] == d2[kStart] && + UnrollCompare::Run(d1, d2); + } +}; + +template +struct UnrollCompare { + template + HOSTDEVICE inline constexpr static bool Run(const T *d1, const T *d2) { + return true; + } +}; + +template +struct UnrollAdd { + template + HOSTDEVICE inline static void Run(const T *d1, const T *d2, T *d3) { + d3[kStart] = d1[kStart] + d2[kStart]; + UnrollAdd::Run(d1, d2, d3); + } +}; + +template +struct UnrollAdd { + template + HOSTDEVICE inline static void Run(const T *d1, const T *d2, T *d3) {} +}; + +template +struct UnrollMul { + template + HOSTDEVICE inline static void Run(const T *d1, const T *d2, T *d3) { + d3[kStart] = d1[kStart] * d2[kStart]; + UnrollMul::Run(d1, d2, d3); + } +}; + +template +struct UnrollMul { + template + HOSTDEVICE inline static void Run(const T *d1, const T *d2, T *d3) {} +}; + +template +struct UnrollProduct { + template + HOSTDEVICE inline static T Run(const T *d) { + return d[kStart] * + UnrollProduct::Run(d); + } + + template + HOSTDEVICE inline static T Run(const T *d1, const T *d2) { + return d1[kStart] * d2[kStart] + + UnrollProduct::Run(d1, d2); + } +}; + +template +struct UnrollProduct { + template + HOSTDEVICE inline constexpr static T Run(const T *d) { + return 1; + } + + template + HOSTDEVICE inline constexpr static T Run(const T *d1, const T *d2) { + return 0; + } +}; + +} // namespace detail + +template +using UnrollFillConstant = detail::UnrollFillConstant<0, N, N == 0>; + +template +using UnrollAssign = detail::UnrollAssign<0, N, N == 0>; + +template +using UnrollVarArgsAssign = detail::UnrollVarArgsAssign; + +template +using UnrollCompare = detail::UnrollCompare<0, N, N == 0>; + +template +using UnrollAdd = detail::UnrollAdd<0, N, N == 0>; + +template +using UnrollMul = detail::UnrollMul<0, N, N == 0>; + +template +using UnrollProduct = detail::UnrollProduct<0, N, N == 0>; + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/unroll_array_ops_test.cc b/paddle/fluid/framework/unroll_array_ops_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..51433c83c801765d8df10590abdd319ba60e4873 --- /dev/null +++ b/paddle/fluid/framework/unroll_array_ops_test.cc @@ -0,0 +1,108 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/unroll_array_ops.h" +#include +#include +#include +#include + +namespace paddle { +namespace framework { + +template +bool CheckEquality(const T* p, size_t n, T val) { + return std::all_of(p, p + n, [val](const T& v) { return v == val; }); +} + +template +bool FillConstantTestMain() { + static_assert(D1 >= D2, ""); + std::array arr; + arr.fill(0); + + UnrollFillConstant::Run(arr.data(), 1); + return CheckEquality(arr.data(), D2, 1) && + CheckEquality(arr.data() + D2, arr.size() - D2, 0); +} + +TEST(unroll_ops, fill_constant) { + EXPECT_TRUE((FillConstantTestMain<9, 0>())); + EXPECT_TRUE((FillConstantTestMain<9, 1>())); + EXPECT_TRUE((FillConstantTestMain<9, 4>())); + EXPECT_TRUE((FillConstantTestMain<9, 9>())); +} + +TEST(unroll_ops, assign) { + const int a[] = {1, 2, 3, 4, 5}; + int b[] = {0, 0, 0, 0, 0}; + UnrollAssign<3>::Run(a, b); + EXPECT_EQ(b[0], 1); + EXPECT_EQ(b[1], 2); + EXPECT_EQ(b[2], 3); + EXPECT_EQ(b[3], 0); + EXPECT_EQ(b[4], 0); +} + +TEST(unroll_ops, var_args_assign) { + int a[] = {0, 0, 0}; + UnrollVarArgsAssign::Run(a, 1, 2); + EXPECT_EQ(a[0], 1); + EXPECT_EQ(a[1], 2); + EXPECT_EQ(a[2], 0); +} + +TEST(unroll_ops, compare) { + int a[] = {1, 2, 3}; + int b[] = {1, 2, 4}; + EXPECT_TRUE(UnrollCompare<2>::Run(a, b)); + EXPECT_FALSE(UnrollCompare<3>::Run(a, b)); + + b[0] = -1; + EXPECT_TRUE(UnrollCompare<0>::Run(a, b)); + EXPECT_FALSE(UnrollCompare<1>::Run(a, b)); +} + +TEST(unroll_ops, add) { + int a[] = {2, 3, 4}; + int b[] = {5, 10, 102}; + int c[] = {0, 0, 0}; + UnrollAdd<2>::Run(a, b, c); + EXPECT_EQ(a[0] + b[0], c[0]); + EXPECT_EQ(a[1] + b[1], c[1]); + EXPECT_EQ(c[2], 0); +} + +TEST(unroll_ops, mul) { + int a[] = {2, 3, 4}; + int b[] = {5, 10, 102}; + int c[] = {0, 0, 0}; + UnrollMul<2>::Run(a, b, c); + EXPECT_EQ(a[0] * b[0], c[0]); + EXPECT_EQ(a[1] * b[1], c[1]); + EXPECT_EQ(c[2], 0); +} + +TEST(unroll_ops, product) { + int a[] = {2, 3, 4}; + int b[] = {5, 10, 102}; + + EXPECT_EQ(UnrollProduct<3>::Run(a), a[0] * a[1] * a[2]); + + EXPECT_EQ(UnrollProduct<3>::Run(a, b), + a[0] * b[0] + a[1] * b[1] + a[2] * b[2]); +} + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/var_type.h b/paddle/fluid/framework/var_type.h index 3b6f1cdb8f24ab20bfc80eeeba88891d7b41d1f9..73be446f71f193bea203c986b482e6b98a9826c5 100644 --- a/paddle/fluid/framework/var_type.h +++ b/paddle/fluid/framework/var_type.h @@ -19,52 +19,50 @@ limitations under the License. */ #include "paddle/fluid/framework/lod_tensor_array.h" #include "paddle/fluid/framework/reader.h" #include "paddle/fluid/framework/selected_rows.h" +#include "paddle/fluid/framework/var_type_traits.h" #include "paddle/fluid/framework/variable.h" namespace paddle { namespace framework { template -inline bool IsType(const std::type_index& type_index) { - return type_index == std::type_index(typeid(T)); +inline bool IsType(const std::type_index& type) { + return type == typeid(T); } -inline proto::VarType::Type ToVarType(std::type_index type) { - if (IsType(type)) { - return proto::VarType_Type_LOD_TENSOR; - } else if (IsType(type)) { - return proto::VarType_Type_LOD_RANK_TABLE; - } else if (IsType(type)) { - return proto::VarType_Type_LOD_TENSOR_ARRAY; - } else if (IsType(type)) { - return proto::VarType_Type_SELECTED_ROWS; - } else if (IsType(type)) { - return proto::VarType_Type_READER; - } else { - PADDLE_THROW("ToVarType:Unsupported type %s", type.name()); +inline proto::VarType::Type ToVarType(int type) { + switch (type) { + case proto::VarType::LOD_TENSOR: + case proto::VarType::SELECTED_ROWS: + case proto::VarType::LOD_RANK_TABLE: + case proto::VarType::LOD_TENSOR_ARRAY: + case proto::VarType::READER: + return static_cast(type); + default: + PADDLE_THROW("ToVarType:Unsupported type %d", type); } } template inline void VisitVarType(const framework::Variable& var, Visitor visitor) { - switch (ToVarType(var.Type())) { - case proto::VarType_Type_LOD_TENSOR: + switch (var.Type()) { + case proto::VarType::LOD_TENSOR: visitor(var.Get()); return; - case proto::VarType_Type_LOD_RANK_TABLE: + case proto::VarType::LOD_RANK_TABLE: visitor(var.Get()); return; - case proto::VarType_Type_LOD_TENSOR_ARRAY: + case proto::VarType::LOD_TENSOR_ARRAY: visitor(var.Get()); return; - case proto::VarType_Type_SELECTED_ROWS: + case proto::VarType::SELECTED_ROWS: visitor(var.Get()); return; - case proto::VarType_Type_READER: + case proto::VarType::READER: visitor(var.Get()); return; default: - PADDLE_THROW("Not supported visit type, %d", ToVarType(var.Type())); + PADDLE_THROW("Not supported visit type, %s", ToTypeName(var.Type())); } } diff --git a/paddle/fluid/framework/var_type_inference.h b/paddle/fluid/framework/var_type_inference.h index f3035cd712bdea517068b4c172bb2794d5fccddb..2e9c64d3e6854bf70c0aee06128b9f1b7c8c7439 100644 --- a/paddle/fluid/framework/var_type_inference.h +++ b/paddle/fluid/framework/var_type_inference.h @@ -13,15 +13,139 @@ See the License for the specific language governing permissions and limitations under the License. */ #pragma once +#include +#include +#include +#include "paddle/fluid/framework/block_desc.h" +#include "paddle/fluid/framework/op_desc.h" #include "paddle/fluid/framework/type_defs.h" namespace paddle { namespace framework { +class OpDesc; +class BlockDesc; +// default infer var type context +class InferVarTypeContext { + public: + InferVarTypeContext(const OpDesc* op, BlockDesc* block) + : op_(op), block_(block) {} + + virtual ~InferVarTypeContext() {} + + virtual Attribute GetAttr(const std::string& name) const { + PADDLE_ENFORCE_NOT_NULL(op_); + return op_->GetAttr(name); + } + + virtual bool HasVar(const std::string& name) const { + PADDLE_ENFORCE_NOT_NULL(block_); + return block_->FindVarRecursive(name) != nullptr; + } + + virtual bool HasInput(const std::string& name) const { + PADDLE_ENFORCE_NOT_NULL(op_); + return op_->Inputs().count(name) > 0; + } + + virtual bool HasOutput(const std::string& name) const { + PADDLE_ENFORCE_NOT_NULL(op_); + return op_->Outputs().count(name) > 0; + } + + virtual const std::vector& Input(const std::string& name) const { + PADDLE_ENFORCE_NOT_NULL(op_); + return op_->Input(name); + } + + virtual const std::vector& Output( + const std::string& name) const { + PADDLE_ENFORCE_NOT_NULL(op_); + return op_->Output(name); + } + + virtual proto::VarType::Type GetType(const std::string& name) const { + PADDLE_ENFORCE_NOT_NULL(block_); + return block_->FindRecursiveOrCreateVar(name).GetType(); + } + + virtual void SetType(const std::string& name, proto::VarType::Type type) { + PADDLE_ENFORCE_NOT_NULL(block_); + block_->FindRecursiveOrCreateVar(name).SetType(type); + } + + virtual proto::VarType::Type GetDataType(const std::string& name) const { + PADDLE_ENFORCE_NOT_NULL(block_); + return block_->FindRecursiveOrCreateVar(name).GetDataType(); + } + + virtual void SetDataType(const std::string& name, proto::VarType::Type type) { + PADDLE_ENFORCE_NOT_NULL(block_); + block_->FindRecursiveOrCreateVar(name).SetDataType(type); + } + + virtual std::vector GetDataTypes( + const std::string& name) const { + PADDLE_ENFORCE_NOT_NULL(block_); + return block_->FindRecursiveOrCreateVar(name).GetDataTypes(); + } + + virtual void SetDataTypes( + const std::string& name, + const std::vector& multiple_data_type) { + PADDLE_ENFORCE_NOT_NULL(block_); + block_->FindRecursiveOrCreateVar(name).SetDataTypes(multiple_data_type); + } + + virtual std::vector GetShape(const std::string& name) const { + PADDLE_ENFORCE_NOT_NULL(block_); + return block_->FindRecursiveOrCreateVar(name).GetShape(); + } + + virtual void SetShape(const std::string& name, + const std::vector& dims) { + PADDLE_ENFORCE_NOT_NULL(block_); + block_->FindRecursiveOrCreateVar(name).SetShape(dims); + } + + virtual int32_t GetLoDLevel(const std::string& name) const { + PADDLE_ENFORCE_NOT_NULL(block_); + return block_->FindRecursiveOrCreateVar(name).GetLoDLevel(); + } + + virtual void SetLoDLevel(const std::string& name, int32_t lod_level) { + PADDLE_ENFORCE_NOT_NULL(block_); + block_->FindRecursiveOrCreateVar(name).SetLoDLevel(lod_level); + } + + protected: + const OpDesc* op_; + BlockDesc* block_; +}; + class VarTypeInference { public: virtual ~VarTypeInference() {} - virtual void operator()(const OpDesc& op_desc, BlockDesc* block) const = 0; + virtual void operator()(InferVarTypeContext* context) const = 0; // NOLINT +}; + +class PassInDtypeAndVarTypeToOutput : public framework::VarTypeInference { + public: + void operator()(framework::InferVarTypeContext* ctx) const final { // NOLINT + auto in_out_var_names = this->GetInputOutputWithSameType(); + + for (auto& i_o_n : in_out_var_names) { + auto& x_name = ctx->Input(i_o_n.first).at(0); + auto& out_name = ctx->Output(i_o_n.second).at(0); + + ctx->SetType(out_name, ctx->GetType(x_name)); + ctx->SetDataType(out_name, ctx->GetDataType(x_name)); + } + } + + protected: + virtual std::unordered_map + GetInputOutputWithSameType() const = 0; }; } // namespace framework diff --git a/paddle/fluid/framework/var_type_inference_test.cc b/paddle/fluid/framework/var_type_inference_test.cc index 7842168f603885ce7dc87d2a01dfa4f544389faa..6bbb25a573d076d5ec6d6fd960a304639e9e3d49 100644 --- a/paddle/fluid/framework/var_type_inference_test.cc +++ b/paddle/fluid/framework/var_type_inference_test.cc @@ -44,20 +44,20 @@ class SumOpMaker : public OpProtoAndCheckerMaker { class SumOpVarTypeInference : public VarTypeInference { public: - void operator()(const OpDesc &op_desc, BlockDesc *block) const override { - auto &inputs = op_desc.Input("X"); + void operator()(framework::InferVarTypeContext *ctx) const override { + auto &inputs = ctx->Input("X"); auto default_var_type = proto::VarType::SELECTED_ROWS; bool any_input_is_lod_tensor = std::any_of( - inputs.begin(), inputs.end(), [block](const std::string &name) { - return block->Var(name)->GetType() == proto::VarType::LOD_TENSOR; + inputs.begin(), inputs.end(), [&ctx](const std::string &name) { + return ctx->GetType(name) == proto::VarType::LOD_TENSOR; }); if (any_input_is_lod_tensor) { default_var_type = proto::VarType::LOD_TENSOR; } - auto out_var_name = op_desc.Output("Out").front(); - block->Var(out_var_name)->SetType(default_var_type); + auto out_var_name = ctx->Output("Out").front(); + ctx->SetType(out_var_name, default_var_type); } }; } // namespace framework @@ -108,7 +108,7 @@ TEST(InferVarType, sum_op_without_infer_var_type) { op->InferVarType(prog.MutableBlock(0)); - ASSERT_EQ(proto::VarType_Type_LOD_TENSOR, + ASSERT_EQ(proto::VarType::LOD_TENSOR, prog.MutableBlock(0)->Var("test2_out")->GetType()); } diff --git a/paddle/fluid/framework/var_type_traits.cc b/paddle/fluid/framework/var_type_traits.cc new file mode 100644 index 0000000000000000000000000000000000000000..a37b1fbab8cfd0642beaf725c02941002b2176b3 --- /dev/null +++ b/paddle/fluid/framework/var_type_traits.cc @@ -0,0 +1,121 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/var_type_traits.h" +#include "paddle/fluid/framework/lod_rank_table.h" +#include "paddle/fluid/framework/reader.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/framework/selected_rows.h" +#include "paddle/fluid/operators/reader/lod_tensor_blocking_queue.h" +#include "paddle/fluid/platform/macros.h" +#ifdef PADDLE_WITH_CUDA +#ifndef _WIN32 +#include "paddle/fluid/operators/nccl/nccl_gpu_common.h" +#endif +#include +#include "paddle/fluid/operators/conv_cudnn_op_cache.h" +#include "paddle/fluid/operators/cudnn_rnn_cache.h" +#endif + +namespace paddle { +namespace framework { + +// Besides registering variable type id, it is helpful to register a +// var_id -> std::type_index map (for example, get type names according to id) +namespace detail { + +template +struct VarIdToTypeIndexMapInitializerImpl { + template + static void Init(MapType1 *id_to_type, MapType2 *type_to_id) { + using Type = + typename std::tuple_element::type; + static_assert(!std::is_same::value, "Type cannot be void"); + constexpr int kId = VarTypeTrait::kId; + auto type = std::type_index(typeid(Type)); + PADDLE_ENFORCE(id_to_type->count(kId) == 0, + "Registered duplicate type id %d for type %s", kId, + type.name()); + PADDLE_ENFORCE(type_to_id->count(type) == 0, + "Registered duplicate type_index %s for id %d", type.name(), + kId); + id_to_type->emplace(kId, type); + type_to_id->emplace(type, kId); + VarIdToTypeIndexMapInitializerImpl::Init(id_to_type, + type_to_id); + } +}; + +template +struct VarIdToTypeIndexMapInitializerImpl { + template + static void Init(MapType1 *, MapType2 *) {} +}; + +// VarIdToTypeIndexMapInitializer is designed to initialize var_id -> +// std::type_index map and std::type_index -> var_id map +using VarIdToTypeIndexMapInitializer = + VarIdToTypeIndexMapInitializerImpl<0, VarTypeRegistry::kRegisteredTypeNum, + VarTypeRegistry::kRegisteredTypeNum == + 0>; + +struct VarIdToTypeIndexMapHolder { + DISABLE_COPY_AND_ASSIGN(VarIdToTypeIndexMapHolder); + + public: + static const std::type_index &ToTypeIndex(int var_id) { + auto it = Instance().id_to_type_map_.find(var_id); + PADDLE_ENFORCE(it != Instance().id_to_type_map_.end(), + "VarId %d is not registered.", var_id); + return it->second; + } + + static int ToTypeId(const std::type_index &type) { + auto it = Instance().type_to_id_map_.find(type); + PADDLE_ENFORCE(it != Instance().type_to_id_map_.end(), + "VarType %s is not registered.", type.name()); + return it->second; + } + + private: + VarIdToTypeIndexMapHolder() { + VarIdToTypeIndexMapInitializer::Init(&id_to_type_map_, &type_to_id_map_); + } + + static const VarIdToTypeIndexMapHolder &Instance() { + static const VarIdToTypeIndexMapHolder instance; + return instance; + } + + std::unordered_map id_to_type_map_; + std::unordered_map type_to_id_map_; +}; + +} // namespace detail + +const std::type_index &VarTraitIdToTypeIndex(int var_id) { + return detail::VarIdToTypeIndexMapHolder::ToTypeIndex(var_id); +} + +const char *ToTypeName(int var_id) { + return VarTraitIdToTypeIndex(var_id).name(); +} + +int TypeIndexToVarTraitId(const std::type_index &type) { + return detail::VarIdToTypeIndexMapHolder::ToTypeId(type); +} + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/var_type_traits.h b/paddle/fluid/framework/var_type_traits.h new file mode 100644 index 0000000000000000000000000000000000000000..fa77b96a7bdfa28ed982db022e8e5ecaef0b443c --- /dev/null +++ b/paddle/fluid/framework/var_type_traits.h @@ -0,0 +1,190 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include +#include "paddle/fluid/framework/framework.pb.h" +#include "paddle/fluid/framework/lod_tensor_array.h" +#include "paddle/fluid/platform/place.h" +#ifdef PADDLE_WITH_CUDA +#include +#ifndef _WIN32 +#include +#endif +#endif + +// Users should add forward declarations here +namespace paddle { + +namespace platform { +#ifdef PADDLE_WITH_CUDA +#ifndef _WIN32 +class Communicator; +#endif +#endif +} // namespace platform + +namespace framework { +class Tensor; +class LoDTensor; +class SelectedRows; +class LoDRankTable; +class ReaderHolder; +class Scope; +} // namespace framework + +namespace operators { + +class CudnnRNNCache; + +namespace reader { +class LoDTensorBlockingQueueHolder; +} // namespace reader +} // namespace operators + +} // namespace paddle + +namespace paddle { +namespace framework { + +const char *ToTypeName(int var_id); +const std::type_index &VarTraitIdToTypeIndex(int var_id); +int TypeIndexToVarTraitId(const std::type_index &type); + +namespace detail { + +template +struct TypePosFinderImpl { + static constexpr int kPos = + std::is_same::value + ? kStart + : TypePosFinderImpl::kPos; +}; + +template +struct TypePosFinderImpl { + static constexpr int kPos = std::is_same::value ? kStart : -1; +}; + +// TypePosFinder helps to find the position in which T is inside Args... +// If T is not inside Args..., kPos would be -1 +template +struct TypePosFinder { + static constexpr int kPos = + TypePosFinderImpl::kPos; +}; + +template +struct VarTypeRegistryImpl { + static constexpr size_t kRegisteredTypeNum = sizeof...(Args); + using ArgTuple = std::tuple; + + // TypePos() returns the position in which T is inside Args... + // If T is not inside Args..., return -1 + template + static constexpr int TypePos() { + return TypePosFinder::kPos; + } + + // IsRegistered() returns whether T is registered inside RegistryImpl + template + static constexpr bool IsRegistered() { + return TypePos() >= 0; + } +}; + +} // namespace detail + +#define REG_PROTO_VAR_TYPE_TRAIT(type, proto_id) \ + template <> \ + struct VarTypeTrait { \ + static_assert(VarTypeRegistry::IsRegistered(), \ + "Must be registered type"); \ + using Type = type; \ + static constexpr int kId = static_cast(proto_id); \ + } + +/** + * The following codes are designed to register variable types. + * Only registered types can be stored in Variable. + * This registry mechanism is designed to speed up Variable. + * + * Caution: If you want to add more var types, please consider carefully + * whether you really need to add it. + */ + +// Users should add other variable types below. +// Paddle would generate unique Ids for each registered variable types. +using VarTypeRegistry = detail::VarTypeRegistryImpl< + Tensor, LoDTensor, SelectedRows, std::vector, LoDRankTable, + LoDTensorArray, platform::PlaceList, ReaderHolder, std::string, Scope *, + std::map, operators::reader::LoDTensorBlockingQueueHolder, +#ifdef PADDLE_WITH_CUDA +#ifndef _WIN32 + ncclUniqueId, platform::Communicator, +#endif + operators::CudnnRNNCache, +#endif + int, float>; + +template +struct VarTypeTrait { + static_assert(VarTypeRegistry::IsRegistered(), "Must be registered type"); + using Type = T; + /** + * Unique VarType Id generation. + * + * The auto-generated id should not be the same as any protobuf id defined in + * framework.proto. Therefore, we generate id by adding the type pos and + * maximum protobuf id (i.e., proto::VarType::TUPLE). + * + * However, we may need more protobuf id in the future. + * To avoid changing this auto id generation algorithm frequently, we + * generate id by adding the type pos and twice of maximum protobuf id (i.e., + * proto::VarType::TUPLE). + */ + static constexpr int kId = VarTypeRegistry::TypePos() + + static_cast(proto::VarType::TUPLE) * 2; +}; + +// Users should set some of variable type ids to be what is defined in +// framework.proto below +REG_PROTO_VAR_TYPE_TRAIT(LoDTensor, proto::VarType::LOD_TENSOR); +REG_PROTO_VAR_TYPE_TRAIT(SelectedRows, proto::VarType::SELECTED_ROWS); +REG_PROTO_VAR_TYPE_TRAIT(std::vector, proto::VarType::STEP_SCOPES); +REG_PROTO_VAR_TYPE_TRAIT(LoDRankTable, proto::VarType::LOD_RANK_TABLE); +REG_PROTO_VAR_TYPE_TRAIT(LoDTensorArray, proto::VarType::LOD_TENSOR_ARRAY); +REG_PROTO_VAR_TYPE_TRAIT(platform::PlaceList, proto::VarType::PLACE_LIST); +REG_PROTO_VAR_TYPE_TRAIT(ReaderHolder, proto::VarType::READER); +REG_PROTO_VAR_TYPE_TRAIT(int, proto::VarType::INT32); +REG_PROTO_VAR_TYPE_TRAIT(float, proto::VarType::FP32); + +/** End of variable type registration */ + +template +inline constexpr bool IsRegisteredVarType() { + return VarTypeRegistry::IsRegistered(); +} + +#undef REG_PROTO_VAR_TYPE_TRAIT +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/var_type_traits_test.cc b/paddle/fluid/framework/var_type_traits_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..a47275e1ca25a4f66e67b4986ec78e49ea952a51 --- /dev/null +++ b/paddle/fluid/framework/var_type_traits_test.cc @@ -0,0 +1,121 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include +#include + +#include "paddle/fluid/framework/lod_rank_table.h" +#include "paddle/fluid/framework/reader.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/framework/selected_rows.h" +#include "paddle/fluid/framework/var_type_traits.h" +#include "paddle/fluid/operators/reader/lod_tensor_blocking_queue.h" +#ifdef PADDLE_WITH_CUDA +#ifndef _WIN32 +#include "paddle/fluid/operators/nccl/nccl_gpu_common.h" +#endif +#include "paddle/fluid/operators/conv_cudnn_op_cache.h" +#include "paddle/fluid/operators/cudnn_rnn_cache.h" +#endif + +namespace paddle { +namespace framework { + +template +struct TypeIndexChecker { + template + static void Check(SetType1 *var_id_set, SetType2 *type_index_set) { + using Type = + typename std::tuple_element::type; + static_assert(std::is_same::Type, Type>::value, + "Type must be the same"); + constexpr auto kId = VarTypeTrait::kId; + std::type_index actual_type(typeid(Type)); + EXPECT_EQ(std::string(ToTypeName(kId)), std::string(actual_type.name())); + EXPECT_EQ(VarTraitIdToTypeIndex(kId), actual_type); + EXPECT_EQ(TypeIndexToVarTraitId(actual_type), kId); + EXPECT_EQ(VarTraitIdToTypeIndex(TypeIndexToVarTraitId(actual_type)), + actual_type); + EXPECT_EQ(TypeIndexToVarTraitId(VarTraitIdToTypeIndex(kId)), kId); + + EXPECT_TRUE(var_id_set->count(kId) == 0); // NOLINT + EXPECT_TRUE(type_index_set->count(actual_type) == 0); // NOLINT + var_id_set->insert(kId); + type_index_set->insert(std::type_index(typeid(Type))); + TypeIndexChecker::Check(var_id_set, + type_index_set); + } +}; + +template +struct TypeIndexChecker { + template + static void Check(SetType1 *, SetType2 *) {} +}; + +TEST(var_type_traits, check_no_duplicate_registry) { + constexpr size_t kRegisteredNum = VarTypeRegistry::kRegisteredTypeNum; + std::unordered_set var_id_set; + std::unordered_set type_index_set; + TypeIndexChecker<0, kRegisteredNum, kRegisteredNum == 0>::Check( + &var_id_set, &type_index_set); +} + +template +bool CheckVarId(int proto_id) { + static_assert(std::is_same::Type, T>::value, + "Type must be the same"); + return VarTypeTrait::kId == proto_id; +} + +TEST(var_type_traits, check_proto_type_id) { + ASSERT_TRUE(CheckVarId(proto::VarType::LOD_TENSOR)); + ASSERT_TRUE(CheckVarId(proto::VarType::SELECTED_ROWS)); + ASSERT_TRUE(CheckVarId>(proto::VarType::STEP_SCOPES)); + ASSERT_TRUE(CheckVarId(proto::VarType::LOD_RANK_TABLE)); + ASSERT_TRUE(CheckVarId(proto::VarType::LOD_TENSOR_ARRAY)); + ASSERT_TRUE(CheckVarId(proto::VarType::PLACE_LIST)); + ASSERT_TRUE(CheckVarId(proto::VarType::READER)); + ASSERT_TRUE(CheckVarId(proto::VarType::INT32)); + ASSERT_TRUE(CheckVarId(proto::VarType::FP32)); + + ASSERT_EQ(proto::VarType_Type_LOD_TENSOR, proto::VarType::LOD_TENSOR); + ASSERT_EQ(proto::VarType_Type_SELECTED_ROWS, proto::VarType::SELECTED_ROWS); + ASSERT_EQ(proto::VarType_Type_STEP_SCOPES, proto::VarType::STEP_SCOPES); + ASSERT_EQ(proto::VarType_Type_LOD_RANK_TABLE, proto::VarType::LOD_RANK_TABLE); + ASSERT_EQ(proto::VarType_Type_LOD_TENSOR_ARRAY, + proto::VarType::LOD_TENSOR_ARRAY); + ASSERT_EQ(proto::VarType_Type_PLACE_LIST, proto::VarType::PLACE_LIST); + ASSERT_EQ(proto::VarType_Type_READER, proto::VarType::READER); + ASSERT_EQ(proto::VarType_Type_FEED_MINIBATCH, proto::VarType::FEED_MINIBATCH); + ASSERT_EQ(proto::VarType_Type_FETCH_LIST, proto::VarType::FETCH_LIST); + ASSERT_EQ(proto::VarType_Type_RAW, proto::VarType::RAW); + ASSERT_EQ(proto::VarType_Type_TUPLE, proto::VarType::TUPLE); + ASSERT_EQ(proto::VarType_Type_INT32, proto::VarType::INT32); + ASSERT_EQ(proto::VarType_Type_FP32, proto::VarType::FP32); +} + +TEST(var_type_traits, test_registry) { + using Registry = detail::VarTypeRegistryImpl; + ASSERT_TRUE(Registry::TypePos() == 0); + ASSERT_TRUE(Registry::TypePos() == 1); + ASSERT_TRUE(Registry::TypePos() == 2); + ASSERT_TRUE(Registry::TypePos() == 3); + ASSERT_TRUE(Registry::TypePos() == -1); +} + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/variable.h b/paddle/fluid/framework/variable.h index 873e1b20a584df3ba90cf5c1a62a3879bf98ce5c..b9d07da822cf1eb42859e1d7d84437582fada8ff 100644 --- a/paddle/fluid/framework/variable.h +++ b/paddle/fluid/framework/variable.h @@ -18,7 +18,7 @@ #include #include -#include "paddle/fluid/platform/enforce.h" +#include "paddle/fluid/framework/var_type_traits.h" namespace paddle { namespace framework { @@ -27,10 +27,14 @@ class Variable { public: template const T& Get() const { + static_assert( + IsRegisteredVarType(), + "Not registered type. Please register T inside var_type_traits.h"); PADDLE_ENFORCE(holder_ != nullptr, "Variable must hold some thing"); - PADDLE_ENFORCE(IsType(), + PADDLE_ENFORCE(holder_->Type() == VarTypeTrait::kId, "Variable must be type %s, the holding type is %s", - typeid(T).name(), holder_->Type().name()); + ToTypeName(VarTypeTrait::kId), + ToTypeName(holder_->Type())); return *static_cast(holder_->Ptr()); } @@ -39,61 +43,61 @@ class Variable { template T* GetMutable() { if (!holder_) { - holder_.reset(new PlaceholderImpl(new T())); + holder_.reset(new PlaceholderImpl()); } else { - PADDLE_ENFORCE(IsType(), + PADDLE_ENFORCE(holder_->Type() == VarTypeTrait::kId, "Variable must be type %s, the holding type is %s", - typeid(T).name(), holder_->Type().name()); + ToTypeName(VarTypeTrait::kId), + ToTypeName(holder_->Type())); } return static_cast(holder_->Ptr()); } template bool IsType() const { - return holder_ != nullptr && - std::type_index(typeid(T)) == std::type_index(holder_->Type()); + return holder_ && holder_->Type() == VarTypeTrait::kId; } void Clear() { holder_.reset(); } - std::type_index Type() const { + int Type() const { PADDLE_ENFORCE(holder_ != nullptr, "Must hold memory"); return holder_->Type(); } private: struct Placeholder { - virtual ~Placeholder() {} - virtual const std::type_info& Type() const = 0; - virtual void* Ptr() const = 0; + virtual ~Placeholder() = default; + + inline int Type() const { return type_; } + inline const void* Ptr() const { return ptr_; } + inline void* Ptr() { return ptr_; } + + protected: + inline void Init(void* p, int type) { + ptr_ = p; + type_ = type; + } + + void* ptr_; + int type_; }; // Placeholder hides type T, so it doesn't appear as a template // parameter of Variable. template struct PlaceholderImpl : public Placeholder { - explicit PlaceholderImpl(T* ptr) : ptr_(ptr), type_(typeid(T)) {} - - virtual const std::type_info& Type() const { return type_; } - virtual void* Ptr() const { return static_cast(ptr_.get()); } + static_assert( + IsRegisteredVarType(), + "Not registered type. Please register T inside var_type_traits.h"); + PlaceholderImpl() { this->Init(&obj_, VarTypeTrait::kId); } - std::unique_ptr ptr_; - const std::type_info& type_; + private: + T obj_; }; - std::unique_ptr - holder_; // pointers to a PlaceholderImpl object indeed. - - // name_ is only meaningful with a Scope and accessible by it. - // - // NOTE: Please don't expose name_ by adding methods like - // Variable::Name or Scope::VarName! A variable could have a human - // readable name or an auto-generated scope-unique name. In the - // former case, the caller knows the name and doesn't need to access - // the name; in the latter case, the variable should be identified - // by its address but not the unreadable name. - friend class Scope; - const std::string* name_; + // pointers to a PlaceholderImpl object indeed. + std::unique_ptr holder_; }; } // namespace framework diff --git a/paddle/fluid/framework/variable_helper.cc b/paddle/fluid/framework/variable_helper.cc new file mode 100644 index 0000000000000000000000000000000000000000..65c939af173a8a2a22d69c636de355293f95dec6 --- /dev/null +++ b/paddle/fluid/framework/variable_helper.cc @@ -0,0 +1,83 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/variable_helper.h" + +#include + +#include "paddle/fluid/framework/feed_fetch_type.h" +#include "paddle/fluid/framework/lod_rank_table.h" +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/lod_tensor_array.h" +#include "paddle/fluid/framework/reader.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/framework/selected_rows.h" +#include "paddle/fluid/platform/place.h" + +namespace paddle { +namespace framework { + +void InitializeVariable(Variable *var, proto::VarType::Type var_type) { + if (var_type == proto::VarType::LOD_TENSOR) { + var->GetMutable(); + } else if (var_type == proto::VarType::SELECTED_ROWS) { + var->GetMutable(); + } else if (var_type == proto::VarType::FEED_MINIBATCH) { + var->GetMutable(); + } else if (var_type == proto::VarType::FETCH_LIST) { + var->GetMutable(); + } else if (var_type == proto::VarType::STEP_SCOPES) { + var->GetMutable>(); + } else if (var_type == proto::VarType::LOD_RANK_TABLE) { + var->GetMutable(); + } else if (var_type == proto::VarType::LOD_TENSOR_ARRAY) { + var->GetMutable(); + } else if (var_type == proto::VarType::PLACE_LIST) { + var->GetMutable(); + } else if (var_type == proto::VarType::READER) { + var->GetMutable(); + } else if (var_type == proto::VarType::RAW) { + // GetMutable will be called in operator + } else { + PADDLE_THROW( + "Variable type %d is not in " + "[LOD_TENSOR, SELECTED_ROWS, FEED_MINIBATCH, FETCH_LIST, " + "LOD_RANK_TABLE, PLACE_LIST, READER, RAW]", + var_type); + } +} + +void CopyVariable(const Variable &src_var, Variable *dst_var) { + // only support cpu now + auto cpu_place = platform::CPUPlace(); + + if (src_var.IsType()) { + auto *tmp_grad_tensor = dst_var->GetMutable(); + auto &src_tensor = src_var.Get(); + tmp_grad_tensor->set_lod(src_tensor.lod()); + framework::TensorCopy(src_tensor, cpu_place, tmp_grad_tensor); + } else if (src_var.IsType()) { + auto &src_slr = src_var.Get(); + auto *tmp_grad_slr = dst_var->GetMutable(); + tmp_grad_slr->set_rows(src_slr.rows()); + tmp_grad_slr->set_height(src_slr.height()); + auto &src_t = src_slr.value(); + auto *dst_t = tmp_grad_slr->mutable_value(); + framework::TensorCopy(src_t, cpu_place, dst_t); + } else { + PADDLE_THROW("unknown var type to copy"); + } +} +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/variable_helper.h b/paddle/fluid/framework/variable_helper.h new file mode 100644 index 0000000000000000000000000000000000000000..5a2c267b7388f6c2de89054dc480fd74b4544bed --- /dev/null +++ b/paddle/fluid/framework/variable_helper.h @@ -0,0 +1,25 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ +#pragma once + +#include "paddle/fluid/framework/framework.pb.h" +#include "paddle/fluid/framework/variable.h" +namespace paddle { +namespace framework { + +void InitializeVariable(Variable* var, proto::VarType::Type var_type); +void CopyVariable(const Variable& src_var, Variable* dst_var); + +} // end namespace framework +} // end namespace paddle diff --git a/paddle/fluid/framework/variable_test.cc b/paddle/fluid/framework/variable_test.cc index 003dcfd3dfe5ecfd563a686bb72b061aff602f73..511c9c52146ece4b90905cc9d49565103589c1ec 100644 --- a/paddle/fluid/framework/variable_test.cc +++ b/paddle/fluid/framework/variable_test.cc @@ -16,27 +16,28 @@ #include #include "gtest/gtest.h" +#include "paddle/fluid/framework/tensor.h" #include "paddle/fluid/framework/variable.h" -TEST(Variable, GetMutable) { - using paddle::framework::Variable; - - struct Tensor { - int content_; - }; +namespace paddle { +namespace framework { +TEST(Variable, GetMutable) { std::unique_ptr v(new Variable()); - Tensor* t = v->GetMutable(); - t->content_ = 1234; + auto* t = v->GetMutable(); + *t = "1234"; - const Tensor& tt = v->Get(); - EXPECT_EQ(1234, tt.content_); + const auto& tt = v->Get(); + EXPECT_EQ("1234", tt); try { - v->GetMutable(); + v->GetMutable(); } catch (std::exception& e) { return; } EXPECT_TRUE(false); } + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/imperative/CMakeLists.txt b/paddle/fluid/imperative/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..e52a0283f726640eb56b24a2978af6ee44e658ff --- /dev/null +++ b/paddle/fluid/imperative/CMakeLists.txt @@ -0,0 +1,9 @@ +if(WITH_PYTHON) +cc_library(layer SRCS layer.cc DEPS proto_desc operator device_context blas pybind) +cc_library(tracer SRCS tracer.cc DEPS proto_desc device_context pybind) +cc_library(engine SRCS engine.cc) +cc_library(imperative_profiler SRCS profiler.cc) +cc_library(nccl_context SRCS nccl_context.cc DEPS device_context) + +cc_test(nccl_context_test SRCS nccl_context_test.cc DEPS nccl_context) +endif() diff --git a/paddle/fluid/imperative/README.md b/paddle/fluid/imperative/README.md new file mode 100644 index 0000000000000000000000000000000000000000..4c4d619b35a9fd67231071ecca791c9df670fea1 --- /dev/null +++ b/paddle/fluid/imperative/README.md @@ -0,0 +1,212 @@ +# Overview + +Imperative Programming is easier to learn, debug and try new ideas. + +# Related Works + +## Pytorch +https://pytorch.org/ + +## TensorFlow Eager +https://www.tensorflow.org/guide/eager + +# Design + +## API +```python +class Layer(object): + + def __call__(inputs): + # build some parameter once. + # ... + return self.apply(inputs): + + def forward(inputs): + # forward logic with paddle operators. backward auto-generated. + + +class PyLayer(core.PyLayer): + + def __call__(cls, inputs): + # trace the logic. + + @staticmethod + def forward(inputs): + # any forward logic implemented with numpy io. + + @staticmethod + def backward(inputs): + # any backward logic implemented with numpy io. + +``` + + +## Tracer + +Current: Python Variable -> C++ VarBase -> C++ Variable -> C++ Tensor + +Longer term. +```python + +# Parent class. +class PyVarBase(object): + pass + +# Current python variable. +class Variable(PyVarBase): + pass + +class IVariable(PyVarBase): + def __init__(self): + self._ivar = core.VarBase() + + # Move var to a device. + def to(device): pass + # Get var value. + def value(): pass + # Trigger backward. + def backward(): pass + # Get var's gradient value. + def gradient_value(): pass + # operators to override. +``` + + + +```cpp +class Tracer { + public: + explicit Tracer(framework::BlockDesc* root_block) : root_block_(root_block) {} + + virtual ~Tracer() {} + + void Trace(OpBase* op, + const std::map>& inputs, + const std::map>& outputs, + framework::BlockDesc* block, const bool stop_gradient = false); + + std::vector PyTrace(OpBase* op, const std::vector& inputs, + bool stop_gradient = false); +}; +``` + +* Trace forward operations +* Perform quick shape/type infer, push kernel execution engine and return to user. +* Perform autograd to generate gradients. +* Clear trace. +* Apply gradients with optimizers + +## Autodiff + +Lots of research already. +https://autodiff-workshop.github.io/ +https://en.wikipedia.org/wiki/Automatic_differentiation + +Basically, trace the forward execution, and perform autodiff +when needed. + +* Can be triggered by `backward()`. +* Can select a block of code to trace and autodiff. +* Use `require_grad` to drop some forward subgraph that doesn't need autodiff. + +## Execution Engine + +Lazy execution of pushed C++ operations. + +## Device Placement + +* Operator executes on the inputs' device. +* All inputs should live on the same device. +* use `Var.to()` to explicitly move var to a device. + +## Save/Load Models + +TODO + +## I/O + +TODO + +## Refactor + +* All function layers with parameters converted to class Layers. +* Existing models converted to imperative mode. +* All op tests run once in static graph, once in imperative mode. + +# Examples + +```python +class MyLayer(fluid.imperative.Layer): + def __init__(self): + super(MyLayer, self).__init__() + + def forward(self, inputs): + x = fluid.layers.relu(inputs) + x = fluid.layers.elementwise_mul(x, x) + x = fluid.layers.reduce_sum(x) + return [x] + + +class MyPyLayer(fluid.imperative.PyLayer): + def __init__(self): + super(MyPyLayer, self).__init__() + + @staticmethod + def forward(inputs): + return np.tanh(inputs[0]) + + @staticmethod + def backward(inputs): + return np.array(dout) * (1 - np.square(np.array(out))) + + +np_inp = np.ones([2, 2], np.float32) +with fluid.imperative.guard(): + my_py_layer = MyPyLayer() + outs = my_py_layer(np_inp) + dy_out = np.sum(outs[0]._numpy()) + outs[0]._backward() + dy_grad = var_inp._gradient() + + +class MLP(fluid.imperative.Layer): + def __init__(self): + super(MLP, self).__init__() + self._fc1 = FC(3, + fluid.ParamAttr( + initializer=fluid.initializer.Constant(value=0.1))) + self._fc2 = FC(4, + fluid.ParamAttr( + initializer=fluid.initializer.Constant(value=0.1))) + + def forward(self, inputs): + x = self._fc1(inputs) + x = self._fc2(x) + x = fluid.layers.reduce_sum(x) + return x + + + np_inp = np.array([[1.0, 2.0], [3.0, 4.0]], dtype=np.float32) + with fluid.imperative.guard(): + var_inp = fluid.imperative.base.to_variable(np_inp) + mlp = MLP() + out = mlp(var_inp) + dy_out = out._numpy() + out._backward() +``` + +# Plan + +2.1,3 fulltime, Can run a few simple models. (Currently, 2 20% engs) + +4.1, 4 fulltime, Can run 6 models, Performance 70% Pytorch. Release alpha. + +6.1, 5 fulltime, Performance close to Pytorch, can run multi-devices. Release Beta. + +8.1, 5 fulltime, Works in general. Update existing models. Can compile to static graph, support more optimizations. + +12.1 Done. + +# Discussion + +TODO. diff --git a/paddle/fluid/imperative/engine.cc b/paddle/fluid/imperative/engine.cc new file mode 100644 index 0000000000000000000000000000000000000000..de7ab0e5918281579728ef48d1517be2cd530af7 --- /dev/null +++ b/paddle/fluid/imperative/engine.cc @@ -0,0 +1,53 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/imperative/engine.h" + +#include // NOLINT +#include + +#include "glog/logging.h" + +namespace paddle { +namespace imperative { + +static std::once_flag init_engine; +static Engine* engine; + +class DummyEngine : public Engine { + public: + void Enqueue(Runnable* runnable) override { + queued_runnables_.push_back(runnable); + } + + size_t Size() const override { return queued_runnables_.size(); } + + void Sync() override { + for (Runnable* l : queued_runnables_) { + LOG(INFO) << "running " << reinterpret_cast(l); + } + queued_runnables_.clear(); + } + + private: + std::vector queued_runnables_; +}; + +Engine* GetEngine() { + std::call_once(init_engine, []() { engine = new DummyEngine(); }); + return engine; +} + +} // namespace imperative +} // namespace paddle diff --git a/paddle/fluid/imperative/engine.h b/paddle/fluid/imperative/engine.h new file mode 100644 index 0000000000000000000000000000000000000000..a1dfa5bda38d0c419aa4ccbea77b32eb7e0d5b23 --- /dev/null +++ b/paddle/fluid/imperative/engine.h @@ -0,0 +1,39 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include + +namespace paddle { +namespace imperative { + +struct Runnable {}; + +class Engine { + public: + virtual ~Engine() {} + + virtual void Enqueue(Runnable* runnable) = 0; + + virtual size_t Size() const = 0; + + virtual void Sync() = 0; +}; + +Engine* GetEngine(); + +} // namespace imperative +} // namespace paddle diff --git a/paddle/fluid/imperative/layer.cc b/paddle/fluid/imperative/layer.cc new file mode 100644 index 0000000000000000000000000000000000000000..bc03285a4c5fe6db2abf2b271d6ddc86e75a9412 --- /dev/null +++ b/paddle/fluid/imperative/layer.cc @@ -0,0 +1,426 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/imperative/layer.h" + +#include +#include +#include +#include +#include +#include + +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/framework/operator.h" +#include "paddle/fluid/framework/tensor_util.h" +#include "paddle/fluid/operators/math/blas.h" +#include "paddle/fluid/platform/device_context.h" +#include "paddle/fluid/string/printf.h" + +namespace paddle { +namespace imperative { + +const char* PyLayer::kFwdInp = "X"; +const char* PyLayer::kFwdOut = "Out"; + +std::map py_funcs_; + +using framework::Variable; + +namespace detail { + +template +class TensorAddToFunctor : public boost::static_visitor<> { + public: + TensorAddToFunctor(int64_t numel, const T* x, T* y) + : numel_(numel), x_(x), y_(y) {} + + void operator()(const platform::CPUPlace& place) { + platform::CPUDeviceContext* ctx = dynamic_cast( + platform::DeviceContextPool::Instance().Get(place)); + auto blas = operators::math::GetBlas(*ctx); + blas.AXPY(numel_, 1., x_, y_); + } + +#ifdef PADDLE_WITH_CUDA + void operator()(const platform::CUDAPlace& place) { + platform::CUDADeviceContext* ctx = + dynamic_cast( + platform::DeviceContextPool::Instance().Get(place)); + auto blas = operators::math::GetBlas(*ctx); + blas.AXPY(numel_, 1., x_, y_); + } +#else + void operator()(const platform::CUDAPlace& place) { + PADDLE_THROW("Do NOT support gradient merge in place %s", place); + } +#endif + + // there is NO blas in CUDAPinnedPlace + void operator()(const platform::CUDAPinnedPlace& place) { + PADDLE_THROW("Do NOT support gradient merge in place %s", place); + } + + private: + int64_t numel_; + const T* x_; + T* y_; +}; + +} // namespace detail + +void AddTo(Variable* src, Variable* dst, platform::Place place) { + framework::Tensor* dst_tensor = dst->GetMutable(); + framework::Tensor* src_tensor = src->GetMutable(); + + // FIXME(minqiyang): loss_grad op will pass a zero grad of label + // ugly fix for it + if (src_tensor->numel() == 0) { + return; + } + + PADDLE_ENFORCE(dst_tensor->numel() == src_tensor->numel(), + "dst_numel %lld vs. src_numel %lld", dst_tensor->numel(), + src_tensor->numel()); + + detail::TensorAddToFunctor func( + src_tensor->numel(), src_tensor->data(), + dst_tensor->mutable_data(place)); + boost::apply_visitor(func, place); +} + +class Autograd { + public: + Autograd() {} + + void RunBackward(VarBase* var) { + if (var->IsStopGradient()) { + return; + } + VLOG(3) << "start autograd"; + + std::deque ready; + ready.push_back(var->PreOp()); + + std::map dep_counts = ComputeDepCounts(var->PreOp()); + + while (!ready.empty()) { + OpBase* ready_op = ready.front(); + ready.pop_front(); + std::map> input_grads = + ready_op->ApplyGrad(); + + for (auto it = input_grads.rbegin(); it != input_grads.rend(); ++it) { + const std::vector& ingrads = it->second; + for (size_t i = 0; i < ingrads.size(); ++i) { + if (!ingrads[i]) continue; + if (ready_op->input_vars_[it->first][i]->IsStopGradient()) { + continue; + } + OpBase* pre_op = ready_op->pre_ops_[it->first][i]; + if (!pre_op) continue; + + dep_counts[pre_op] -= 1; + PADDLE_ENFORCE(dep_counts[pre_op] >= 0); + bool pre_op_ready = dep_counts[pre_op] == 0; + if (pre_op_ready) { + ready.push_back(pre_op); + } + } + } + + ready_op->InvokeBackwardHooks(); + } + } + + private: + std::map ComputeDepCounts(OpBase* op) { + std::map ret; + + std::deque queue; + queue.push_back(op); + std::unordered_set visited; + visited.insert(op); + while (!queue.empty()) { + OpBase* candidate = queue.front(); + queue.pop_front(); + for (auto it : candidate->pre_ops_) { + for (OpBase* pre_op : it.second) { + if (!pre_op) continue; + VLOG(5) << "op dep " << candidate->Type() << " trace id " + << candidate->trace_id_ << " <---- " << it.first << " <---- " + << pre_op->Type() << " trace id " << pre_op->trace_id_; + if (visited.find(pre_op) == visited.end()) { + visited.insert(pre_op); + queue.push_back(pre_op); + } + ret[pre_op] += 1; + } + } + } + return ret; + } +}; + +std::unique_ptr VarBase::NewVarBase(const platform::Place& dst_place, + const bool blocking) const { + PADDLE_ENFORCE(var_->IsInitialized(), + "Variable must be initialized when getting numpy tensor"); + + // TODO(minqiyang): change this after move unique_name generator to CXX + const framework::LoDTensor& self_tensor = var_->Get(); + std::unique_ptr new_var(new VarBase( + "Itmp", self_tensor.type(), self_tensor.dims(), dst_place, true, false)); + framework::LoDTensor* tensor = + new_var->var_->GetMutable(); + tensor->set_lod(var_->Get().lod()); + + if (blocking) { + platform::DeviceContext* dev_ctx = + platform::DeviceContextPool::Instance().Get(dst_place); + + framework::TensorCopySync(var_->Get(), dst_place, + tensor); + + dev_ctx->Wait(); + } else { + framework::TensorCopy(var_->Get(), dst_place, tensor); + } + + if (platform::is_gpu_place(dst_place)) { + VLOG(3) << "copy tensor " << Name() << " from gpu"; + } + + return new_var; +} + +framework::LoDTensor& VarBase::GradValue() { + VLOG(3) << "get var grad " << Name(); + PADDLE_ENFORCE_NOT_NULL(grads_, + "Could not get grad value from no grad variable"); + return *(grads_->var_->GetMutable()); +} + +std::map> OpBase::ApplyGrad() { + PADDLE_ENFORCE(!grad_op_descs_.empty() || backward_id_ > 0, + "%s has no backward implementation", Type()); + + VLOG(3) << "apply op grad: " << Type(); + std::vector tmp_grad_outputs; + if (backward_id_ > 0) { + VLOG(3) << "py_layer_grad"; + tmp_grad_outputs.resize(1); + tmp_grad_outputs[0][framework::GradVarName(PyLayer::kFwdOut)] = + PyLayer::ApplyGrad( + backward_id_, + grad_input_vars_[0][framework::GradVarName(PyLayer::kFwdInp)]); + } else { + const size_t grad_op_count = grad_op_descs_.size(); + + tmp_grad_outputs.resize(grad_op_count); + for (size_t k = 0; k < grad_op_count; ++k) { + framework::OpDesc* grad_op_desc = grad_op_descs_[k]; + auto& grad_output_variable_map = grad_output_vars_[k]; + + VLOG(3) << "apply grad op " << grad_op_desc->Type(); + + // Allocate tmp grad output variable + for (const auto& it : grad_output_variable_map) { + auto& outputs = tmp_grad_outputs[k][it.first]; + outputs.reserve(it.second.size()); + for (size_t i = 0; i < it.second.size(); ++i) { + VarBase* origin_grad_var_base = it.second[i]; + + // Allocate a new variable + VarBase* tmp_grad_var_base = new VarBase( + string::Sprintf("%s@IGrad", origin_grad_var_base->Name()), + origin_grad_var_base->DataType(), origin_grad_var_base->Dims(), + place_, true, false); + outputs.emplace_back(tmp_grad_var_base); + } + } + + // No need to do compile time infer shape here. + // grad_op_desc_->InferShape(*block_); + // grad_op_desc->InferVarType(block_); + + std::unique_ptr opbase = + framework::OpRegistry::CreateOp(*grad_op_desc); + + auto& info = framework::OpInfoMap::Instance().Get(grad_op_desc->Type()); + if (info.infer_var_type_) { + RuntimeInferVarTypeContext infer_var_type_ctx( + &grad_input_vars_[k], &tmp_grad_outputs[k], &attrs_); + info.infer_var_type_(&infer_var_type_ctx); + } + + framework::OperatorWithKernel* op_kernel = + dynamic_cast(opbase.get()); + PADDLE_ENFORCE_NOT_NULL(op_kernel, "only support op with kernel"); + + // Run grad op + framework::VariableValueMap grad_invars_map; + framework::VariableValueMap grad_outvars_map; + + for (const auto& it : grad_input_vars_[k]) { + auto& grad_invars = grad_invars_map[it.first]; + grad_invars.reserve(it.second.size()); + for (const VarBase* grad_inp : it.second) { + PADDLE_ENFORCE_NOT_NULL(grad_inp->var_, "op %s input %s nullptr", + grad_op_desc->Type(), grad_inp->Name()); + + grad_invars.emplace_back(grad_inp->var_); + } + } + + for (const auto& it : tmp_grad_outputs[k]) { + auto& grad_outvars = grad_outvars_map[it.first]; + grad_outvars.reserve(it.second.size()); + for (VarBase* grad_out : it.second) { + PADDLE_ENFORCE_NOT_NULL(grad_out->var_, "op %s output %s nullptr", + grad_op_desc->Type(), grad_out->Name()); + + grad_outvars.emplace_back(grad_out->var_); + } + } + + framework::RuntimeContext ctx(grad_invars_map, grad_outvars_map); + framework::Scope scope; + PreparedOp p = PreparedOp::Prepare(ctx, *op_kernel, place_); + p.op.RuntimeInferShape(scope, place_, ctx); + p.func( + framework::ExecutionContext(p.op, scope, *p.dev_ctx, p.ctx, nullptr)); + } + } + + // Add tmp grad outputs to original grad vars + for (size_t k = 0; k < grad_output_vars_.size(); ++k) { + for (const auto& it : grad_output_vars_[k]) { + auto& outputs = tmp_grad_outputs[k][it.first]; + const auto& origin_outputs = it.second; + PADDLE_ENFORCE_EQ(outputs.size(), origin_outputs.size()); + + for (size_t i = 0; i < outputs.size(); ++i) { + framework::Variable* grad = outputs[i]->var_; + framework::Variable* orig_grad = origin_outputs[i]->var_; + VLOG(3) << "AddTo Called with orig_grad is: " + << origin_outputs[i]->name_ << " Grad to be added is " + << outputs[i]->name_; + AddTo(grad, orig_grad, place_); + delete grad; + } + } + } + + return input_vars_; +} + +void OpBase::InvokeBackwardHooks() { + VLOG(3) << "call backward hooks, hooks num: " << backward_hooks_.size(); + + // call backward hooks + for (py::object& callable : backward_hooks_) { + callable(this); + } +} + +void OpBase::RegisterBackwardHooks(const py::object& callable) { + VLOG(3) << "Register backward hooks " << trace_id_; + + // TODO(minqiyang): check the callable format + backward_hooks_.push_back(callable); +} + +void VarBase::RunBackward() { + if (!pre_op_) return; + + VLOG(3) << "start backward"; + auto grads_t = grads_->var_->GetMutable(); + operators::math::set_constant( + *(platform::DeviceContextPool::Instance().Get( + var_->GetMutable()->place())), + grads_t, 1.0); + + PADDLE_ENFORCE( + grads_ == + pre_op_->output_vars_[pre_op_out_name_][pre_op_out_idx_]->grads_); + Autograd().RunBackward(this); +} + +void PyLayer::RegisterFunc(int func_id, const py::object& py_func) { + py_funcs_[func_id] = py_func; +} + +int PyLayer::NumFuncs() { return py_funcs_.size(); } + +std::vector PyLayer::Apply( + int func_id, const std::vector& inputs) { + PADDLE_ENFORCE(py_funcs_.find(func_id) != py_funcs_.end()); + return CallPythonFunc(py_funcs_[func_id], inputs); +} + +std::vector PyLayer::ApplyGrad(int func_id, + const std::vector& inputs) { + PADDLE_ENFORCE(py_funcs_.find(func_id) != py_funcs_.end()); + auto rets = CallPythonFunc(py_funcs_[func_id], inputs); + + std::vector outs; + outs.reserve(rets.size()); + for (size_t i = 0U; i != rets.size(); ++i) { + outs.emplace_back(new VarBase( + string::Sprintf("%s_out_%d", framework::GradVarName(PyLayer::kFwdOut), + i), + rets[i], nullptr, true)); + } + + return outs; +} + +std::vector PyLayer::CallPythonFunc( + const py::object& callable, const std::vector& ins) { + py::gil_scoped_acquire guard; + py::tuple in_args(ins.size()); + for (size_t i = 0; i < ins.size(); ++i) { + const framework::LoDTensor& t = ins[i]->var_->Get(); + in_args[i] = t.IsInitialized() ? py::cast(t) : py::cast(nullptr); + } + VLOG(3) << "pyfunc in " << py::len(in_args); + + // TODO(panyx0718): Who owns the returned LoDTensor. + auto ret = callable(in_args); + auto ret_tuple = py::cast(ret); + size_t ret_num = py::len(ret_tuple); + std::vector outs; + outs.reserve(ret_num); + VLOG(3) << "pyfunc out " << ret_num; + for (size_t i = 0; i < ret_num; ++i) { + try { + auto* py_out_tensor = py::cast(ret_tuple[i]); + PADDLE_ENFORCE_NOT_NULL(py_out_tensor, + "Output tensor %d should not be nullptr", i); + auto* var = new framework::Variable(); + auto* tensor = var->GetMutable(); + tensor->ShareDataWith(*py_out_tensor); + tensor->set_lod(py_out_tensor->lod()); + outs.emplace_back(var); + } catch (py::cast_error&) { + PADDLE_THROW("The %d-th output must be LoDTensor", i); + } + } + return outs; +} + +} // namespace imperative +} // namespace paddle diff --git a/paddle/fluid/imperative/layer.h b/paddle/fluid/imperative/layer.h new file mode 100644 index 0000000000000000000000000000000000000000..72c548d5e92dec3ec2638904f508c2777ee327c6 --- /dev/null +++ b/paddle/fluid/imperative/layer.h @@ -0,0 +1,518 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +// clang-format off +#include "paddle/fluid/framework/python_headers.h" +// clang-format on + +#include // NOLINT +#include // NOLINT +#include // NOLINT +#include // NOLINT +#include // NOLINT + +#include "paddle/fluid/framework/op_desc.h" +#include "paddle/fluid/framework/operator.h" +#include "paddle/fluid/framework/var_desc.h" +#include "paddle/fluid/framework/var_type_inference.h" +#include "paddle/fluid/platform/enforce.h" +#include "paddle/fluid/platform/device_context.h" +#include "paddle/fluid/operators/math/math_function.h" + +#include "paddle/fluid/imperative/type_defs.h" + +namespace paddle { +namespace imperative { + +class VarBase; + +namespace py = ::pybind11; + +class PreparedOp { + public: + PreparedOp(const framework::OperatorBase& op, + const framework::RuntimeContext& ctx, + framework::OperatorWithKernel::OpKernelFunc func, + platform::DeviceContext* dev_ctx, + std::vector* kernel_configs) + : op(op), + ctx(ctx), + func(func), + dev_ctx(dev_ctx), + kernel_configs(kernel_configs) {} + + static PreparedOp Prepare(const framework::RuntimeContext& ctx, + const framework::OperatorWithKernel& op, + const platform::Place& place) { + platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance(); + auto* dev_ctx = pool.Get(place); + + // check if op[type] has kernel registered. + auto& all_op_kernels = op.AllOpKernels(); + auto kernels_iter = all_op_kernels.find(op.Type()); + if (kernels_iter == all_op_kernels.end()) { + PADDLE_THROW( + "There are no kernels which are registered in the %s operator.", + op.Type()); + } + + framework::OperatorWithKernel::OpKernelMap& kernels = kernels_iter->second; + + auto expected_kernel_key = + op.GetExpectedKernelType(framework::ExecutionContext( + op, framework::Scope(), *dev_ctx, ctx, nullptr)); + VLOG(3) << "expected_kernel_key:" << expected_kernel_key; + + auto kernel_iter = kernels.find(expected_kernel_key); +#ifdef PADDLE_WITH_MKLDNN + // workaround for missing MKLDNN kernel when FLAGS_use_mkldnn env var is set + if (kernel_iter == kernels.end() && + expected_kernel_key.library_type_ == framework::LibraryType::kMKLDNN) { + VLOG(3) << "missing MKLDNN kernel: fallbacking to PLAIN one"; + expected_kernel_key.library_type_ = framework::LibraryType::kPlain; + expected_kernel_key.data_layout_ = framework::DataLayout::kAnyLayout; + kernel_iter = kernels.find(expected_kernel_key); + } +#endif + if (kernel_iter == kernels.end()) { + PADDLE_THROW("op %s does not have kernel for %s", op.Type(), + KernelTypeToString(expected_kernel_key)); + } + std::vector* kernel_configs = + op.GetKernelConfig(expected_kernel_key); + return PreparedOp(op, ctx, kernel_iter->second, dev_ctx, kernel_configs); + } + + inline platform::DeviceContext* GetDeviceContext() const { return dev_ctx; } + + const framework::OperatorBase& op; + const framework::RuntimeContext& ctx; + framework::OperatorWithKernel::OpKernelFunc func; + platform::DeviceContext* dev_ctx; + std::vector* kernel_configs; +}; + +class OpBase; + +/* The wrapper for Variable which holds a Variable and a VarBase of its + * gradient. This object should be managed totally by Python intepreter. + * + * Nearly all interface should be implemented in C++. + */ +class VarBase { + public: + // Internal interface, create VarBase from exist variable + VarBase(const std::string& name, framework::Variable* var, VarBase* grad, + bool stop_gradient) + : VarBase(name, var->Get().type(), + var->Get().dims(), + var->Get().place(), var, grad, + stop_gradient, false) {} + + // Python interface + VarBase(const std::string& name, const framework::proto::VarType::Type dtype, + const std::vector& shape, const platform::Place& place, + bool stop_gradient, bool persistable) + : VarBase(name, dtype, framework::make_ddim(shape), place, stop_gradient, + persistable) {} + + // Internal interface, create VarBase from with ddim + VarBase(const std::string& name, const framework::proto::VarType::Type dtype, + const framework::DDim& shape, const platform::Place& place, + bool stop_gradient, bool persistable) + : VarBase(name, dtype, shape, place, nullptr, nullptr, stop_gradient, + persistable) {} + + private: + // TODO(minqiyang): need support SelectedRows + VarBase(const std::string& name, framework::proto::VarType::Type dtype, + const framework::DDim& shape, const platform::Place& place, + framework::Variable* var, VarBase* grad, bool stop_gradient, + bool persistable) + : name_(name), + type_(framework::proto::VarType::LOD_TENSOR), + var_(var), + grads_(grad), + stop_gradient_(stop_gradient), + persistable_(persistable), + pre_op_(nullptr), + pre_op_out_name_(), + pre_op_out_idx_(-1) { + if (!var_) { + var_ = new framework::Variable(); + } + auto tensor = var_->GetMutable(); + tensor->Resize(shape); + tensor->mutable_data(place, dtype); + VLOG(10) << "create varbase: " << name_ << " type: " << dtype + << " place: " << place; + } + + public: + virtual ~VarBase() { + if (var_) { + delete var_; + var_ = nullptr; + } + + if (grads_) { + delete grads_; + grads_ = nullptr; + } + + pre_op_ = nullptr; + pre_op_out_idx_ = -1; + } + + inline void SetName(const std::string& name) { name_ = name; } + inline std::string Name() const { return name_; } + + inline std::vector Shape() const { + if (var_->IsInitialized()) { + return framework::vectorize(var_->Get().dims()); + } else { + return {}; + } + } + + inline framework::DDim Dims() const { + return var_->Get().dims(); + } + + // data type. e.g.. FP32 + inline void SetDataType(framework::proto::VarType::Type type) { + auto tensor = var_->GetMutable(); + tensor->mutable_data(tensor->place(), type); + } + inline framework::proto::VarType::Type DataType() const { + auto tensor = var_->Get(); + return tensor.type(); + } + + // tensor type. e.g.. LoDTensor + inline void SetType(framework::proto::VarType::Type type) { type_ = type; } + inline framework::proto::VarType::Type Type() const { return type_; } + + inline void SetStopGradient(bool stop_gradient) { + stop_gradient_ = stop_gradient; + } + inline bool IsStopGradient() const { return stop_gradient_; } + + inline void SetPersistable(bool persistable) { persistable_ = persistable; } + inline bool IsPersistable() const { return persistable_; } + + inline OpBase* PreOp() const { return pre_op_; } + inline int PreOpOutIdx() const { return pre_op_out_idx_; } + + void RunBackward(); + + inline void ResetPreOp(OpBase* op) { + if (op == pre_op_) { + // clear pre_op info when op equals to var's pre_op + pre_op_ = nullptr; + pre_op_out_idx_ = -1; + } + } + + void TrackPreOp(OpBase* pre_op, const std::string& pre_op_out_name, + int pre_op_out_idx, bool pre_op_stop_gradient) { + pre_op_ = pre_op; + pre_op_out_name_ = pre_op_out_name; + pre_op_out_idx_ = pre_op_out_idx; + if (pre_op_stop_gradient) { + stop_gradient_ = pre_op_stop_gradient; + } + } + + void ClearGradient() { + VLOG(1) << "clear gradient of " << Name(); + if (grads_ && grads_->var_ && grads_->var_->IsInitialized()) { + auto grads_t = grads_->var_->GetMutable(); + operators::math::set_constant( + *(platform::DeviceContextPool::Instance().Get( + grads_->var_->Get().place())), + grads_t, 0.0); + } + } + + framework::LoDTensor& GradValue(); + + std::unique_ptr NewVarBase(const platform::Place& dst_place, + const bool blocking) const; + + inline std::string GradName() const { + return string::Sprintf("%s@IGrad", Name()); + } + + std::string name_; + framework::proto::VarType::Type type_; + platform::Place place_; + + framework::Variable* var_; + VarBase* grads_; + + private: + bool stop_gradient_; + bool persistable_; + + OpBase* pre_op_; + std::string pre_op_out_name_; + int pre_op_out_idx_; +}; + +/* The wrapper for OpDesc which holds a OpDesc and a OpDesc of its + * gradient. This object should be managed totally by Python intepreter. + */ +class PYBIND11_HIDDEN OpBase { + public: + OpBase(const std::string& type) + : type_(type), + trace_id_(-1), + forward_id_(-1), + backward_id_(-1), + place_(platform::CPUPlace()), + backward_hooks_() {} + + virtual ~OpBase() { + // TODO(minqiyang): remove op_desc from block_desc in tracer + // + // reset all output vars' pre op + for (auto iter : output_vars_) { + for (VarBase* var : iter.second) { + var->ResetPreOp(this); + } + } + + // release resource + for (framework::OpDesc* desc : grad_op_descs_) { + delete desc; + } + } + + std::map> ApplyGrad(); + + inline std::string Type() const { return type_; } + inline std::string GradOpType(size_t index) const { + PADDLE_ENFORCE_NOT_NULL(grad_op_descs_[index]); + return grad_op_descs_[index]->Type(); + } + + void RegisterBackwardHooks(const py::object& callable); + + void InvokeBackwardHooks(); + + void TrackPreOp(const std::string& inp_name, + const std::vector& inputs) { + auto& pre_ops_list = pre_ops_[inp_name]; + pre_ops_list.reserve(inputs.size()); + auto& pre_ops_out_idx_list = pre_ops_out_idx_[inp_name]; + for (VarBase* inp_var : inputs) { + if (inp_var->PreOp() && !inp_var->IsStopGradient()) { + VLOG(3) << "add pre op " << inp_var->PreOp()->Type() << " in slot " + << inp_name; + pre_ops_list.emplace_back(inp_var->PreOp()); + pre_ops_out_idx_list.push_back(inp_var->PreOpOutIdx()); + } else { + VLOG(3) << "no pre op in slot " << inp_name + << " input var stop_gradient: " << inp_var->IsStopGradient(); + pre_ops_list.emplace_back(nullptr); + // pre_ops_out_idx_list.push_back(-1); + } + } + } + + std::string type_; + // One of `trace_id_` or `forward_id_` is set, not both. + // For pure python PyLayer, use `forward_id_`, otherwise, use trace_id_. + int trace_id_; + int forward_id_; + + // When has backward, one of `grad_op_descs_` or `backward_id_` is set, + // not both. + // Note: each fwd op corresponds to a vector of bwd ops. + std::vector grad_op_descs_; + int backward_id_; + + platform::Place place_; + + VarBasePtrMap input_vars_; + VarBasePtrMap output_vars_; + OpBasePtrMap pre_ops_; + std::map> pre_ops_out_idx_; + + // Inputs to a vector of bwd ops. + std::vector grad_input_vars_; + // Outputs to a vector of bwd ops. + std::vector grad_output_vars_; + + std::vector backward_hooks_; + + framework::AttributeMap attrs_; +}; + +class Layer { + public: + virtual ~Layer() {} + + virtual std::vector Forward(const std::vector& inputs) { + std::vector vars; + return vars; + } +}; + +class PyLayer { + public: + virtual ~PyLayer() {} + + static const char* kFwdInp; + static const char* kFwdOut; + + static void RegisterFunc(int func_id, const py::object& py_func); + + static int NumFuncs(); + + static std::vector Apply( + int func_id, const std::vector& inputs); + + static std::vector ApplyGrad(int func_id, + const std::vector& inputs); + + private: + static std::vector CallPythonFunc( + const py::object& callable, const std::vector& ins); +}; + +// infer var type context for imperative mode +class PYBIND11_HIDDEN RuntimeInferVarTypeContext + : public framework::InferVarTypeContext { + public: + RuntimeInferVarTypeContext(const imperative::VarBasePtrMap* inputs, + imperative::VarBasePtrMap* outputs, + const framework::AttributeMap* attrs_map) + : InferVarTypeContext(nullptr, nullptr), + inputs_(inputs), + outputs_(outputs), + attrs_(attrs_map), + input_names_(), + output_names_(), + var_set_() { + input_names_.reserve(inputs_->size()); + for (auto& it : *inputs_) { + for (imperative::VarBase* var : it.second) { + input_names_[it.first].emplace_back(var->Name()); + var_set_[var->Name()] = var; + } + } + + output_names_.reserve(outputs_->size()); + for (auto& it : *outputs_) { + for (imperative::VarBase* var : it.second) { + output_names_[it.first].emplace_back(var->Name()); + var_set_[var->Name()] = var; + } + } + } + + virtual ~RuntimeInferVarTypeContext() {} + + framework::Attribute GetAttr(const std::string& name) const override { + PADDLE_ENFORCE_NOT_NULL(attrs_); + return attrs_->at(name); + } + + bool HasVar(const std::string& name) const override { + return var_set_.count(name) > 0; + } + + bool HasInput(const std::string& name) const override { + PADDLE_ENFORCE_NOT_NULL(inputs_); + return inputs_->count(name) > 0; + } + + bool HasOutput(const std::string& name) const override { + PADDLE_ENFORCE_NOT_NULL(outputs_); + return outputs_->count(name) > 0; + } + + const std::vector& Input( + const std::string& name) const override { + return input_names_.at(name); + } + + const std::vector& Output( + const std::string& name) const override { + return output_names_.at(name); + } + + framework::proto::VarType::Type GetType( + const std::string& name) const override { + return var_set_.at(name)->Type(); + } + + void SetType(const std::string& name, + framework::proto::VarType::Type type) override { + var_set_[name]->SetType(type); + } + + framework::proto::VarType::Type GetDataType( + const std::string& name) const override { + return var_set_.at(name)->DataType(); + } + + void SetDataType(const std::string& name, + framework::proto::VarType::Type type) override { + var_set_[name]->SetDataType(type); + } + + std::vector GetDataTypes( + const std::string& name) const override { + PADDLE_THROW("GetDataTypes is not supported in runtime InferVarType"); + } + + void SetDataTypes(const std::string& name, + const std::vector& + multiple_data_type) override { + PADDLE_THROW("SetDataTypes is not supported in runtime InferVarType"); + } + + std::vector GetShape(const std::string& name) const override { + PADDLE_THROW("Do not handle Shape in runtime InferVarType"); + } + + void SetShape(const std::string& name, + const std::vector& dims) override { + PADDLE_THROW("Do not handle Shape in runtime InferVarType"); + } + + int32_t GetLoDLevel(const std::string& name) const override { + PADDLE_THROW("Do not handle LoDLevel in runtime InferVarType"); + } + + void SetLoDLevel(const std::string& name, int32_t lod_level) override { + PADDLE_THROW("Do not handle LoDLevel in runtime InferVarType"); + } + + private: + const imperative::VarBasePtrMap* inputs_; + imperative::VarBasePtrMap* outputs_; + const framework::AttributeMap* attrs_; + std::unordered_map> input_names_; + std::unordered_map> output_names_; + std::unordered_map var_set_; +}; + +} // namespace imperative +} // namespace paddle diff --git a/paddle/fluid/imperative/nccl_context.cc b/paddle/fluid/imperative/nccl_context.cc new file mode 100644 index 0000000000000000000000000000000000000000..f96c83936df590e5bd3abe89b7e7c2a6ddf92d01 --- /dev/null +++ b/paddle/fluid/imperative/nccl_context.cc @@ -0,0 +1,133 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/imperative/nccl_context.h" + +namespace paddle { +namespace imperative { +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) +void NCCLParallelContext::RecvNCCLID(const std::string &ep, + ncclUniqueId *nccl_id) { + auto addr = paddle::string::Split(ep, ':'); + PADDLE_ENFORCE_EQ(addr.size(), 2UL, + "The endpoint should contain host and port: %s", ep); + std::string host = addr[0]; + int port = std::stoi(addr[1]); + + int server_fd, new_socket; + struct sockaddr_in address; + int addrlen = sizeof(address); + char buffer[1024] = {0}; + int opt = 0; + // creating socket fd + if ((server_fd = socket(AF_INET, SOCK_STREAM, 0)) == 0) + PADDLE_THROW("create server fd failed"); + if (setsockopt(server_fd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt))) + PADDLE_THROW("set socket opt failed"); + + address.sin_family = AF_INET; + address.sin_addr.s_addr = INADDR_ANY; + address.sin_port = htons(port); + + if (bind(server_fd, (struct sockaddr *)&address, sizeof(address)) < 0) + PADDLE_THROW("binding failed on ep: %s", ep); + VLOG(3) << "listening on: " << ep; + if (listen(server_fd, 3) < 0) PADDLE_THROW("listen on server fd failed"); + + if ((new_socket = + accept(server_fd, reinterpret_cast(&address), + reinterpret_cast(&addrlen))) < 0) + PADDLE_THROW("accept the new socket fd failed"); + + if (read(new_socket, buffer, 1024) < 0) + PADDLE_THROW("reading the ncclUniqueId from socket failed"); + VLOG(3) << "recevived the ncclUniqueId"; + memcpy(nccl_id, buffer, NCCL_UNIQUE_ID_BYTES); + + VLOG(3) << "closing the socket server: " << ep; + close(server_fd); +} + +void NCCLParallelContext::SendNCCLID(const std::string &ep, + ncclUniqueId *nccl_id) { + auto addr = paddle::string::Split(ep, ':'); + PADDLE_ENFORCE_EQ(addr.size(), 2UL, + "The endpoint should contain host and port: %s", ep); + std::string host = addr[0]; + int port = std::stoi(addr[1]); + // struct sockaddr_in address; + int sock = 0; + struct sockaddr_in serv_addr; + char buffer[1024] = {0}; + + memcpy(buffer, nccl_id, NCCL_UNIQUE_ID_BYTES); + if ((sock = socket(AF_INET, SOCK_STREAM, 0)) < 0) + PADDLE_THROW("create socket failed"); + + memset(&serv_addr, '0', sizeof(serv_addr)); + serv_addr.sin_family = AF_INET; + serv_addr.sin_port = htons(port); + + if (inet_pton(AF_INET, host.c_str(), &serv_addr.sin_addr) <= 0) + PADDLE_THROW("invalied address: %s", ep); + + while (true) { + if (connect(sock, (struct sockaddr *)&serv_addr, sizeof(serv_addr)) < 0) { + VLOG(0) << "worker: " << ep + << " is not ready, will retry after 3 seconds..."; + std::this_thread::sleep_for(std::chrono::seconds(3)); + continue; + } + VLOG(3) << "sending the ncclUniqueId to " << ep; + send(sock, buffer, NCCL_UNIQUE_ID_BYTES, 0); + break; + } +} + +void NCCLParallelContext::BcastNCCLId(ncclUniqueId *nccl_id, int root) { + if (strategy_.local_rank_ == root) { + for (auto ep : strategy_.trainer_endpoints_) { + if (ep != strategy_.current_endpoint_) SendNCCLID(ep, nccl_id); + } + } else { + RecvNCCLID(strategy_.current_endpoint_, nccl_id); + } +} + +void NCCLParallelContext::Init() { + ncclUniqueId nccl_id; + ncclComm_t comm; + if (strategy_.local_rank_ == 0) { + // generate the unique ncclid on the root worker + platform::dynload::ncclGetUniqueId(&nccl_id); + BcastNCCLId(&nccl_id, 0); + } else { + BcastNCCLId(&nccl_id, 0); + } + int gpu_id = boost::get(place_).device; + VLOG(0) << "init nccl context nranks: " << strategy_.nranks_ + << " local rank: " << strategy_.local_rank_ << " gpu id: " << gpu_id; + + PADDLE_ENFORCE(cudaSetDevice(gpu_id)); + PADDLE_ENFORCE(platform::dynload::ncclCommInitRank( + &comm, strategy_.nranks_, nccl_id, strategy_.local_rank_)); + + platform::DeviceContextPool &pool = platform::DeviceContextPool::Instance(); + auto *dev_ctx = static_cast(pool.Get(place_)); + dev_ctx->set_nccl_comm(comm); +} +#endif + +} // namespace imperative +} // namespace paddle diff --git a/paddle/fluid/imperative/nccl_context.h b/paddle/fluid/imperative/nccl_context.h new file mode 100644 index 0000000000000000000000000000000000000000..b4f44e56405a51082e60afd69fb6f011dab44b86 --- /dev/null +++ b/paddle/fluid/imperative/nccl_context.h @@ -0,0 +1,81 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +#pragma once + +// network header files +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) +#include +#include +#include +#include +#endif + +#include +#include + +#include "paddle/fluid/framework/variable.h" +#include "paddle/fluid/platform/device_context.h" +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) +#include "paddle/fluid/platform/dynload/nccl.h" +#endif +#include "paddle/fluid/platform/place.h" +#include "paddle/fluid/string/split.h" + +namespace paddle { +namespace imperative { + +struct ParallelStrategy { + int nranks_{1}; + int local_rank_{0}; + std::vector trainer_endpoints_{}; + std::string current_endpoint_{""}; +}; + +class ParallelContext { + public: + explicit ParallelContext(const ParallelStrategy& strategy, + const platform::Place& place) + : strategy_(strategy), place_(place) {} + + virtual ~ParallelContext() {} + + virtual void Init() = 0; + + protected: + ParallelStrategy strategy_; + platform::Place place_; +}; + +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) +class NCCLParallelContext : ParallelContext { + public: + explicit NCCLParallelContext(const ParallelStrategy& strategy, + const platform::Place& place) + : ParallelContext(strategy, place) {} + + ~NCCLParallelContext() {} + + void BcastNCCLId(ncclUniqueId* nccl_id, int root); + + void Init() override; + + protected: + void RecvNCCLID(const std::string& endpoint, ncclUniqueId* nccl_id); + + void SendNCCLID(const std::string& endpoint, ncclUniqueId* nccl_id); +}; +#endif + +} // namespace imperative +} // namespace paddle diff --git a/paddle/fluid/imperative/nccl_context_test.cc b/paddle/fluid/imperative/nccl_context_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..74a74ebe921378e2994a6a4cb2087d0acde950b1 --- /dev/null +++ b/paddle/fluid/imperative/nccl_context_test.cc @@ -0,0 +1,52 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/imperative/nccl_context.h" +#include "gtest/gtest.h" +#include "paddle/fluid/platform/device_context.h" + +namespace imperative = paddle::imperative; +namespace platform = paddle::platform; + +imperative::ParallelStrategy GetStrategy(int local_rank) { + std::vector eps = {"127.0.0.1:9866", "127.0.0.1:9867"}; + imperative::ParallelStrategy strategy; + strategy.trainer_endpoints_ = eps; + strategy.current_endpoint_ = eps[local_rank]; + strategy.nranks_ = 2; + strategy.local_rank_ = local_rank; + return strategy; +} + +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) +void BcastNCCLId(int local_rank, ncclUniqueId *nccl_id) { + auto strategy = GetStrategy(local_rank); + platform::CUDAPlace gpu(local_rank); + imperative::NCCLParallelContext ctx(strategy, gpu); + ctx.BcastNCCLId(nccl_id, 0); +} + +TEST(BcastNCCLId, Run) { + ncclUniqueId nccl_id; + platform::dynload::ncclGetUniqueId(&nccl_id); + std::thread t(BcastNCCLId, 0, &nccl_id); + + ncclUniqueId recv_nccl_id; + BcastNCCLId(1, &recv_nccl_id); + + t.join(); + EXPECT_EQ(0, std::memcmp(nccl_id.internal, recv_nccl_id.internal, + NCCL_UNIQUE_ID_BYTES)); +} +#endif diff --git a/paddle/fluid/imperative/profiler.cc b/paddle/fluid/imperative/profiler.cc new file mode 100644 index 0000000000000000000000000000000000000000..34570b3a60ec83fdeb1577789271942125b16eb1 --- /dev/null +++ b/paddle/fluid/imperative/profiler.cc @@ -0,0 +1,62 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/imperative/profiler.h" + +#ifdef WITH_GPERFTOOLS +#include "gperftools/profiler.h" +#endif +#include +#include +#include // NOLINT +#include // NOLINT + +DEFINE_string( + tracer_profile_fname, "xxgperf", + "Profiler filename for imperative tracer, which generated by gperftools." + "Only valid when compiled `WITH_PROFILER=ON`. Empty if disable."); + +namespace paddle { +namespace imperative { + +static std::once_flag gTracerProfileOnce; +#ifdef WITH_GPERFTOOLS +static bool gTracerProfilerStarted = false; +#endif + +void StartProfile() { + if (!FLAGS_tracer_profile_fname.empty()) { + std::call_once(gTracerProfileOnce, [] { +#ifdef WITH_GPERFTOOLS + ProfilerStart(FLAGS_tracer_profile_fname.c_str()); + gTracerProfilerStarted = true; +#else + LOG(WARNING) << "Paddle is not compiled with gperftools. " + "FLAGS_tracer_profile_fname will be ignored"; +#endif + }); + } +} + +void StopProfile() { +#ifdef WITH_GPERFTOOLS + ProfilerFlush(); +#else + LOG(WARNING) << "Paddle is not compiled with gperftools. " + "FLAGS_tracer_profile_fname will be ignored"; +#endif +} + +} // namespace imperative +} // namespace paddle diff --git a/paddle/fluid/imperative/profiler.h b/paddle/fluid/imperative/profiler.h new file mode 100644 index 0000000000000000000000000000000000000000..d52aeed4e81755cfa285616d7b0a7e79061c6af8 --- /dev/null +++ b/paddle/fluid/imperative/profiler.h @@ -0,0 +1,25 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +namespace paddle { +namespace imperative { + +extern void StartProfile(); + +extern void StopProfile(); + +} // namespace imperative +} // namespace paddle diff --git a/paddle/fluid/imperative/tracer.cc b/paddle/fluid/imperative/tracer.cc new file mode 100644 index 0000000000000000000000000000000000000000..7c495ddd68221acfed8537fd72e9a582e891f8db --- /dev/null +++ b/paddle/fluid/imperative/tracer.cc @@ -0,0 +1,344 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/imperative/tracer.h" + +#include +#include +#include +#include + +#include "paddle/fluid/framework/var_type_inference.h" +#include "paddle/fluid/operators/math/math_function.h" +#include "paddle/fluid/platform/device_context.h" +#include "paddle/fluid/platform/enforce.h" + +namespace paddle { +namespace imperative { + +void CreateGradOp(const framework::OpDesc& op_desc, + const std::unordered_set& no_grad_set, + const std::vector& grad_sub_block, + std::vector* grad_op_descs, + std::unordered_map* grad_to_var) { + PADDLE_ENFORCE(grad_op_descs->empty()); + const framework::OpInfo& op_info = + framework::OpInfoMap::Instance().Get(op_desc.Type()); + if (!op_info.grad_op_maker_) return; + + std::vector> descs = + op_info.GradOpMaker()(op_desc, no_grad_set, grad_to_var, grad_sub_block); + for (auto& desc : descs) { + grad_op_descs->emplace_back(desc.release()); + } +} + +void InitGrad(VarBase* var, platform::DeviceContext* dev_ctx) { + PADDLE_ENFORCE_NOT_NULL(var, "Could not get valid var base"); + PADDLE_ENFORCE_NOT_NULL(dev_ctx, + "Could not get valid device from forward op"); + + if (var->grads_ == nullptr) { + auto& var_t = var->var_->Get(); + var->grads_ = new VarBase(var->GradName(), framework::proto::VarType::FP32, + framework::vectorize(var_t.dims()), + dev_ctx->GetPlace(), true, false); + auto grad_t = var->grads_->var_->GetMutable(); + operators::math::set_constant(*dev_ctx, grad_t, 0.0); + } +} + +platform::Place GetExpectedPlace(platform::Place place, VarBasePtrMap inputs) { + platform::Place result = place; + for (auto it : inputs) { + for (VarBase* var : it.second) { + platform::Place tmp_place = + var->var_->Get().place(); + if (!platform::is_same_place(tmp_place, result)) { + PADDLE_THROW( + "Input variable should keep in the same place: %s, but get place: " + "%s of input %s instead", + result, tmp_place, it.first); + } + } + } + + return result; +} + +framework::VariableNameMap CreateInputVarNameMap( + const OpBase* op, const VarBasePtrMap& varbase_map) { + framework::VariableNameMap result; + + auto& info_map = framework::OpInfoMap::Instance(); + auto* op_info = info_map.GetNullable(op->Type()); + if (op_info == nullptr || op_info->proto_ == nullptr) { + return result; + } + + for (auto& in : op_info->Proto().inputs()) { + auto it = varbase_map.find(in.name()); + if (it == varbase_map.end()) { + PADDLE_ENFORCE(in.dispensable()); + result[in.name()] = {}; + } else { + auto var_vector = it->second; + std::vector args; + args.reserve(var_vector.size()); + for (VarBase* var_base : var_vector) { + args.emplace_back(var_base->Name()); + } + result[in.name()] = args; + } + } + return result; +} + +framework::VariableNameMap CreateOutputVarNameMap( + const OpBase* op, const VarBasePtrMap& varbase_map) { + framework::VariableNameMap result; + + auto& info_map = framework::OpInfoMap::Instance(); + auto* op_info = info_map.GetNullable(op->Type()); + if (op_info == nullptr || op_info->proto_ == nullptr) { + return result; + } + + for (auto& out : op_info->Proto().outputs()) { + auto it = varbase_map.find(out.name()); + if (it == varbase_map.end()) { + PADDLE_ENFORCE(out.dispensable()); + result[out.name()] = {}; + } else { + auto var_vector = it->second; + std::vector args; + args.reserve(var_vector.size()); + for (VarBase* var_base : var_vector) { + args.emplace_back(var_base->Name()); + } + result[out.name()] = args; + } + } + return result; +} + +Tracer::Tracer(framework::BlockDesc* root_block) : root_block_(root_block) {} + +std::set Tracer::Trace(OpBase* op, const VarBasePtrMap& inputs, + VarBasePtrMap* outputs, + framework::AttributeMap attrs_map, + const platform::Place expected_place, + const bool stop_gradient) { + framework::VariableValueMap invars_map; + framework::VariableValueMap outvars_map; + + // Construct input_vars_map and output_vars_map + std::map current_vars_map; + op->input_vars_ = inputs; + for (auto it : op->input_vars_) { + auto& invars = invars_map[it.first]; + invars.reserve(it.second.size()); + for (VarBase* inp : it.second) { + PADDLE_ENFORCE_NOT_NULL(inp->var_, "op %s input %s nullptr", op->Type(), + inp->Name()); + + invars.emplace_back(inp->var_); + if (!stop_gradient) { + current_vars_map[inp->Name()] = inp; + } + VLOG(3) << "input var name: " << inp->Name() + << " inited: " << inp->var_->IsInitialized() + << " stop_grad: " << inp->IsStopGradient(); + } + op->TrackPreOp(it.first, it.second); + } + + op->output_vars_ = *outputs; + for (auto it : op->output_vars_) { + auto& outvars = outvars_map[it.first]; + const std::vector& outputs = it.second; + outvars.reserve(outputs.size()); + for (size_t i = 0U; i < outputs.size(); ++i) { + VarBase* out = outputs[i]; + outvars.emplace_back(out->var_); + out->TrackPreOp(op, it.first, i, stop_gradient); + if (!stop_gradient) { + current_vars_map[out->Name()] = out; + } + + VLOG(3) << "output var name: " << out->Name() + << " inited: " << out->var_->IsInitialized() + << " stop_grad: " << out->IsStopGradient(); + } + } + + // Check attrs and create op + framework::VariableNameMap invars_name_map = + CreateInputVarNameMap(op, inputs); + framework::VariableNameMap outvars_name_map = + CreateOutputVarNameMap(op, *outputs); + + auto& info = framework::OpInfoMap::Instance().Get(op->Type()); + if (info.Checker() != nullptr) { + info.Checker()->Check(&attrs_map); + } + + std::unique_ptr op_base = + framework::OpRegistry::CreateOp(op->Type(), invars_name_map, + outvars_name_map, attrs_map); + + if (info.infer_var_type_) { + RuntimeInferVarTypeContext infer_var_type_ctx(&inputs, outputs, &attrs_map); + info.infer_var_type_(&infer_var_type_ctx); + } + + // TODO(minqiyang): Support infer var type in imperative mode + // Run forward op + VLOG(3) << "tracer running " << op->Type(); + framework::RuntimeContext ctx(invars_map, outvars_map); + + // TODO(panyx0718): Cache p. + framework::OperatorWithKernel* op_kernel = + dynamic_cast(op_base.get()); + PADDLE_ENFORCE_NOT_NULL(op_kernel, "only support op with kernel"); + + framework::Scope scope; + op->place_ = GetExpectedPlace(expected_place, inputs); + + PreparedOp prepared_op = PreparedOp::Prepare(ctx, *op_kernel, op->place_); + prepared_op.op.RuntimeInferShape(scope, op->place_, ctx); + prepared_op.func( + framework::ExecutionContext(prepared_op.op, scope, *prepared_op.dev_ctx, + prepared_op.ctx, prepared_op.kernel_configs)); + + // construct backward op + std::set vars_saved_for_backward; + if (!stop_gradient) { + VLOG(5) << "start construct backward op"; + + // construct grad op descs + op->attrs_ = attrs_map; + std::unique_ptr fwd_op_desc(new framework::OpDesc( + op->Type(), invars_name_map, outvars_name_map, attrs_map)); + std::unique_ptr> grad_to_var( + new std::unordered_map()); + // NOTE(minqiyang): We don't support control flow op in imperative now + // Add grad_block_ when we want to support it + CreateGradOp(*fwd_op_desc, {}, {}, &op->grad_op_descs_, grad_to_var.get()); + + VLOG(5) << "create grad op desc: " << op->grad_op_descs_[0]->Type(); + + const size_t grad_op_count = op->grad_op_descs_.size(); + + op->grad_input_vars_.resize(grad_op_count); + op->grad_output_vars_.resize(grad_op_count); + + for (size_t i = 0; i < grad_op_count; ++i) { + framework::OpDesc* grad_op_desc = op->grad_op_descs_[i]; + for (auto it : grad_op_desc->Inputs()) { + auto& grad_in_vars = op->grad_input_vars_[i][it.first]; + grad_in_vars.reserve(it.second.size()); + for (const std::string& grad_invar : it.second) { + auto var_it = grad_to_var->find(grad_invar); + if (var_it == grad_to_var->end()) { + auto fwd_var_it = current_vars_map.find(grad_invar); + PADDLE_ENFORCE(fwd_var_it != current_vars_map.end()); + // Forward inputs or outputs. + grad_in_vars.emplace_back(fwd_var_it->second); + } else { + VarBase* var = current_vars_map[var_it->second]; + InitGrad(var, prepared_op.GetDeviceContext()); + // Douts. + grad_in_vars.emplace_back(var->grads_); + } + + vars_saved_for_backward.insert(it.first); + } + } + + for (auto it : grad_op_desc->Outputs()) { + auto& grad_out_vars = op->grad_output_vars_[i][it.first]; + for (const std::string& grad_outvar : it.second) { + auto var_it = grad_to_var->find(grad_outvar); + PADDLE_ENFORCE(var_it != grad_to_var->end(), + "Could not found the grad op output var, should this " + "operator %s's stop gradient be True", + op->Type()); + VarBase* var = current_vars_map[var_it->second]; + InitGrad(var, prepared_op.GetDeviceContext()); + grad_out_vars.push_back(var->grads_); + VLOG(3) << "grads output var name: " << var->name_; + } + } + } + } + + return vars_saved_for_backward; +} + +std::vector Tracer::PyTrace(OpBase* op, + const std::vector& inputs, + bool stop_gradient) { + VLOG(3) << "py_trace " << op->Type(); + + op->input_vars_[PyLayer::kFwdInp] = inputs; + + std::vector ret_vars = + PyLayer::Apply(op->forward_id_, inputs); + + op->TrackPreOp(PyLayer::kFwdInp, inputs); + + std::vector& outputs = op->output_vars_[PyLayer::kFwdOut]; + outputs.reserve(ret_vars.size()); + for (size_t i = 0U; i != ret_vars.size(); ++i) { + framework::Variable* v = ret_vars[i]; + VarBase* out = new VarBase(string::Sprintf("%s_out_%d", op->Type(), i), v, + nullptr, stop_gradient); + outputs.emplace_back(out); + out->TrackPreOp(op, PyLayer::kFwdOut, i, stop_gradient); + } + + if (!stop_gradient) { + VLOG(5) << "start construct backward op"; + op->grad_input_vars_.resize(1); + op->grad_output_vars_.resize(1); + auto& grad_input_vars = + op->grad_input_vars_[0][framework::GradVarName(PyLayer::kFwdInp)]; + auto& grad_output_vars = + op->grad_output_vars_[0][framework::GradVarName(PyLayer::kFwdOut)]; + + for (VarBase* inp : inputs) { + grad_input_vars.push_back(inp); + } + for (VarBase* out : outputs) { + grad_input_vars.push_back(out); + } + + // TODO(minqiyang): Add GPU support for PyLayer, only support CPU now + platform::CPUPlace place; + for (VarBase* out : outputs) { + InitGrad(out, platform::DeviceContextPool::Instance().Get(place)); + grad_input_vars.push_back(out->grads_); + } + + for (VarBase* inp : inputs) { + InitGrad(inp, platform::DeviceContextPool::Instance().Get(place)); + grad_output_vars.push_back(inp->grads_); + } + } + return outputs; +} + +} // namespace imperative +} // namespace paddle diff --git a/paddle/fluid/imperative/tracer.h b/paddle/fluid/imperative/tracer.h new file mode 100644 index 0000000000000000000000000000000000000000..a87f3b8009dd552626c6c03fba3b0bbf3a78bb83 --- /dev/null +++ b/paddle/fluid/imperative/tracer.h @@ -0,0 +1,66 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include +#include + +#include "paddle/fluid/framework/op_desc.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/imperative/engine.h" +#include "paddle/fluid/imperative/layer.h" +#include "paddle/fluid/platform/place.h" + +namespace paddle { +namespace imperative { + +void CreateGradOp(const framework::OpDesc& op_desc, + const std::unordered_set& no_grad_set, + const std::vector& grad_sub_block, + framework::OpDesc** grad_op_desc, + std::unordered_map* grad_to_var); + +void InitVar(const VarBase* var, framework::Variable* grad_var, + platform::DeviceContext* dev_ctx); + +platform::Place GetExpectedPlace(platform::Place place, VarBasePtrMap inputs); + +class Tracer { + public: + explicit Tracer(framework::BlockDesc* root_block); + + virtual ~Tracer() {} + + std::set Trace(OpBase* op, const VarBasePtrMap& inputs, + VarBasePtrMap* outputs, // NOLINT + framework::AttributeMap attrs_map, + const platform::Place expected_place, + const bool stop_gradient = false); + + std::vector PyTrace(OpBase* op, const std::vector& inputs, + bool stop_gradient = false); + + private: + platform::Place GetPlace(const VarBasePtrMap& inputs); + + framework::BlockDesc* root_block_; +}; + +} // namespace imperative +} // namespace paddle diff --git a/paddle/fluid/imperative/type_defs.h b/paddle/fluid/imperative/type_defs.h new file mode 100644 index 0000000000000000000000000000000000000000..c51ce931defbc87231a2f8c6c07f99d9853fb283 --- /dev/null +++ b/paddle/fluid/imperative/type_defs.h @@ -0,0 +1,32 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include + +namespace paddle { +namespace imperative { + +class VarBase; +class OpBase; + +typedef std::map> VarBasePtrMap; +typedef std::map> ConstVarBasePtrMap; +typedef std::map> OpBasePtrMap; + +} // namespace imperative +} // namespace paddle diff --git a/paddle/fluid/inference/CMakeLists.txt b/paddle/fluid/inference/CMakeLists.txt index 9794a193bcfaae19552b1f6fbdf2dab2898033d5..fb433ff2a2bd113358152248120d0d2be94bd927 100644 --- a/paddle/fluid/inference/CMakeLists.txt +++ b/paddle/fluid/inference/CMakeLists.txt @@ -1,8 +1,5 @@ -# analysis and tensorrt must be added before creating static library, -# otherwise, there would be undefined reference to them in static library. -add_subdirectory(analysis) -if (TENSORRT_FOUND) - add_subdirectory(tensorrt) +if(WITH_TESTING) + include(tests/test.cmake) # some generic cmake funtion for inference endif() set(FLUID_CORE_MODULES proto_desc memory lod_tensor executor) @@ -10,27 +7,60 @@ set(FLUID_CORE_MODULES proto_desc memory lod_tensor executor) # TODO(panyx0718): Should this be called paddle_fluid_inference_api_internal? cc_library(paddle_fluid_api SRCS io.cc - DEPS ${FLUID_CORE_MODULES} ${GLOB_OP_LIB}) + DEPS ${FLUID_CORE_MODULES} ${GLOB_OP_LIB} ${GLOB_OPERATOR_DEPS}) + +# analysis and tensorrt must be added before creating static library, +# otherwise, there would be undefined reference to them in static library. +add_subdirectory(analysis) +add_subdirectory(utils) +if (TENSORRT_FOUND) + add_subdirectory(tensorrt) +endif() + +if (ANAKIN_FOUND) + add_subdirectory(anakin) +endif() get_property(fluid_modules GLOBAL PROPERTY FLUID_MODULES) +get_property(cuda_modules GLOBAL PROPERTY CUDA_MODULES) +get_property(fluid_third_partys GLOBAL PROPERTY FLUID_THRID_PARTYS) +if (WIN32) +list(APPEND fluid_third_partys gflags glog protobuf cblas) +endif(WIN32) # paddle_fluid_origin exclude inference api interface -cc_library(paddle_fluid_origin DEPS ${fluid_modules} paddle_fluid_api) +if(WIN32) + sep_library(paddle_fluid_origin DEPS ${fluid_modules} paddle_fluid_api) +else(WIN32) + cc_library(paddle_fluid_origin DEPS ${fluid_modules} paddle_fluid_api) +endif(WIN32) add_subdirectory(api) +if(WITH_MKLDNN) + set(mkldnn_quantizer_src ${CMAKE_CURRENT_SOURCE_DIR}/api/mkldnn_quantizer.cc) + set(mkldnn_quantizer_cfg mkldnn_quantizer_config) +endif() + set(STATIC_INFERENCE_APIS paddle_fluid_api paddle_inference_api analysis_predictor) set(SHARED_INFERENCE_SRCS io.cc ${CMAKE_CURRENT_SOURCE_DIR}/api/api.cc ${CMAKE_CURRENT_SOURCE_DIR}/api/api_impl.cc ${CMAKE_CURRENT_SOURCE_DIR}/api/analysis_predictor.cc + ${mkldnn_quantizer_src} ${CMAKE_CURRENT_SOURCE_DIR}/api/details/zero_copy_tensor.cc) -if (WITH_GPU AND TENSORRT_FOUND) - set(STATIC_INFERENCE_APIS ${STATIC_INFERENCE_APIS} paddle_inference_tensorrt_subgraph_engine) - set(SHARED_INFERENCE_SRCS ${SHARED_INFERENCE_SRCS} ${CMAKE_CURRENT_SOURCE_DIR}/api/api_tensorrt_subgraph_engine.cc) + +# FIXME(gongwb): hidden libdgc.a +if(WITH_GPU AND NOT WIN32) + set(fluid_modules ${fluid_modules} dgc) endif() -# Create static library -cc_library(paddle_fluid DEPS ${fluid_modules} ${STATIC_INFERENCE_APIS} zero_copy_tensor) +if(WIN32) + sep_library(paddle_fluid DEPS ${fluid_modules} ${STATIC_INFERENCE_APIS} zero_copy_tensor reset_tensor_array + analysis_config ${mkldnn_quantizer_cfg} paddle_pass_builder) +else(WIN32) + cc_library(paddle_fluid DEPS ${fluid_modules} ${STATIC_INFERENCE_APIS} + zero_copy_tensor reset_tensor_array analysis_config ${mkldnn_quantizer_cfg} paddle_pass_builder) +endif(WIN32) if(NOT APPLE) # TODO(liuyiqu: Temporarily disable the link flag because it is not support on Mac. @@ -39,11 +69,20 @@ if(NOT APPLE) endif() # Create shared library -cc_library(paddle_fluid_shared SHARED SRCS ${SHARED_INFERENCE_SRCS} - DEPS ${fluid_modules} paddle_fluid_api) +if(WIN32) + sep_library(paddle_fluid_shared SHARED SRCS ${SHARED_INFERENCE_SRCS} + DEPS ${fluid_modules} paddle_fluid_api reset_tensor_array + analysis_config ${mkldnn_quantizer_cfg} paddle_pass_builder) +else(WIN32) + cc_library(paddle_fluid_shared SHARED SRCS ${SHARED_INFERENCE_SRCS} + DEPS ${fluid_modules} paddle_fluid_api reset_tensor_array + analysis_config ${mkldnn_quantizer_cfg} paddle_pass_builder) +endif() +get_property(os_dependency_modules GLOBAL PROPERTY OS_DEPENDENCY_MODULES) +target_link_libraries(paddle_fluid_shared ${os_dependency_modules}) set_target_properties(paddle_fluid_shared PROPERTIES OUTPUT_NAME paddle_fluid) -if(NOT APPLE) +if(NOT APPLE AND NOT WIN32) # TODO(liuyiqun): Temporarily disable the link flag because it is not support on Mac. set(LINK_FLAGS "-Wl,--version-script ${CMAKE_CURRENT_SOURCE_DIR}/paddle_fluid.map") set_target_properties(paddle_fluid_shared PROPERTIES LINK_FLAGS "${LINK_FLAGS}") @@ -66,5 +105,5 @@ if(WITH_TESTING) add_subdirectory(tests/book) if(WITH_INFERENCE_API_TEST) add_subdirectory(tests/api) - endif() + endif() endif() diff --git a/paddle/fluid/inference/anakin/CMakeLists.txt b/paddle/fluid/inference/anakin/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..e8fb56590563f49f920bfe71d160ec822cb3ca30 --- /dev/null +++ b/paddle/fluid/inference/anakin/CMakeLists.txt @@ -0,0 +1,5 @@ +cc_library(anakin_engine SRCS engine.cc DEPS framework_proto) +cc_library(anakin_op_teller SRCS op_teller.cc DEPS framework_proto) +target_link_libraries(anakin_engine anakin anakin_saber_common) +cc_test(test_anakin_engine SRCS test_anakin_engine.cc DEPS anakin_engine) +add_subdirectory(convert) diff --git a/paddle/fluid/inference/anakin/convert/CMakeLists.txt b/paddle/fluid/inference/anakin/convert/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..d3d1522dccf0d8af4f26eec4e0c57257279880e0 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/CMakeLists.txt @@ -0,0 +1,18 @@ +cc_library(anakin_op_converter SRCS fc.cc conv2d.cc conv2d_fusion.cc elementwise.cc activation.cc pool2d.cc concat.cc split.cc relu.cc softmax.cc batch_norm.cc reshape.cc flatten.cc transpose.cc density_prior_box.cc detection_out.cc scale.cc dropout.cc im2sequence.cc sum.cc DEPS anakin_engine framework_proto scope op_registry) + +cc_test(test_anakin_fc SRCS test_fc_op.cc DEPS anakin_op_converter mul_op SERIAL) +cc_test(test_anakin_conv2d SRCS test_conv2d_op.cc DEPS anakin_op_converter conv_op im2col vol2col depthwise_conv SERIAL) +cc_test(test_anakin_activation SRCS test_activation_op.cc DEPS activation_op anakin_op_converter SERIAL) +cc_test(test_anakin_pool2d SRCS test_pool2d_op.cc DEPS anakin_op_converter pool_op pooling SERIAL) +cc_test(test_anakin_concat SRCS test_concat_op.cc DEPS anakin_op_converter concat_op concat_and_split SERIAL) +cc_test(test_anakin_split SRCS test_split_op.cc DEPS anakin_op_converter split_op concat_and_split SERIAL) +cc_test(test_anakin_elementwise SRCS test_elementwise_op.cc DEPS anakin_op_converter elementwise_add_op elementwise_mul_op SERIAL) +cc_test(test_anakin_relu SRCS test_relu_op.cc DEPS activation_op anakin_op_converter SERIAL SERIAL) +cc_test(test_anakin_softmax SRCS test_softmax_op.cc DEPS anakin_op_converter softmax_op softmax SERIAL) +cc_test(test_anakin_reshape SRCS test_reshape_op.cc DEPS anakin_op_converter reshape_op SERIAL) +cc_test(test_anakin_flatten SRCS test_flatten_op.cc DEPS anakin_op_converter flatten_op reshape_op SERIAL) +cc_test(test_anakin_transpose SRCS test_transpose_op.cc DEPS anakin_op_converter transpose_op SERIAL) +cc_test(test_anakin_batch_norm SRCS test_batch_norm_op.cc DEPS anakin_op_converter batch_norm_op SERIAL) +cc_test(test_anakin_dropout SRCS test_dropout_op.cc DEPS anakin_op_converter dropout_op SERIAL) +#cc_test(test_anakin_im2sequence SRCS test_im2sequence_op.cc DEPS anakin_op_converter im2sequence_op im2col) +cc_test(test_anakin_sum SRCS test_sum_op.cc DEPS anakin_op_converter sum_op selected_rows_functor SERIAL) diff --git a/paddle/fluid/inference/anakin/convert/activation.cc b/paddle/fluid/inference/anakin/convert/activation.cc new file mode 100644 index 0000000000000000000000000000000000000000..a9aeb19ffd5f04c03df593e8f48976e7fa6155ab --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/activation.cc @@ -0,0 +1,56 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/anakin/convert/activation.h" +#include +#include + +using anakin::graph::GraphGlobalMem; +using anakin::AK_FLOAT; +using anakin::saber::NV; +using anakin::saber::Shape; + +namespace paddle { +namespace inference { +namespace anakin { + +ActivationOpConverter::ActivationOpConverter(const std::string &op_type) + : op_type_(op_type) { + auto it = anakin_op_types_.find(op_type_); + PADDLE_ENFORCE(it != anakin_op_types_.end(), + "activation op type is not support"); + anakin_op_type_ = it->second; +} + +void ActivationOpConverter::operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) { + framework::OpDesc op_desc(op, nullptr); + PADDLE_ENFORCE_EQ(op_desc.Input("X").size(), 1); + PADDLE_ENFORCE_EQ(op_desc.Output("Out").size(), 1); + + auto op_name = op_desc.Type() + ":" + op_desc.Output("Out").front(); + auto input_name = op_desc.Input("X").front(); + auto output_name = op_desc.Output("Out").front(); + engine_->AddOp(op_name, "Activation", {input_name}, {output_name}); + engine_->AddOpAttr(op_name, "type", anakin_op_type_); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +REGISTER_ANAKIN_OP_CONVERTER(sigmoid, SigmoidOpConverter); +REGISTER_ANAKIN_OP_CONVERTER(tanh, TanhOpConverter); diff --git a/paddle/fluid/inference/anakin/convert/activation.h b/paddle/fluid/inference/anakin/convert/activation.h new file mode 100644 index 0000000000000000000000000000000000000000..592a3d5bd9d1272aae8a13d0d0acc77f8990c6b3 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/activation.h @@ -0,0 +1,53 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include "paddle/fluid/inference/anakin/convert/op_converter.h" + +namespace paddle { +namespace inference { +namespace anakin { + +class ActivationOpConverter : public AnakinOpConverter { + public: + explicit ActivationOpConverter(const std::string &op_type); + + virtual void operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) override; + virtual ~ActivationOpConverter() {} + + private: + std::string op_type_; + std::string anakin_op_type_; + std::map anakin_op_types_{{"tanh", "TanH"}, + {"sigmoid", "Sigmoid"}}; +}; + +class TanhOpConverter : public ActivationOpConverter { + public: + TanhOpConverter() : ActivationOpConverter("tanh") {} +}; + +class SigmoidOpConverter : public ActivationOpConverter { + public: + SigmoidOpConverter() : ActivationOpConverter("sigmoid") {} +}; +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/convert/batch_norm.cc b/paddle/fluid/inference/anakin/convert/batch_norm.cc new file mode 100644 index 0000000000000000000000000000000000000000..38cf6172027b3b200a378a61b6d5b395cc571de7 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/batch_norm.cc @@ -0,0 +1,128 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/anakin/convert/batch_norm.h" +#include +#include +#include +#include +#include + +using anakin::graph::GraphGlobalMem; +using anakin::AK_FLOAT; +using anakin::saber::NV; +using anakin::saber::Shape; + +namespace paddle { +namespace inference { +namespace anakin { + +void BatchNormOpConverter::operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) { + framework::OpDesc op_desc(op, nullptr); + PADDLE_ENFORCE_EQ(op_desc.Output("Y").size(), 1); + std::map inputs; + for (auto k : {"X", "Scale", "Bias", "Mean", "Variance"}) { + PADDLE_ENFORCE_EQ(op_desc.Input(k).size(), 1UL); + auto v = op_desc.Input(k).front(); + inputs.insert({k, v}); + } + + auto output = op_desc.Output("Y").front(); + auto op_name = op_desc.Type() + ":" + op_desc.Output("Y").front(); + auto epsilon = boost::get(op_desc.GetAttr("epsilon")); + // auto momentum = boost::get(op_desc.GetAttr("momentum")); + + auto bn_op_name = op_name + ":bn"; + auto bn_output = bn_op_name + "_output"; + engine_->AddOp(bn_op_name, "BatchNorm", {inputs["X"]}, {bn_output}); + engine_->AddOpAttr(bn_op_name, "epsilon", epsilon); + engine_->AddOpAttr(bn_op_name, "momentum", static_cast(1.0)); + + auto scale_op_name = op_name + ":scale"; + auto get_lod_tensor = [this, &scope, &op_name](const std::string &var_name, + framework::LoDTensor *tensor) { + auto *v = scope.FindVar(var_name); + PADDLE_ENFORCE_NOT_NULL(v); + auto *t = v->GetMutable(); + tensor->Resize(t->dims()); + TensorCopySync(*t, platform::CPUPlace(), tensor); + }; + + framework::LoDTensor bias_t; + framework::LoDTensor mean_t; + framework::LoDTensor scale_t; + framework::LoDTensor variance_t; + get_lod_tensor(inputs["Bias"], &bias_t); + get_lod_tensor(inputs["Mean"], &mean_t); + get_lod_tensor(inputs["Scale"], &scale_t); + get_lod_tensor(inputs["Variance"], &variance_t); + + auto fill_shape = [](size_t n, std::vector shape) { + shape.insert(shape.begin(), 1); + if (shape.size() < n) { + shape.insert(shape.end(), n - shape.size(), 1); + } + return shape; + }; + Shape shape1(fill_shape(4, framework::vectorize2int(mean_t.dims()))); + Shape shape2(fill_shape(4, framework::vectorize2int(variance_t.dims()))); + auto *weight1 = + GraphGlobalMem::Global().template new_block(shape1); + auto *mean_data = static_cast(weight1->h_tensor().mutable_data()); + std::copy_n(mean_t.data(), mean_t.numel(), mean_data); + engine_->AddOpAttr(bn_op_name, "weight_1", *weight1); + + auto *weight2 = + GraphGlobalMem::Global().template new_block(shape2); + auto *variance_data = + static_cast(weight2->h_tensor().mutable_data()); + std::copy_n(variance_t.data(), variance_t.numel(), variance_data); + engine_->AddOpAttr(bn_op_name, "weight_2", *weight2); + + Shape shape3(std::vector({1, 1, 1, 1})); + auto *weight3 = + GraphGlobalMem::Global().template new_block(shape3); + auto *alpha_data = static_cast(weight3->h_tensor().mutable_data()); + float weight3_data[] = {1}; + std::copy(std::begin(weight3_data), std::end(weight3_data), alpha_data); + engine_->AddOpAttr(bn_op_name, "weight_3", *weight3); + + Shape scale_shape(fill_shape(4, framework::vectorize2int(scale_t.dims()))); + auto *scale = + GraphGlobalMem::Global().template new_block(scale_shape); + auto *scale_data = static_cast(scale->h_tensor().mutable_data()); + std::copy_n(scale_t.data(), scale_t.numel(), scale_data); + + Shape bias_shape(fill_shape(4, framework::vectorize2int(bias_t.dims()))); + auto *bias = + GraphGlobalMem::Global().template new_block(bias_shape); + auto *bias_data = static_cast(bias->h_tensor().mutable_data()); + std::copy_n(bias_t.data(), bias_t.numel(), bias_data); + + engine_->AddOp(scale_op_name, "Scale", {bn_output}, {output}); + engine_->AddOpAttr(scale_op_name, "axis", 1); + engine_->AddOpAttr(scale_op_name, "num_axes", 1); + engine_->AddOpAttr(scale_op_name, "bias_term", true); + engine_->AddOpAttr(scale_op_name, "weight_1", *scale); + engine_->AddOpAttr(scale_op_name, "weight_2", *bias); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +REGISTER_ANAKIN_OP_CONVERTER(batch_norm, BatchNormOpConverter); diff --git a/paddle/fluid/inference/anakin/convert/batch_norm.h b/paddle/fluid/inference/anakin/convert/batch_norm.h new file mode 100644 index 0000000000000000000000000000000000000000..c56735f15b435b46cf9f623bd284b5731a36c327 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/batch_norm.h @@ -0,0 +1,36 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/inference/anakin/convert/op_converter.h" + +namespace paddle { +namespace inference { +namespace anakin { + +class BatchNormOpConverter : public AnakinOpConverter { + public: + BatchNormOpConverter() = default; + + virtual void operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) override; + virtual ~BatchNormOpConverter() {} +}; + +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/convert/concat.cc b/paddle/fluid/inference/anakin/convert/concat.cc new file mode 100644 index 0000000000000000000000000000000000000000..ae90c083690da6e108a05460de68be2eb0cd9b48 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/concat.cc @@ -0,0 +1,52 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/anakin/convert/concat.h" +#include + +using anakin::graph::GraphGlobalMem; +using anakin::AK_FLOAT; +using anakin::Precision; +using anakin::saber::NV; +using anakin::saber::X86; +using anakin::saber::Shape; +using anakin::PBlock; +using anakin::PTuple; + +namespace paddle { +namespace inference { +namespace anakin { + +void ConcatOpConverter::operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) { + framework::OpDesc op_desc(op, nullptr); + int axis = boost::get(op_desc.GetAttr("axis")); + auto input_names = op_desc.Input("X"); + // PADDLE_ENFORCE(axis > 0, + // "The axis attr of Concat op should be large than 0 for trt"); + + auto y_name = op_desc.Output("Out").front(); + auto op_name = op_desc.Type() + ":" + op_desc.Output("Out").front(); + + engine_->AddOp(op_name, "Concat", input_names, {y_name}); + engine_->AddOpAttr(op_name, "axis", axis); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +REGISTER_ANAKIN_OP_CONVERTER(concat, ConcatOpConverter); diff --git a/paddle/fluid/inference/anakin/convert/concat.h b/paddle/fluid/inference/anakin/convert/concat.h new file mode 100644 index 0000000000000000000000000000000000000000..974ff689bfef681f8993d5dbb0dbbbdde91f33bd --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/concat.h @@ -0,0 +1,38 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/inference/anakin/convert/op_converter.h" + +namespace paddle { +namespace inference { +namespace anakin { + +class ConcatOpConverter : public AnakinOpConverter { + public: + ConcatOpConverter() = default; + + virtual void operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) override; + virtual ~ConcatOpConverter() {} + + private: +}; + +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/convert/conv2d.cc b/paddle/fluid/inference/anakin/convert/conv2d.cc new file mode 100644 index 0000000000000000000000000000000000000000..308f14604b9c83f2278499359328109d31f9ff17 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/conv2d.cc @@ -0,0 +1,88 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/anakin/convert/conv2d.h" +#include +#include +#include + +using anakin::graph::GraphGlobalMem; +using anakin::AK_FLOAT; +using anakin::saber::NV; +using anakin::saber::Shape; +using anakin::PTuple; + +namespace paddle { +namespace inference { +namespace anakin { + +void Conv2dOpConverter::operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) { + framework::OpDesc op_desc(op, nullptr); + PADDLE_ENFORCE_EQ(op_desc.Input("Input").size(), 1UL); + PADDLE_ENFORCE_EQ(op_desc.Input("Filter").size(), 1UL); + PADDLE_ENFORCE_EQ(op_desc.Output("Output").size(), 1UL); + + auto input_name = op_desc.Input("Input").front(); + auto output_name = op_desc.Output("Output").front(); + auto op_name = op_desc.Type() + ":" + op_desc.Output("Output").front(); + engine_->AddOp(op_name, "Convolution", {input_name}, {output_name}); + + auto *filter_v = scope.FindVar(op_desc.Input("Filter").front()); + PADDLE_ENFORCE_NOT_NULL(filter_v); + auto *filter_t = filter_v->GetMutable(); + std::unique_ptr weight_tensor( + new framework::LoDTensor()); + weight_tensor->Resize(filter_t->dims()); + TensorCopySync((*filter_t), platform::CPUPlace(), weight_tensor.get()); + + PADDLE_ENFORCE_EQ(weight_tensor->dims().size(), 4UL); + + // const int n_output = weight_tensor->dims()[0]; + // const int n_input = weight_tensor->dims()[1]; + const int filter_h = weight_tensor->dims()[2]; + const int filter_w = weight_tensor->dims()[3]; + // auto filter_num = n_input * filter_h * filter_w ; + auto filter_num = weight_tensor->dims()[0]; + engine_->AddOpAttr(op_name, "filter_num", filter_num); + engine_->AddOpAttr>(op_name, "kernel_size", {filter_h, filter_w}); + auto strides = boost::get>(op_desc.GetAttr("strides")); + engine_->AddOpAttr>(op_name, "strides", strides); + auto paddings = boost::get>(op_desc.GetAttr("paddings")); + engine_->AddOpAttr>(op_name, "padding", paddings); + auto dilations = boost::get>(op_desc.GetAttr("dilations")); + engine_->AddOpAttr>(op_name, "dilation_rate", dilations); + const int groups = boost::get(op_desc.GetAttr("groups")); + engine_->AddOpAttr(op_name, "group", groups); + engine_->AddOpAttr(op_name, "axis", 1); + engine_->AddOpAttr(op_name, "bias_term", false); + + auto weight_shape = framework::vectorize2int(filter_t->dims()); + Shape anakin_shape(weight_shape); + auto *weight1 = + GraphGlobalMem::Global().template new_block(anakin_shape); + float *cpu_data = static_cast(weight1->h_tensor().mutable_data()); + std::copy_n(weight_tensor->data(), weight_tensor->numel(), cpu_data); + weight1->d_tensor().set_shape(anakin_shape); + weight1->d_tensor().copy_from(weight1->h_tensor()); + engine_->AddOpAttr(op_name, "weight_1", *weight1); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +REGISTER_ANAKIN_OP_CONVERTER(conv2d, Conv2dOpConverter); diff --git a/paddle/fluid/inference/anakin/convert/conv2d.h b/paddle/fluid/inference/anakin/convert/conv2d.h new file mode 100644 index 0000000000000000000000000000000000000000..dca5d19f468ac6d6e2f4bcda8ecaa3922d80e6b1 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/conv2d.h @@ -0,0 +1,36 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/inference/anakin/convert/op_converter.h" + +namespace paddle { +namespace inference { +namespace anakin { + +class Conv2dOpConverter : public AnakinOpConverter { + public: + Conv2dOpConverter() = default; + + virtual void operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) override; + virtual ~Conv2dOpConverter() {} +}; + +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/convert/conv2d_fusion.cc b/paddle/fluid/inference/anakin/convert/conv2d_fusion.cc new file mode 100644 index 0000000000000000000000000000000000000000..fa1ab0efeeb5cacd112ca1b644735eaaf49e55f8 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/conv2d_fusion.cc @@ -0,0 +1,114 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/anakin/convert/conv2d_fusion.h" +#include +#include +#include + +using anakin::graph::GraphGlobalMem; +using anakin::AK_FLOAT; +using anakin::saber::NV; +using anakin::saber::Shape; +using anakin::PTuple; + +namespace paddle { +namespace inference { +namespace anakin { + +void Conv2dFusionOpConverter::operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) { + framework::OpDesc op_desc(op, nullptr); + PADDLE_ENFORCE_EQ(op_desc.Input("Input").size(), 1UL); + PADDLE_ENFORCE_EQ(op_desc.Input("Filter").size(), 1UL); + PADDLE_ENFORCE_EQ(op_desc.Input("Bias").size(), 1UL); + PADDLE_ENFORCE_EQ(op_desc.Output("Output").size(), 1UL); + + auto input_name = op_desc.Input("Input").front(); + auto output_name = op_desc.Output("Output").front(); + auto op_name = op_desc.Type() + ":" + op_desc.Output("Output").front(); + engine_->AddOp(op_name, "Convolution", {input_name}, {output_name}); + + auto *filter_v = scope.FindVar(op_desc.Input("Filter").front()); + PADDLE_ENFORCE_NOT_NULL(filter_v); + auto *filter_t = filter_v->GetMutable(); + + auto *b_v = scope.FindVar(op_desc.Input("Bias").front()); + PADDLE_ENFORCE_NOT_NULL(b_v); + auto *b_t = b_v->GetMutable(); + + std::unique_ptr weight_tensor( + new framework::LoDTensor()); + weight_tensor->Resize(filter_t->dims()); + TensorCopySync((*filter_t), platform::CPUPlace(), weight_tensor.get()); + + PADDLE_ENFORCE_EQ(weight_tensor->dims().size(), 4UL); + + // const int n_output = weight_tensor->dims()[0]; + // const int n_input = weight_tensor->dims()[1]; + const int filter_h = weight_tensor->dims()[2]; + const int filter_w = weight_tensor->dims()[3]; + // auto filter_num = n_input * filter_h * filter_w ; + auto filter_num = weight_tensor->dims()[0]; + engine_->AddOpAttr(op_name, "filter_num", filter_num); + engine_->AddOpAttr>(op_name, "kernel_size", {filter_h, filter_w}); + auto strides = boost::get>(op_desc.GetAttr("strides")); + engine_->AddOpAttr>(op_name, "strides", strides); + auto paddings = boost::get>(op_desc.GetAttr("paddings")); + engine_->AddOpAttr>(op_name, "padding", paddings); + auto dilations = boost::get>(op_desc.GetAttr("dilations")); + engine_->AddOpAttr>(op_name, "dilation_rate", dilations); + const int groups = boost::get(op_desc.GetAttr("groups")); + engine_->AddOpAttr(op_name, "group", groups); + engine_->AddOpAttr(op_name, "axis", 1); + engine_->AddOpAttr(op_name, "bias_term", true); + + auto weight_shape = framework::vectorize2int(filter_t->dims()); + Shape anakin_shape(weight_shape); + auto *weight1 = + GraphGlobalMem::Global().template new_block(anakin_shape); + float *cpu_data = static_cast(weight1->h_tensor().mutable_data()); + std::copy_n(weight_tensor->data(), weight_tensor->numel(), cpu_data); + weight1->d_tensor().set_shape(anakin_shape); + weight1->d_tensor().copy_from(weight1->h_tensor()); + engine_->AddOpAttr(op_name, "weight_1", *weight1); + + auto bias_shape = framework::vectorize2int(b_t->dims()); + framework::LoDTensor bias_tensor; + bias_tensor.Resize(b_t->dims()); + TensorCopySync((*b_t), platform::CPUPlace(), &bias_tensor); + auto *bias_data = bias_tensor.data(); + bias_shape.insert(bias_shape.begin(), 1); + bias_shape.insert(bias_shape.begin(), 1); + bias_shape.insert(bias_shape.begin(), 1); + // bias_shape.push_back(1); + // bias_shape.push_back(1); + Shape anakin_bias_shape(bias_shape); + + auto *weight2 = GraphGlobalMem::Global().template new_block( + anakin_bias_shape); + float *cpu_data2 = static_cast(weight2->h_tensor().mutable_data()); + std::copy_n(bias_data, bias_tensor.numel(), cpu_data2); + weight2->d_tensor().set_shape(anakin_bias_shape); + weight2->d_tensor().copy_from(weight2->h_tensor()); + engine_->AddOpAttr(op_name, "weight_2", *weight2); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +REGISTER_ANAKIN_OP_CONVERTER(conv2d_fusion, Conv2dFusionOpConverter); diff --git a/paddle/fluid/inference/anakin/convert/conv2d_fusion.h b/paddle/fluid/inference/anakin/convert/conv2d_fusion.h new file mode 100644 index 0000000000000000000000000000000000000000..0d9ef28183b309c4b50714fcbe64e24c5d9dfbaa --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/conv2d_fusion.h @@ -0,0 +1,36 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/inference/anakin/convert/op_converter.h" + +namespace paddle { +namespace inference { +namespace anakin { + +class Conv2dFusionOpConverter : public AnakinOpConverter { + public: + Conv2dFusionOpConverter() = default; + + virtual void operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) override; + virtual ~Conv2dFusionOpConverter() {} +}; + +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/convert/density_prior_box.cc b/paddle/fluid/inference/anakin/convert/density_prior_box.cc new file mode 100644 index 0000000000000000000000000000000000000000..30796f7592427191a4396a154be62838b7e666ad --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/density_prior_box.cc @@ -0,0 +1,107 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/anakin/convert/density_prior_box.h" +#include +#include +#include + +using anakin::graph::GraphGlobalMem; +using anakin::AK_FLOAT; +using anakin::saber::NV; +using anakin::saber::Shape; +using anakin::PTuple; + +namespace paddle { +namespace inference { +namespace anakin { + +void DensityPriorBoxOpConverter::operator()( + const framework::proto::OpDesc& op, const framework::BlockDesc& block_desc, + const framework::Scope& scope, bool test_mode) { + framework::OpDesc op_desc(op, nullptr); + auto input_name = op_desc.Input("Input").front(); + auto image_name = op_desc.Input("Image").front(); + auto output_name = op_desc.Output("Boxes").front(); + auto op_type = op_desc.Type(); + auto op_name = op_type + ":" + op_desc.Output("Boxes").front(); + + // only for density_prior_box + std::vector fixed_sizes = {}; + std::vector fixed_ratios = {}; + std::vector densities = {}; + + std::vector min_sizes = {}; + std::vector max_sizes = {}; + std::vector aspect_ratios = {}; + bool is_clip = false; + bool is_flip = false; + + if (op_type == "density_prior_box") { + fixed_sizes = + boost::get>(op_desc.GetAttr("fixed_sizes")); + fixed_ratios = + boost::get>(op_desc.GetAttr("fixed_ratios")); + densities = boost::get>(op_desc.GetAttr("densities")); + is_clip = boost::get(op_desc.GetAttr("clip")); + } else if (op_type == "prior_box") { + min_sizes = boost::get>(op_desc.GetAttr("min_sizes")); + max_sizes = boost::get>(op_desc.GetAttr("max_sizes")); + aspect_ratios = + boost::get>(op_desc.GetAttr("aspect_ratios")); + is_clip = boost::get(op_desc.GetAttr("clip")); + is_flip = boost::get(op_desc.GetAttr("flip")); + } + std::vector dens; + for (auto& ele : densities) { + dens.push_back(static_cast(ele)); + } + + auto variances = boost::get>(op_desc.GetAttr("variances")); + + // lack img_h, img_w + auto step_h = boost::get(op_desc.GetAttr("step_h")); + auto step_w = boost::get(op_desc.GetAttr("step_w")); + auto offset = boost::get(op_desc.GetAttr("offset")); + PTuple t_order; + t_order.push_back("MIN"); + t_order.push_back("COM"); + t_order.push_back("MAX"); + + std::vector temp_v = {}; + + engine_->AddOp(op_name, "PriorBox", {input_name, image_name}, {output_name}); + engine_->AddOpAttr>(op_name, "min_size", min_sizes); + engine_->AddOpAttr>(op_name, "max_size", max_sizes); + engine_->AddOpAttr>(op_name, "aspect_ratio", aspect_ratios); + engine_->AddOpAttr>(op_name, "fixed_size", fixed_sizes); + engine_->AddOpAttr>(op_name, "fixed_ratio", fixed_ratios); + engine_->AddOpAttr>(op_name, "density", dens); + engine_->AddOpAttr(op_name, "is_flip", is_flip); + engine_->AddOpAttr(op_name, "is_clip", is_clip); + engine_->AddOpAttr>(op_name, "variance", variances); + engine_->AddOpAttr(op_name, "img_h", static_cast(0)); + engine_->AddOpAttr(op_name, "img_w", static_cast(0)); + engine_->AddOpAttr(op_name, "step_h", step_h); + engine_->AddOpAttr(op_name, "step_w", step_w); + engine_->AddOpAttr(op_name, "offset", offset); + engine_->AddOpAttr>(op_name, "order", t_order); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +REGISTER_ANAKIN_OP_CONVERTER(density_prior_box, DensityPriorBoxOpConverter); +REGISTER_ANAKIN_OP_CONVERTER(prior_box, DensityPriorBoxOpConverter); diff --git a/paddle/fluid/inference/anakin/convert/density_prior_box.h b/paddle/fluid/inference/anakin/convert/density_prior_box.h new file mode 100644 index 0000000000000000000000000000000000000000..bf9210711a0f69595c241803cd40d42770ccd5d7 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/density_prior_box.h @@ -0,0 +1,38 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include "paddle/fluid/inference/anakin/convert/op_converter.h" + +namespace paddle { +namespace inference { +namespace anakin { + +class DensityPriorBoxOpConverter : public AnakinOpConverter { + public: + DensityPriorBoxOpConverter() = default; + + virtual void operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) override; + virtual ~DensityPriorBoxOpConverter() {} +}; + +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/convert/detection_out.cc b/paddle/fluid/inference/anakin/convert/detection_out.cc new file mode 100644 index 0000000000000000000000000000000000000000..262ad28a654609cddde979d387621bb0c7c1a7f9 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/detection_out.cc @@ -0,0 +1,73 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/anakin/convert/detection_out.h" +#include +#include + +using anakin::graph::GraphGlobalMem; +using anakin::AK_FLOAT; +using anakin::saber::NV; +using anakin::saber::Shape; + +namespace paddle { +namespace inference { +namespace anakin { + +void DetectionOutOpConverter::operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) { + framework::OpDesc op_desc(op, nullptr); + auto target_name = op_desc.Input("TargetBox").front(); + auto prior_box_name = op_desc.Input("PriorBox").front(); + auto scores_name = op_desc.Input("Scores").front(); + auto output_name = op_desc.Output("Out").front(); + + auto op_name = op_desc.Type() + ":" + op_desc.Output("Out").front(); + + auto code_type = boost::get(op_desc.GetAttr("code_type")); + auto background_label = boost::get(op_desc.GetAttr("background_label")); + auto score_threshold = boost::get(op_desc.GetAttr("score_threshold")); + auto nms_top_k = boost::get(op_desc.GetAttr("nms_top_k")); + auto nms_threshold = boost::get(op_desc.GetAttr("nms_threshold")); + auto nms_eta = boost::get(op_desc.GetAttr("nms_eta")); + auto keep_top_k = boost::get(op_desc.GetAttr("keep_top_k")); + std::string anakin_code_type; + if (code_type == "decode_center_size") { + anakin_code_type = "CENTER_SIZE"; + } else if (code_type == "encode_center_size") { + PADDLE_THROW( + "Not support encode_center_size code_type in DetectionOut of anakin"); + } + + engine_->AddOp(op_name, "DetectionOutput", + {target_name, scores_name, prior_box_name}, {output_name}); + engine_->AddOpAttr(op_name, "share_location", true); + engine_->AddOpAttr(op_name, "variance_encode_in_target", false); + engine_->AddOpAttr(op_name, "class_num", static_cast(0)); + engine_->AddOpAttr(op_name, "background_id", background_label); + engine_->AddOpAttr(op_name, "keep_top_k", keep_top_k); + engine_->AddOpAttr(op_name, "code_type", anakin_code_type); + engine_->AddOpAttr(op_name, "conf_thresh", score_threshold); + engine_->AddOpAttr(op_name, "nms_top_k", nms_top_k); + engine_->AddOpAttr(op_name, "nms_thresh", nms_threshold); + engine_->AddOpAttr(op_name, "nms_eta", nms_eta); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +REGISTER_ANAKIN_OP_CONVERTER(detection_out, DetectionOutOpConverter); diff --git a/paddle/fluid/inference/anakin/convert/detection_out.h b/paddle/fluid/inference/anakin/convert/detection_out.h new file mode 100644 index 0000000000000000000000000000000000000000..ca78f10fdc2a7c7064ae0399e7f1afff1383ce67 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/detection_out.h @@ -0,0 +1,38 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include "paddle/fluid/inference/anakin/convert/op_converter.h" + +namespace paddle { +namespace inference { +namespace anakin { + +class DetectionOutOpConverter : public AnakinOpConverter { + public: + DetectionOutOpConverter() = default; + + virtual void operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) override; + virtual ~DetectionOutOpConverter() {} +}; + +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/convert/dropout.cc b/paddle/fluid/inference/anakin/convert/dropout.cc new file mode 100644 index 0000000000000000000000000000000000000000..bc9b26dcf2733369e558cde2954e9d0caaba86b0 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/dropout.cc @@ -0,0 +1,67 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/anakin/convert/dropout.h" +#include +#include +#include + +using anakin::graph::GraphGlobalMem; +using anakin::AK_FLOAT; +using anakin::Precision; +using anakin::saber::NV; +using anakin::saber::X86; +using anakin::saber::Shape; +using anakin::PBlock; +using anakin::PTuple; + +namespace paddle { +namespace inference { +namespace anakin { + +void DropoutOpConverter::operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) { + framework::OpDesc op_desc(op, nullptr); + PADDLE_ENFORCE_EQ(op_desc.Input("X").size(), 1); + PADDLE_ENFORCE_EQ(op_desc.Output("Mask").size(), 1); + PADDLE_ENFORCE_EQ(op_desc.Output("Out").size(), 1); + + auto x_name = op_desc.Input("X").front(); + auto out_name = op_desc.Output("Out").front(); + auto op_name = op_desc.Type() + ":" + op_desc.Output("Out").front(); + + engine_->AddOp(op_name, "Scale", {x_name}, {out_name}); + + auto dropout_prob = boost::get(op_desc.GetAttr("dropout_prob")); + auto factor = 1 - dropout_prob; + Shape shape1(std::vector({1, 1, 1, 1})); + auto *weight1 = + GraphGlobalMem::Global().template new_block(shape1); + auto *factor_data = static_cast(weight1->h_tensor().mutable_data()); + float weight1_data[] = {factor}; + std::copy(std::begin(weight1_data), std::end(weight1_data), factor_data); + + engine_->AddOpAttr(op_name, "weight_1", *weight1); + engine_->AddOpAttr(op_name, "axis", 0); + engine_->AddOpAttr(op_name, "num_axes", 0); + engine_->AddOpAttr(op_name, "bias_term", false); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +REGISTER_ANAKIN_OP_CONVERTER(dropout, DropoutOpConverter); diff --git a/paddle/fluid/inference/anakin/convert/dropout.h b/paddle/fluid/inference/anakin/convert/dropout.h new file mode 100644 index 0000000000000000000000000000000000000000..11412e217ef5fa77bd22d7530d88be1347f2616f --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/dropout.h @@ -0,0 +1,38 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/inference/anakin/convert/op_converter.h" + +namespace paddle { +namespace inference { +namespace anakin { + +class DropoutOpConverter : public AnakinOpConverter { + public: + DropoutOpConverter() = default; + + virtual void operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) override; + virtual ~DropoutOpConverter() {} + + private: +}; + +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/convert/elementwise.cc b/paddle/fluid/inference/anakin/convert/elementwise.cc new file mode 100644 index 0000000000000000000000000000000000000000..fe9a896d8266e06250b712be0c75290c039e9a08 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/elementwise.cc @@ -0,0 +1,88 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/anakin/convert/elementwise.h" +#include +#include +#include + +using anakin::graph::GraphGlobalMem; +using anakin::AK_FLOAT; +using anakin::Precision; +using anakin::saber::NV; +using anakin::saber::X86; +using anakin::saber::Shape; +using anakin::PBlock; +using anakin::PTuple; + +namespace paddle { +namespace inference { +namespace anakin { + +void ElementwiseAddOpConverter::operator()( + const framework::proto::OpDesc &op, const framework::BlockDesc &block_desc, + const framework::Scope &scope, bool test_mode) { + framework::OpDesc op_desc(op, nullptr); + PADDLE_ENFORCE_EQ(op_desc.Input("X").size(), 1); + PADDLE_ENFORCE_EQ(op_desc.Input("Y").size(), 1); + PADDLE_ENFORCE_EQ(op_desc.Output("Out").size(), 1); + + auto x_name = op_desc.Input("X").front(); + auto y_name = op_desc.Input("Y").front(); + auto out_name = op_desc.Output("Out").front(); + auto op_name = op_desc.Type() + ":" + op_desc.Output("Out").front(); + + engine_->AddOp(op_name, "Eltwise", {x_name, y_name}, {out_name}); + std::string elementwise_type = "Add"; + engine_->AddOpAttr(op_name, "type", elementwise_type); + std::vector coeff = {1.0, 1.0}; + engine_->AddOpAttr>(op_name, "coeff", coeff); +} + +void ElementwiseMulOpConverter::operator()( + const framework::proto::OpDesc &op, const framework::BlockDesc &block_desc, + const framework::Scope &scope, bool test_mode) { + framework::OpDesc op_desc(op, nullptr); + PADDLE_ENFORCE_EQ(op_desc.Input("X").size(), 1); + PADDLE_ENFORCE_EQ(op_desc.Input("Y").size(), 1); + PADDLE_ENFORCE_EQ(op_desc.Output("Out").size(), 1); + + auto x_name = op_desc.Input("X").front(); + auto y_name = op_desc.Input("Y").front(); + auto out_name = op_desc.Output("Out").front(); + auto op_name = op_desc.Type() + ":" + op_desc.Output("Out").front(); + + engine_->AddOp(op_name, "Scale", {x_name, y_name}, {out_name}); + // Fill a number to weight_1 as a placeholder. + Shape shape1(std::vector({1, 1, 1, 1})); + auto *weight1 = + GraphGlobalMem::Global().template new_block(shape1); + auto *placeholder_data = + static_cast(weight1->h_tensor().mutable_data()); + float weight1_data[] = {1}; + std::copy(std::begin(weight1_data), std::end(weight1_data), placeholder_data); + engine_->AddOpAttr(op_name, "weight_1", *weight1); + + auto axis = boost::get(op_desc.GetAttr("axis")); + engine_->AddOpAttr(op_name, "axis", axis); + engine_->AddOpAttr(op_name, "num_axes", 1); + engine_->AddOpAttr(op_name, "bias_term", false); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +REGISTER_ANAKIN_OP_CONVERTER(elementwise_add, ElementwiseAddOpConverter); +REGISTER_ANAKIN_OP_CONVERTER(elementwise_mul, ElementwiseMulOpConverter); diff --git a/paddle/fluid/inference/anakin/convert/elementwise.h b/paddle/fluid/inference/anakin/convert/elementwise.h new file mode 100644 index 0000000000000000000000000000000000000000..e4664493a9d3ce1ed9a0c79a05fb466c4e781b3e --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/elementwise.h @@ -0,0 +1,51 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/inference/anakin/convert/op_converter.h" + +namespace paddle { +namespace inference { +namespace anakin { + +class ElementwiseAddOpConverter : public AnakinOpConverter { + public: + ElementwiseAddOpConverter() = default; + + virtual void operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) override; + virtual ~ElementwiseAddOpConverter() {} + + private: +}; + +class ElementwiseMulOpConverter : public AnakinOpConverter { + public: + ElementwiseMulOpConverter() = default; + + virtual void operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) override; + virtual ~ElementwiseMulOpConverter() {} + + private: +}; + +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/convert/fc.cc b/paddle/fluid/inference/anakin/convert/fc.cc new file mode 100644 index 0000000000000000000000000000000000000000..a80a1a47e91aa085935b5febb3858e028f396091 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/fc.cc @@ -0,0 +1,122 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/anakin/convert/fc.h" +#include +#include +#include + +using anakin::graph::GraphGlobalMem; +using anakin::AK_FLOAT; +using anakin::saber::NV; +using anakin::saber::Shape; + +namespace paddle { +namespace inference { +namespace anakin { + +void FcBaseOpConverter::operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) { + framework::OpDesc op_desc(op, nullptr); + auto input_names = op_desc.InputNames(); + bool with_bias = input_names.size() == 3; + + std::string w_name = "Y"; + std::string i_name = "X"; + if (with_bias) { + w_name = "W"; + i_name = "Input"; + } + + auto op_name = op_desc.Type() + ":" + op_desc.Output("Out").front(); + + // get weights + auto *y_v = scope.FindVar(op_desc.Input(w_name).front()); + PADDLE_ENFORCE_NOT_NULL(y_v); + auto *y_t = y_v->GetMutable(); + + auto input_name = op_desc.Input(i_name).front(); + auto output_name = op_desc.Output("Out").front(); + + engine_->AddOp(op_name, "Dense", {input_name}, {output_name}); + engine_->AddOpAttr(op_name, "bias_term", with_bias); + engine_->AddOpAttr(op_name, "axis", 1); + + auto weight_shape = framework::vectorize2int(y_t->dims()); + int out_dim = weight_shape[1]; + engine_->AddOpAttr(op_name, "out_dim", out_dim); + const int w_m = weight_shape[0]; + const int w_k = weight_shape[1]; + + if (weight_shape.size() < 4UL) { + weight_shape.insert(weight_shape.begin(), 4UL - weight_shape.size(), 1); + } + Shape anakin_shape(weight_shape); + + framework::LoDTensor weight_tensor; + weight_tensor.Resize(y_t->dims()); + TensorCopySync((*y_t), platform::CPUPlace(), &weight_tensor); + auto *weight_data = weight_tensor.data(); + PADDLE_ENFORCE(w_m * w_k == weight_tensor.numel()); + + std::vector trans_weight_data(weight_tensor.numel()); + for (int i = 0; i < w_m; i++) { + for (int j = 0; j < w_k; j++) { + trans_weight_data[i + j * w_m] = weight_data[i * w_k + j]; + } + } + auto *weight1 = + GraphGlobalMem::Global().template new_block(anakin_shape); + float *cpu_data = static_cast(weight1->h_tensor().mutable_data()); + std::copy_n(trans_weight_data.data(), weight_tensor.numel(), cpu_data); + weight1->d_tensor().set_shape(anakin_shape); + weight1->d_tensor().copy_from(weight1->h_tensor()); + engine_->AddOpAttr(op_name, "weight_1", *weight1); + + // get bias + if (with_bias) { + auto *b_v = scope.FindVar(op_desc.Input("Bias").front()); + PADDLE_ENFORCE_NOT_NULL(b_v); + auto *b_t = b_v->GetMutable(); + + auto bias_shape = framework::vectorize2int(b_t->dims()); + framework::LoDTensor bias_tensor; + bias_tensor.Resize(b_t->dims()); + TensorCopySync((*b_t), platform::CPUPlace(), &bias_tensor); + auto *bias_data = bias_tensor.data(); + bias_shape.insert(bias_shape.begin(), 1); + bias_shape.insert(bias_shape.begin(), 1); + bias_shape.insert(bias_shape.begin(), 1); + // bias_shape.push_back(1); + // bias_shape.push_back(1); + Shape anakin_bias_shape(bias_shape); + + auto *weight2 = GraphGlobalMem::Global().template new_block( + anakin_bias_shape); + float *cpu_data2 = static_cast(weight2->h_tensor().mutable_data()); + std::copy_n(bias_data, bias_tensor.numel(), cpu_data2); + weight2->d_tensor().set_shape(anakin_bias_shape); + weight2->d_tensor().copy_from(weight2->h_tensor()); + engine_->AddOpAttr(op_name, "weight_2", *weight2); + } +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +REGISTER_ANAKIN_OP_CONVERTER(mul, MulOpConverter); +REGISTER_ANAKIN_OP_CONVERTER(fc, FcOpConverter); diff --git a/paddle/fluid/inference/anakin/convert/fc.h b/paddle/fluid/inference/anakin/convert/fc.h new file mode 100644 index 0000000000000000000000000000000000000000..fb461908b35e0111065e1a46c52306c64ace7d7c --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/fc.h @@ -0,0 +1,48 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/inference/anakin/convert/op_converter.h" + +namespace paddle { +namespace inference { +namespace anakin { + +class FcBaseOpConverter : public AnakinOpConverter { + public: + FcBaseOpConverter() = default; + + virtual void operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) override; + virtual ~FcBaseOpConverter() {} +}; + +// with bias +class FcOpConverter : public FcBaseOpConverter { + public: + FcOpConverter() = default; +}; + +// without bias +class MulOpConverter : public FcBaseOpConverter { + public: + MulOpConverter() = default; +}; + +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/convert/flatten.cc b/paddle/fluid/inference/anakin/convert/flatten.cc new file mode 100644 index 0000000000000000000000000000000000000000..7f5c1510960d1014c33bd565939812fe7c7dfc06 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/flatten.cc @@ -0,0 +1,52 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/anakin/convert/flatten.h" +#include + +using anakin::graph::GraphGlobalMem; +using anakin::AK_FLOAT; +using anakin::saber::NV; +using anakin::saber::Shape; +using anakin::PTuple; + +namespace paddle { +namespace inference { +namespace anakin { + +void FlattenOpConverter::operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) { + framework::OpDesc op_desc(op, nullptr); + PADDLE_ENFORCE_EQ(op_desc.Input("X").size(), 1UL); + PADDLE_ENFORCE_EQ(op_desc.Output("Out").size(), 1UL); + + auto input = op_desc.Input("X").front(); + auto output = op_desc.Output("Out").front(); + int axis = boost::get(op_desc.GetAttr("axis")); + PADDLE_ENFORCE(axis == 1, + "the anakin flatten op converter now only support aixs == 1."); + + std::vector out_dims = {0, -1, 1, 1}; + auto op_name = op_desc.Type() + ":" + op_desc.Output("Out").front(); + engine_->AddOp(op_name, "Reshape", {input}, {output}); + engine_->AddOpAttr>(op_name, "dims", out_dims); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +REGISTER_ANAKIN_OP_CONVERTER(flatten, FlattenOpConverter); diff --git a/paddle/fluid/inference/anakin/convert/flatten.h b/paddle/fluid/inference/anakin/convert/flatten.h new file mode 100644 index 0000000000000000000000000000000000000000..c9cc0006eb2448917bbcc0952f5e2cae72b73de1 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/flatten.h @@ -0,0 +1,36 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/inference/anakin/convert/op_converter.h" + +namespace paddle { +namespace inference { +namespace anakin { + +class FlattenOpConverter : public AnakinOpConverter { + public: + FlattenOpConverter() = default; + + virtual void operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) override; + virtual ~FlattenOpConverter() {} +}; + +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/convert/im2sequence.cc b/paddle/fluid/inference/anakin/convert/im2sequence.cc new file mode 100644 index 0000000000000000000000000000000000000000..2cc330c3829f6033229748523c3df750b951626f --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/im2sequence.cc @@ -0,0 +1,63 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/anakin/convert/im2sequence.h" +#include +#include +#include + +using anakin::graph::GraphGlobalMem; +using anakin::AK_FLOAT; +using anakin::Precision; +using anakin::saber::NV; +using anakin::saber::X86; +using anakin::saber::Shape; +using anakin::PBlock; +using anakin::PTuple; + +namespace paddle { +namespace inference { +namespace anakin { + +void Im2SequenceConverter::operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) { + framework::OpDesc op_desc(op, nullptr); + PADDLE_ENFORCE_EQ(op_desc.Input("X").size(), 1); + PADDLE_ENFORCE_EQ(op_desc.Output("Y").size(), 0); + PADDLE_ENFORCE_EQ(op_desc.Output("Out").size(), 1); + + auto x_name = op_desc.Input("X").front(); + auto out_name = op_desc.Output("Out").front(); + auto op_name = op_desc.Type() + ":" + op_desc.Output("Out").front(); + + engine_->AddOp(op_name, "Im2Sequence", {x_name}, {out_name}); + + std::vector dilations = {1, 1}; + auto paddings = boost::get>(op_desc.GetAttr("paddings")); + auto strides = boost::get>(op_desc.GetAttr("strides")); + auto kernels = boost::get>(op_desc.GetAttr("kernels")); + + engine_->AddOpAttr>(op_name, "paddings", paddings); + engine_->AddOpAttr>(op_name, "strides", strides); + engine_->AddOpAttr>(op_name, "window_size", kernels); + engine_->AddOpAttr>(op_name, "dilations", dilations); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +REGISTER_ANAKIN_OP_CONVERTER(im2sequence, Im2SequenceConverter); diff --git a/paddle/fluid/inference/anakin/convert/im2sequence.h b/paddle/fluid/inference/anakin/convert/im2sequence.h new file mode 100644 index 0000000000000000000000000000000000000000..714679c1d9601136f1f54287bb58d611e852f3fe --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/im2sequence.h @@ -0,0 +1,38 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/inference/anakin/convert/op_converter.h" + +namespace paddle { +namespace inference { +namespace anakin { + +class Im2SequenceConverter : public AnakinOpConverter { + public: + Im2SequenceConverter() = default; + + virtual void operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) override; + virtual ~Im2SequenceConverter() {} + + private: +}; + +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/convert/op_converter.h b/paddle/fluid/inference/anakin/convert/op_converter.h new file mode 100644 index 0000000000000000000000000000000000000000..1ca62658ef26ffebcc068c91ece7d9bbed0a348f --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/op_converter.h @@ -0,0 +1,167 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include +#include +#include "framework/core/types.h" +#include "paddle/fluid/framework/block_desc.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/inference/anakin/engine.h" +#include "paddle/fluid/inference/utils/singleton.h" +#include "saber/saber_types.h" + +namespace paddle { +namespace inference { +namespace anakin { + +using AnakinNvEngine = + AnakinEngine<::anakin::saber::NV, ::anakin::Precision::FP32>; + +class AnakinOpConverter { + public: + AnakinOpConverter() = default; + + virtual void operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, bool test_mode) {} + void ConvertOp(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const std::unordered_set ¶meters, + const framework::Scope &scope, AnakinNvEngine *engine, + bool test_mode = false) { + framework::OpDesc op_desc(op, nullptr); + std::string op_type = op_desc.Type(); + AnakinOpConverter *it = nullptr; + if (op_type == "depthwise_conv2d") op_type = "conv2d"; + if (op_type == "reshape2") op_type = "reshape"; + if (op_type == "transpose2") op_type = "transpose"; + if (op_type == "flatten2") op_type = "flatten"; + + if (!it) { + it = Registry::Global().Lookup(op_type); + } + PADDLE_ENFORCE_NOT_NULL(it, "no OpConverter for optype [%s]", op_type); + it->SetEngine(engine); + (*it)(op, block_desc, scope, test_mode); + } + + void ConvertBlock(framework::BlockDesc *block_desc, + const std::unordered_set ¶meters, + const framework::Scope &scope, AnakinNvEngine *engine) { + std::unique_lock lock(mutex_); + framework::proto::BlockDesc *block = block_desc->Proto(); + for (auto i = 0; i < block->ops_size(); i++) { + auto &op = block->ops(i); + ConvertOp(op, *block_desc, parameters, scope, engine); + } + } + + // The scope here should be inited with the parameter vars. + void ConvertBlockToAnakinEngine( + framework::BlockDesc *block_desc, framework::Scope *scope, + const std::vector &inputs, + const std::unordered_set ¶meters, + const std::vector &outputs, AnakinNvEngine *engine) { + ConvertBlock(block_desc, parameters, *scope, engine); + engine->Freeze(); + // if the max_batch size + int max_batch_size = engine->GetMaxBatchSize(); + PADDLE_ENFORCE(max_batch_size > 0, + "the max_batch_size setted from config->EnableAnakinEngine " + "must largger than 0"); + // If the user does not specify this variable, we use the input shape from + // the block_desc. + auto max_input_shape = engine->GetMaxInputShape(); + std::map> temp_max_input_shape; + + for (auto &input : inputs) { + if (parameters.count(input)) continue; + std::vector input_shape; + input_shape.resize(4); + input_shape[0] = max_batch_size; + if (max_input_shape.count(input)) { + PADDLE_ENFORCE(max_input_shape[input].size() == 4, + "the dimensions of max_input_shape setted from " + "config->EnableAnakinEngine must be 4"); + for (int i = 1; i < 4; i++) { + input_shape[i] = max_input_shape[input][i]; + } + } else { + auto *var = block_desc->FindVar(input); + PADDLE_ENFORCE(var, "no variable called %s", input); + + auto var_shape = var->GetShape(); + std::cout << "input :" << input << std::endl; + PADDLE_ENFORCE(var_shape.size() == 4); + + for (size_t i = 1; i < var_shape.size(); i++) { + input_shape[i] = var_shape[i]; + } + } + temp_max_input_shape[input] = input_shape; + engine->SetInputShape(input, input_shape); + engine->Graph()->RegistVar(input); // For share from data. + } + engine->SetMaxInputShape(temp_max_input_shape); + engine->Optimize(); + + // For anakin share with fluid tensor. + engine->AllocTmpMem(); + engine->InitGraph(); + } + + void SetEngine(AnakinNvEngine *engine) { engine_ = engine; } + virtual ~AnakinOpConverter() {} + + protected: + bool test_mode_; + AnakinNvEngine *engine_{nullptr}; + + private: + std::unordered_map converters_; + framework::Scope *scope_{nullptr}; + std::mutex mutex_; +}; + +} // namespace anakin +} // namespace inference +} // namespace paddle + +#define REGISTER_ANAKIN_OP_CONVERTER(op_type__, Converter__) \ + struct anakin_##op_type__##_converter \ + : public ::paddle::framework::Registrar { \ + anakin_##op_type__##_converter() { \ + LOG(INFO) << "register convert " << #op_type__; \ + ::paddle::inference::Registry< \ + ::paddle::inference::anakin::AnakinOpConverter>::Global() \ + .Register<::paddle::inference::anakin::Converter__>(#op_type__); \ + } \ + }; \ + anakin_##op_type__##_converter anakin_##op_type__##_converter__; \ + int TouchConverterRegister_anakin_##op_type__() { \ + anakin_##op_type__##_converter__.Touch(); \ + return 0; \ + } + +#define USE_ANAKIN_CONVERTER(op_type__) \ + extern int TouchConverterRegister_anakin_##op_type__(); \ + int use_op_converter_anakin_##op_type__ __attribute__((unused)) = \ + TouchConverterRegister_anakin_##op_type__(); diff --git a/paddle/fluid/inference/anakin/convert/pool2d.cc b/paddle/fluid/inference/anakin/convert/pool2d.cc new file mode 100644 index 0000000000000000000000000000000000000000..87eefe712a5ad2acd8c9b5abe521c832ad2c1ef2 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/pool2d.cc @@ -0,0 +1,81 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/anakin/convert/pool2d.h" +#include +#include +#include + +using anakin::graph::GraphGlobalMem; +using anakin::AK_FLOAT; +using anakin::Precision; +using anakin::saber::NV; +using anakin::saber::X86; +using anakin::saber::Shape; +using anakin::PBlock; +using anakin::PTuple; + +namespace paddle { +namespace inference { +namespace anakin { + +void Pool2dOpConverter::operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) { + framework::OpDesc op_desc(op, nullptr); + PADDLE_ENFORCE_EQ(op_desc.Input("X").size(), 1); + PADDLE_ENFORCE_EQ(op_desc.Output("Out").size(), 1); + + auto x_name = op_desc.Input("X").front(); + auto y_name = op_desc.Output("Out").front(); + auto op_name = op_desc.Type() + ":" + op_desc.Output("Out").front(); + + bool global_pooling = boost::get(op_desc.GetAttr("global_pooling")); + std::string pool_type = + boost::get(op_desc.GetAttr("pooling_type")); + std::vector ksize = + boost::get>(op_desc.GetAttr("ksize")); + std::vector strides = + boost::get>(op_desc.GetAttr("strides")); + std::vector paddings = + boost::get>(op_desc.GetAttr("paddings")); + bool ceil_mode = boost::get(op_desc.GetAttr("ceil_mode")); + std::string anakin_pool_type; + if (pool_type == "max") { + anakin_pool_type = "MAX"; + } else if (pool_type == "avg") { + if (paddings[0] || paddings[1]) { + anakin_pool_type = "AVGEXC"; + } else { + anakin_pool_type = "AVG"; + } + } else { + PADDLE_THROW("TensorRT unsupported pooling type!"); + } + + engine_->AddOp(op_name, "Pooling", {x_name}, {y_name}); + engine_->AddOpAttr>(op_name, "pool_size", ksize); + engine_->AddOpAttr>(op_name, "strides", strides); + engine_->AddOpAttr>(op_name, "padding", paddings); + engine_->AddOpAttr(op_name, "method", anakin_pool_type); + engine_->AddOpAttr(op_name, "global_pooling", global_pooling); + engine_->AddOpAttr(op_name, "cmp_out_shape_floor_as_conv", !ceil_mode); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +REGISTER_ANAKIN_OP_CONVERTER(pool2d, Pool2dOpConverter); diff --git a/paddle/fluid/inference/anakin/convert/pool2d.h b/paddle/fluid/inference/anakin/convert/pool2d.h new file mode 100644 index 0000000000000000000000000000000000000000..ec28e48ac848eff1d37c39063725624bf7d65723 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/pool2d.h @@ -0,0 +1,38 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/inference/anakin/convert/op_converter.h" + +namespace paddle { +namespace inference { +namespace anakin { + +class Pool2dOpConverter : public AnakinOpConverter { + public: + Pool2dOpConverter() = default; + + virtual void operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) override; + virtual ~Pool2dOpConverter() {} + + private: +}; + +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/convert/relu.cc b/paddle/fluid/inference/anakin/convert/relu.cc new file mode 100644 index 0000000000000000000000000000000000000000..993437d014b1f951dac94da7a3179b4bcb63466d --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/relu.cc @@ -0,0 +1,48 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/anakin/convert/relu.h" +#include +#include + +using anakin::graph::GraphGlobalMem; +using anakin::AK_FLOAT; +using anakin::saber::NV; +using anakin::saber::Shape; + +namespace paddle { +namespace inference { +namespace anakin { + +void ReluOpConverter::operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) { + framework::OpDesc op_desc(op, nullptr); + PADDLE_ENFORCE_EQ(op_desc.Input("X").size(), 1); + PADDLE_ENFORCE_EQ(op_desc.Output("Out").size(), 1); + + auto op_name = op_desc.Type() + ":" + op_desc.Output("Out").front(); + auto input_name = op_desc.Input("X").front(); + auto output_name = op_desc.Output("Out").front(); + + engine_->AddOp(op_name, "ReLU", {input_name}, {output_name}); + engine_->AddOpAttr(op_name, "alpha", 0); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +REGISTER_ANAKIN_OP_CONVERTER(relu, ReluOpConverter); diff --git a/paddle/fluid/inference/anakin/convert/relu.h b/paddle/fluid/inference/anakin/convert/relu.h new file mode 100644 index 0000000000000000000000000000000000000000..6ede506511917c80faa59d40ee0a7bfff194da97 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/relu.h @@ -0,0 +1,38 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include "paddle/fluid/inference/anakin/convert/op_converter.h" + +namespace paddle { +namespace inference { +namespace anakin { + +class ReluOpConverter : public AnakinOpConverter { + public: + ReluOpConverter() = default; + + virtual void operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) override; + virtual ~ReluOpConverter() {} +}; + +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/convert/reshape.cc b/paddle/fluid/inference/anakin/convert/reshape.cc new file mode 100644 index 0000000000000000000000000000000000000000..17e0a1acb5f4e08e848e91bbb051757d85796c0a --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/reshape.cc @@ -0,0 +1,53 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/anakin/convert/reshape.h" +#include + +using anakin::graph::GraphGlobalMem; +using anakin::AK_FLOAT; +using anakin::saber::NV; +using anakin::saber::Shape; +using anakin::PTuple; + +namespace paddle { +namespace inference { +namespace anakin { + +void ReshapeOpConverter::operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) { + framework::OpDesc op_desc(op, nullptr); + PADDLE_ENFORCE_EQ(op_desc.Input("X").size(), 1UL); + PADDLE_ENFORCE_EQ(op_desc.Output("Out").size(), 1UL); + + auto input = op_desc.Input("X").front(); + auto output = op_desc.Output("Out").front(); + + auto op_name = op_desc.Type() + ":" + op_desc.Output("Out").front(); + engine_->AddOp(op_name, "Reshape", {input}, {output}); + + auto shape = boost::get>(op_desc.GetAttr("shape")); + if (shape.size() < 4) { + shape.insert(shape.end(), 4 - shape.size(), 1); + } + engine_->AddOpAttr>(op_name, "dims", shape); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +REGISTER_ANAKIN_OP_CONVERTER(reshape, ReshapeOpConverter); diff --git a/paddle/fluid/inference/anakin/convert/reshape.h b/paddle/fluid/inference/anakin/convert/reshape.h new file mode 100644 index 0000000000000000000000000000000000000000..9ce2ea2a4f3f8802225fe8ca8ed602c9f7d27968 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/reshape.h @@ -0,0 +1,36 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/inference/anakin/convert/op_converter.h" + +namespace paddle { +namespace inference { +namespace anakin { + +class ReshapeOpConverter : public AnakinOpConverter { + public: + ReshapeOpConverter() = default; + + virtual void operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) override; + virtual ~ReshapeOpConverter() {} +}; + +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/convert/scale.cc b/paddle/fluid/inference/anakin/convert/scale.cc new file mode 100644 index 0000000000000000000000000000000000000000..dd68af4f79a6d1e8add04bde6a6890bca1b00d14 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/scale.cc @@ -0,0 +1,57 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/anakin/convert/scale.h" +#include +#include + +using anakin::graph::GraphGlobalMem; +using anakin::AK_FLOAT; +using anakin::saber::NV; +using anakin::saber::Shape; + +namespace paddle { +namespace inference { +namespace anakin { + +void ScaleOpConverter::operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) { + framework::OpDesc op_desc(op, nullptr); + PADDLE_ENFORCE_EQ(op_desc.Input("X").size(), 1); + PADDLE_ENFORCE_EQ(op_desc.Output("Out").size(), 1); + + auto op_name = op_desc.Type() + ":" + op_desc.Output("Out").front(); + + auto input_name = op_desc.Input("X").front(); + auto output_name = op_desc.Output("Out").front(); + float scale = boost::get(op_desc.GetAttr("scale")); + float bias = boost::get(op_desc.GetAttr("bias")); + float bias_after_scale = + boost::get(op_desc.GetAttr("bias_after_scale")); + PADDLE_ENFORCE(bias_after_scale, + "The anakin scale layer only support bias after scale now."); + + engine_->AddOp(op_name, "Power", {input_name}, {output_name}); + engine_->AddOpAttr(op_name, "shift", bias); + engine_->AddOpAttr(op_name, "scale", scale); + engine_->AddOpAttr(op_name, "power", static_cast(1.0)); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +REGISTER_ANAKIN_OP_CONVERTER(scale, ScaleOpConverter); diff --git a/paddle/fluid/inference/anakin/convert/scale.h b/paddle/fluid/inference/anakin/convert/scale.h new file mode 100644 index 0000000000000000000000000000000000000000..ba3bcdd21494a4eeb6190aa8383e17e1b828b5f3 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/scale.h @@ -0,0 +1,38 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include "paddle/fluid/inference/anakin/convert/op_converter.h" + +namespace paddle { +namespace inference { +namespace anakin { + +class ScaleOpConverter : public AnakinOpConverter { + public: + ScaleOpConverter() = default; + + virtual void operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) override; + virtual ~ScaleOpConverter() {} +}; + +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/convert/softmax.cc b/paddle/fluid/inference/anakin/convert/softmax.cc new file mode 100644 index 0000000000000000000000000000000000000000..a6c1e971b16fa7fe6a074bcb2cdf391410f8871f --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/softmax.cc @@ -0,0 +1,52 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/anakin/convert/softmax.h" + +using anakin::graph::GraphGlobalMem; +using anakin::AK_FLOAT; +using anakin::saber::NV; +using anakin::saber::Shape; + +namespace paddle { +namespace inference { +namespace anakin { + +void SoftMaxOpConverter::operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) { + framework::OpDesc op_desc(op, nullptr); + PADDLE_ENFORCE_EQ(op_desc.Input("X").size(), 1UL); + + auto input = op_desc.Input("X").front(); + auto output = op_desc.Output("Out").front(); + auto op_name = op_desc.Type() + ":" + op_desc.Output("Out").front(); + + auto input_var_desc = block_desc.FindVar(input); + PADDLE_ENFORCE(input_var_desc, + "Cant find %s variable When runing Anakin Softmax converter.", + input); + auto input_shape_in_fluid = input_var_desc->GetShape(); + size_t input_dims = input_shape_in_fluid.size(); + + engine_->AddOp(op_name, "Softmax", {input}, {output}); + engine_->AddOpAttr(op_name, "axis", static_cast(input_dims - 1)); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +REGISTER_ANAKIN_OP_CONVERTER(softmax, SoftMaxOpConverter); diff --git a/paddle/fluid/inference/anakin/convert/softmax.h b/paddle/fluid/inference/anakin/convert/softmax.h new file mode 100644 index 0000000000000000000000000000000000000000..a16356d5bb61ac2f3b4f7751e257ce36ca604bf1 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/softmax.h @@ -0,0 +1,36 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/inference/anakin/convert/op_converter.h" + +namespace paddle { +namespace inference { +namespace anakin { + +class SoftMaxOpConverter : public AnakinOpConverter { + public: + SoftMaxOpConverter() = default; + + virtual void operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) override; + virtual ~SoftMaxOpConverter() {} +}; + +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/convert/split.cc b/paddle/fluid/inference/anakin/convert/split.cc new file mode 100644 index 0000000000000000000000000000000000000000..ec582c1812623cd4bcefa2097015ba258f6bacbb --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/split.cc @@ -0,0 +1,64 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/anakin/convert/split.h" +#include +#include + +using anakin::graph::GraphGlobalMem; +using anakin::AK_FLOAT; +using anakin::Precision; +using anakin::saber::NV; +using anakin::saber::X86; +using anakin::saber::Shape; +using anakin::PBlock; +using anakin::PTuple; + +namespace paddle { +namespace inference { +namespace anakin { + +void SplitOpConverter::operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) { + framework::OpDesc op_desc(op, nullptr); + auto input_name = op_desc.Input("X").front(); + auto y_names = op_desc.Output("Out"); + auto op_name = op_desc.Type() + ":" + op_desc.Output("Out").front(); + int axis = boost::get(op_desc.GetAttr("axis")); + + std::vector output_lengths = + boost::get>(op_desc.GetAttr("sections")); + + int split_num = output_lengths.size(); + PADDLE_ENFORCE(split_num > 1, + "anakin split op converter: the split num should > 1"); + int num_sum = 0; + std::vector slice_point; + for (int i = 0; i < split_num - 1; i++) { + num_sum += output_lengths[i]; + slice_point.push_back(num_sum); + } + engine_->AddOp(op_name, "Slice", {input_name}, y_names); + engine_->AddOpAttr(op_name, "axis", axis); + engine_->AddOpAttr>(op_name, "slice_point", slice_point); + // slice_dim is useless in anakin + engine_->AddOpAttr(op_name, "slice_dim", 4); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle +REGISTER_ANAKIN_OP_CONVERTER(split, SplitOpConverter); diff --git a/paddle/fluid/inference/anakin/convert/split.h b/paddle/fluid/inference/anakin/convert/split.h new file mode 100644 index 0000000000000000000000000000000000000000..184112e589e2bbdb30bc7a5d2cd053b7f3732a58 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/split.h @@ -0,0 +1,38 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/inference/anakin/convert/op_converter.h" + +namespace paddle { +namespace inference { +namespace anakin { + +class SplitOpConverter : public AnakinOpConverter { + public: + SplitOpConverter() = default; + + virtual void operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) override; + virtual ~SplitOpConverter() {} + + private: +}; + +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/convert/sum.cc b/paddle/fluid/inference/anakin/convert/sum.cc new file mode 100644 index 0000000000000000000000000000000000000000..2a4178e2371389b44557d44ea526c7cc4a731d16 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/sum.cc @@ -0,0 +1,55 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/anakin/convert/sum.h" +#include +#include +#include + +using anakin::graph::GraphGlobalMem; +using anakin::AK_FLOAT; +using anakin::Precision; +using anakin::saber::NV; +using anakin::saber::X86; +using anakin::saber::Shape; +using anakin::PBlock; +using anakin::PTuple; + +namespace paddle { +namespace inference { +namespace anakin { + +void SumOpConverter::operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, bool test_mode) { + framework::OpDesc op_desc(op, nullptr); + PADDLE_ENFORCE_EQ(op_desc.Input("X").size(), 2); + PADDLE_ENFORCE_EQ(op_desc.Output("Out").size(), 1); + + auto input_names = op_desc.Input("X"); + auto out_name = op_desc.Output("Out").front(); + auto op_name = op_desc.Type() + ":" + op_desc.Output("Out").front(); + + std::vector coeff = {1, 1}; + std::string elementwise_type = "Add"; + engine_->AddOp(op_name, "Eltwise", input_names, {out_name}); + engine_->AddOpAttr>(op_name, "coeff", coeff); + engine_->AddOpAttr(op_name, "type", elementwise_type); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +REGISTER_ANAKIN_OP_CONVERTER(sum, SumOpConverter); diff --git a/paddle/fluid/inference/anakin/convert/sum.h b/paddle/fluid/inference/anakin/convert/sum.h new file mode 100644 index 0000000000000000000000000000000000000000..b5d402b77fcf555ffaf910f8c9d1b7337181a64b --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/sum.h @@ -0,0 +1,38 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/inference/anakin/convert/op_converter.h" + +namespace paddle { +namespace inference { +namespace anakin { + +class SumOpConverter : public AnakinOpConverter { + public: + SumOpConverter() = default; + + virtual void operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) override; + virtual ~SumOpConverter() {} + + private: +}; + +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/convert/test_activation_op.cc b/paddle/fluid/inference/anakin/convert/test_activation_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..8bedd4a749a645829658291310347eeed1c0ea49 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/test_activation_op.cc @@ -0,0 +1,53 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/inference/anakin/convert/activation.h" +#include "paddle/fluid/inference/anakin/convert/op_converter.h" +#include "paddle/fluid/inference/anakin/convert/ut_helper.h" + +namespace paddle { +namespace inference { +namespace anakin { + +static void test_activation_op(const std::string &op_type) { + auto *converter = Registry::Global().Lookup(op_type); + PADDLE_ENFORCE(converter != nullptr); + std::unordered_set parameters; + framework::Scope scope; + AnakinConvertValidation validator(parameters, &scope); + validator.DeclInputVar("act-X", {10, 6, 1, 1}); + validator.DeclOutputVar("act-Out", {10, 6, 1, 1}); + framework::OpDesc desc; + desc.SetType(op_type); + desc.SetInput("X", {"act-X"}); + desc.SetOutput("Out", {"act-Out"}); + + LOG(INFO) << "set OP"; + validator.SetOp(*desc.Proto()); + LOG(INFO) << "execute"; + + validator.Execute(5); +} + +TEST(sigm_op, test) { test_activation_op("sigmoid"); } +TEST(tanh_op, test) { test_activation_op("tanh"); } +} // namespace anakin +} // namespace inference +} // namespace paddle + +USE_OP(sigmoid); +USE_OP(tanh); +USE_ANAKIN_CONVERTER(sigmoid); +USE_ANAKIN_CONVERTER(tanh); diff --git a/paddle/fluid/inference/anakin/convert/test_batch_norm_op.cc b/paddle/fluid/inference/anakin/convert/test_batch_norm_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..2832e1c8d167c646c9049beebc57a82fe416e62c --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/test_batch_norm_op.cc @@ -0,0 +1,71 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/inference/anakin/convert/ut_helper.h" + +namespace paddle { +namespace inference { +namespace anakin { + +TEST(batch_norm_op, test) { + std::unordered_set parameters( + {"batch_norm_scale", "batch_norm_bias", "batch_norm_mean", + "batch_norm_variance"}); + framework::Scope scope; + AnakinConvertValidation validator(parameters, &scope); + std::vector param_shape{2}; + + validator.DeclInputVar("batch_norm_X", {1, 2, 5, 5}); + validator.DeclParamVar("batch_norm_scale", param_shape); + validator.DeclParamVar("batch_norm_bias", param_shape); + validator.DeclParamVar("batch_norm_mean", param_shape); + validator.DeclParamVar("batch_norm_variance", param_shape); + validator.DeclOutputVar("batch_norm_Y", {1, 2, 5, 5}); + validator.DeclOutputVar("batch_norm_save_mean", param_shape); + validator.DeclOutputVar("batch_norm_save_variance", param_shape); + + // Prepare Op description + framework::OpDesc desc; + + desc.SetType("batch_norm"); + desc.SetInput("X", {"batch_norm_X"}); + desc.SetInput("Scale", {"batch_norm_scale"}); + desc.SetInput("Bias", {"batch_norm_bias"}); + desc.SetInput("Mean", {"batch_norm_mean"}); + desc.SetInput("Variance", {"batch_norm_variance"}); + desc.SetOutput("Y", {"batch_norm_Y"}); + desc.SetOutput("MeanOut", {"batch_norm_mean"}); + desc.SetOutput("VarianceOut", {"batch_norm_variance"}); + desc.SetOutput("SavedMean", {"batch_norm_save_mean"}); + desc.SetOutput("SavedVariance", {"batch_norm_save_variance"}); + + float eps = 1e-5f; + bool is_test = true; + desc.SetAttr("epsilon", eps); + desc.SetAttr("is_test", is_test); + + validator.SetOp(*desc.Proto()); + + std::unordered_set neglected_output = { + "batch_norm_save_mean", "batch_norm_save_variance", "batch_norm_mean", + "batch_norm_variance"}; + validator.Execute(1, neglected_output); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle +USE_OP(batch_norm); +USE_ANAKIN_CONVERTER(batch_norm); diff --git a/paddle/fluid/inference/anakin/convert/test_concat_op.cc b/paddle/fluid/inference/anakin/convert/test_concat_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..ecf44def5a2429360f0bcb92f00a0423e1d491cd --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/test_concat_op.cc @@ -0,0 +1,74 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/inference/anakin/convert/concat.h" +#include "paddle/fluid/inference/anakin/convert/op_converter.h" +#include "paddle/fluid/inference/anakin/convert/ut_helper.h" + +namespace paddle { +namespace inference { +namespace anakin { + +TEST(concat_op, test) { + std::unordered_set parameters({""}); + framework::Scope scope; + AnakinConvertValidation validator(parameters, &scope); + validator.DeclInputVar("concat_x1", {1, 2, 1, 1}); + validator.DeclInputVar("concat_x2", {1, 3, 1, 1}); + validator.DeclInputVar("concat_x3", {1, 1, 1, 1}); + validator.DeclOutputVar("concat_out", {1, 6, 1, 1}); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("concat"); + desc.SetInput("X", {"concat_x1", "concat_x2", "concat_x3"}); + desc.SetOutput("Out", {"concat_out"}); + + int axis = 1; + desc.SetAttr("axis", axis); + + validator.SetOp(*desc.Proto()); + + validator.Execute(1); +} + +TEST(concat_op, test2) { + std::unordered_set parameters({""}); + framework::Scope scope; + AnakinConvertValidation validator(parameters, &scope); + validator.DeclInputVar("concat_x1", {1, 4}); + validator.DeclInputVar("concat_x2", {3, 4}); + validator.DeclInputVar("concat_x3", {2, 4}); + validator.DeclOutputVar("concat_out", {6, 4}); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("concat"); + desc.SetInput("X", {"concat_x1", "concat_x2", "concat_x3"}); + desc.SetOutput("Out", {"concat_out"}); + + int axis = 0; + desc.SetAttr("axis", axis); + + validator.SetOp(*desc.Proto()); + + validator.Execute(1); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle +USE_OP(concat); +USE_ANAKIN_CONVERTER(concat); diff --git a/paddle/fluid/inference/anakin/convert/test_conv2d_op.cc b/paddle/fluid/inference/anakin/convert/test_conv2d_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..6d93e50bc96b08b6ef7dd7c9d836038e335daae3 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/test_conv2d_op.cc @@ -0,0 +1,62 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/inference/anakin/convert/conv2d.h" +#include "paddle/fluid/inference/anakin/convert/op_converter.h" +#include "paddle/fluid/inference/anakin/convert/ut_helper.h" + +namespace paddle { +namespace inference { +namespace anakin { + +TEST(conv2d_op, test) { + auto* conv2d_converter = + Registry::Global().Lookup("conv2d"); + ASSERT_TRUE(conv2d_converter != nullptr); + std::unordered_set parameters({"conv2d-Y"}); + framework::Scope scope; + AnakinConvertValidation validator(parameters, &scope); + validator.DeclInputVar("conv2d-X", {1, 3, 3, 3}); + validator.DeclParamVar("conv2d-Y", {4, 3, 1, 1}); + validator.DeclOutputVar("conv2d-Out", {1, 4, 3, 3}); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("conv2d"); + desc.SetInput("Input", {"conv2d-X"}); + desc.SetInput("Filter", {"conv2d-Y"}); + desc.SetOutput("Output", {"conv2d-Out"}); + + const std::vector strides({1, 1}); + const std::vector paddings({0, 0}); + const std::vector dilations({1, 1}); + const int groups = 1; + + desc.SetAttr("strides", strides); + desc.SetAttr("paddings", paddings); + desc.SetAttr("dilations", dilations); + desc.SetAttr("groups", groups); + + validator.SetOp(*desc.Proto()); + + validator.Execute(3); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +USE_OP(conv2d); +USE_ANAKIN_CONVERTER(conv2d); diff --git a/paddle/fluid/inference/anakin/convert/test_dropout_op.cc b/paddle/fluid/inference/anakin/convert/test_dropout_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..b2de5ae0a6e58eb25a4588571686a25500fe546c --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/test_dropout_op.cc @@ -0,0 +1,53 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/inference/anakin/convert/dropout.h" +#include "paddle/fluid/inference/anakin/convert/op_converter.h" +#include "paddle/fluid/inference/anakin/convert/ut_helper.h" + +namespace paddle { +namespace inference { +namespace anakin { + +TEST(dropout_op, native) { + std::unordered_set parameters; + framework::Scope scope; + AnakinConvertValidation validator(parameters, &scope); + validator.DeclInputVar("x", {1, 1, 2, 2}); + validator.DeclOutputVar("out", {1, 1, 2, 2}); + validator.DeclOutputVar("mask", {1, 1, 2, 2}); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("dropout"); + desc.SetInput("X", {"x"}); + desc.SetOutput("Out", {"out"}); + desc.SetOutput("Mask", {"mask"}); + + float dropout_prob = 0.5; + desc.SetAttr("dropout_prob", dropout_prob); + desc.SetAttr("is_test", true); + + validator.SetOp(*desc.Proto()); + std::unordered_set neglected_output = {"mask"}; + validator.Execute(1, neglected_output); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +USE_OP(dropout); +USE_ANAKIN_CONVERTER(dropout); diff --git a/paddle/fluid/inference/anakin/convert/test_elementwise_op.cc b/paddle/fluid/inference/anakin/convert/test_elementwise_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..3a437f5fdb565609667b7a862c9b2bb13cdbeded --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/test_elementwise_op.cc @@ -0,0 +1,56 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/inference/anakin/convert/elementwise.h" +#include "paddle/fluid/inference/anakin/convert/op_converter.h" +#include "paddle/fluid/inference/anakin/convert/ut_helper.h" + +namespace paddle { +namespace inference { +namespace anakin { + +static void test_elementwise_op(const std::string &op_type) { + std::unordered_set parameters; + framework::Scope scope; + AnakinConvertValidation validator(parameters, &scope); + validator.DeclInputVar("x", {1, 1, 2, 2}); + validator.DeclInputVar("y", {1, 1, 2, 2}); + validator.DeclOutputVar("out", {1, 1, 2, 2}); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType(op_type); + desc.SetInput("X", {"x"}); + desc.SetInput("Y", {"y"}); + desc.SetOutput("Out", {"out"}); + + int axis = -1; + desc.SetAttr("axis", axis); + + validator.SetOp(*desc.Proto()); + validator.Execute(1); +} + +TEST(elementwise_op, native_add) { test_elementwise_op("elementwise_add"); } +TEST(elementwise_op, native_mul) { test_elementwise_op("elementwise_mul"); } + +} // namespace anakin +} // namespace inference +} // namespace paddle + +USE_OP(elementwise_add); +USE_ANAKIN_CONVERTER(elementwise_add); +USE_OP(elementwise_mul); +USE_ANAKIN_CONVERTER(elementwise_mul); diff --git a/paddle/fluid/inference/anakin/convert/test_fc_op.cc b/paddle/fluid/inference/anakin/convert/test_fc_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..ee6d1dc291fe3733ff2e9f66dd453120fa266a55 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/test_fc_op.cc @@ -0,0 +1,50 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/inference/anakin/convert/op_converter.h" +#include "paddle/fluid/inference/anakin/convert/ut_helper.h" + +namespace paddle { +namespace inference { +namespace anakin { + +TEST(fc_op, test) { + auto* fc_converter = Registry::Global().Lookup("fc"); + ASSERT_TRUE(fc_converter); + + std::unordered_set parameters({"mul_y"}); + framework::Scope scope; + AnakinConvertValidation validator(parameters, &scope); + validator.DeclInputVar("mul_x", {1, 1, 2, 2}); + validator.DeclParamVar("mul_y", {4, 2}); + validator.DeclOutputVar("mul_out", {1, 2}); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("mul"); + desc.SetInput("X", {"mul_x"}); + desc.SetInput("Y", {"mul_y"}); + desc.SetOutput("Out", {"mul_out"}); + validator.SetOp(*desc.Proto()); + + validator.Execute(10); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +USE_OP(mul); +USE_ANAKIN_CONVERTER(fc); diff --git a/paddle/fluid/inference/anakin/convert/test_flatten_op.cc b/paddle/fluid/inference/anakin/convert/test_flatten_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..d13281f11f03fdd75e585bce8b30e8780d81f7d7 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/test_flatten_op.cc @@ -0,0 +1,51 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/inference/anakin/convert/op_converter.h" +#include "paddle/fluid/inference/anakin/convert/ut_helper.h" + +namespace paddle { +namespace inference { +namespace anakin { + +TEST(flatten_op, test) { + auto *converter = Registry::Global().Lookup("flatten"); + ASSERT_TRUE(converter); + + std::unordered_set parameters; + framework::Scope scope; + AnakinConvertValidation validator(parameters, &scope); + validator.DeclInputVar("flatten-X", {3, 10, 10, 4}); + validator.DeclOutputVar("flatten-Out", {3, 400, 1, 1}); + framework::OpDesc desc; + desc.SetType("flatten"); + desc.SetInput("X", {"flatten-X"}); + desc.SetOutput("Out", {"flatten-Out"}); + desc.SetAttr("axis", 1); + + LOG(INFO) << "set OP"; + validator.SetOp(*desc.Proto()); + LOG(INFO) << "execute"; + + validator.Execute(5); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +USE_OP(reshape); +USE_OP_ITSELF(flatten); +USE_ANAKIN_CONVERTER(flatten); diff --git a/paddle/fluid/inference/anakin/convert/test_im2sequence_op.cc b/paddle/fluid/inference/anakin/convert/test_im2sequence_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..5e5764633125c867e27b0b52e0e6ef18714653b2 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/test_im2sequence_op.cc @@ -0,0 +1,55 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/inference/anakin/convert/im2sequence.h" +#include "paddle/fluid/inference/anakin/convert/op_converter.h" +#include "paddle/fluid/inference/anakin/convert/ut_helper.h" + +namespace paddle { +namespace inference { +namespace anakin { + +TEST(im2sequence_op, native) { + std::unordered_set parameters; + framework::Scope scope; + AnakinConvertValidation validator(parameters, &scope); + + std::vector kernels = {6, 1}; + std::vector strides = {1, 1}; + std::vector paddings = {0, 0, 0, 0}; + + validator.DeclInputVar("x", {1, 1, 2, 2}); + validator.DeclOutputVar("out", {1, 1 * kernels[0] * kernels[1]}); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("im2sequence"); + desc.SetInput("X", {"x"}); + desc.SetOutput("Out", {"out"}); + + desc.SetAttr("kernels", kernels); + desc.SetAttr("strides", strides); + desc.SetAttr("paddings", paddings); + + validator.SetOp(*desc.Proto()); + validator.Execute(1); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +USE_OP(im2sequence); +USE_ANAKIN_CONVERTER(im2sequence); diff --git a/paddle/fluid/inference/anakin/convert/test_pool2d_op.cc b/paddle/fluid/inference/anakin/convert/test_pool2d_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..1ac019467721605c539c7ada452d04d5134fa341 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/test_pool2d_op.cc @@ -0,0 +1,119 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. */ +#include +#include "paddle/fluid/inference/anakin/convert/op_converter.h" +#include "paddle/fluid/inference/anakin/convert/ut_helper.h" + +namespace paddle { +namespace inference { +namespace anakin { + +void test_pool2d(bool global_pooling, bool ceil_mode, + std::string pool_type = "max") { + auto* pool2d_converter = + Registry::Global().Lookup("pool2d"); + ASSERT_TRUE(pool2d_converter); + + framework::Scope scope; + std::unordered_set parameters; + AnakinConvertValidation validator(parameters, &scope); + + // The ITensor's Dims should not contain the batch size. + // So, the ITensor's Dims of input and output should be C * H * W. + validator.DeclInputVar("pool2d_x", {1, 3, 6, 7}); + if (global_pooling) + validator.DeclOutputVar("pool2d_out", {1, 3, 1, 1}); + else if (ceil_mode) + validator.DeclOutputVar("pool2d_out", {1, 3, 3, 4}); + else + validator.DeclOutputVar("pool2d_out", {1, 3, 3, 3}); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("pool2d"); + desc.SetInput("X", {"pool2d_x"}); + desc.SetOutput("Out", {"pool2d_out"}); + + std::vector ksize({2, 2}); + std::vector strides({2, 2}); + std::vector paddings({0, 0}); + std::string pooling_t = pool_type; + + desc.SetAttr("pooling_type", pooling_t); + desc.SetAttr("ksize", ksize); + desc.SetAttr("strides", strides); + desc.SetAttr("paddings", paddings); + desc.SetAttr("global_pooling", global_pooling); + desc.SetAttr("ceil_mode", ceil_mode); + + LOG(INFO) << "set OP"; + validator.SetOp(*desc.Proto()); + LOG(INFO) << "execute"; + + validator.Execute(1); +} + +void test_pool2d2(bool global_pooling, bool ceil_mode, + std::string pool_type = "max") { + auto* pool2d_converter = + Registry::Global().Lookup("pool2d"); + ASSERT_TRUE(pool2d_converter); + + framework::Scope scope; + std::unordered_set parameters; + AnakinConvertValidation validator(parameters, &scope); + + // The ITensor's Dims should not contain the batch size. + // So, the ITensor's Dims of input and output should be C * H * W. + validator.DeclInputVar("pool2d_x", {1, 1, 17, 17}); + validator.DeclOutputVar("pool2d_out", {1, 1, 17, 17}); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("pool2d"); + desc.SetInput("X", {"pool2d_x"}); + desc.SetOutput("Out", {"pool2d_out"}); + + std::vector ksize({3, 3}); + std::vector strides({1, 1}); + std::vector paddings({1, 1}); + std::string pooling_t = pool_type; + + desc.SetAttr("pooling_type", pooling_t); + desc.SetAttr("ksize", ksize); + desc.SetAttr("strides", strides); + desc.SetAttr("paddings", paddings); + desc.SetAttr("global_pooling", global_pooling); + desc.SetAttr("ceil_mode", true); + + LOG(INFO) << "set OP"; + validator.SetOp(*desc.Proto()); + LOG(INFO) << "execute"; + + validator.Execute(1); +} + +TEST(Pool2dOpConverter, normal) { test_pool2d(false, false); } +TEST(Pool2dOpConverter, test_global_pooling) { test_pool2d(true, false); } + +TEST(Pool2dOpConverter, max_ceil_test) { test_pool2d(false, true); } +TEST(Pool2dOpConverter, avg_ceil_test) { test_pool2d(false, true, "avg"); } +TEST(Pool2dOpConverter, avg_ceil_test2) { test_pool2d2(false, true, "avg"); } + +} // namespace anakin +} // namespace inference +} // namespace paddle + +USE_OP(pool2d); +USE_ANAKIN_CONVERTER(pool2d); diff --git a/paddle/fluid/inference/anakin/convert/test_relu_op.cc b/paddle/fluid/inference/anakin/convert/test_relu_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..04e624518a5a4477bbb41475b575f85be5a120d4 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/test_relu_op.cc @@ -0,0 +1,50 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/inference/anakin/convert/op_converter.h" +#include "paddle/fluid/inference/anakin/convert/relu.h" +#include "paddle/fluid/inference/anakin/convert/ut_helper.h" + +namespace paddle { +namespace inference { +namespace anakin { + +static void test_activation_op(const std::string &op_type) { + auto *converter = Registry::Global().Lookup(op_type); + PADDLE_ENFORCE(converter != nullptr); + std::unordered_set parameters; + framework::Scope scope; + AnakinConvertValidation validator(parameters, &scope); + validator.DeclInputVar("act-X", {10, 6, 1, 1}); + validator.DeclOutputVar("act-Out", {10, 6, 1, 1}); + framework::OpDesc desc; + desc.SetType(op_type); + desc.SetInput("X", {"act-X"}); + desc.SetOutput("Out", {"act-Out"}); + + LOG(INFO) << "set OP"; + validator.SetOp(*desc.Proto()); + LOG(INFO) << "execute"; + + validator.Execute(5); +} + +TEST(sigm_op, test) { test_activation_op("relu"); } +} // namespace anakin +} // namespace inference +} // namespace paddle + +USE_OP(relu); +USE_ANAKIN_CONVERTER(relu); diff --git a/paddle/fluid/inference/anakin/convert/test_reshape_op.cc b/paddle/fluid/inference/anakin/convert/test_reshape_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..306ebf510f29a87ca1ffa6df86e08f86b3f8ffbb --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/test_reshape_op.cc @@ -0,0 +1,74 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/inference/anakin/convert/op_converter.h" +#include "paddle/fluid/inference/anakin/convert/ut_helper.h" + +namespace paddle { +namespace inference { +namespace anakin { + +TEST(reshape, test) { + auto* converter = Registry::Global().Lookup("reshape"); + ASSERT_TRUE(converter); + framework::Scope scope; + std::unordered_set parameters; + AnakinConvertValidation validator(parameters, &scope); + + // validator.DeclInputVar("reshape-X", {2, 3, 3, 1}); + // validator.DeclOutputVar("reshape-Out", {3, 2, 1, 3}); + validator.DeclInputVar("reshape-X", {1, 2, 4, 1}); + validator.DeclOutputVar("reshape-Out", {1, 8, 1, 1}); + + framework::OpDesc desc; + desc.SetType("reshape"); + desc.SetInput("X", {"reshape-X"}); + desc.SetOutput("Out", {"reshape-Out"}); + // desc.SetAttr("shape", std::vector({3, 2, 1, 3})); + desc.SetAttr("shape", std::vector({1, 8, 1, 1})); + + LOG(INFO) << "set OP"; + validator.SetOp(*desc.Proto()); + LOG(INFO) << "execute"; + validator.Execute(1); +} + +TEST(reshape, test2) { + framework::Scope scope; + std::unordered_set parameters; + AnakinConvertValidation validator(parameters, &scope); + + validator.DeclInputVar("reshape-X", {1, 2, 4}); + validator.DeclOutputVar("reshape-Out", {1, 4, 2}); + + framework::OpDesc desc; + desc.SetType("reshape"); + desc.SetInput("X", {"reshape-X"}); + desc.SetOutput("Out", {"reshape-Out"}); + // desc.SetAttr("shape", std::vector({3, 2, 1, 3})); + desc.SetAttr("shape", std::vector({0, -1, 2})); + + LOG(INFO) << "set OP"; + validator.SetOp(*desc.Proto()); + LOG(INFO) << "execute"; + validator.Execute(1); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +USE_OP(reshape); +USE_ANAKIN_CONVERTER(reshape); diff --git a/paddle/fluid/inference/anakin/convert/test_softmax_op.cc b/paddle/fluid/inference/anakin/convert/test_softmax_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..8c14fae0a67b9e488cf072535868a34f6195ab71 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/test_softmax_op.cc @@ -0,0 +1,49 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/inference/anakin/convert/op_converter.h" +#include "paddle/fluid/inference/anakin/convert/ut_helper.h" + +namespace paddle { +namespace inference { +namespace anakin { + +TEST(softmax, test) { + auto* converter = Registry::Global().Lookup("softmax"); + ASSERT_TRUE(converter); + framework::Scope scope; + std::unordered_set parameters; + AnakinConvertValidation validator(parameters, &scope); + + validator.DeclInputVar("softmax-X", {1, 10, 2}); + validator.DeclOutputVar("softmax-Out", {1, 10, 2}); + + framework::OpDesc desc; + desc.SetType("softmax"); + desc.SetInput("X", {"softmax-X"}); + desc.SetOutput("Out", {"softmax-Out"}); + + LOG(INFO) << "set OP"; + validator.SetOp(*desc.Proto()); + LOG(INFO) << "execute"; + validator.Execute(1); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +USE_OP(softmax); +USE_ANAKIN_CONVERTER(softmax); diff --git a/paddle/fluid/inference/anakin/convert/test_split_op.cc b/paddle/fluid/inference/anakin/convert/test_split_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..aa61c01a511c2337944aadbbc3d47893487de683 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/test_split_op.cc @@ -0,0 +1,110 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/inference/anakin/convert/op_converter.h" +#include "paddle/fluid/inference/anakin/convert/split.h" +#include "paddle/fluid/inference/anakin/convert/ut_helper.h" + +namespace paddle { +namespace inference { +namespace anakin { + +template +void AnakinSliceTest(const std::vector &in_shape, + const std::vector §ions) { + std::unordered_set parameters({""}); + framework::Scope scope; + AnakinConvertValidation validator(parameters, &scope); + + validator.DeclInputVar("split_input", in_shape); + std::vector output_vars; + for (size_t i = 0; i < sections.size(); ++i) { + auto out_shape = in_shape; + out_shape[Axis] = sections[i]; + std::string output_name = "split_out" + std::to_string(i); + validator.DeclOutputVar(output_name, out_shape); + output_vars.push_back(output_name); + } + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("split"); + desc.SetInput("X", {"split_input"}); + desc.SetOutput("Out", output_vars); + + desc.SetAttr("axis", Axis); + desc.SetAttr("num", 0); + desc.SetAttr("sections", sections); + + validator.SetOp(*desc.Proto()); + + validator.Execute(1); +} + +// batch = 0, axis = 1, same shape +TEST(split_op, test_same_shape_axis1_batch1) { + AnakinSliceTest<1>({1, 4, 2, 2}, {2, 2}); +} +// batch = 0, axis = 1, different shape +TEST(split_op, test_different_shape_axis1_batch1) { + AnakinSliceTest<1>({1, 3, 2, 2}, {2, 1}); +} +// batch = 10, axis = 1, same shape +TEST(split_op, test_same_shape_axis1_batch10) { + AnakinSliceTest<1>({1, 4, 2, 2}, {2, 2}); +} +// batch = 10, axis = 1, different shape +TEST(split_op, test_different_shape_axis1_batch10) { + AnakinSliceTest<1>({1, 3, 2, 2}, {2, 1}); +} +// batch = 0, axis = 2, same shape +TEST(split_op, test_same_shape_axis2_batch1) { + AnakinSliceTest<2>({1, 3, 4, 2}, {2, 2}); +} +// batch = 0, axis = 2, different shape +TEST(split_op, test_different_shape_axis2_batch1) { + AnakinSliceTest<2>({1, 3, 3, 2}, {2, 1}); +} +// batch = 10, axis = 2, same shape +TEST(split_op, test_same_shape_axis2_batch10) { + AnakinSliceTest<2>({1, 3, 4, 2}, {2, 2}); +} +// batch = 10, axis = 2, different shape +TEST(split_op, test_different_shape_axis2_batch10) { + AnakinSliceTest<2>({1, 3, 3, 2}, {2, 1}); +} +// batch = 0, axis = 3, same shape +TEST(split_op, test_same_shape_axis3_batch1) { + AnakinSliceTest<3>({1, 3, 2, 4}, {2, 2}); +} +// batch = 0, axis = 3, different shape +TEST(split_op, test_different_shape_axis3_batch1) { + AnakinSliceTest<3>({1, 3, 2, 3}, {2, 1}); +} +// batch = 10, axis = 3, same shape +TEST(split_op, test_same_shape_axis3_batch10) { + AnakinSliceTest<3>({1, 3, 2, 4}, {2, 2}); +} +// batch = 10, axis = 3, different shape +TEST(split_op, test_different_shape_axis3_batch10) { + AnakinSliceTest<3>({1, 3, 2, 3}, {2, 1}); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +USE_OP(split); +USE_ANAKIN_CONVERTER(split); diff --git a/paddle/fluid/inference/anakin/convert/test_sum_op.cc b/paddle/fluid/inference/anakin/convert/test_sum_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..d6a59a0166be9239b480221cc076069239403429 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/test_sum_op.cc @@ -0,0 +1,48 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/inference/anakin/convert/op_converter.h" +#include "paddle/fluid/inference/anakin/convert/sum.h" +#include "paddle/fluid/inference/anakin/convert/ut_helper.h" +#include "paddle/fluid/operators/sum_op.h" + +namespace paddle { +namespace inference { +namespace anakin { + +TEST(sum, native) { + std::unordered_set parameters; + framework::Scope scope; + AnakinConvertValidation validator(parameters, &scope); + validator.DeclInputVar("sum_x1", {1, 2, 1, 2}); + validator.DeclInputVar("sum_x2", {1, 2, 1, 2}); + validator.DeclOutputVar("sum_out", {1, 2, 1, 2}); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("sum"); + desc.SetInput("X", {"sum_x1", "sum_x2"}); + desc.SetOutput("Out", {"sum_out"}); + + validator.SetOp(*desc.Proto()); + validator.Execute(1); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +USE_OP(sum); +USE_ANAKIN_CONVERTER(sum); diff --git a/paddle/fluid/inference/anakin/convert/test_transpose_op.cc b/paddle/fluid/inference/anakin/convert/test_transpose_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..016ed26f02f782fe5032d8368f7767a5c94dfe9f --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/test_transpose_op.cc @@ -0,0 +1,73 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/inference/anakin/convert/op_converter.h" +#include "paddle/fluid/inference/anakin/convert/ut_helper.h" + +namespace paddle { +namespace inference { +namespace anakin { + +TEST(transpose_op, test) { + auto* converter = Registry::Global().Lookup("transpose"); + ASSERT_TRUE(converter != nullptr); + std::unordered_set parameters; + framework::Scope scope; + AnakinConvertValidation validator(parameters, &scope); + validator.DeclInputVar("transpose-X", {2, 3, 4, 5}); + validator.DeclOutputVar("transpose-Out", {4, 2, 5, 3}); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("transpose"); + desc.SetInput("X", {"transpose-X"}); + desc.SetOutput("Out", {"transpose-Out"}); + desc.SetAttr("axis", std::vector({2, 0, 3, 1})); + + LOG(INFO) << "set OP"; + validator.SetOp(*desc.Proto()); + LOG(INFO) << "execute"; + + validator.Execute(3); +} + +// test input shape's dims < 4 +TEST(transpose_op, test2) { + std::unordered_set parameters; + framework::Scope scope; + AnakinConvertValidation validator(parameters, &scope); + validator.DeclInputVar("transpose-X", {3, 4, 5}); + validator.DeclOutputVar("transpose-Out", {3, 5, 4}); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("transpose"); + desc.SetInput("X", {"transpose-X"}); + desc.SetOutput("Out", {"transpose-Out"}); + desc.SetAttr("axis", std::vector({0, 2, 1})); + + LOG(INFO) << "set OP"; + validator.SetOp(*desc.Proto()); + LOG(INFO) << "execute"; + + validator.Execute(1); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +USE_OP(transpose); +USE_ANAKIN_CONVERTER(transpose); diff --git a/paddle/fluid/inference/anakin/convert/transpose.cc b/paddle/fluid/inference/anakin/convert/transpose.cc new file mode 100644 index 0000000000000000000000000000000000000000..f35372fe5c315ec68bc80a6d03c5931899ff7555 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/transpose.cc @@ -0,0 +1,56 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/anakin/convert/transpose.h" +#include +#include +#include + +using anakin::graph::GraphGlobalMem; +using anakin::AK_FLOAT; +using anakin::saber::NV; +using anakin::saber::Shape; +using anakin::PTuple; + +namespace paddle { +namespace inference { +namespace anakin { + +void TransposeOpConverter::operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) { + framework::OpDesc op_desc(op, nullptr); + PADDLE_ENFORCE_EQ(op_desc.Input("X").size(), 1); + PADDLE_ENFORCE_EQ(op_desc.Output("Out").size(), 1); + + auto input = op_desc.Input("X").front(); + auto output = op_desc.Output("Out").front(); + auto op_name = op_desc.Type() + ":" + op_desc.Output("Out").front(); + engine_->AddOp(op_name, "Permute", {input}, {output}); + + auto axis = boost::get>(op_desc.GetAttr("axis")); + size_t axis_size = axis.size(); + while (axis.size() < 4) { + axis.push_back(axis_size); + axis_size += 1; + } + engine_->AddOpAttr>(op_name, "dims", axis); +} + +} // namespace anakin +} // namespace inference +} // namespace paddle + +REGISTER_ANAKIN_OP_CONVERTER(transpose, TransposeOpConverter); diff --git a/paddle/fluid/inference/anakin/convert/transpose.h b/paddle/fluid/inference/anakin/convert/transpose.h new file mode 100644 index 0000000000000000000000000000000000000000..bacbf152bc12319e6296677500b17d55d9772412 --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/transpose.h @@ -0,0 +1,36 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/inference/anakin/convert/op_converter.h" + +namespace paddle { +namespace inference { +namespace anakin { + +class TransposeOpConverter : public AnakinOpConverter { + public: + TransposeOpConverter() = default; + + virtual void operator()(const framework::proto::OpDesc &op, + const framework::BlockDesc &block_desc, + const framework::Scope &scope, + bool test_mode) override; + virtual ~TransposeOpConverter() {} +}; + +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/convert/ut_helper.h b/paddle/fluid/inference/anakin/convert/ut_helper.h new file mode 100644 index 0000000000000000000000000000000000000000..029aff6704ff1015e5c2378a2202c94043df990d --- /dev/null +++ b/paddle/fluid/inference/anakin/convert/ut_helper.h @@ -0,0 +1,219 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include +#include +#include +#include +#include + +#include "paddle/fluid/framework/block_desc.h" +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/framework/tensor_util.h" +#include "paddle/fluid/inference/anakin/convert/op_converter.h" +#include "paddle/fluid/inference/anakin/engine.h" +#include "paddle/fluid/inference/analysis/helper.h" +#include "paddle/fluid/inference/utils/singleton.h" +#include "paddle/fluid/platform/enforce.h" + +using anakin::graph::GraphGlobalMem; +using anakin::AK_FLOAT; +using anakin::Precision; +using anakin::saber::NV; +using anakin::saber::X86; +using anakin::saber::Shape; +using anakin::PBlock; +using anakin::PTuple; + +namespace paddle { +namespace inference { +namespace anakin { + +/* + * Get a random float value between [low, high] + */ +float random(float low, float high) { + static std::random_device rd; + static std::mt19937 mt(rd()); + std::uniform_real_distribution dist(low, high); + return dist(mt); +} + +void RandomizeTensor(framework::LoDTensor* tensor, const platform::Place& place, + const platform::DeviceContext& ctx) { + auto dims = tensor->dims(); + size_t num_elements = analysis::AccuDims(dims, dims.size()); + PADDLE_ENFORCE_GT(num_elements, 0); + + platform::CPUPlace cpu_place; + framework::LoDTensor temp_tensor; + temp_tensor.Resize(dims); + auto* temp_data = temp_tensor.mutable_data(cpu_place); + + for (size_t i = 0; i < num_elements; i++) { + *(temp_data + i) = random(0., 1.); + } + + TensorCopySync(temp_tensor, place, tensor); +} + +/* + * Help to validate the correctness between Fluid Op and the corresponding + * anakin + * layer. + */ +class AnakinConvertValidation { + using AnakinNvEngineT = AnakinEngine; + + public: + AnakinConvertValidation() = delete; + + AnakinConvertValidation(const std::unordered_set& parameters, + framework::Scope* scope) + : parameters_(parameters), scope_(scope), place_(0) { + PADDLE_ENFORCE_EQ(cudaStreamCreate(&stream_), 0); + engine_.reset(new AnakinEngine(true)); + } + + // Declare a Variable as input with random initialization. + void DeclInputVar(const std::string& name, + const std::vector tensor_dims) { + DeclVar(name, tensor_dims); + // should decalre anakin input here. + } + + void DeclParamVar(const std::string& name, const std::vector dim_vec) { + DeclVar(name, dim_vec); + } + + void DeclOutputVar(const std::string& name, const std::vector dim_vec) { + DeclVar(name, dim_vec); + // should declare anakin output here. + } + + void DeclVar(const std::string& name, const std::vector dim_vec) { + platform::CUDADeviceContext ctx(place_); + auto* x = scope_->Var(name); + auto* x_tensor = x->GetMutable(); + x_tensor->Resize(framework::make_ddim(dim_vec)); + RandomizeTensor(x_tensor, place_, ctx); + + std::vector dim_vec_int64; + for (auto& ele : dim_vec) { + dim_vec_int64.push_back(static_cast(ele)); + } + + // Add var_desc to block_desc + auto* block_desc = program_desc_.MutableBlock(framework::kRootBlockIndex); + + auto* var_desc = block_desc->Var(name); + var_desc->SetShape(dim_vec_int64); + } + + void SetOp(const framework::proto::OpDesc& desc) { + op_ = framework::OpRegistry::CreateOp(desc); + op_desc_.reset(new framework::OpDesc(desc, nullptr)); + // should init anakin engine here. + + auto& block_desc = program_desc_.Block(framework::kRootBlockIndex); + Singleton::Global().ConvertOp( + desc, block_desc, parameters_, *scope_, engine_.get(), + true /*test_mode*/); + engine_->Freeze(); + + std::map> temp_max_input_shape; + for (const auto& input : op_desc_->InputArgumentNames()) { + if (parameters_.count(input)) continue; + auto& t = inference::analysis::GetFromScope(*scope_, + input); + auto t_shape = framework::vectorize2int(t.dims()); + while (t_shape.size() < 4) { + t_shape.push_back(1); + } + engine_->SetInputShape(input, t_shape); + temp_max_input_shape[input] = t_shape; + } + engine_->SetMaxInputShape(temp_max_input_shape); + engine_->Optimize(); + engine_->InitGraph(); + } + + // We use the set 'neglected_output' here, because some Ops like batch norm, + // the outputs specified in the op des are only used during training, + // so we should neglect those output during inference. + void Execute(int batch_size, + std::unordered_set neglected_output = {}) { + // Execute Fluid Op + platform::CUDADeviceContext ctx(place_); + op_->Run(*scope_, place_); + + // std::vector input_vector; + // std::vector output_vector; + std::map inputs; + for (const auto& input : op_desc_->InputArgumentNames()) { + if (parameters_.count(input)) continue; + auto* var = scope_->FindVar(input); + auto tensor = var->GetMutable(); + inputs.insert({input, tensor}); + } + + std::map outputs; + std::vector> fluid_outputs; + for (const auto& output : op_desc_->OutputArgumentNames()) { + if (neglected_output.count(output)) continue; + std::vector fluid_out; + auto* var = scope_->FindVar(output); + auto tensor = var->GetMutable(); + framework::TensorToVector(*tensor, ctx, &fluid_out); + fluid_outputs.push_back(fluid_out); + + outputs.insert({output, tensor}); + } + + engine_->Execute(inputs, outputs, stream_); + int i_output = 0; + for (const auto& output : op_desc_->OutputArgumentNames()) { + if (neglected_output.count(output)) continue; + std::vector anakin_out; + auto* var = scope_->FindVar(output); + auto tensor = var->GetMutable(); + framework::TensorToVector(*tensor, ctx, &anakin_out); + + size_t anakin_out_size = anakin_out.size(); + auto fluid_out = fluid_outputs[i_output++]; + for (size_t i = 0; i < anakin_out_size; i++) { + EXPECT_LT(std::abs(fluid_out[i] - anakin_out[i]), 1e-3); + } + } + } + + private: + std::unique_ptr engine_{nullptr}; + cudaStream_t stream_; + std::unique_ptr op_; + std::unique_ptr op_desc_; + framework::ProgramDesc program_desc_; + const std::unordered_set& parameters_; + framework::Scope* scope_; + platform::CUDAPlace place_; +}; + +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/engine.cc b/paddle/fluid/inference/anakin/engine.cc new file mode 100644 index 0000000000000000000000000000000000000000..ba044c9401a5f0fb5a839c1766fdd9d412d42212 --- /dev/null +++ b/paddle/fluid/inference/anakin/engine.cc @@ -0,0 +1,136 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/anakin/engine.h" +#include +#include +#include +#include +#include "paddle/fluid/framework/ddim.h" + +using anakin::Precision; +using anakin::OpRunType; +using paddle::framework::LoDTensor; +template +using AnakinNetT = anakin::Net; + +template +using AnakinGraphT = anakin::graph::Graph; + +namespace paddle { +namespace inference { +namespace anakin { + +template +AnakinEngine::AnakinEngine( + bool need_summary, int device, int max_batch_size, + std::map> max_input_shape) + : graph_(new AnakinGraphT()), + net_(new AnakinNetT(need_summary)) { + device_ = device; + max_batch_size_ = max_batch_size; + max_input_shape_ = max_input_shape; +} + +template +AnakinEngine::~AnakinEngine() {} + +template +void AnakinEngine::SetInputShape( + const std::string &name, std::vector shape) { + graph_->AddOpAttr<::anakin::PTuple>(name, "input_shape", + std::move(shape)); +} + +template +void AnakinEngine::InitGraph() { + net_->init(*graph_); +} + +template +void AnakinEngine::AddOp( + const std::string &name, const std::string &type, + const std::vector &inputs, + const std::vector &outputs) { + PADDLE_ENFORCE(graph_->AddOp(name, type, inputs, outputs), "Add operation."); +} + +template +void AnakinEngine::Execute( + const std::map &inputs, + const std::map &outputs, + cudaStream_t stream) { + cudaDeviceSynchronize(); + for (const auto &input : inputs) { + auto *tensor = input.second; + auto *data = tensor->data(); + + auto fluid_input_shape = framework::vectorize2int(tensor->dims()); + while (fluid_input_shape.size() < 4) { + fluid_input_shape.push_back(1); + } + auto *anakin_input = net_->get_in(input.first); + std::vector max_input_shape = max_input_shape_[input.first]; + int max_shape_sum = + std::accumulate(max_input_shape.begin(), max_input_shape.end(), 1, + std::multiplies()); + + PADDLE_ENFORCE(max_shape_sum >= tensor->numel(), + "The anakin input max shape should be greater than" + " or equal to the real input shape, Please set the max " + "input shape using EnableAnakinEngine"); + anakin_input->reshape(fluid_input_shape); + ::anakin::saber::Tensor tmp_anakin_tensor(data, TargetT(), 0, + fluid_input_shape); + anakin_input->copy_from(tmp_anakin_tensor); + } + net_->prediction(); + cudaDeviceSynchronize(); + for (const auto &output : outputs) { + platform::CUDAPlace gpu_place(device_); + auto *tensor = output.second; + auto *anakin_output = net_->get_out(output.first); + auto *anakin_data = anakin_output->data(); + auto anakin_output_shape = anakin_output->valid_shape(); + tensor->Resize(framework::make_ddim(anakin_output_shape)); + auto *fluid_data = tensor->mutable_data(gpu_place); + memory::Copy(gpu_place, static_cast(fluid_data), gpu_place, + static_cast(anakin_data), + tensor->numel() * sizeof(float), stream); + } + cudaDeviceSynchronize(); +} + +template +void AnakinEngine::Freeze() { + PADDLE_ENFORCE(graph_->Freeze_v3(), "Freeze anakin subgraph."); +} + +template +void AnakinEngine::Optimize() { + PADDLE_ENFORCE(graph_->Optimize(), "Graph optimization."); +} + +template +std::unique_ptr> +AnakinEngine::Clone() { + auto *engine = new AnakinEngine(); + engine->net_ = std::move(net_->Clone()); + return std::unique_ptr(engine); +} + +template class AnakinEngine<::anakin::saber::NV, ::anakin::Precision::FP32>; +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/engine.h b/paddle/fluid/inference/anakin/engine.h new file mode 100644 index 0000000000000000000000000000000000000000..4845ffdf5b9dcfa99d1f421d47328beb4b196298 --- /dev/null +++ b/paddle/fluid/inference/anakin/engine.h @@ -0,0 +1,143 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include +#include +#include +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/inference/engine.h" +#include "paddle/fluid/inference/utils/singleton.h" + +#include "framework/core/net/net.h" +#include "framework/core/types.h" +#include "framework/graph/graph.h" +#include "framework/graph/graph_global_mem.h" +#include "saber/saber_types.h" + +using anakin::Precision; +using anakin::saber::NV; + +namespace anakin { + +template +class Net; + +namespace graph { +template +class Graph; +} // namespace graph +} // namespace anakin + +namespace paddle { +namespace inference { +namespace anakin { + +template +class AnakinEngine { + using NetT = ::anakin::Net; + using GraphT = ::anakin::graph::Graph; + + public: + explicit AnakinEngine( + bool need_summary = false, int device = 0, int max_batch_size = 1, + std::map> max_input_shape = {}); + ~AnakinEngine(); + void InitGraph(); + void SetInputShape(const std::string &name, std::vector shape); + void AddOp(const std::string &name, const std::string &type, + const std::vector &inputs, + const std::vector &outputs); + + template + void AddOpAttr(const std::string &op_name, const std::string &attr_name, + const T &attr_value) { + PADDLE_ENFORCE(graph_->AddOpAttr(op_name, attr_name, attr_value), + "Add operation's attribution."); + } + NetT *Net() { return net_.get(); } + GraphT *Graph() { return graph_.get(); } + std::unique_ptr Clone(); + const std::map> &GetMaxInputShape() { + return max_input_shape_; + } + void SetMaxInputShape(std::map> shape) { + max_input_shape_ = shape; + } + int GetMaxBatchSize() { return max_batch_size_; } + void Freeze(); + void Optimize(); + void AllocTmpMem() { + PADDLE_ENFORCE(net_->alloc_memory_first(*graph_), + "anakin alloc temp memory first failed"); + } + void Save(std::string path) { graph_->save(path); } + + bool IsInit() { return initialized_; } + int GetDevice() { return device_; } + void Execute(const std::map &inputs, + const std::map &outputs, + cudaStream_t stream); + + private: + bool initialized_{false}; + int max_batch_size_; + std::map> max_input_shape_; + int device_; + std::unique_ptr graph_; + std::unique_ptr net_; +}; + +class AnakinEngineManager { + using AnakinNvEngineT = AnakinEngine; + + public: + bool HasEngine(const std::string &name) const { + if (engines_.count(name) == 0) return false; + return engines_.at(name).get() != nullptr; + } + AnakinNvEngineT *Get(const std::string &name) const { + return engines_.at(name).get(); + } + + AnakinNvEngineT *Create( + bool need_summary, int device, int max_batch_size, + std::map> max_input_shape, + std::string engine_name) { + std::unique_lock lk(mut_); + auto *p = new AnakinEngine( + need_summary, device, max_batch_size, max_input_shape); + engines_[engine_name].reset(p); + return p; + } + + void DeleteALL() { + for (auto &item : engines_) { + item.second.reset(nullptr); + } + } + + private: + std::unordered_map> engines_; + std::mutex mut_; +}; +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/op_teller.cc b/paddle/fluid/inference/anakin/op_teller.cc new file mode 100644 index 0000000000000000000000000000000000000000..2042fb18ea41f8b41fc35543c7e1b642c4f2fa7c --- /dev/null +++ b/paddle/fluid/inference/anakin/op_teller.cc @@ -0,0 +1,69 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/anakin/op_teller.h" + +namespace paddle { +namespace inference { +namespace anakin { + +// Just tell by the op_types. +struct SimpleOpTypeSetTeller : public Teller { + SimpleOpTypeSetTeller() { + teller_set.insert("mul"); + teller_set.insert("fc"); + teller_set.insert("conv2d_fusion"); + teller_set.insert("split"); + teller_set.insert("relu"); + teller_set.insert("pool2d"); + teller_set.insert("elementwise_add"); + teller_set.insert("elementwise_mul"); + teller_set.insert("concat"); + teller_set.insert("tanh"); + teller_set.insert("conv2d"); + teller_set.insert("batch_norm"); + teller_set.insert("softmax"); + teller_set.insert("flatten2"); + teller_set.insert("reshape2"); + teller_set.insert("transpose2"); + teller_set.insert("density_prior_box"); + teller_set.insert("detection_out"); + teller_set.insert("dropout"); + teller_set.insert("sigmoid"); + teller_set.insert("sum"); + teller_set.insert("depthwise_conv2d"); + teller_set.insert("prior_box"); + } + + bool operator()(const std::string& op_type, + const framework::OpDesc& desc) override { + return teller_set.count(op_type); + } + + private: + std::unordered_set teller_set; +}; + +bool OpTeller::Tell(const std::string& op_type, const framework::OpDesc& desc) { + for (auto& teller : tellers_) { + if ((*teller)(op_type, desc)) return true; + } + return false; +} + +OpTeller::OpTeller() { tellers_.emplace_back(new SimpleOpTypeSetTeller); } + +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/op_teller.h b/paddle/fluid/inference/anakin/op_teller.h new file mode 100644 index 0000000000000000000000000000000000000000..15a42067b8438e60851a50e454abde95782d90ee --- /dev/null +++ b/paddle/fluid/inference/anakin/op_teller.h @@ -0,0 +1,70 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include +#include +#include "paddle/fluid/framework/op_desc.h" + +namespace paddle { +namespace inference { +namespace anakin { + +/* + * Single Op teller definition. + * One can override this and define a more complex tell logic, considerring more + * issues such as op_desc. + */ +struct Teller { + virtual bool operator()(const std::string& op_type, + const framework::OpDesc& desc) = 0; + + virtual ~Teller() = default; +}; +/* + * A real example: + * + * struct SomeTeller : public Teller { + * bool operator()(const std::string& op_type, + * const framework::OpDesc& desc) override { + * return op_type == "fc" && desc.Inputs().size() == 2; + * } + *}; + */ + +/* + * class OpTeller helps to tell whether a fluid + * operator can be transformed to a TensorRT layer. + */ +class OpTeller { + public: + static OpTeller& Global() { + static std::unique_ptr x(new OpTeller); + return *x; + } + + bool Tell(const std::string& op_type, const framework::OpDesc& desc); + + private: + OpTeller(); + + private: + std::vector> tellers_; +}; + +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/anakin/test_anakin_engine.cc b/paddle/fluid/inference/anakin/test_anakin_engine.cc new file mode 100644 index 0000000000000000000000000000000000000000..8fd6b8bec9ada6dec67fd24a2457713203431ebf --- /dev/null +++ b/paddle/fluid/inference/anakin/test_anakin_engine.cc @@ -0,0 +1,95 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include + +#include + +#include "paddle/fluid/inference/anakin/engine.h" + +using anakin::graph::GraphGlobalMem; +using anakin::AK_FLOAT; +using anakin::Precision; +using anakin::saber::NV; +using anakin::saber::X86; +using anakin::saber::Shape; +using anakin::PBlock; +using anakin::PTuple; +namespace paddle { +namespace inference { +namespace anakin { + +class TestAnakinEngine : public ::testing::Test { + protected: + void SetUp() override; + void TearDown() override {} + + protected: + using AnakinNvEngineT = AnakinEngine; + std::unique_ptr engine_{nullptr}; +}; + +void TestAnakinEngine::SetUp() { + engine_.reset(new AnakinEngine(true)); +} + +TEST_F(TestAnakinEngine, Execute) { + engine_->AddOp("op1", "Dense", {"x"}, {"y"}); + engine_->AddOpAttr("op1", "out_dim", 2); + engine_->AddOpAttr("op1", "bias_term", false); + engine_->AddOpAttr("op1", "axis", 1); + std::vector shape = {1, 1, 1, 2}; + Shape tmp_shape(shape); + // PBlock weight1(tmp_shape); + auto *weight1 = + GraphGlobalMem::Global().template new_block(tmp_shape); + // auto *weight1 = new PBlock(tmp_shape, AK_FLOAT); + + float *cpu_data = static_cast(weight1->h_tensor().mutable_data()); + cpu_data[0] = 2.; + weight1->d_tensor().set_shape(tmp_shape); + weight1->d_tensor().copy_from(weight1->h_tensor()); + engine_->AddOpAttr("op1", "weight_1", *weight1); + + engine_->Freeze(); + // PTuple input_shape = {1}; + // engine_->AddOpAttr("x", "input_shape", input_shape); + engine_->SetInputShape("x", {1, 1, 1, 1}); + engine_->Optimize(); + engine_->InitGraph(); + framework::LoDTensor x; + framework::LoDTensor y; + x.Resize({1, 1, 1, 1}); + y.Resize({1, 1, 1, 2}); + auto *x_data = x.mutable_data(platform::CUDAPlace()); + float x_data_cpu[] = {1.}; + cudaMemcpy(x_data, x_data_cpu, sizeof(float), cudaMemcpyHostToDevice); + + std::map inputs = {{"x", &x}}; + auto *y_data = y.mutable_data(platform::CUDAPlace()); + std::map outputs = {{"y", &y}}; + + cudaStream_t stream; + + engine_->Execute(inputs, outputs, stream); + auto *y_data_gpu = y_data; + float y_data_cpu[2]; + cudaMemcpy(y_data_cpu, y_data_gpu, sizeof(float) * 2, cudaMemcpyDeviceToHost); + LOG(INFO) << "output value: " << y_data_cpu[0] << ", " << y_data_cpu[1]; +} + +} // namespace anakin +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/CMakeLists.txt b/paddle/fluid/inference/analysis/CMakeLists.txt index d4d2fd4634f9e11f3f002e11e177c332ced49885..7a795bda820dc34f78f33191558fd6c0ccd2cb24 100644 --- a/paddle/fluid/inference/analysis/CMakeLists.txt +++ b/paddle/fluid/inference/analysis/CMakeLists.txt @@ -1,51 +1,40 @@ -cc_library(ir_pass_manager SRCS ir_pass_manager.cc DEPS graph pass) -set(analysis_deps - framework_proto proto_desc ir_pass_manager graph pass paddle_fluid_api executor pretty_log) +unset(analysis_deps CACHE) +set(analysis_deps # analysis_deps can be extended accross the project + framework_proto proto_desc graph pass paddle_fluid_api executor pretty_log + ir_pass_manager + CACHE INTERNAL "") -cc_library(analysis SRCS pass_manager.cc node.cc data_flow_graph.cc graph_traits.cc subgraph_splitter.cc +add_subdirectory(ir_passes) +add_subdirectory(passes) + +cc_library(analysis_helper SRCS helper.cc DEPS framework_proto proto_desc graph paddle_fluid_api) + +cc_library(ir_pass_manager SRCS ir_pass_manager.cc DEPS graph pass ${INFER_IR_PASSES} analysis_helper) + +cc_library(argument SRCS argument.cc DEPS scope proto_desc) +cc_library(analysis_pass SRCS analysis_pass.cc DEPS proto_desc) + +cc_library(analysis SRCS analyzer.cc - helper.cc - # passes - analysis_pass.cc - fluid_to_data_flow_graph_pass.cc - data_flow_graph_to_fluid_pass.cc - dfg_graphviz_draw_pass.cc - tensorrt_subgraph_pass.cc - tensorrt_subgraph_node_mark_pass.cc - fluid_to_ir_pass.cc - model_store_pass.cc - DEPS ${analysis_deps}) - -cc_test(test_node SRCS node_tester.cc DEPS analysis) + analysis_pass + DEPS ${analysis_deps} analysis_helper + ${INFER_IR_PASSES} + ) + cc_test(test_dot SRCS dot_tester.cc DEPS analysis) -cc_binary(inference_analyzer SRCS analyzer_main.cc DEPS analysis paddle_fluid) - -function (inference_analysis_test TARGET) - if(WITH_TESTING) - set(options "") - set(oneValueArgs "") - set(multiValueArgs SRCS ARGS EXTRA_DEPS) - cmake_parse_arguments(analysis_test "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) - set(mem_opt "") - if(WITH_GPU) - set(mem_opt "--fraction_of_gpu_memory_to_use=0.5") - endif() - cc_test(${TARGET} - SRCS "${analysis_test_SRCS}" - DEPS analysis pass ${GLOB_PASS_LIB} ${analysis_test_EXTRA_DEPS} - ARGS --inference_model_dir=${PYTHON_TESTS_DIR}/book/word2vec.inference.model ${mem_opt} ${analysis_test_ARGS}) - set_tests_properties(${TARGET} PROPERTIES DEPENDS test_word2vec) - endif(WITH_TESTING) + +function(inference_analysis_test TARGET) + if(WITH_TESTING) + set(options "") + set(oneValueArgs "") + set(multiValueArgs SRCS ARGS EXTRA_DEPS) + cmake_parse_arguments(analysis_test "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) + inference_base_test(${TARGET} + SRCS ${analysis_test_SRCS} + DEPS analysis pass ${GLOB_PASS_LIB} ${analysis_test_EXTRA_DEPS} + ARGS --inference_model_dir=${WORD2VEC_MODEL_DIR} ${analysis_test_ARGS}) + endif() endfunction(inference_analysis_test) -inference_analysis_test(test_analyzer SRCS analyzer_tester.cc EXTRA_DEPS paddle_inference_api) -inference_analysis_test(test_data_flow_graph SRCS data_flow_graph_tester.cc) -inference_analysis_test(test_data_flow_graph_to_fluid_pass SRCS data_flow_graph_to_fluid_pass_tester.cc) -inference_analysis_test(test_fluid_to_ir_pass SRCS fluid_to_ir_pass_tester.cc) -inference_analysis_test(test_fluid_to_data_flow_graph_pass SRCS fluid_to_data_flow_graph_pass_tester.cc) -inference_analysis_test(test_subgraph_splitter SRCS subgraph_splitter_tester.cc) -inference_analysis_test(test_dfg_graphviz_draw_pass SRCS dfg_graphviz_draw_pass_tester.cc) -inference_analysis_test(test_tensorrt_subgraph_pass SRCS tensorrt_subgraph_pass_tester.cc) -inference_analysis_test(test_pass_manager SRCS pass_manager_tester.cc) -inference_analysis_test(test_tensorrt_subgraph_node_mark_pass SRCS tensorrt_subgraph_node_mark_pass_tester.cc) -inference_analysis_test(test_model_store_pass SRCS model_store_pass_tester.cc) +inference_analysis_test(test_analyzer SRCS analyzer_tester.cc + EXTRA_DEPS reset_tensor_array paddle_inference_api) diff --git a/paddle/fluid/inference/analysis/analysis_pass.h b/paddle/fluid/inference/analysis/analysis_pass.h index 13805ea4acf936b242bcd86b2faf89813753a9fe..d5a972fab3beae4d4e2e512d1ccda3f0b8356682 100644 --- a/paddle/fluid/inference/analysis/analysis_pass.h +++ b/paddle/fluid/inference/analysis/analysis_pass.h @@ -19,42 +19,34 @@ limitations under the License. */ #include #include "paddle/fluid/framework/framework.pb.h" +#include "paddle/fluid/framework/ir/graph.h" #include "paddle/fluid/inference/analysis/argument.h" -#include "paddle/fluid/inference/analysis/data_flow_graph.h" #include "paddle/fluid/inference/analysis/helper.h" -#include "paddle/fluid/inference/analysis/node.h" namespace paddle { namespace inference { namespace analysis { +/* + * AnalysisPass is a pass used to control the IR passes. + */ class AnalysisPass { public: AnalysisPass() = default; virtual ~AnalysisPass() = default; - // Mutable Pass. - virtual bool Initialize(Argument *argument) { return false; } - // Readonly Pass. - virtual bool Initialize(const Argument &argument) { return false; } - // Virtual method overriden by subclasses to do any necessary clean up after - // all passes have run. - virtual bool Finalize() { return false; } - - // Create a debugger Pass that draw the DFG by graphviz toolkit. - virtual AnalysisPass *CreateGraphvizDebugerPass() const { return nullptr; } - - // Run on a single DataFlowGraph. - virtual void Run(DataFlowGraph *x) = 0; + // Run on a single Graph. + void Run(Argument* argument) { RunImpl(argument); } // Human-readable short representation. virtual std::string repr() const = 0; // Human-readable long description. virtual std::string description() const { return "No DOC"; } -}; -// GraphPass processes on any GraphType. -class DataFlowGraphPass : public AnalysisPass {}; + protected: + // User should implement these. + virtual void RunImpl(Argument* argument) = 0; +}; } // namespace analysis } // namespace inference diff --git a/paddle/fluid/inference/analysis/analyzer.cc b/paddle/fluid/inference/analysis/analyzer.cc index d780592eb9f79e39e34fcd3bd6b086992eaa931f..d82a063d8808591a7ebf6b70e7421a401ce969f7 100644 --- a/paddle/fluid/inference/analysis/analyzer.cc +++ b/paddle/fluid/inference/analysis/analyzer.cc @@ -15,111 +15,26 @@ #include "paddle/fluid/inference/analysis/analyzer.h" #include #include - -#include "paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass.h" -#include "paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.h" -#include "paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.h" -#include "paddle/fluid/inference/analysis/fluid_to_ir_pass.h" -#include "paddle/fluid/inference/analysis/model_store_pass.h" -#include "paddle/fluid/inference/analysis/pass_manager.h" -#include "paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass.h" -#include "paddle/fluid/inference/analysis/tensorrt_subgraph_pass.h" - -DEFINE_bool(IA_enable_tensorrt_subgraph_engine, false, - "Enable subgraph to TensorRT engine for acceleration"); - -DEFINE_bool(IA_enable_ir, false, "Turn on IR support"); - -DEFINE_string(IA_graphviz_log_root, "./", - "Graphviz debuger for data flow graphs."); - -DEFINE_string(IA_output_storage_path, "", "optimized model output path"); +#include "paddle/fluid/inference/analysis/passes/passes.h" +#include "paddle/fluid/string/pretty_log.h" namespace paddle { namespace inference { namespace analysis { -class DfgPassManagerImpl final : public DfgPassManager { - public: - DfgPassManagerImpl() { - // TODO(Superjomn) set the key with pass reprs. - if (!FLAGS_IA_enable_ir) { - AddPass("fluid-to-data-flow-graph", new FluidToDataFlowGraphPass); - } else { - AddPass("fluid-to-ir-pass", new FluidToIrPass); - } - TryAddTensorRtPass(); - AddPass("data-flow-graph-to-fluid", new DataFlowGraphToFluidPass); - if (!FLAGS_IA_output_storage_path.empty()) { - AddPass("model-store-pass", new ModelStorePass); - } - } - - std::string repr() const override { return "dfg-pass-manager"; } - std::string description() const override { return "DFG pass manager."; } - - private: - void AddPass(const std::string& name, AnalysisPass* pass) { - VLOG(3) << "Adding pass " << name; - Register(name, pass); - AddGraphvizDebugerPass(pass); - } - - void TryAddTensorRtPass() { - if (FLAGS_IA_enable_tensorrt_subgraph_engine) { - auto trt_teller = [&](const Node* node) { - std::unordered_set teller_set( - {"mul", "conv2d", "pool2d", "relu", "softmax", "sigmoid", - "depthwise_conv2d", "batch_norm", "concat", "tanh", "pad", - "elementwise_add", "dropout"}); - if (!node->IsFunction()) return false; +Analyzer::Analyzer() {} - const auto* func = static_cast(node); - if (teller_set.count(func->func_type())) { - return true; - } else { - return false; - } - }; +void Analyzer::Run(Argument *argument) { RunAnalysis(argument); } - AddPass("tensorrt-subgraph-marker", - new TensorRTSubgraphNodeMarkPass(trt_teller)); - AddPass("tensorrt-subgraph", new TensorRTSubGraphPass(trt_teller)); - } +void Analyzer::RunAnalysis(Argument *argument) { + PADDLE_ENFORCE(argument->analysis_passes_valid(), + "analsis_passes is not valid in the argument."); + for (auto &pass : argument->analysis_passes()) { + string::PrettyLogH1("--- Running analysis [%s]", pass); + auto *ptr = PassRegistry::Global().Retreive(pass); + PADDLE_ENFORCE_NOT_NULL(ptr, "no analysis pass called %s", pass); + ptr->Run(argument); } - - // Add the graphviz debuger pass if the parent pass has one. - void AddGraphvizDebugerPass(AnalysisPass* pass) { - auto* debuger_pass = pass->CreateGraphvizDebugerPass(); - if (debuger_pass) { - Register(debuger_pass->repr(), debuger_pass); - } - } -}; - -Analyzer::Analyzer() { Register("manager1", new DfgPassManagerImpl); } - -void Analyzer::Run(Argument* argument) { - std::vector passes; - for (auto& pass : all_ir_passes_) { - if (!disabled_ir_passes_.count(pass)) { - passes.push_back(pass); - passes.push_back("graph_viz_pass"); // add graphviz for debug. - } - } - passes.push_back("graph_viz_pass"); - argument->Set(kFluidToIrPassesAttr, new std::vector(passes)); - - for (auto& x : data_) { - PADDLE_ENFORCE(x->Initialize(argument)); - x->RunAll(); - PADDLE_ENFORCE(x->Finalize()); - } -} - -Analyzer& Analyzer::DisableIrPasses(const std::vector& passes) { - disabled_ir_passes_.insert(passes.begin(), passes.end()); - return *this; } } // namespace analysis diff --git a/paddle/fluid/inference/analysis/analyzer.h b/paddle/fluid/inference/analysis/analyzer.h index 765145cb7da44ca13c5394ad1dc2e879e69d69d1..a6de18db60072fc2a5310893a885994d675dd8b6 100644 --- a/paddle/fluid/inference/analysis/analyzer.h +++ b/paddle/fluid/inference/analysis/analyzer.h @@ -40,47 +40,21 @@ limitations under the License. */ #include #include "paddle/fluid/inference/analysis/analysis_pass.h" #include "paddle/fluid/inference/analysis/flags.h" -#include "paddle/fluid/inference/analysis/pass_manager.h" namespace paddle { namespace inference { namespace analysis { -class Analyzer : public OrderedRegistry { +class Analyzer final { public: - // Register all the pass-managers. Analyzer(); void Run(Argument* argument); - Analyzer& DisableIrPasses(const std::vector& passes); - DISABLE_COPY_AND_ASSIGN(Analyzer); - private: - // All avaiable IR passes. - // The bigger fuse comes first, so that the small operators prefer to be - // merged in a larger fuse op. The small fusion will not break the pattern of - // larger fusion. - const std::vector all_ir_passes_{{ - // Manual update the passes here. - "infer_clean_graph_pass", // - "attention_lstm_fuse_pass", // - "embedding_fc_lstm_fuse_pass", // - "fc_lstm_fuse_pass", // - "mul_lstm_fuse_pass", // - "fc_gru_fuse_pass", // - "mul_gru_fuse_pass", // - "seq_concat_fc_fuse_pass", // - "fc_fuse_pass", // - "conv_bn_fuse_pass", // - "conv_eltwiseadd_bn_fuse_pass", // -#ifdef PADDLE_WITH_MKLDNN - "conv_relu_mkldnn_fuse_pass", // -#endif - }}; - - std::unordered_set disabled_ir_passes_; + protected: + void RunAnalysis(Argument* argument); }; } // namespace analysis diff --git a/paddle/fluid/inference/analysis/analyzer_main.cc b/paddle/fluid/inference/analysis/analyzer_main.cc deleted file mode 100644 index 5e1fe3eb797cdced56a61aa2db0c3d18601824f8..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/analyzer_main.cc +++ /dev/null @@ -1,33 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -/* - * This file implements analysizer -- an executation help to analyze and - * optimize trained model. - */ -#include "paddle/fluid/inference/analysis/analyzer.h" -#include -#include - -int main(int argc, char** argv) { - google::ParseCommandLineFlags(&argc, &argv, true); - using paddle::inference::analysis::Analyzer; - using paddle::inference::analysis::Argument; - - Argument argument; - Analyzer analyzer; - analyzer.Run(&argument); - - return 0; -} diff --git a/paddle/fluid/inference/analysis/analyzer_tester.cc b/paddle/fluid/inference/analysis/analyzer_tester.cc index 5430e5c1ef1c70d27295ebc1a9bd427cd95f006a..c814ce454840a2c6f3829599b86c9e127d07e4f4 100644 --- a/paddle/fluid/inference/analysis/analyzer_tester.cc +++ b/paddle/fluid/inference/analysis/analyzer_tester.cc @@ -19,6 +19,7 @@ #include "paddle/fluid/inference/analysis/ut_helper.h" #include "paddle/fluid/inference/api/paddle_inference_api.h" #include "paddle/fluid/inference/api/paddle_inference_pass.h" +#include "paddle/fluid/platform/port.h" namespace paddle { namespace inference { @@ -27,21 +28,27 @@ namespace analysis { using namespace framework; // NOLINT TEST(Analyzer, analysis_without_tensorrt) { - FLAGS_IA_enable_tensorrt_subgraph_engine = false; Argument argument; - argument.fluid_model_dir.reset(new std::string(FLAGS_inference_model_dir)); + argument.SetModelDir(FLAGS_inference_model_dir); + argument.SetIrAnalysisPasses({"infer_clean_graph_pass"}); + argument.SetUseGPU(false); + argument.SetAnalysisPasses({"ir_graph_build_pass", "ir_analysis_pass", + "ir_params_sync_among_devices_pass"}); + Analyzer analyser; analyser.Run(&argument); } TEST(Analyzer, analysis_with_tensorrt) { - FLAGS_IA_enable_tensorrt_subgraph_engine = true; Argument argument; - argument.Set("minimum_subgraph_size", new int(0)); - argument.Set("max_batch_size", new int(3)); - argument.Set("workspace_size", new int(1 << 20)); - argument.Set("precision_mode", new std::string("FP32")); - argument.fluid_model_dir.reset(new std::string(FLAGS_inference_model_dir)); + argument.SetTensorRtMaxBatchSize(3); + argument.SetTensorRtWorkspaceSize(1 << 20); + argument.SetModelDir(FLAGS_inference_model_dir); + argument.SetIrAnalysisPasses({"infer_clean_graph_pass"}); + argument.SetUseGPU(false); + argument.SetAnalysisPasses({"ir_graph_build_pass", "ir_analysis_pass", + "ir_params_sync_among_devices_pass"}); + Analyzer analyser; analyser.Run(&argument); } @@ -66,18 +73,19 @@ void TestWord2vecPrediction(const std::string& model_path) { std::vector outputs; CHECK(predictor->Run(slots, &outputs)); - PADDLE_ENFORCE(outputs.size(), 1UL); + PADDLE_ENFORCE_EQ(outputs.size(), 1UL); // Check the output buffer size and result of each tid. - PADDLE_ENFORCE(outputs.front().data.length(), 33168UL); + PADDLE_ENFORCE_EQ(outputs.front().data.length(), 33168UL); float result[5] = {0.00129761, 0.00151112, 0.000423564, 0.00108815, 0.000932706}; const size_t num_elements = outputs.front().data.length() / sizeof(float); // The outputs' buffers are in CPU memory. - for (size_t i = 0; i < std::min(5UL, num_elements); i++) { - LOG(INFO) << "data: " - << static_cast(outputs.front().data.data())[i]; - PADDLE_ENFORCE(static_cast(outputs.front().data.data())[i], - result[i]); + for (size_t i = 0; i < std::min(static_cast(5UL), num_elements); + i++) { + LOG(INFO) << "data: " << static_cast(outputs.front().data.data())[i] + << " result: " << result[i]; + EXPECT_NEAR(static_cast(outputs.front().data.data())[i], result[i], + 1e-3); } } diff --git a/paddle/fluid/inference/analysis/argument.h b/paddle/fluid/inference/analysis/argument.h index e8fb0775b45761f64fd6fd28306c35b76d1e40c4..a736ca393ccb7168a9faf650a6bce13f35fffca8 100644 --- a/paddle/fluid/inference/analysis/argument.h +++ b/paddle/fluid/inference/analysis/argument.h @@ -23,15 +23,31 @@ #pragma once +#include +#include #include +#include +#include +#include +#include + +#include "paddle/fluid/framework/ir/graph.h" #include "paddle/fluid/framework/program_desc.h" -#include "paddle/fluid/inference/analysis/data_flow_graph.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/inference/api/paddle_analysis_config.h" #include "paddle/fluid/platform/variant.h" namespace paddle { namespace inference { namespace analysis { +using framework::ir::Graph; + +#ifdef PADDLE_WITH_MKLDNN +using VarQuantScale = + std::unordered_map>; +#endif + /* * The argument definition of both Pass and PassManagers. * @@ -39,75 +55,144 @@ namespace analysis { */ struct Argument { Argument() = default; - explicit Argument(const std::string& fluid_model_dir) - : fluid_model_dir(new std::string(fluid_model_dir)) {} - // The directory of the trained model. - std::unique_ptr fluid_model_dir; - // The path of `__model__` and `param`, this is used when the file name of - // model and param is changed. - std::unique_ptr fluid_model_program_path; - std::unique_ptr fluid_model_param_path; - - // The graph that process by the Passes or PassManagers. - std::unique_ptr main_dfg; - - // The original program desc. - std::unique_ptr origin_program_desc; - - // The processed program desc. - std::unique_ptr transformed_program_desc; - - // The output storage path of ModelStorePass. - std::unique_ptr model_output_store_path; - - // Support for any other attributes. - template - void Set(const std::string& key, T* data) { - PADDLE_ENFORCE_NOT_NULL(data); - PADDLE_ENFORCE(!attrs_.count(key), "Duplicate set Argument's attr [%s]", - key); - attrs_[key] = data; - attr_deleters_[key] = [data, key]() { - VLOG(3) << "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"; - VLOG(3) << "argument delete attr: " << key; - delete data; - }; - } - - bool Has(const std::string& name) const { return attrs_.count(name); } - - template - T* Release(const std::string& key) { - PADDLE_ENFORCE(attrs_.count(key)); - auto* res = boost::any_cast(attrs_.at(key)); - attrs_.erase(key); - attr_deleters_.erase(key); - return res; - } - - template - T& Get(const std::string& key) { - PADDLE_ENFORCE(Has(key)); - return *boost::any_cast(attrs_.at(key)); - } - - ~Argument() { - for (auto& item : attr_deleters_) { - item.second(); - } - } + explicit Argument(const std::string& model_dir) { SetModelDir(model_dir); } + + using unique_ptr_t = std::unique_ptr>; + using fusion_statis_t = std::unordered_map; + using engine_opt_info_t = std::map; + using anakin_max_shape_t = std::map>; + + bool Has(const std::string& key) const { return valid_fields_.count(key); } + +#define DECL_ARGUMENT_FIELD(field__, Field, type__) \ + public: \ + type__& field__() { \ + PADDLE_ENFORCE(Has(#field__)); \ + return field__##_; \ + } \ + void Set##Field(const type__& x) { \ + field__##_ = x; \ + valid_fields_.insert(#field__); \ + } \ + DECL_ARGUMENT_FIELD_VALID(field__); \ + type__* field__##_ptr() { return &field__##_; } \ + \ + private: \ + type__ field__##_; + +#define DECL_ARGUMENT_FIELD_VALID(field__) \ + bool field__##_valid() { return Has(#field__); } + +#define DECL_ARGUMENT_UNIQUE_FIELD(field__, Field, type__) \ + public: \ + type__& field__() { \ + PADDLE_ENFORCE_NOT_NULL(field__##_); \ + PADDLE_ENFORCE(Has(#field__)); \ + return *static_cast(field__##_.get()); \ + } \ + void Set##Field(type__* x) { \ + field__##_ = \ + unique_ptr_t(x, [](void* x) { delete static_cast(x); }); \ + valid_fields_.insert(#field__); \ + } \ + void Set##Field##NotOwned(type__* x) { \ + valid_fields_.insert(#field__); \ + field__##_ = unique_ptr_t(x, [](void* x) {}); \ + } \ + DECL_ARGUMENT_FIELD_VALID(field__); \ + type__* field__##_ptr() { \ + PADDLE_ENFORCE(Has(#field__)); \ + return static_cast(field__##_.get()); \ + } \ + type__* Release##Field() { \ + PADDLE_ENFORCE(Has(#field__)); \ + valid_fields_.erase(#field__); \ + return static_cast(field__##_.release()); \ + } \ + \ + private: \ + unique_ptr_t field__##_; + + DECL_ARGUMENT_FIELD(predictor_id, PredictorID, int); + // Model path + DECL_ARGUMENT_FIELD(model_dir, ModelDir, std::string); + // Model specified with program and parameters files. + DECL_ARGUMENT_FIELD(model_program_path, ModelProgramPath, std::string); + DECL_ARGUMENT_FIELD(model_params_path, ModelParamsPath, std::string); + DECL_ARGUMENT_FIELD(model_from_memory, ModelFromMemory, bool); + DECL_ARGUMENT_FIELD(engine_opt_info, EngineOptInfo, engine_opt_info_t); + + // The overall graph to work on. + DECL_ARGUMENT_UNIQUE_FIELD(main_graph, MainGraph, framework::ir::Graph); + // The overall Scope to work on. + DECL_ARGUMENT_UNIQUE_FIELD(scope, Scope, framework::Scope); + + // The default program, loaded from disk. + DECL_ARGUMENT_UNIQUE_FIELD(main_program, MainProgram, framework::ProgramDesc); + + // The ir passes to perform in analysis phase. + DECL_ARGUMENT_FIELD(ir_analysis_passes, IrAnalysisPasses, + std::vector); + DECL_ARGUMENT_FIELD(analysis_passes, AnalysisPasses, + std::vector); + + // Pass a set of op types to enable its mkldnn kernel + DECL_ARGUMENT_FIELD(mkldnn_enabled_op_types, MKLDNNEnabledOpTypes, + std::unordered_set); + +#ifdef PADDLE_WITH_MKLDNN + // A set of op types to enable their quantized kernels + DECL_ARGUMENT_FIELD(quantize_enabled_op_types, QuantizeEnabledOpTypes, + std::unordered_set); + + // A set of op IDs to exclude from enabling their quantized kernels + DECL_ARGUMENT_FIELD(quantize_excluded_op_ids, QuantizeExcludedOpIds, + std::unordered_set); + + // Scales for variables to be quantized + DECL_ARGUMENT_FIELD(quant_var_scales, QuantVarScales, VarQuantScale); +#endif + + // Passed from config. + DECL_ARGUMENT_FIELD(use_gpu, UseGPU, bool); + DECL_ARGUMENT_FIELD(gpu_device_id, GPUDeviceId, int); + DECL_ARGUMENT_FIELD(use_tensorrt, UseTensorRT, bool); + DECL_ARGUMENT_FIELD(tensorrt_max_batch_size, TensorRtMaxBatchSize, int); + DECL_ARGUMENT_FIELD(tensorrt_workspace_size, TensorRtWorkspaceSize, int); + DECL_ARGUMENT_FIELD(tensorrt_min_subgraph_size, TensorRtMinSubgraphSize, int); + DECL_ARGUMENT_FIELD(tensorrt_precision_mode, TensorRtPrecisionMode, + AnalysisConfig::Precision); + DECL_ARGUMENT_FIELD(tensorrt_use_static_engine, TensorRtUseStaticEngine, + bool); + + DECL_ARGUMENT_FIELD(anakin_max_input_shape, AnakinMaxInputShape, + anakin_max_shape_t); + DECL_ARGUMENT_FIELD(anakin_max_batch_size, AnakinMaxBatchSize, int); + DECL_ARGUMENT_FIELD(anakin_min_subgraph_size, AnakinMinSubgraphSize, int); + DECL_ARGUMENT_FIELD(use_anakin, UseAnakin, bool); + + // Memory optimized related. + DECL_ARGUMENT_FIELD(enable_memory_optim, EnableMemoryOptim, bool); + DECL_ARGUMENT_FIELD(static_memory_optim, StaticMemoryOptim, bool); + DECL_ARGUMENT_FIELD(static_memory_optim_force_update, + StaticMemoryOptimForceUpdate, bool); + // Indicate which kind of sort algorithm is used for operators, the memory + // optimization relays on the sort algorithm. + DECL_ARGUMENT_FIELD(memory_optim_sort_kind, MemoryOptimSortKind, int); + + // The program transformed by IR analysis phase. + DECL_ARGUMENT_UNIQUE_FIELD(ir_analyzed_program, IrAnalyzedProgram, + framework::proto::ProgramDesc); + + DECL_ARGUMENT_FIELD(fusion_statis, FusionStatis, fusion_statis_t); private: - std::unordered_map attrs_; - std::unordered_map> attr_deleters_; + std::unordered_set valid_fields_; }; -#define UNLIKELY(condition) __builtin_expect(static_cast(condition), 0) -#define ANALYSIS_ARGUMENT_CHECK_FIELD(field__) \ - if (UNLIKELY(!(field__))) { \ - LOG(ERROR) << "field " << #field__ << " should be set."; \ - return false; \ - } +#define ARGUMENT_CHECK_FIELD(argument__, fieldname__) \ + PADDLE_ENFORCE(argument__->Has(#fieldname__), \ + "the argument field [%s] should be set", #fieldname__); } // namespace analysis } // namespace inference diff --git a/paddle/fluid/inference/analysis/data_flow_graph.cc b/paddle/fluid/inference/analysis/data_flow_graph.cc deleted file mode 100644 index 8c7d58678fd29cb25d13d64a08e6c6f26f242d8b..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/data_flow_graph.cc +++ /dev/null @@ -1,496 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/inference/analysis/data_flow_graph.h" -#include "paddle/fluid/inference/analysis/dot.h" -#include "paddle/fluid/inference/analysis/node.h" - -namespace paddle { -namespace inference { -namespace analysis { -using ir_node_t = framework::ir::Node; -using ir_graph_t = framework::ir::Graph; - -// It is a better idea that the inputs and outputs of this graph is set manually -// before, but there must be a Pass that helps to prune the unnecessary ops that -// do not contribute to the given targets, so in this pass, analysis and get the -// inputs and outputs is OK. -void DataFlowGraph::Build() { - inputs_.clear(); - outputs_.clear(); - std::unordered_set ins; - std::unordered_set outs; - for (auto &node : nodes.nodes()) { - for (auto *in : node->inlinks) { - ins.insert(in); - } - for (auto *out : node->outlinks) { - outs.insert(out); - } - } - - // The nodes that in ins but not in outs is the graph's inputs - // similarly, the nodes that in outs but not in ins is the graphs' outputs - for (auto *in : ins) { - if (!outs.count(in)) { - inputs_.push_back(in); - } - } - for (auto *out : outs) { - if (!ins.count(out)) { - outputs_.push_back(out); - } - } - - Clean(); -} - -void DataFlowGraph::Build(const framework::proto::ProgramDesc &prog) { - // insert vars - // The `var2id` keeps a map from a variable's name to its Node-id, the Node-id - // will keep updating to its latest alias during the graph-building. - std::unordered_map var2id; - auto &main_block = prog.blocks(framework::kRootBlockIndex); - for (int i = 0; i < main_block.vars_size(); i++) { - const auto &var = main_block.vars(i); - auto *v = nodes.Create(Node::Type::kValue); - v->SetName(var.name()); - v->SetPbDesc(const_cast(static_cast(&var))); - v->SetPbMsg(var.SerializeAsString()); - var2id[var.name()] = v->id(); - } - - // The variables in a SSA can only write once, so if a variable is written - // multiple times(quite common in our ProgramDesc design), multiple alias - // Nodes of this variable will be created, and each will just write once. - - // An set that keep all the names of the variables(the original, not alias) - // that have been written(as outputs). Once an Op's output variable hit the - // set, it should create a new alias and update the global alias for this - // variable. And that make a Data Flow Graph a SSA. - std::unordered_set unique_written_vars; - for (int i = 0; i < main_block.ops_size(); i++) { - const auto &op = main_block.ops(i); - auto *o = nodes.Create(Node::Type::kFunction); - o->SetName(op.type()); - static_cast(o)->SetFuncType(op.type()); - // Link to the original protobuf message's memory, make it easier to - // generate from a data flow graph to fluid ProgramDesc. - o->SetPbDesc(const_cast(static_cast(&op))); - o->SetPbMsg(op.SerializeAsString()); - - // set inputs and outputs - for (int j = 0; j < op.inputs_size(); j++) { - auto &in_var = op.inputs(j); - for (int k = 0; k < in_var.arguments_size(); k++) { - auto *in = nodes.GetMutable(var2id.at(in_var.arguments(k))); - in->outlinks.push_back(o); - o->inlinks.push_back(in); - unique_written_vars.insert(in); - } - } - for (int j = 0; j < op.outputs_size(); j++) { - auto &out_var = op.outputs(j); - for (int k = 0; k < out_var.arguments_size(); k++) { - auto *out = nodes.GetMutable(var2id[out_var.arguments(k)]); - if (unique_written_vars.count(out)) { - // Loop found, for example, a = op(a), use SSA, change to a1 = op(a). - auto *out_alias = nodes.Create(Node::Type::kValue); - out_alias->SetName(out->name()); - out_alias->SetPbDesc(out->pb_desc()); - out_alias->SetPbMsg(out->pb_msg()); - var2id[out_alias->name()] = - out_alias->id(); // update variable's alias Node - LOG(INFO) << "loop found in graph, create SSA alias node [" - << out_alias->repr() << "] for [" << out->repr() << "]"; - out = out_alias; - } - out->inlinks.push_back(o); - o->outlinks.push_back(out); - } - } - } - // Analysis and extract the inputs and outputs of this graph. - Build(); -} - -void DataFlowGraph::Build(const framework::ir::Graph &graph) { - // Create nodes - std::unordered_map ir_node_map; - for (auto *ir_node : graph.Nodes()) { - Node *x{nullptr}; - if (ir_node->IsOp()) { - PADDLE_ENFORCE(ir_node->Op()); - VLOG(4) << "get op " << ir_node << " " << ir_node->Name(); - x = nodes.Create(Node::Type::kFunction); - x->attr("ir_node").Pointer() = ir_node; - PADDLE_ENFORCE(ir_node->Op()->Proto()); - x->SetName(ir_node->Op()->Proto()->type()); - x->SetPbMsg(ir_node->Op()->Proto()->SerializeAsString()); - } else if (ir_node->IsVar()) { - // Not create a Node for IR ControlDepVar, considering Inference currently - // just used in single thread scenerio. - VLOG(4) << "get var " << ir_node->Name(); - x = nodes.Create(Node::Type::kValue); - x->attr("ir_node").Pointer() = ir_node; - x->SetName(ir_node->Name()); - // x->SetPbMsg(ir_node->Var()->Proto()->SerializeAsString()); - } else { - PADDLE_THROW("Failed to create an Node from IR, unknown type"); - } - ir_node_map.emplace(ir_node, x); - } - VLOG(4) << "finish creating Nodes"; - - VLOG(4) << "to create edge"; - // Create links - for (auto *ir_node : graph.Nodes()) { - auto it = ir_node_map.find(ir_node); - // Skip ControlDepVar. - if (it == ir_node_map.end()) continue; - auto *node = it->second; - for (auto *x : ir_node->inputs) { - if (!ir_node_map.count(x)) continue; - node->inlinks.push_back(ir_node_map.at(x)); - } - for (auto *x : ir_node->outputs) { - if (!ir_node_map.count(x)) continue; - node->outlinks.push_back(ir_node_map.at(x)); - } - } - - Build(); - PADDLE_ENFORCE(!inputs_.empty(), - "Can't deduce any inputs from the graph, Is the graph empty?"); - - ir_graph = &graph; - VLOG(3) << "finished build from IR"; -} - -void DataFlowGraph::Clean() { - for (auto &node : nodes.nodes()) { - std::unordered_set inlinks_set(node->inlinks.begin(), - node->inlinks.end()); - std::unordered_set outlinks_set(node->outlinks.begin(), - node->outlinks.end()); - if (inlinks_set.size() < node->inlinks.size()) { - node->inlinks.assign(inlinks_set.begin(), inlinks_set.end()); - } - if (outlinks_set.size() < node->outlinks.size()) { - node->outlinks.assign(outlinks_set.begin(), outlinks_set.end()); - } - } -} - -std::string DataFlowGraph::DotString() const { - Dot dot; - - // Add nodes - for (size_t i = 0; i < nodes.size(); i++) { - const Node &node = nodes.Get(i); - dot.AddNode(node.repr(), node.dot_attrs()); - } - - // Add edges - for (size_t i = 0; i < nodes.size(); i++) { - const Node &node = nodes.Get(i); - for (auto &in : node.inlinks) { - dot.AddEdge(in->repr(), node.repr(), {}); - } - } - return dot.Build(); -} - -std::string DataFlowGraph::HumanReadableInfo(bool show_values, - bool show_functions) const { - std::stringstream values, functions; - for (auto &n : nodes.nodes()) { - if (show_values && n->IsValue()) { - values << n->repr() << "\n"; - } - if (show_functions && n->IsFunction()) { - functions << n->repr() << "\n"; - } - } - return "Values:\n" + values.str() + "\n\n" + "Functions:\n" + functions.str(); -} - -// -// NodesBFSIterator -// - -GraphTraits::NodesBFSIterator::NodesBFSIterator( - const std::vector &source) - : queue_(source.begin(), source.end()) {} - -GraphTraits::NodesBFSIterator::NodesBFSIterator( - GraphTraits::NodesBFSIterator &&other) noexcept - : queue_(std::move(other.queue_)), - visited_(std::move(other.visited_)) {} - -GraphTraits::NodesBFSIterator::NodesBFSIterator( - const GraphTraits::NodesBFSIterator &other) - : queue_(other.queue_), visited_(other.visited_) {} - -Node &GraphTraits::NodesBFSIterator::operator*() { - PADDLE_ENFORCE(!queue_.empty()); - return *queue_.front(); -} - -Node *GraphTraits::NodesBFSIterator::operator->() { - PADDLE_ENFORCE(!queue_.empty()); - return queue_.front(); -} - -GraphTraits::NodesBFSIterator & -GraphTraits::NodesBFSIterator::operator=( - const GraphTraits::NodesBFSIterator &other) { - queue_ = other.queue_; - visited_ = other.visited_; - return *this; -} - -GraphTraits::NodesBFSIterator - &GraphTraits::NodesBFSIterator::operator++() { - PADDLE_ENFORCE(!queue_.empty()); - auto *cur = queue_.front(); - visited_.insert(cur); - queue_.pop_front(); - for (auto *output : cur->outlinks) { - if (!visited_.count(output)) { - queue_.push_back(output); - visited_.insert(output); - } - } - return *this; -} - -bool GraphTraits::NodesBFSIterator::operator==( - const GraphTraits::NodesBFSIterator &other) { - if (queue_.empty()) return other.queue_.empty(); - if ((!queue_.empty()) && (!other.queue_.empty())) { - return queue_.front() == other.queue_.front() && - visited_.size() == other.visited_.size(); - // equality of queue and - // visited. Just a light but week implementation. - } - return false; -} - -// -// NodesDFSIterator -// -GraphTraits::NodesDFSIterator::NodesDFSIterator( - const std::vector &source) { - for (auto *x : source) stack_.push(x); -} - -GraphTraits::NodesDFSIterator::NodesDFSIterator( - GraphTraits::NodesDFSIterator &&other) noexcept - : stack_(std::move(other.stack_)), - visited_(std::move(other.visited_)) {} - -GraphTraits::NodesDFSIterator::NodesDFSIterator( - const GraphTraits::NodesDFSIterator &other) - : stack_(other.stack_), visited_(other.visited_) {} - -Node &GraphTraits::NodesDFSIterator::operator*() { - PADDLE_ENFORCE(!stack_.empty()); - return *stack_.top(); -} - -GraphTraits::NodesDFSIterator - &GraphTraits::NodesDFSIterator::operator++() { - if (stack_.empty()) return *this; - visited_.insert(stack_.top()); - auto *cur = stack_.top(); - stack_.pop(); - for (auto *x : cur->outlinks) { - if (!visited_.count(x)) { - stack_.push(x); - visited_.insert(x); - } - } - return *this; -} -bool GraphTraits::NodesDFSIterator::operator==( - const GraphTraits::NodesDFSIterator &other) { - if (stack_.empty()) return other.stack_.empty(); - if ((!stack_.empty()) && (!other.stack_.empty())) { - return stack_.top() == other.stack_.top(); - } - return false; -} - -GraphTraits::NodesDFSIterator & -GraphTraits::NodesDFSIterator::operator=( - const GraphTraits::NodesDFSIterator &other) { - stack_ = other.stack_; - visited_ = other.visited_; - return *this; -} -Node *GraphTraits::NodesDFSIterator::operator->() { - return stack_.top(); -} - -inline bool CheckNodeIndegreeEquals(const Node &node, size_t n) { - return node.inlinks.size() == n; -} - -GraphTraits::NodesTSIterator::NodesTSIterator( - const std::vector &source) { - PADDLE_ENFORCE(!source.empty(), - "Start points of topological sorting should not be empty!"); - // CHECK all the inputs' in-degree is 0 - for (auto *node : source) { - PADDLE_ENFORCE(CheckNodeIndegreeEquals(*node, 0)); - } - - std::unordered_set visited; - std::unordered_set to_visit{source.begin(), source.end()}; - - std::vector inlink_visited; - while (!to_visit.empty()) { - std::vector queue(to_visit.begin(), to_visit.end()); - for (auto *p : queue) { - if (p->deleted()) { - visited.insert(p); - to_visit.erase(p); - continue; - } - inlink_visited.clear(); - - std::copy_if(p->inlinks.begin(), p->inlinks.end(), - std::back_inserter(inlink_visited), - [&](Node *x) { return visited.count(x); }); - - if (inlink_visited.size() == p->inlinks.size()) { - sorted_.push_back(p); - for (auto *_ : p->outlinks) { - if (!visited.count(_)) { - to_visit.insert(_); - } - } - - to_visit.erase(p); - visited.insert(p); - } - } - } -} - -GraphTraits::NodesTSIterator::NodesTSIterator( - const paddle::inference::analysis::GraphTraits< - DataFlowGraph>::NodesTSIterator &other) - : sorted_(other.sorted_), cursor_(other.cursor_) {} - -Node &GraphTraits::NodesTSIterator::operator*() { - PADDLE_ENFORCE_LT(cursor_, sorted_.size()); - return *sorted_[cursor_]; -} - -paddle::inference::analysis::GraphTraits::NodesTSIterator - &GraphTraits::NodesTSIterator::operator++() { - if (++cursor_ >= sorted_.size()) { - sorted_.clear(); - cursor_ = 0; - } - return *this; -} -paddle::inference::analysis::GraphTraits::NodesTSIterator & -GraphTraits::NodesTSIterator::operator=( - const paddle::inference::analysis::GraphTraits< - DataFlowGraph>::NodesTSIterator &other) { - cursor_ = other.cursor_; - sorted_ = other.sorted_; - return *this; -} - -bool GraphTraits::NodesTSIterator::operator==( - const paddle::inference::analysis::GraphTraits< - DataFlowGraph>::NodesTSIterator &other) { - return sorted_ == other.sorted_ && cursor_ == other.cursor_; -} - -Node *GraphTraits::NodesTSIterator::operator->() { - PADDLE_ENFORCE_LT(cursor_, sorted_.size()); - return sorted_[cursor_]; -} - -std::pair, std::vector> -ExtractInputAndOutputOfSubGraph(std::vector &graph) { // NOLINT - std::unordered_set nodes(graph.begin(), graph.end()); - std::unordered_set inputs; - std::unordered_set outputs; - // Input a Value, check whether its inlink is in the subgraph. - auto inlink_in_subgraph = [&](Node *n) { - for (auto *in : n->inlinks) { - if (nodes.count(in)) return true; - } - return false; - }; - - for (auto &node : graph) { - for (auto *in : node->inlinks) { - // The Value that is written by nodes inside a sub-graph shouldn't be the - // input of the sub-graph. - if (!nodes.count(in) && in->type() == Node::Type::kValue && - !inlink_in_subgraph(in)) { - inputs.insert(in); - } - } - for (auto *out : node->outlinks) { - if (!nodes.count(out) && out->type() == Node::Type::kValue) { - outputs.insert(out); - } - } - } - return std::make_pair(std::vector(inputs.begin(), inputs.end()), - std::vector(outputs.begin(), outputs.end())); -} - -// Filter the Intermediate results of the subgraph node. -void FilterRedundantOutputOfSubGraph(DataFlowGraph *graph) { - std::vector op_nodes; - for (auto &node : GraphTraits(*graph).nodes_in_TS()) { - if (node.type() == Node::Type::kValue || node.deleted()) { - continue; - } - op_nodes.push_back(&node); - } - size_t op_num = op_nodes.size(); - for (size_t i = 0; i < op_num; i++) { - if (op_nodes[i]->type() == Node::Type::kFunction) continue; - std::unordered_set follow_up_input_names; - for (size_t j = i + 1; j < op_num; j++) { - for (auto *in : op_nodes[j]->inlinks) { - follow_up_input_names.insert(in->name()); - } - } - std::vector filtered_subgraph_outlinks; - for (auto *out : op_nodes[i]->outlinks) { - if (follow_up_input_names.count(out->name())) { - filtered_subgraph_outlinks.push_back(out); - } else { - out->SetDeleted(); - } - } - // The filtered_subgraph_outlinks may be empty. - op_nodes[i]->outlinks = filtered_subgraph_outlinks; - } -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/data_flow_graph.h b/paddle/fluid/inference/analysis/data_flow_graph.h deleted file mode 100644 index 437e097acd24aad384df6712ce0de6106b3b5c65..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/data_flow_graph.h +++ /dev/null @@ -1,209 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -/* - * Data flow graph is an pass that build the basic graph. It contains a graph - * and the iterators that enable the iteration over the graph. - */ - -#pragma once - -#include -#include -#include -#include -#include -#include - -#include "paddle/fluid/framework/ir/graph.h" -#include "paddle/fluid/inference/analysis/graph_traits.h" -#include "paddle/fluid/inference/analysis/node.h" -#include "paddle/fluid/platform/enforce.h" - -namespace paddle { -namespace inference { -namespace analysis { - -/* - * DataFlowGraph - A container of Value and Function Nodes. - * - * This is the base graph for any other type of graphs, such as SSA or CFG. - */ -struct DataFlowGraph { - NodeMap nodes; - // inputs and outputs are deduced from the graph. - // Used to interact with IR. - const framework::ir::Graph *ir_graph{nullptr}; - - // Extract inputs and outputs of the graph. - void Build(); - - void Build(const framework::proto::ProgramDesc &prog); - - // Build a graph from ir::Graph. - void Build(const framework::ir::Graph &graph); - - // Get an attribute. - AnyAttr &Attr(const std::string &key) { return attrs_[key]; } - - // Output a DOT graph file for debug. - std::string DotString() const; - - std::string HumanReadableInfo(bool show_values = true, - bool show_functions = true) const; - - const std::vector &inputs() const { - PADDLE_ENFORCE(!inputs_.empty(), - "No inputs are deduced, need to Build() first."); - return inputs_; - } - const std::vector &outputs() const { - PADDLE_ENFORCE(!outputs_.empty(), - "No outputs are deduced, need to Build() first."); - return outputs_; - } - - private: - mutable std::vector inputs_; - mutable std::vector outputs_; - std::unordered_map attrs_; - - // Remove duplicate edges and so on. - void Clean(); -}; - -/* - * An graph trait help to traverse the graph using BFS. - * The BFS start from a graph's inputs, the graph should be fully-connected, so - * that the iterator can reach the end. - */ -template <> -struct GraphTraits { - // BFS iterator on nodes. - struct NodesBFSIterator - : public std::iterator { - NodesBFSIterator() = default; - explicit NodesBFSIterator(const std::vector &source); - NodesBFSIterator(NodesBFSIterator &&other) noexcept; - // NOTE Heavy to use. - NodesBFSIterator(const NodesBFSIterator &other); - - Node &operator*(); - NodesBFSIterator &operator++(); - Node *operator->(); - // TODO(Superjomn) current implementation just compare the first - // element, need to compare the graph and all the elements in the queue and - // set. - NodesBFSIterator &operator=(const NodesBFSIterator &other); - bool operator==(const NodesBFSIterator &other); - bool operator!=(const NodesBFSIterator &other) { return !(*this == other); } - - private: - std::deque queue_; - std::unordered_set visited_; - }; - - // DFS iterator on nodes. - struct NodesDFSIterator - : public std::iterator { - NodesDFSIterator() = default; - NodesDFSIterator(const std::vector &source); - NodesDFSIterator(NodesDFSIterator &&other) noexcept; - NodesDFSIterator(const NodesDFSIterator &other); - - Node &operator*(); - NodesDFSIterator &operator++(); - // TODO(Superjomn) current implementation just compare the first - // element, need to compare the graph and all the elements in the queue and - // set. - NodesDFSIterator &operator=(const NodesDFSIterator &other); - bool operator==(const NodesDFSIterator &other); - bool operator!=(const NodesDFSIterator &other) { return !(*this == other); } - Node *operator->(); - - private: - std::stack stack_; - std::unordered_set visited_; - }; - - // Topological sorting iterator on nodes. - struct NodesTSIterator - : public std::iterator { - NodesTSIterator() = default; - NodesTSIterator(const std::vector &source); - NodesTSIterator(NodesTSIterator &&other) - : sorted_(std::move(other.sorted_)), cursor_(other.cursor_) { - other.cursor_ = 0; - } - NodesTSIterator(const NodesTSIterator &other); - - Node &operator*(); - NodesTSIterator &operator++(); - // TODO(Superjomn) current implementation just compare the first - // element, need to compare the graph and all the elements in the queue and - // set. - NodesTSIterator &operator=(const NodesTSIterator &other); - bool operator==(const NodesTSIterator &other); - bool operator!=(const NodesTSIterator &other) { return !(*this == other); } - Node *operator->(); - - private: - std::vector sorted_; - size_t cursor_{0}; - }; - - explicit GraphTraits(const DataFlowGraph &graph) : graph_(graph) {} - - // default use BFS to visit the nodes. - iterator_range nodes() { - return iterator_range(nodes_bfs_begin(), nodes_bfs_end()); - } - iterator_range nodes_in_BFS() { - return iterator_range(nodes_bfs_begin(), nodes_bfs_end()); - } - iterator_range nodes_in_DFS() { - return iterator_range(nodes_dfs_begin(), nodes_dfs_end()); - } - iterator_range nodes_in_TS() { - return iterator_range(nodes_ts_begin(), nodes_ts_end()); - } - - private: - NodesBFSIterator nodes_bfs_begin() { - return NodesBFSIterator(graph_.inputs()); - } - NodesBFSIterator nodes_bfs_end() { return NodesBFSIterator(); } - - NodesDFSIterator nodes_dfs_begin() { - return NodesDFSIterator(graph_.inputs()); - } - NodesDFSIterator nodes_dfs_end() { return NodesDFSIterator(); } - - NodesTSIterator nodes_ts_begin() { return NodesTSIterator(graph_.inputs()); } - NodesTSIterator nodes_ts_end() { return NodesTSIterator(); } - - private: - const DataFlowGraph &graph_; -}; - -// Extract the inputs and outputs of a graph. The inputs and outputs of a -// sub-graph is the inputs nodes and output nodes that doesn't inside the -// sub-graph. -std::pair, std::vector> -ExtractInputAndOutputOfSubGraph(std::vector &graph); // NOLINT - -void FilterRedundantOutputOfSubGraph(DataFlowGraph *graph); -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/data_flow_graph_tester.cc b/paddle/fluid/inference/analysis/data_flow_graph_tester.cc deleted file mode 100644 index 1682011c3d8cc9927a4b026b370671798cace625..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/data_flow_graph_tester.cc +++ /dev/null @@ -1,165 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/inference/analysis/data_flow_graph.h" -#include "paddle/fluid/framework/program_desc.h" -#include "paddle/fluid/inference/analysis/ut_helper.h" - -namespace paddle { -namespace inference { -namespace analysis { - -TEST(DataFlowGraph, BFS) { - auto desc = LoadProgramDesc(FLAGS_inference_model_dir + "/__model__"); - auto dfg = ProgramDescToDFG(desc); - dfg.Build(); - - for (auto* in : dfg.inputs()) { - LOG(INFO) << "inputs: " << in->name() << " " - << static_cast(in->type()); - } - for (auto* out : dfg.outputs()) { - LOG(INFO) << "outputs: " << out->name() << " " - << static_cast(out->type()); - } - - size_t count = 0; - for (auto& node : GraphTraits(dfg).nodes()) { - LOG(INFO) << "visiting " << node.name(); - ++count; - } - ASSERT_EQ(count, dfg.nodes.size()); -} - -TEST(DataFlowGraph, DFS) { - auto desc = LoadProgramDesc(FLAGS_inference_model_dir + "/__model__"); - DataFlowGraph dfg; - dfg.Build(desc); - size_t count = 0; - for (auto& node : GraphTraits(dfg).nodes_in_DFS()) { - LOG(INFO) << "visiting " << node.name(); - ++count; - } - ASSERT_EQ(count, dfg.nodes.size()); -} - -// Topological sorting. -/* - * Graph topology - * inputs: 0, 1, 2 - * 0 -> 4 - * 0 -> 5 - * 1 -> 6 - * 2 -> 7 - * 4 -> 5 - * 4 -> 7 - * 4 -> 3 - * 7 -> 3 - */ -TEST(DataFlowGraph, TS) { - DataFlowGraph graph; - - for (int i = 0; i < 8; i++) { - auto* node = graph.nodes.Create(Node::Type::kValue); - node->SetName("node-" + std::to_string(i)); - } - - auto add_link = [&](int i, int j) { - Node* source = graph.nodes.GetMutable(i); - Node* target = graph.nodes.GetMutable(j); - target->inlinks.push_back(source); - source->outlinks.push_back(target); - }; - - add_link(0, 4); - add_link(0, 5); - add_link(1, 6); - add_link(2, 7); - add_link(4, 5); - add_link(4, 7); - add_link(4, 3); - add_link(7, 3); - graph.Build(); - - auto its = GraphTraits(graph).nodes_in_TS(); - std::vector sorted_ids; - for (auto it = its.begin(); it != its.end(); ++it) { - LOG(INFO) << it->name(); - sorted_ids.push_back(it->id()); - } - - // Assert a occurs prior to b in the sorted_ids. - auto assert_positive_sequence_pair = [&](int a, int b) { - auto a_offset = std::find(sorted_ids.begin(), sorted_ids.end(), a); - auto b_offset = std::find(sorted_ids.begin(), sorted_ids.end(), b); - ASSERT_LT(a_offset, b_offset); - }; - - assert_positive_sequence_pair(2, 7); - assert_positive_sequence_pair(7, 3); - assert_positive_sequence_pair(4, 3); - assert_positive_sequence_pair(0, 4); - assert_positive_sequence_pair(0, 5); - assert_positive_sequence_pair(1, 6); - assert_positive_sequence_pair(4, 5); - assert_positive_sequence_pair(4, 7); -} - -TEST(DataFlowGraph, Build_ProgramDesc) { - auto desc = LoadProgramDesc(FLAGS_inference_model_dir + "/__model__"); - DataFlowGraph graph; - graph.Build(desc); - ASSERT_EQ(graph.nodes.size(), 38UL); -} - -void SetOp(framework::ProgramDesc* prog, const std::string& type, - const std::vector& inputs, - const std::vector& outputs) { - auto* op = prog->MutableBlock(0)->AppendOp(); - op->SetType(type); - op->SetInput("Xs", inputs); - op->SetOutput("Xs", outputs); -} - -TEST(DataFlowGraph, Build_IR_Graph) { - framework::ProgramDesc prog; - for (auto& v : std::vector({"a", "b", "c", "d", "e", "f"})) { - auto* var = prog.MutableBlock(0)->Var(v); - var->SetType(framework::proto::VarType::SELECTED_ROWS); - if (v == "c") { - var->SetPersistable(true); - } - } - - SetOp(&prog, "OP0", std::vector({"a"}), - std::vector({"b"})); - SetOp(&prog, "OP1", std::vector({"a"}), - std::vector({"c"})); - SetOp(&prog, "mul", std::vector({"b", "c"}), - std::vector({"d"})); - SetOp(&prog, "elementwise_add", std::vector({"d", "e"}), - std::vector({"f"})); - - DataFlowGraph graph; - - framework::ir::Graph ir_graph(prog); - - graph.Build(ir_graph); - - ASSERT_EQ(graph.nodes.size(), ir_graph.Nodes().size()); -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass.cc b/paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass.cc deleted file mode 100644 index cb549f4b50cf56154a951d16b58b022dbad3e990..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass.cc +++ /dev/null @@ -1,284 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include "paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass.h" -#include -#include "paddle/fluid/framework/block_desc.h" -#include "paddle/fluid/framework/ir/fuse_pass_base.h" -#include "paddle/fluid/framework/op_desc.h" -#include "paddle/fluid/framework/proto_desc.h" -#include "paddle/fluid/inference/analysis/analyzer.h" -#include "paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.h" -#include "paddle/fluid/inference/io.h" - -namespace paddle { -namespace inference { - -namespace analysis { - -using framework::proto::ProgramDesc; - -std::vector ExtractParameters( - const std::vector> &nodes); - -bool DataFlowGraphToFluidPass::Initialize(Argument *argument) { - ANALYSIS_ARGUMENT_CHECK_FIELD(argument) - ANALYSIS_ARGUMENT_CHECK_FIELD(argument->origin_program_desc) - // The transformed_program_desc should inherit all the VarDesc and BlockDesc - // from the original program desc. The operators of the main block(the first - // block) should rewritten by data flow graph. - argument->transformed_program_desc.reset( - new ProgramDesc(*argument->origin_program_desc)); - argument->transformed_program_desc->mutable_blocks(framework::kRootBlockIndex) - ->clear_ops(); - desc_ = argument->transformed_program_desc.get(); - argument_ = argument; - return true; -} - -bool DataFlowGraphToFluidPass::Finalize() { return true; } - -void DataFlowGraphToFluidPass::Run(DataFlowGraph *graph) { - // FilterRedundantOutputOfSubGraph(graph); - for (auto &node : GraphTraits(*graph).nodes_in_TS()) { - if (node.deleted()) continue; - - switch (node.type()) { - case Node::Type::kFunction: { - AddFluidOp(&node); - } break; - case Node::Type::kFunctionBlock: { - AddEngineOp(&node); - } break; - default: - continue; - } - } - - if (argument_->Has(framework::ir::kParamScopeAttr)) { - LOG(WARNING) << "parameter changes in the scope takes effect"; - } - - PADDLE_ENFORCE(argument_->transformed_program_desc.get()); -} - -void DataFlowGraphToFluidPass::AddFluidOp(Node *node) { - PADDLE_ENFORCE(node); - PADDLE_ENFORCE(node->IsFunction()); - PADDLE_ENFORCE(node->pb_desc() || !node->pb_msg().empty(), - "node has invalid protobuf repr."); - - // currently only the main block is analyzed. - PADDLE_ENFORCE(desc_); - auto *main_block = desc_->mutable_blocks(framework::kRootBlockIndex); - auto *op = main_block->add_ops(); - - if (node->pb_desc()) { - auto *ori_op = static_cast(node->pb_desc()); - *op = - *ori_op; // copy the attributes, by default, these will not be changed - // by analysis phrase. - // The inputs and outputs of the existing ops are not changed by tensorrt - // subgraph pass. - // NOTE It might be changed by other passes in the long run. - } else { - op->ParseFromString(node->pb_msg()); - } -} - -void CreateTrtEngineOp(Node *node, Argument *argument, - framework::proto::BlockDesc *block) { - PADDLE_ENFORCE(argument->main_dfg.get()); - const DataFlowGraph &graph = *(argument->main_dfg); - static int counter{0}; - PADDLE_ENFORCE(node->IsFunctionBlock()); - framework::OpDesc desc; - auto *func = static_cast(node); - - // collect inputs - std::unordered_set input_names; - std::unordered_set input_names_with_id; - for (auto *x : func->inlinks) { - input_names.insert(x->name()); - input_names_with_id.insert(x->name() + std::to_string(x->id())); - } - desc.SetInput( - "Xs", std::vector(input_names.begin(), input_names.end())); - - std::unordered_set output_names; - std::unordered_set output_names_with_id; - for (auto *x : func->outlinks) { - output_names.insert(x->name()); - output_names_with_id.insert(x->name() + std::to_string(x->id())); - } - - desc.SetOutput( - "Ys", std::vector(output_names.begin(), output_names.end())); - desc.SetType("tensorrt_engine"); - - std::unordered_map output_name_map; - - // The following procedure is used to rename all the intermediate - // variables and the output variables of the subgraph. - // Why we do this? - // During the transition from fluid OP to tensorrt OP, we map - // the input and output Tensor(fluid data structure) of fluid OP - // to the correspondin ITensor (trt data structure) through the - // Tensor name. When we set up ITensor for an variable, we must - // ensure that it has not been set before. - // If there is variable in the fluid graph, which is not only the - // input of a OP, but also the output of a Op, there will be problems. - // So we have to rename the variable in the subgraph to make sure - // it is either an OP's input or an OP's output. - - auto subgraph_nodes = func->subgraph; - for (int index = 0; index < block->ops_size(); index++) { - framework::proto::OpDesc *op = block->mutable_ops(index); - auto correspond_node = subgraph_nodes[index]; - PADDLE_ENFORCE_EQ(correspond_node->name(), op->type()); - - std::unordered_map var2id; - for (auto *in_var : correspond_node->inlinks) { - var2id[in_var->name()] = in_var->id(); - } - // rename for the input variables of op inside subgraph - for (int i = 0; i < op->inputs_size(); i++) { - framework::proto::OpDesc_Var *in_var = op->mutable_inputs(i); - std::vector replaced_names; - for (int k = 0; k < in_var->arguments_size(); k++) { - std::string arg_value = in_var->arguments(k); - std::string arg_value_with_id = - arg_value + std::to_string(var2id[arg_value]); - if (input_names_with_id.count(arg_value_with_id)) { - replaced_names.push_back(arg_value); - } else { - replaced_names.push_back(arg_value_with_id); - } - } - in_var->clear_arguments(); - for (size_t k = 0; k < replaced_names.size(); k++) { - in_var->add_arguments(replaced_names[k]); - } - } - var2id.clear(); - for (auto out_var : correspond_node->outlinks) { - var2id[out_var->name()] = out_var->id(); - } - - // rename for the output variables of op inside subgraph - for (int i = 0; i < op->outputs_size(); i++) { - framework::proto::OpDesc_Var *out_var = op->mutable_outputs(i); - std::vector replaced_names; - for (int k = 0; k < out_var->arguments_size(); k++) { - std::string arg_value = out_var->arguments(k); - std::string arg_value_with_id = - arg_value + std::to_string(var2id[arg_value]); - if (output_names_with_id.count(arg_value_with_id)) { - output_name_map[arg_value] = arg_value_with_id; - } - replaced_names.push_back(arg_value_with_id); - } - out_var->clear_arguments(); - for (size_t k = 0; k < replaced_names.size(); k++) { - out_var->add_arguments(replaced_names[k]); - } - } - } - // When tensorrt engine runs at the end of the operation, - // output_mapping help us copy the data from the renamed ITensor - // to Tensor. - std::vector output_mapping; - for (auto name : output_names) { - PADDLE_ENFORCE(output_name_map.count(name) != 0); - output_mapping.push_back(output_name_map[name]); - } - - PADDLE_ENFORCE(!block->vars().empty(), "the block has no var-desc"); - // Set attrs - - SetAttr(desc.Proto(), "subgraph", block->SerializeAsString()); - SetAttr(desc.Proto(), "max_batch_size", argument->Get("max_batch_size")); - SetAttr(desc.Proto(), "workspace_size", argument->Get("workspace_size")); - SetAttr(desc.Proto(), "engine_uniq_key", "trt-" + std::to_string(counter++)); - SetAttr(desc.Proto(), "parameters", ExtractParameters(graph.nodes.nodes())); - SetAttr(desc.Proto(), "output_name_mapping", output_mapping); - node->SetPbMsg(desc.Proto()->SerializeAsString()); -} - -std::vector ExtractParameters( - const std::vector> &nodes) { - std::vector parameters; - for (const auto &node : nodes) { - if (!node->IsValue()) continue; - PADDLE_ENFORCE(!node->pb_msg().empty(), "pb_msg should be set first"); - framework::proto::VarDesc var; - var.ParseFromString(node->pb_msg()); - if (var.persistable()) { - parameters.push_back(var.name()); - } - } - return parameters; -} - -void DataFlowGraphToFluidPass::AddEngineOp(Node *node) { - // TODO(Superjomn) Here need to expose some arguments for default setting. - PADDLE_ENFORCE(node->IsFunctionBlock()); - auto *block_node = static_cast(node); - framework::proto::BlockDesc proto; - framework::BlockDesc block_desc(nullptr, &proto); - block_desc.Proto()->set_parent_idx(-1); - block_desc.Proto()->set_idx(0); - VLOG(4) << "origin variable size: " - << argument_->origin_program_desc->blocks(0).vars().size(); - VLOG(4) << "transformed variable size: " << block_desc.Proto()->vars().size(); - // copy ops. - - for (auto *node : block_node->subgraph) { - auto *op = block_desc.AppendOp(); - PADDLE_ENFORCE(!node->pb_msg().empty()); - op->Proto()->ParseFromString(node->pb_msg()); - } - - *block_desc.Proto()->mutable_vars() = - argument_->origin_program_desc->blocks(0).vars(); - PADDLE_ENFORCE(!block_desc.Proto()->vars().empty()); - CreateTrtEngineOp(node, argument_, block_desc.Proto()); - auto *main_block = desc_->mutable_blocks(framework::kRootBlockIndex); - auto *op = main_block->add_ops(); - PADDLE_ENFORCE(!node->pb_msg().empty(), "failed to set desc for block"); - op->ParseFromString(node->pb_msg()); -} - -namespace { -class DFG_DebuggerPass : public DFG_GraphvizDrawPass { - public: - using Config = DFG_GraphvizDrawPass::Config; - explicit DFG_DebuggerPass(const Config &config) - : DFG_GraphvizDrawPass(config) {} - - std::string repr() const override { return "dfg-to-fluid-debuger-pass"; } - - bool Finalize() override { return true; } -}; -} // namespace - -AnalysisPass *DataFlowGraphToFluidPass::CreateGraphvizDebugerPass() const { - return new DFG_DebuggerPass(DFG_GraphvizDrawPass::Config( - FLAGS_IA_graphviz_log_root, - "data_flow_graph_to_fluid_graphviz_debugger")); -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass.h b/paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass.h deleted file mode 100644 index 891c7226e245fa3b92892785362c186185a61f62..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass.h +++ /dev/null @@ -1,59 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. */ - -/* - * This file implements the transformation from fluid ProgramDesc to data flow - * graph. - */ - -#pragma once - -#include -#include "paddle/fluid/framework/program_desc.h" -#include "paddle/fluid/inference/analysis/analysis_pass.h" -#include "paddle/fluid/inference/analysis/data_flow_graph.h" - -namespace paddle { -namespace inference { - -namespace analysis { -class DataFlowGraphToFluidPass final : public DataFlowGraphPass { - public: - DataFlowGraphToFluidPass() = default; - - bool Initialize(Argument *argument) override; - bool Finalize() override; - - void Run(DataFlowGraph *graph) override; - - std::string repr() const override { return "DFG to fluid"; } - std::string description() const override { - return "Transform a DFG to a Fluid ProgramDesc"; - } - - AnalysisPass *CreateGraphvizDebugerPass() const override; - - protected: - // Add a Fluid Op into the ProgramDesc. - void AddFluidOp(Node *node); - // Add a EngineOp into the ProgramDesc. - void AddEngineOp(Node *node); - - private: - framework::proto::ProgramDesc *desc_; - Argument *argument_; -}; -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass_tester.cc b/paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass_tester.cc deleted file mode 100644 index 4ef381db295b986b91173a728b6d98640f6f4f51..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass_tester.cc +++ /dev/null @@ -1,48 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include "paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass.h" - -#include -#include -#include -#include "paddle/fluid/framework/executor.h" -#include "paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.h" -#include "paddle/fluid/inference/analysis/ut_helper.h" -#include "paddle/fluid/inference/io.h" - -namespace paddle { -namespace inference { -namespace analysis { - -TEST(DataFlowGraph, Test) { - Argument argument(FLAGS_inference_model_dir); - - FluidToDataFlowGraphPass pass0; - DataFlowGraphToFluidPass pass1; - ASSERT_TRUE(pass0.Initialize(&argument)); - ASSERT_TRUE(pass1.Initialize(&argument)); - - pass0.Run(argument.main_dfg.get()); - pass1.Run(argument.main_dfg.get()); - - pass0.Finalize(); - pass1.Finalize(); - - LOG(INFO) << argument.main_dfg->nodes.size(); -} - -}; // namespace analysis -}; // namespace inference -}; // namespace paddle diff --git a/paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.cc b/paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.cc deleted file mode 100644 index 648b8f7d6a6ec4bafbad2838c5631e776c8699b1..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.cc +++ /dev/null @@ -1,59 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.h" - -namespace paddle { -namespace inference { -namespace analysis { - -int DFG_GraphvizDrawPass::counter_{0}; - -void DFG_GraphvizDrawPass::Run(DataFlowGraph *graph) { - auto content = Draw(graph); - auto dot_path = GenDotPath(); - std::ofstream file(dot_path); - file.write(content.c_str(), content.size()); - file.close(); - - auto png_path = dot_path.substr(0, dot_path.size() - 4) + ".png"; - std::string message; - VLOG(3) << "draw to " << png_path; - ExecShellCommand("dot -Tpng " + dot_path + " -o " + png_path, &message); -} - -std::string DFG_GraphvizDrawPass::Draw(DataFlowGraph *graph) { - Dot dot; - // Add nodes - for (size_t i = 0; i < graph->nodes.size(); i++) { - const Node &node = graph->nodes.Get(i); - if (config_.display_deleted_node || !node.deleted()) { - dot.AddNode(node.repr(), node.dot_attrs()); - } - } - // Add edges - for (size_t i = 0; i < graph->nodes.size(); i++) { - const Node &node = graph->nodes.Get(i); - if (!config_.display_deleted_node && node.deleted()) continue; - for (auto &out : node.outlinks) { - if (!config_.display_deleted_node && out->deleted()) continue; - dot.AddEdge(node.repr(), out->repr(), {}); - } - } - return dot.Build(); -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.h b/paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.h deleted file mode 100644 index e537bfc0e64d4ff46b3d61499a1a0298ed83533f..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.h +++ /dev/null @@ -1,78 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -/* - * This file create an DFG_GraphvizDrawPass which helps to draw a data flow - * graph's structure using graphviz. - */ - -#pragma once - -#include -#include -#include "paddle/fluid/inference/analysis/analysis_pass.h" -#include "paddle/fluid/inference/analysis/dot.h" - -namespace paddle { -namespace inference { -namespace analysis { - -/* - * Output a dot file and write to some place. - */ -class DFG_GraphvizDrawPass : public DataFlowGraphPass { - public: - struct Config { - Config(const std::string &dir, const std::string &id, - bool display_deleted_node = false) - : dir(dir), id(id), display_deleted_node(display_deleted_node) {} - - // The directory to store the .dot or .png files. - const std::string dir; - // The identifier for this dot file. - const std::string id; - // Whether to display deleted nodes, default false. - const bool display_deleted_node; - }; - - explicit DFG_GraphvizDrawPass(const Config &config) : config_(config) {} - - bool Initialize(Argument *argument) override { return true; } - void Run(DataFlowGraph *graph) override; - bool Finalize() override { return true; } - - std::string repr() const override { return "DFG graphviz drawer"; } - std::string description() const override { - return "Debug a DFG by draw with graphviz"; - } - - protected: - // A counter to add a number prefix to the debugger image output so that they - // will sort in the triggered order. - static int counter_; - - // Path of the dot file to output. - std::string GenDotPath() const { - return config_.dir + "/" + std::to_string(counter_++) + "-graph_" + - config_.id + ".dot"; - } - - virtual std::string Draw(DataFlowGraph *graph); - - Config config_; -}; - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/dfg_graphviz_draw_pass_tester.cc b/paddle/fluid/inference/analysis/dfg_graphviz_draw_pass_tester.cc deleted file mode 100644 index 928be7917047382d9b86294f6039b26b0ebf6f49..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/dfg_graphviz_draw_pass_tester.cc +++ /dev/null @@ -1,54 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.h" - -#include -#include -#include -#include "paddle/fluid/inference/analysis/ut_helper.h" - -namespace paddle { -namespace inference { -namespace analysis { - -TEST(DFG_GraphvizDrawPass, dfg_graphviz_draw_pass_tester) { - Argument argument(FLAGS_inference_model_dir); - FluidToDataFlowGraphPass pass0; - ASSERT_TRUE(pass0.Initialize(&argument)); - pass0.Run(argument.main_dfg.get()); - - // auto dfg = ProgramDescToDFG(*argument.origin_program_desc); - - DFG_GraphvizDrawPass::Config config("./", "test"); - DFG_GraphvizDrawPass pass(config); - pass.Initialize(&argument); - pass.Run(argument.main_dfg.get()); - - // test content - std::ifstream file("./0-graph_test.dot"); - ASSERT_TRUE(file.is_open()); - - std::string line; - int no{0}; - while (std::getline(file, line)) { - no++; - } - // DFG is sensitive to ProgramDesc, be careful to change the existing models. - ASSERT_EQ(no, 83); -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/dot_tester.cc b/paddle/fluid/inference/analysis/dot_tester.cc index 56ceb9bd5d6f41a601d66f6124fb7b4099c9337e..c785a312bf96c3586ea990fd9028cfd3b930d577 100644 --- a/paddle/fluid/inference/analysis/dot_tester.cc +++ b/paddle/fluid/inference/analysis/dot_tester.cc @@ -16,7 +16,6 @@ #include #include -#include "paddle/fluid/inference/analysis/data_flow_graph.h" namespace paddle { namespace inference { diff --git a/paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.cc b/paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.cc deleted file mode 100644 index 2b7d632c839e735ca03c6e17b94307b40cc13374..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.cc +++ /dev/null @@ -1,76 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include -#include -#include - -#include "paddle/fluid/inference/analysis/analyzer.h" -#include "paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.h" -#include "paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.h" - -namespace paddle { -namespace inference { -namespace analysis { - -bool FluidToDataFlowGraphPass::Initialize(Argument *argument) { - ANALYSIS_ARGUMENT_CHECK_FIELD(argument); - if (argument->origin_program_desc) { - LOG(WARNING) << "argument's origin_program_desc is already set, might " - "duplicate called"; - } - if (!argument->fluid_model_program_path) { - ANALYSIS_ARGUMENT_CHECK_FIELD(argument->fluid_model_dir); - argument->fluid_model_program_path.reset( - new std::string(*argument->fluid_model_dir + "/__model__")); - } - ANALYSIS_ARGUMENT_CHECK_FIELD(argument->fluid_model_program_path); - auto program = LoadProgramDesc(*argument->fluid_model_program_path); - argument->origin_program_desc.reset( - new framework::proto::ProgramDesc(program)); - - if (!argument->main_dfg) { - argument->main_dfg.reset(new DataFlowGraph); - } - desc_ = argument->origin_program_desc.get(); - return true; -} - -bool FluidToDataFlowGraphPass::Finalize() { return true; } - -void FluidToDataFlowGraphPass::Run(DataFlowGraph *graph) { - PADDLE_ENFORCE(graph); - PADDLE_ENFORCE(desc_); - graph->Build(*desc_); -} - -namespace { -class DFG_DebuggerPass : public DFG_GraphvizDrawPass { - public: - using Config = DFG_GraphvizDrawPass::Config; - explicit DFG_DebuggerPass(const Config &config) - : DFG_GraphvizDrawPass(config) {} - std::string repr() const override { return "fluid-to-dfg-debuger-pass"; } - bool Finalize() override { return true; } -}; -} - -AnalysisPass *FluidToDataFlowGraphPass::CreateGraphvizDebugerPass() const { - return new DFG_DebuggerPass(DFG_GraphvizDrawPass::Config( - FLAGS_IA_graphviz_log_root, "fluid-to-dfg-debuger")); -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.h b/paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.h deleted file mode 100644 index b9e262020e9522e167b998d57e2be2ac19b48447..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.h +++ /dev/null @@ -1,57 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. */ - -/* - * This file implements the transformation from data flow graph to fluid - * ProgramDesc. - */ - -#pragma once - -#include - -#include "paddle/fluid/framework/program_desc.h" -#include "paddle/fluid/inference/analysis/analysis_pass.h" -#include "paddle/fluid/inference/analysis/data_flow_graph.h" - -namespace paddle { -namespace inference { -namespace analysis { - -/* - * Transform a FluidDesc to a SSA. - */ -class FluidToDataFlowGraphPass final : public DataFlowGraphPass { - public: - FluidToDataFlowGraphPass() = default; - - bool Initialize(Argument *argument) override; - bool Finalize() override; - - void Run(DataFlowGraph *graph) override; - - std::string repr() const override { return "fluid-to-data-flow-graph"; } - std::string description() const override { - return "transform a fluid ProgramDesc to a data flow graph."; - } - - AnalysisPass *CreateGraphvizDebugerPass() const override; - - private: - framework::proto::ProgramDesc const *desc_; -}; - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass_tester.cc b/paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass_tester.cc deleted file mode 100644 index 267a0a84ebf75615e0b390f4a1b3bf3b51793fc7..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass_tester.cc +++ /dev/null @@ -1,38 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include "paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.h" - -#include -#include "paddle/fluid/inference/analysis/ut_helper.h" - -namespace paddle { -namespace inference { -namespace analysis { - -TEST(FluidToDataFlowGraphPass, Test) { - FluidToDataFlowGraphPass pass; - Argument argument(FLAGS_inference_model_dir); - pass.Initialize(&argument); - pass.Run(argument.main_dfg.get()); - // Analysis is sensitive to ProgramDesc, careful to change the original model. - ASSERT_EQ(argument.main_dfg->nodes.size(), 38UL); - pass.Finalize(); - ASSERT_FALSE(argument.main_dfg->DotString().empty()); - EXPECT_FALSE(argument.main_dfg->inputs().empty()); -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/fluid_to_ir_pass.cc b/paddle/fluid/inference/analysis/fluid_to_ir_pass.cc deleted file mode 100644 index fc60ca3bd0bf706407defb2655a093d999aef7c2..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/fluid_to_ir_pass.cc +++ /dev/null @@ -1,60 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include "paddle/fluid/inference/analysis/fluid_to_ir_pass.h" -#include "paddle/fluid/framework/executor.h" -#include "paddle/fluid/framework/ir/fuse_pass_base.h" -#include "paddle/fluid/inference/io.h" -#include "paddle/fluid/platform/device_context.h" -#include "paddle/fluid/platform/place.h" - -namespace paddle { -namespace inference { -namespace analysis { - -void FluidToIrPass::EnableParamModify(const std::string &model_dir, - const std::string &prog_file, - const std::string ¶m_file) { - PADDLE_ENFORCE(argument_); - argument_->Set(framework::ir::kParamScopeAttr, new framework::Scope); - // Load parameters. - VLOG(3) << "Loading parameters from " << model_dir; - LoadParams(&argument_->Get(framework::ir::kParamScopeAttr), - model_dir, prog_file, param_file); -} - -bool FluidToIrPass::LoadParams(framework::Scope *scope, const std::string &dir, - const std::string &prog_file, - const std::string ¶m_file) { - platform::CPUPlace place; - platform::CPUDeviceContext ctx(place); - framework::Executor executor(place); - PADDLE_ENFORCE(argument_->origin_program_desc.get()); - framework::ProgramDesc program(*argument_->origin_program_desc); - if ((!prog_file.empty()) && (!param_file.empty())) { - LOG(INFO) << "load single model file from " << prog_file; - Load(&executor, scope, prog_file, param_file); - } else if (!dir.empty()) { - LOG(INFO) << "load from dir " << dir; - Load(&executor, scope, dir); - } else { - LOG(ERROR) << "failed to load parameters"; - return false; - } - return true; -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/fluid_to_ir_pass.h b/paddle/fluid/inference/analysis/fluid_to_ir_pass.h deleted file mode 100644 index c2599e218a2306f9353b843b7ea3f18aeacb008e..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/fluid_to_ir_pass.h +++ /dev/null @@ -1,128 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#pragma once - -#include -#include - -#include "paddle/fluid/framework/ir/fuse_pass_base.h" -#include "paddle/fluid/inference/analysis/analysis_pass.h" -#include "paddle/fluid/inference/analysis/flags.h" -#include "paddle/fluid/inference/analysis/ir_pass_manager.h" - -namespace paddle { -namespace inference { -namespace analysis { - -static const char kFluidToIrPassesAttr[] = "__fluid_to_ir_passes__"; - -class FluidToIrPass final : public DataFlowGraphPass { - public: - FluidToIrPass() = default; - - bool Initialize(Argument *argument) override { - ANALYSIS_ARGUMENT_CHECK_FIELD(argument); - PADDLE_ENFORCE(argument->Has(kFluidToIrPassesAttr), - "argument need the attr %s", kFluidToIrPassesAttr); - argument_ = argument; - if (argument->origin_program_desc) { - LOG(WARNING) << "argument's origin_program_desc is already set, might " - "duplicate called"; - } - // set fluid model program path - if (!argument->fluid_model_program_path) { - ANALYSIS_ARGUMENT_CHECK_FIELD(argument->fluid_model_dir); - argument->fluid_model_program_path.reset( - new std::string(*argument->fluid_model_dir + "/__model__")); - } - ANALYSIS_ARGUMENT_CHECK_FIELD(argument->fluid_model_program_path); - // Load program. - auto program = LoadProgramDesc(*argument->fluid_model_program_path); - argument->origin_program_desc.reset( - new framework::proto::ProgramDesc(program)); - // Create main data flow graph. - if (!argument->main_dfg) { - argument->main_dfg.reset(new DataFlowGraph); - } - argument->Set("ir_program_desc", new ProgramDesc(program)); - - LOG(INFO) << "Loading parameters"; - // Load parameters to argument if needed. - if (argument->fluid_model_dir || (argument->fluid_model_program_path && - argument->fluid_model_param_path)) { -#define SAFE_GET(ATTR) std::string ATTR = argument->ATTR ? *argument->ATTR : ""; - SAFE_GET(fluid_model_dir); - SAFE_GET(fluid_model_program_path); - SAFE_GET(fluid_model_param_path); -#undef SAFE_GET - EnableParamModify(fluid_model_dir, fluid_model_program_path, - fluid_model_param_path); - } - - return true; - } - - bool Finalize() override { return true; } - - void Run(DataFlowGraph *graph) override { - // Call all the IR Passes - IRPassManager ir_passes(argument_->Get("ir_program_desc"), - nullptr); - // Pass the scope from analysis to IR if needed. - if (argument_->Has(framework::ir::kParamScopeAttr)) { - // Here the address is passed, attention that IR doesn't own the scope, so - // the real scope in analysis should live during the IR phase. - ir_passes.graph().Set( - framework::ir::kParamScopeAttr, - new framework::Scope *(&argument_->Get( - framework::ir::kParamScopeAttr))); - } - - if (FLAGS_IA_enable_ir) { - const auto &ir_passes_to_apply = - argument_->Get>(kFluidToIrPassesAttr); - ir_passes.Apply(ir_passes_to_apply); - } - - PADDLE_ENFORCE(argument_->main_dfg.get()); - argument_->main_dfg->Build(ir_passes.graph()); - // inherit the arguments from ir. - if (ir_passes.graph().Has(framework::ir::kFuseStatisAttr)) { - argument_->Set( - framework::ir::kFuseStatisAttr, - new std::unordered_map( - ir_passes.graph().Get>( - framework::ir::kFuseStatisAttr))); - } - } - - void EnableParamModify(const std::string &model_dir, - const std::string &prog_file, - const std::string ¶m_file); - - std::string repr() const override { return "fluid-to-ir-pass"; } - - private: - // Load parameters from a single file or from a directory. - bool LoadParams(framework::Scope *scope, const std::string &dir, - const std::string &prog_file, const std::string ¶m_file); - - private: - Argument *argument_{nullptr}; -}; - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/fluid_to_ir_pass_tester.cc b/paddle/fluid/inference/analysis/fluid_to_ir_pass_tester.cc deleted file mode 100644 index 367c25805d05f8d10fb8341158760ac6356a5c48..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/fluid_to_ir_pass_tester.cc +++ /dev/null @@ -1,36 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include "paddle/fluid/inference/analysis/fluid_to_ir_pass.h" - -#include -#include "paddle/fluid/inference/analysis/ut_helper.h" -#include "paddle/fluid/inference/api/paddle_inference_pass.h" - -namespace paddle { -namespace inference { -namespace analysis { - -TEST(FluidToIrPass, Test) { - FluidToIrPass pass; - Argument argument(FLAGS_inference_model_dir); - argument.Set(kFluidToIrPassesAttr, - new std::vector({"infer_clean_graph_pass"})); - pass.Initialize(&argument); - pass.Run(argument.main_dfg.get()); -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/graph_traits.cc b/paddle/fluid/inference/analysis/graph_traits.cc deleted file mode 100644 index 2ea70a1d2060e03769d67060dc6f008207342b52..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/graph_traits.cc +++ /dev/null @@ -1,15 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/inference/analysis/graph_traits.h" diff --git a/paddle/fluid/inference/analysis/graph_traits.h b/paddle/fluid/inference/analysis/graph_traits.h deleted file mode 100644 index aed2b1e8e27d94b430201d70ecf09d4acc33c8fa..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/graph_traits.h +++ /dev/null @@ -1,63 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -/* - * This file defines the GraphTraits template class that should be specified - * by classes that want to be iteratable by generic graph iterators. - * - * This file also defines the marker class Inverse that is used to iterate over - * graphs in a graph defined, inverse ordering... - */ - -#pragma once - -#include "paddle/fluid/inference/analysis/helper.h" - -namespace paddle { -namespace inference { -namespace analysis { - -/* - * This class should be specialized by different graph types... - * That's why the base class is empty. - */ -template -struct GraphTraits { - // using NodesBFSIterator = xxx - - // NodesBFSIterator nodes_begin(); - // NodesBFSIterator nodes_end(); -}; - -/* - * Inverse - This class is used as a marker class to tell the graph iterator to - * iterate in a graph defined Inverse order. - */ -template -struct Inverse { - const GraphType &graph; - - explicit Inverse(const GraphType &graph) : graph(graph) {} -}; - -/* - * Provide a partial specialization of GraphTraits so that the inverse of an - * inverse turns into the original graph. - */ -template -struct GraphTraits>> : GraphTraits {}; - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/helper.cc b/paddle/fluid/inference/analysis/helper.cc index ca40c01fc57dbcc2ca16770a1b7d798de8b5625b..4f5c50d0d6b9ac94130cb82fb342ae5ee592f2c0 100644 --- a/paddle/fluid/inference/analysis/helper.cc +++ b/paddle/fluid/inference/analysis/helper.cc @@ -36,6 +36,14 @@ void SetAttr(framework::proto::OpDesc *op, const std::string &name, attr->set_i(data); } template <> +void SetAttr(framework::proto::OpDesc *op, const std::string &name, + const bool &data) { + auto *attr = op->add_attrs(); + attr->set_name(name); + attr->set_type(paddle::framework::proto::AttrType::BOOLEAN); + attr->set_b(data); +} +template <> void SetAttr(framework::proto::OpDesc *op, const std::string &name, const int64_t &data) { auto *attr = op->add_attrs(); diff --git a/paddle/fluid/inference/analysis/helper.h b/paddle/fluid/inference/analysis/helper.h index 5151e2b69ac199dea136535ba445e890596f6227..a48058400241b030f17557156a4d973fca92fd8d 100644 --- a/paddle/fluid/inference/analysis/helper.h +++ b/paddle/fluid/inference/analysis/helper.h @@ -17,15 +17,31 @@ limitations under the License. */ #include #include #include +#include +#include #include #include #include +#include #include #include "paddle/fluid/framework/framework.pb.h" #include "paddle/fluid/framework/scope.h" #include "paddle/fluid/framework/variable.h" #include "paddle/fluid/platform/enforce.h" +#include "paddle/fluid/platform/port.h" + +#ifdef _WIN32 +#include +#include +#define GCC_ATTRIBUTE(attr__) +#define MKDIR(path) _mkdir(path) +#else +#include +#define GCC_ATTRIBUTE(attr__) __attribute__((attr__)); +#define MKDIR(path) mkdir(path, S_IRWXU | S_IRWXG | S_IROTH | S_IXOTH) +#endif +#define __SHOULD_USE_RESULT__ GCC_ATTRIBUTE(warn_unused_result) namespace paddle { namespace inference { @@ -101,20 +117,20 @@ class OrderedRegistry { public: T *Register(const std::string &name, T *x) { PADDLE_ENFORCE(!dic_.count(name), "duplicate key [%s]", name); - dic_[name] = data_.size(); - data_.emplace_back(std::unique_ptr(x)); - return data_.back().get(); + dic_[name] = elements_.size(); + elements_.emplace_back(std::unique_ptr(x)); + return elements_.back().get(); } T *Lookup(const std::string &name) { auto it = dic_.find(name); if (it == dic_.end()) return nullptr; - return data_[it->second].get(); + return elements_[it->second].get(); } protected: std::unordered_map dic_; - std::vector> data_; + std::vector> elements_; }; template @@ -124,20 +140,6 @@ T &GetFromScope(const framework::Scope &scope, const std::string &name) { return *var->GetMutable(); } -static void ExecShellCommand(const std::string &cmd, std::string *message) { - char buffer[128]; - std::shared_ptr pipe(popen(cmd.c_str(), "r"), pclose); - if (!pipe) { - LOG(ERROR) << "error running command: " << cmd; - return; - } - while (!feof(pipe.get())) { - if (fgets(buffer, 128, pipe.get()) != nullptr) { - *message += buffer; - } - } -} - static framework::proto::ProgramDesc LoadProgramDesc( const std::string &model_path) { std::ifstream fin(model_path, std::ios::in | std::ios::binary); @@ -169,6 +171,83 @@ static bool PathExists(const std::string &path) { return false; } +static std::string GetDirRoot(const std::string &path) { + char sep = '/'; + +#ifdef _WIN32 + sep = '\\'; +#endif + + size_t i = path.rfind(sep, path.length()); + if (i != std::string::npos) { + return (path.substr(0, i)); + } + return path; +} + +static std::string GetOrCreateModelOptCacheDir(const std::string &model_root) { + std::string opt_cache_dir = model_root + "/_opt_cache/"; + if (!PathExists(opt_cache_dir)) { + PADDLE_ENFORCE(MKDIR(opt_cache_dir.c_str()) != -1, + "Can not create optimize cache directory: %s, Make sure you " + "have permission to write", + opt_cache_dir); + } + return opt_cache_dir; +} + +static std::string GetTrtCalibPath(const std::string &model_root, + const std::string &engine_key) { + return model_root + "/trt_calib_" + engine_key; +} + +// If there is no calib table data file in model_opt_cache_dir, return "". +static std::string GetTrtCalibTableData(const std::string &model_opt_cache_dir, + const std::string &engine_key, + bool enable_int8) { + std::string trt_calib_table_path = + GetTrtCalibPath(model_opt_cache_dir, engine_key); + if (enable_int8 && FileExists(trt_calib_table_path)) { + VLOG(3) << "Calibration table file: " << trt_calib_table_path + << "is found here"; + std::ifstream infile(trt_calib_table_path, std::ios::in); + std::stringstream buffer; + buffer << infile.rdbuf(); + std::string calibration_data(buffer.str()); + return calibration_data; + } + return ""; +} + +static std::string GetTrtEngineSerializedPath(const std::string &model_root, + const std::string &engine_key) { + return model_root + "/trt_serialized_" + engine_key; +} + +static std::string GetTrtEngineSerializedData( + const std::string &model_opt_cache_dir, const std::string &engine_key) { + std::string trt_serialized_path = + GetTrtEngineSerializedPath(model_opt_cache_dir, engine_key); + if (FileExists(trt_serialized_path)) { + VLOG(3) << "Trt serialized file: " << trt_serialized_path + << "is found here"; + std::ifstream infile(trt_serialized_path, std::ios::in); + std::stringstream buffer; + buffer << infile.rdbuf(); + std::string trt_engine_serialized_data(buffer.str()); + return trt_engine_serialized_data; + } + return ""; +} + +static void SaveTrtEngineSerializedDataToFile( + const std::string &trt_serialized_path, + const std::string &engine_serialized_data) { + std::ofstream outfile(trt_serialized_path); + outfile << engine_serialized_data; + outfile.close(); +} + } // namespace analysis } // namespace inference } // namespace paddle diff --git a/paddle/fluid/inference/analysis/ir_pass_manager.cc b/paddle/fluid/inference/analysis/ir_pass_manager.cc index e76708baf4b39afb0febbcf3ff71281dfbfc8627..78e502c670f0eb2480b560964cf31e247990a367 100644 --- a/paddle/fluid/inference/analysis/ir_pass_manager.cc +++ b/paddle/fluid/inference/analysis/ir_pass_manager.cc @@ -13,11 +13,18 @@ // limitations under the License. #include "paddle/fluid/inference/analysis/ir_pass_manager.h" +#include +#include #include +#include +#include +#include #include #include "paddle/fluid/framework/ir/fuse_pass_base.h" #include "paddle/fluid/framework/ir/graph.h" #include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/inference/analysis/argument.h" +#include "paddle/fluid/inference/analysis/ir_passes/subgraph_detector.h" #include "paddle/fluid/string/pretty_log.h" namespace paddle { @@ -27,31 +34,130 @@ using string::PrettyLogEndl; using string::PrettyLog; using string::Style; -IRPassManager::IRPassManager(const ProgramDesc &program, - framework::Scope *scope) - : program_(program) { - graph_.reset(new framework::ir::Graph(program)); - if (scope) - graph_->Set(framework::ir::kParamScopeAttr, new framework::Scope *(scope)); +IRPassManager::IRPassManager(Argument *argument) { + ARGUMENT_CHECK_FIELD(argument, main_program); + graph_ = std::unique_ptr(new Graph(argument->main_program())); + if (argument->Has("scope")) { + graph_->Set(framework::ir::kParamScopeAttr, + new framework::Scope *( + const_cast(&argument->scope()))); + } + + ARGUMENT_CHECK_FIELD(argument, ir_analysis_passes); + CreatePasses(argument, argument->ir_analysis_passes()); } -void IRPassManager::Apply(const std::vector &passes) { - // Apply all the passes +void IRPassManager::CreatePasses(Argument *argument, + const std::vector &passes) { std::string pre_pass; int pass_num = 0; for (const std::string &pass_name : passes) { - PrettyLogEndl(Style::H2(), "--- Running IR pass [%s]", pass_name); auto pass = framework::ir::PassRegistry::Instance().Get(pass_name); + if (pass_name == "graph_viz_pass") { std::string dot_file_path = std::to_string(pass_num) + "_ir_" + (pre_pass.empty() ? "origin" : pre_pass) + ".dot"; pass->Set("graph_viz_path", new std::string(std::move(dot_file_path))); pass_num++; + } else if (pass_name == "mkldnn_placement_pass") { + pass->Set("mkldnn_enabled_op_types", + new std::unordered_set( + argument->mkldnn_enabled_op_types())); +#ifdef PADDLE_WITH_MKLDNN + } else if (pass_name == "cpu_quantize_placement_pass") { + pass->Set("quantize_enabled_op_types", + new std::unordered_set( + argument->quantize_enabled_op_types())); + pass->Set( + "quantize_excluded_op_ids", + new std::unordered_set(argument->quantize_excluded_op_ids())); + } else if (pass_name == "cpu_quantize_pass") { + pass->Set("quant_var_scales", + new VarQuantScale(argument->quant_var_scales())); +#endif + } else if (pass_name == "tensorrt_subgraph_pass") { + pass->Set("workspace_size", new int(argument->tensorrt_workspace_size())); + pass->Set("max_batch_size", new int(argument->tensorrt_max_batch_size())); + pass->Set("min_subgraph_size", + new int(argument->tensorrt_min_subgraph_size())); + pass->Set("program", + new framework::ProgramDesc *(&argument->main_program())); + + bool enable_int8 = argument->tensorrt_precision_mode() == + AnalysisConfig::Precision::kInt8; + + pass->Set("enable_int8", new bool(enable_int8)); + + bool use_static_engine = argument->tensorrt_use_static_engine(); + bool model_from_memory = argument->model_from_memory(); + bool int8_valid = !(model_from_memory && enable_int8); + PADDLE_ENFORCE(int8_valid, + "TRT INT8 Now don't support model load from memory."); + + if ((!model_from_memory && use_static_engine) || enable_int8) { + std::string model_opt_cache_dir = + argument->Has("model_dir") + ? argument->model_dir() + : GetDirRoot(argument->model_program_path()); + pass->Set( + "model_opt_cache_dir", + new std::string(GetOrCreateModelOptCacheDir(model_opt_cache_dir))); + } + pass->Set("gpu_device_id", new int(argument->gpu_device_id())); + pass->Set("use_static_engine", new bool(use_static_engine)); + pass->Set("model_from_memory", new bool(argument->model_from_memory())); + pass->Set("engine_opt_info", new std::map( + argument->engine_opt_info())); } - graph_ = pass->Apply(std::move(graph_)); + + if (pass_name == "anakin_subgraph_pass") { + pass->Set("program", + new framework::ProgramDesc *(&argument->main_program())); + pass->Set("gpu_device_id", new int(argument->gpu_device_id())); + pass->Set("model_from_memory", new bool(argument->model_from_memory())); + pass->Set("engine_opt_info", new std::map( + argument->engine_opt_info())); + pass->Set("predictor_id", new int(argument->predictor_id())); + pass->Set("max_input_shape", new std::map>( + argument->anakin_max_input_shape())); + pass->Set("max_batch_size", new int(argument->anakin_max_batch_size())); + } + pre_pass = pass_name; + + passes_.emplace_back(std::move(pass)); + } +} + +std::unique_ptr IRPassManager::Apply(std::unique_ptr graph) { + if (passes_.empty()) { + return graph; + } + PADDLE_ENFORCE(graph.get()); + // Apply all the passes + for (const auto &pass : passes_) { + if (pass->Type() != "graph_viz_pass") { + PrettyLogEndl(Style::H2(), "--- Running IR pass [%s]", pass->Type()); + } + graph.reset(pass->Apply(graph.release())); } + return graph; +} + +framework::proto::ProgramDesc IRPassManager::AcquireProgram( + std::unique_ptr *graph, ProgramDesc *program) const { + auto pass = + framework::ir::PassRegistry::Instance().Get("graph_to_program_pass"); + + // Direct using ProgramDesc desc(argument->main_program()) may cause + // incomplete copies of information. + ProgramDesc desc; + desc.CopyFrom(*program->Proto()); + pass->SetNotOwned("program", &desc); + auto *the_graph = graph->release(); + graph->reset(pass->Apply(the_graph)); + return *desc.Proto(); } } // namespace analysis diff --git a/paddle/fluid/inference/analysis/ir_pass_manager.h b/paddle/fluid/inference/analysis/ir_pass_manager.h index bb230283b7c2cc783d0b68ea0aa3cca1cabc75e6..2d120679eedd392d78b4da66276297ff7280792b 100644 --- a/paddle/fluid/inference/analysis/ir_pass_manager.h +++ b/paddle/fluid/inference/analysis/ir_pass_manager.h @@ -20,27 +20,42 @@ * for inference. */ +#pragma once + +#include +#include +#include +#include +#include #include "paddle/fluid/framework/ir/graph.h" #include "paddle/fluid/framework/ir/pass.h" #include "paddle/fluid/framework/program_desc.h" #include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/inference/analysis/argument.h" +#include "paddle/fluid/inference/analysis/helper.h" namespace paddle { namespace inference { namespace analysis { using framework::ProgramDesc; +using framework::ir::Graph; class IRPassManager final { public: - IRPassManager(const ProgramDesc &program, framework::Scope *scope); + explicit IRPassManager(Argument *argument); + + std::unique_ptr Apply(std::unique_ptr graph); - void Apply(const std::vector &passes); + framework::proto::ProgramDesc AcquireProgram(std::unique_ptr *graph, + ProgramDesc *program) const; framework::ir::Graph &graph() const { return *graph_; } private: - std::unique_ptr graph_; - ProgramDesc program_; + void CreatePasses(Argument *argument, const std::vector &passes); + + std::unique_ptr graph_; + std::vector> passes_; }; } // namespace analysis diff --git a/paddle/fluid/inference/analysis/ir_passes/CMakeLists.txt b/paddle/fluid/inference/analysis/ir_passes/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..05a3d7ddfdb08c98866cc0a08ec4113866c7567d --- /dev/null +++ b/paddle/fluid/inference/analysis/ir_passes/CMakeLists.txt @@ -0,0 +1,28 @@ +cc_library(subgraph_detector SRCS subgraph_detector.cc subgraph_util.cc DEPS proto_desc) +if(WITH_TESTING) + add_dependencies(subgraph_detector gtest) +endif() + +if (WITH_GPU AND TENSORRT_FOUND) + cc_library(tensorrt_subgraph_pass SRCS tensorrt_subgraph_pass.cc DEPS subgraph_detector tensorrt_op_teller) + + set(analysis_deps ${analysis_deps} + subgraph_detector tensorrt_subgraph_pass + CACHE INTERNAL "") + + set(pass_file ${PADDLE_BINARY_DIR}/paddle/fluid/inference/api/paddle_inference_pass.h) + file(APPEND ${pass_file} "USE_PASS(tensorrt_subgraph_pass);\n") + set(INFER_IR_PASSES ${INFER_IR_PASSES} tensorrt_subgraph_pass CACHE INTERNAL "") +endif() + +if (ANAKIN_FOUND) + cc_library(anakin_subgraph_pass SRCS anakin_subgraph_pass.cc DEPS subgraph_detector anakin_op_teller) + + set(analysis_deps ${analysis_deps} + subgraph_detector anakin_subgraph_pass + CACHE INTERNAL "") + + set(pass_file ${PADDLE_BINARY_DIR}/paddle/fluid/inference/api/paddle_inference_pass.h) + file(APPEND ${pass_file} "USE_PASS(anakin_subgraph_pass);\n") + set(INFER_IR_PASSES ${INFER_IR_PASSES} anakin_subgraph_pass CACHE INTERNAL "") +endif() diff --git a/paddle/fluid/inference/analysis/ir_passes/anakin_subgraph_pass.cc b/paddle/fluid/inference/analysis/ir_passes/anakin_subgraph_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..b8d8b6fed8ca237e87cfc67979ec6ddd340b8916 --- /dev/null +++ b/paddle/fluid/inference/analysis/ir_passes/anakin_subgraph_pass.cc @@ -0,0 +1,217 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include +#include +#include +#include +#include +#include + +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" +#include "paddle/fluid/inference/anakin/convert/op_converter.h" +#include "paddle/fluid/inference/anakin/op_teller.h" +#include "paddle/fluid/inference/analysis/helper.h" +#include "paddle/fluid/inference/analysis/ir_passes/anakin_subgraph_pass.h" +#include "paddle/fluid/inference/analysis/ir_passes/subgraph_detector.h" +#include "paddle/fluid/string/pretty_log.h" + +namespace paddle { +namespace inference { +namespace analysis { + +using framework::ir::Node; + +void analysis::AnakinSubgraphPass::ApplyImpl( + framework::ir::Graph *graph) const { + framework::ir::FusePassBase::Init("anakin_subgraph_pass", graph); + + auto teller = [](const framework::ir::Node *node) { + if (!node->IsOp() || !node->Op()) return false; + return anakin::OpTeller::Global().Tell(node->Op()->Type(), *node->Op()); + }; + + SubGraphFuser fuser(graph, teller, 6 /* min_subgraph_size */); + fuser(); + + std::vector graph_param_names = + ExtractParameters(graph->Nodes()); + + // those parameter already exist in anakin, and should not have another copy + // in fluid. + std::vector repetitive_params; + + for (auto *node : graph->Nodes()) { + if (node->IsOp() && !Agent(node).subgraph()->empty()) { + CreateAnakinOp(node, graph, graph_param_names, &repetitive_params); + std::unordered_set nodes2remove( + Agent(node).subgraph()->begin(), Agent(node).subgraph()->end()); + framework::ir::GraphSafeRemoveNodes(graph, nodes2remove); + } + } + + std::unordered_set nodes2remove; + for (auto *node : graph->Nodes()) { + if (node->IsOp() && Agent(node).deleted()) { + nodes2remove.insert(node); + } + } + framework::ir::GraphSafeRemoveNodes(graph, nodes2remove); + graph->Set(framework::ir::kRepetitiveParamAttr, + new std::vector(repetitive_params)); +} + +std::string GenerateAnakinEngineKey(const std::set &engine_inputs, + const std::set &engine_outputs, + std::string id) { + std::string engine_hash_key = ""; + for (auto name : engine_inputs) { + engine_hash_key += name; + } + for (auto name : engine_outputs) { + engine_hash_key += name; + } + engine_hash_key += id; + auto engine_key = std::to_string(std::hash()(engine_hash_key)); + return engine_key; +} + +void AnakinSubgraphPass::CreateAnakinOp( + framework::ir::Node *node, Graph *graph, + const std::vector &graph_params, + std::vector *repetitive_params) const { + auto *op_desc = node->Op(); + auto &subgraph = *Agent(node).subgraph(); + PADDLE_ENFORCE(!subgraph.empty()); + + framework::ProgramDesc *program_desc = + Get("program"); + // Add new block for TensorRTEngineOP + const framework::BlockDesc &main_block = + program_desc->Block(framework::kRootBlockIndex); + // const framework::BlockDesc& main_block = program_desc->Block(0); + framework::BlockDesc *new_block = program_desc->AppendBlock(main_block); + + // An fake block desc. + framework::proto::BlockDesc block_proto; + framework::BlockDesc block_desc(nullptr, &block_proto); + block_desc.Proto()->set_parent_idx(-1); + block_desc.Proto()->set_idx(0); + string::PrettyLogDetail("--- detect a sub-graph with %d nodes", + subgraph.size()); + + for (auto *node : subgraph) { + auto *new_block_op = new_block->AppendOp(); + auto *op = block_desc.AppendOp(); + *new_block_op->Proto() = *node->Op()->Proto(); + *op->Proto() = *node->Op()->Proto(); + } + + // Then, we will use the input_names_with_id and output_names_with_id to + // generate the eigine key. + // So, We use set instead of unordered_set here to ensure that the engine key + // is unique. + std::set input_names; + std::set input_names_with_id; + std::vector params; + for (auto *x : node->inputs) { + input_names.insert(x->Name()); + input_names_with_id.insert(x->Name() + std::to_string(x->id())); + if (std::count(graph_params.begin(), graph_params.end(), x->Name()) > 0) { + params.push_back(x->Name()); + } + } + std::copy(params.begin(), params.end(), + std::back_inserter(*repetitive_params)); + op_desc->SetInput( + "Xs", std::vector(input_names.begin(), input_names.end())); + + std::set output_names; + std::set output_names_with_id; + for (auto *x : node->outputs) { + output_names.insert(x->Name()); + output_names_with_id.insert(x->Name() + std::to_string(x->id())); + } + + op_desc->SetOutput( + "Ys", std::vector(output_names.begin(), output_names.end())); + op_desc->SetType("anakin_engine"); + + std::unordered_map output_name_map; + std::unordered_map graph_var_map; + + for (framework::ir::Node *node : graph->Nodes()) { + if (node->IsVar() && node->Var()) { + graph_var_map[node->Name()] = node; + } + } + auto &subgraph_nodes = *Agent(node).subgraph(); + + // The following procedure is used to rename all the intermediate + // variables and the output variables of the subgraph. + RenameAndGetOutputs(subgraph_nodes, &block_desc, input_names_with_id, + &output_names_with_id, &output_names, &output_name_map, + graph_var_map, false); + + // When anakin engine runs at the end of the operation, + // output_mapping help us copy the data from the renamed ITensor + // to Tensor. + std::vector output_mapping; + for (auto name : output_names) { + PADDLE_ENFORCE(output_name_map.count(name) != 0); + output_mapping.push_back(output_name_map[name]); + } + + PADDLE_ENFORCE(!block_desc.Proto()->vars().empty(), + "the block has no var-desc"); + PADDLE_ENFORCE(!output_mapping.empty()); + op_desc->SetBlockAttr("sub_block", new_block); + SetAttr(op_desc->Proto(), "subgraph", + block_desc.Proto()->SerializeAsString()); + // Set attrs + SetAttr(op_desc->Proto(), "parameters", params); + SetAttr(op_desc->Proto(), "output_name_mapping", output_mapping); + int predictor_id = Get("predictor_id"); + auto engine_key = GenerateAnakinEngineKey( + input_names_with_id, output_names_with_id, std::to_string(predictor_id)); + + SetAttr(op_desc->Proto(), "engine_key", engine_key); + auto max_input_shape = + Get>>("max_input_shape"); + auto max_batch_size = Get("max_batch_size"); + + auto *anakin_engine = + inference::Singleton::Global().Create( + true, Get("gpu_device_id"), max_batch_size, max_input_shape, + engine_key); + + auto *scope = param_scope(); + std::unordered_set param_set(params.begin(), params.end()); + framework::BlockDesc block_desc_temp(nullptr, block_desc.Proto()); + + inference::Singleton::Global() + .ConvertBlockToAnakinEngine( + &block_desc_temp, scope, + std::vector(input_names.begin(), input_names.end()), + param_set, output_mapping, anakin_engine); +} + +} // namespace analysis +} // namespace inference +} // namespace paddle + +REGISTER_PASS(anakin_subgraph_pass, + paddle::inference::analysis::AnakinSubgraphPass); diff --git a/paddle/fluid/inference/analysis/ir_passes/anakin_subgraph_pass.h b/paddle/fluid/inference/analysis/ir_passes/anakin_subgraph_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..e80b8bb612096a1da7cd5835c948085d51fdfe7a --- /dev/null +++ b/paddle/fluid/inference/analysis/ir_passes/anakin_subgraph_pass.h @@ -0,0 +1,43 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include +#include +#include "paddle/fluid/framework/ir/pass.h" +#include "paddle/fluid/inference/anakin/engine.h" +#include "paddle/fluid/inference/analysis/ir_passes/subgraph_util.h" + +using anakin::Precision; +using anakin::saber::NV; +namespace paddle { +namespace inference { +namespace analysis { + +class AnakinSubgraphPass : public framework::ir::FusePassBase { + public: + void ApplyImpl(framework::ir::Graph *graph) const override; + + private: + void CreateAnakinOp(framework::ir::Node *x, framework::ir::Graph *graph, + const std::vector &graph_params, + std::vector *repetitive_params) const; + void CleanIntermediateOutputs(framework::ir::Node *node); +}; + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/ir_passes/subgraph_detector.cc b/paddle/fluid/inference/analysis/ir_passes/subgraph_detector.cc new file mode 100644 index 0000000000000000000000000000000000000000..76b1671601eec95d64b36effc5727481dcd070e2 --- /dev/null +++ b/paddle/fluid/inference/analysis/ir_passes/subgraph_detector.cc @@ -0,0 +1,467 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/inference/analysis/ir_passes/subgraph_detector.h" +#include +#include +#include +#include +#include "paddle/fluid/framework/ir/graph_helper.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" +#include "paddle/fluid/framework/ir/node.h" + +namespace paddle { +namespace inference { +namespace analysis { + +using framework::ir::Node; + +std::pair, std::vector> +ExtractInputAndOutputOfSubGraph(std::vector &graph) { // NOLINT + std::unordered_set nodes(graph.begin(), graph.end()); + std::unordered_set inputs; + std::unordered_set outputs; + // Input a Value, check whether its inlink is in the subgraph. + auto inlink_in_subgraph = [&](Node *n) { + for (auto *in : n->inputs) { + if (nodes.count(in)) return true; + } + return false; + }; + + for (auto &node : graph) { + for (auto *in : node->inputs) { + // The Value that is written by nodes inside a sub-graph shouldn't be the + // input of the sub-graph. + if (!nodes.count(in) && in->IsVar() && !inlink_in_subgraph(in)) { + inputs.insert(in); + } + } + for (auto *out : node->outputs) { + if (!nodes.count(out) && out->IsVar()) { + outputs.insert(out); + } + } + } + return std::make_pair(std::vector(inputs.begin(), inputs.end()), + std::vector(outputs.begin(), outputs.end())); +} + +// Filter the Intermediate results of the subgraph node. +void FilterRedundantOutputOfSubGraph(Graph *graph) { + std::vector op_nodes; + for (auto &node : TopologicalSort(*graph)) { + if (node.IsVar() || Agent(&node).deleted()) { + continue; + } + op_nodes.push_back(&node); + } + size_t op_num = op_nodes.size(); + for (size_t i = 0; i < op_num; i++) { + if (op_nodes[i]->IsOp()) continue; + std::unordered_set follow_up_input_names; + for (size_t j = i + 1; j < op_num; j++) { + for (auto *in : op_nodes[j]->inputs) { + follow_up_input_names.insert(in->Name()); + } + } + std::vector filtered_subgraph_outlinks; + for (auto *out : op_nodes[i]->outputs) { + if (follow_up_input_names.count(out->Name())) { + filtered_subgraph_outlinks.push_back(out); + } else { + Agent(out).set_deleted(true); + } + } + // The filtered_subgraph_outlinks may be empty. + op_nodes[i]->outputs = filtered_subgraph_outlinks; + } +} + +std::vector> SubgraphDetector::operator()() { + MarkNodesInsideSubGraph(); + return ExtractSubGraphs(); +} + +// Mark the output variables inside a subgraph with the func. +inline void MarkOutLinksInSubGraph(const Node *func) { + for (auto *var : func->outputs) { + Agent(var).set_marked(true); + } +} + +void SubgraphDetector::MarkNodesInsideSubGraph() { + for (auto &node : framework::ir::GraphTraits::DFS(*graph_)) { + if (node_inside_subgraph_teller_(&node)) { + Agent(&node).set_marked(true); + if (node.IsOp()) { + // If a function is inside the sub-graph, mark all the output variables + // to be inside too, so that two marked functions will be inside a same + // sub-graph, lets take a example: A_function->var->B_function, if + // A_function is marked, var should also be marked, so that B_function + // will be in the same sub-graph with A_function if B_function is + // marked. + MarkOutLinksInSubGraph(&node); + } + } + } +} + +// Use the Union Find(UF) algorithm to find fully connected sub-graphs, if node +// a's output is node b, that is a and b is in the same sub-graph. The UF +// algorithm will group them to the same cluster. +using node_map_t = std::unordered_map; +// Find the ancestor id of a node. +int UnionFindGetAncestor(const node_map_t &node_map, size_t id) { + int tmp = id; + do { + tmp = Agent(node_map.at(tmp)).union_find_parent(); + } while (Agent(node_map.at(tmp)).union_find_parent() != tmp); + return tmp; +} +// Make this two node share the same ancestor. +// TODO(Superjom) bad performance, make a balanced tree latter. +void UnionFindCombine(const node_map_t &node_map, size_t a, size_t b) { + int a_ancestor = UnionFindGetAncestor(node_map, a); + int b_ancestor = UnionFindGetAncestor(node_map, b); + Agent(node_map.at(b_ancestor)).set_union_find_parent(a_ancestor); + Agent(node_map.at(a)).set_union_find_parent(a_ancestor); + Agent(node_map.at(b)).set_union_find_parent(a_ancestor); +} + +// This is a simple representation of a graph. +// The BriefNode hold the pointer of the Node. +// This is to avoid changing the original graph +// in the process of trt graph analysis. +struct BriefNode { + explicit BriefNode(Node *n) { node = n; } + Node *node; + std::vector inlinks; + std::vector outlinks; +}; + +// Union two adjacent BriefNode. +// Suppose we have two adjacent nodes src and dst. +// We will perform the following operations: +// 1. add all inputs(except src) of dst to src inlinks. +// 2. add all outputs of dst to src outlinks. +// 3. change all the dst's inputs and outputs +// corresponding inlinks and outlinks to src node. +// 4. delete all dst's inlinks and outlinks. +void UnionContractedNodes(const std::unordered_map &node_map, + int src_id, int dst_id) { + // merge the two adjacent nodes into one node. + BriefNode *src_node = node_map.at(src_id); + BriefNode *dst_node = node_map.at(dst_id); + + std::unordered_set inputs(src_node->inlinks.begin(), + src_node->inlinks.end()); + std::unordered_set outputs; + + for (auto *n : src_node->outlinks) { + if (n != dst_node) outputs.insert(n); + } + + // Add the inlinks and outlinks of dst node to src node. + std::vector dst_in_nodes = dst_node->inlinks; + for (BriefNode *node : dst_in_nodes) { + if (node != src_node) { + inputs.insert(node); + } + } + + std::vector dst_out_nodes = dst_node->outlinks; + for (BriefNode *node : dst_out_nodes) { + outputs.insert(node); + } + +// update the dst and src node's inlinks and outlinks. +#ifdef __clang__ + src_node->inlinks = std::vector(inputs.begin(), inputs.end()); + src_node->outlinks = std::vector(outputs.begin(), outputs.end()); + dst_node->inlinks.clear(); + dst_node->outlinks.clear(); +#else + src_node->inlinks = + std::move(std::vector(inputs.begin(), inputs.end())); + src_node->outlinks = + std::move(std::vector(outputs.begin(), outputs.end())); + dst_node->inlinks.clear(); + dst_node->outlinks.clear(); +#endif + + auto inlink_or_outlink_cleaner = [&](std::vector &nodes) { + for (auto *&n : nodes) { + if (n == src_node || n == dst_node) { + n = src_node; + } + } + }; + // Change all the dst inputs and outputs corresponding inlink and + // outlink to the src node. + for (auto *node : src_node->inlinks) { + inlink_or_outlink_cleaner(node->outlinks); + } + + for (auto *node : src_node->outlinks) { + inlink_or_outlink_cleaner(node->inlinks); + } +} + +// FlexibleDFS +// If reverse is true, do reverse dfs. +// If enter func is not nullptr, calls enter(node) before visiting any children +// of node. +// If leave func not nullptr, calls leave(node) after visiting all parents of +// node. +void FlexibleDFS(const std::vector &source, bool reverse, + const std::function &enter, + const std::function &leave) { + typedef struct { + const BriefNode *node; + bool leave; + } FNode; + + std::vector stack; + for (auto &node : source) { + stack.push_back(FNode{node, false}); + } + std::unordered_set visited; + while (!stack.empty()) { + auto fnode = stack.back(); + stack.pop_back(); + + if (fnode.leave) { + if (leave && !leave(fnode.node)) return; + } + if (visited.count(fnode.node)) continue; + visited.insert(fnode.node); + + if (enter && !enter(fnode.node)) return; + + if (leave) stack.push_back(FNode{fnode.node, true}); + const std::vector iter_nodes = + reverse == true ? fnode.node->inlinks : fnode.node->outlinks; + for (const BriefNode *node : iter_nodes) { + if (!visited.count(node)) { + stack.push_back(FNode{node, false}); + } + } + } +} + +std::vector> SubgraphDetector::ExtractSubGraphs() { + // Run the Extract algorithm to find all subgraphs. + std::vector marked_nodes; + // We use brief_node_map to represent the original graph in order to avoid + // changing the original graph. + std::unordered_map brief_node_map; + + std::unordered_set valid_node_ids; + for (auto *node : graph_->Nodes()) { + valid_node_ids.insert(node->id()); + } + + for (auto &node : framework::ir::GraphTraits::TS(*graph_)) { + brief_node_map[node.id()] = new BriefNode(&node); + if (Agent(&node).marked()) { + marked_nodes.push_back(&node); + } + } + + // extract sub-graphs in the marked node set, use Union Find algorithm. + node_map_t node_map; // id to ptr + for (auto *n : marked_nodes) { + // n's parent == n.id means it is the ancestor + Agent(n).set_union_find_parent(n->id()); + node_map[n->id()] = n; + } + + // create breif node map + for (auto &itr : brief_node_map) { + for (Node *node : itr.second->node->inputs) { + if (!valid_node_ids.count(node->id())) { + LOG(INFO) << "invalid node id " << node->id(); + continue; + } + itr.second->inlinks.push_back(brief_node_map.at(node->id())); + } + + for (Node *node : itr.second->node->outputs) { + if (!valid_node_ids.count(node->id())) { + LOG(INFO) << "invalid node id " << node->id(); + continue; + } + itr.second->outlinks.push_back(brief_node_map.at(node->id())); + } + } + + for (auto &itr : brief_node_map) { + BriefNode *brief_node = itr.second; + + if (!Agent(brief_node->node).marked()) { + VLOG(4) << brief_node->node->id() << " node not a trt candidate."; + continue; + } + + // Our algorithm must guarantee that: + // 1. The graph is always directed acyclic graph(DAG). + // 2. If there is a path in the subgraph from X to Y (X and Y are both + // nodes in the subgraph), then all paths from X to Y are in the + // subgraph. + // + // In order to achieve the above guarantee. + // For adjacent nodes src -> dst. + // 1. Get all dst input nodes except src. + // 2. Reverse DFS from those input nodes + // 3. If there is a path from input nodes to src, + // then the src and dst nodes can not be fused into one node, + // otherwise it can be done. + + while (true) { + std::unordered_set contract_nodes; + for (auto *out : brief_node->outlinks) { + // must be an trt candidate + if (!Agent(out->node).marked()) continue; + // get all dst input nodes except src. + std::vector source_nodes; + for (auto *n : out->inlinks) { + if (n != brief_node) { + source_nodes.push_back(n); + } + } + + // Reverse DFS from the source_nodes. + bool have_excess_path = false; + FlexibleDFS(source_nodes, true, nullptr, + [&have_excess_path, brief_node](const BriefNode *n) { + if (n == brief_node) { + have_excess_path = true; + return false; + } + return true; + }); + if (have_excess_path) continue; + contract_nodes.insert(out); + } + if (contract_nodes.empty()) break; + + for (auto dst_node : contract_nodes) { + UnionFindCombine(node_map, brief_node->node->id(), + dst_node->node->id()); + UnionContractedNodes(brief_node_map, brief_node->node->id(), + dst_node->node->id()); + } + } + } + + std::unordered_map> clusters; + for (auto *n : marked_nodes) { + if (n->IsOp()) { + clusters[UnionFindGetAncestor(node_map, Agent(n).union_find_parent())] + .push_back(n); + } + } + std::vector> result; + std::for_each(clusters.begin(), clusters.end(), + [&](const decltype(clusters)::value_type &it) { + result.push_back(it.second); + }); + + return result; +} + +void SubGraphFuser::operator()() { ReplaceNodesWithSubGraphs(); } + +void RemoveIntermediateOutputInSubgraph(const std::vector &subgraph, + Graph *graph, + std::vector *outputs) { + std::unordered_set subgraph_set(subgraph.begin(), subgraph.end()); + std::unordered_set valid_output; + + for (auto *output : *outputs) { + int num_used = 0; + for (auto *node : output->outputs) { + if (!subgraph_set.count(node)) ++num_used; + if (num_used > 0) valid_output.insert(output); + } + } + + outputs->assign(valid_output.begin(), valid_output.end()); +} + +void DetachDeletedNodes(framework::ir::Graph *graph) { + std::unordered_set nodes; + for (auto *node : graph->Nodes()) { + if (Agent(node).deleted()) { + node->inputs.clear(); + node->outputs.clear(); + } + } +} + +void SubGraphFuser::ReplaceNodesWithSubGraphs() { + auto subgraphs = SubgraphDetector(graph_, node_inside_subgraph_teller_)(); + for (auto &subgraph : subgraphs) { + if (subgraph.size() <= (size_t)min_subgraph_size_) continue; + std::unordered_set subgraph_uniq(subgraph.begin(), subgraph.end()); + // replace this sub-graph with the first node. Two steps: 1. Create a Block + // Node that contains this subgraph 2. Mark the nodes inside the sub-graph + // as deleted. 3. Replace the deleted node with the new Block Node. + framework::OpDesc empty_desc; + empty_desc.SetType("anakin_engine"); + auto *block_node = graph_->CreateOpNode(&empty_desc); + Agent(block_node).set_subgraph({}); + auto io = ExtractInputAndOutputOfSubGraph(subgraph); + block_node->inputs = std::move(io.first); + block_node->outputs = std::move(io.second); + + RemoveIntermediateOutputInSubgraph(subgraph, graph_, &block_node->outputs); + + for (auto *node : subgraph) { + // TODO(Superjomn) need a unified mechanism to treat deleted node in each + // pass. + Agent(node).set_deleted(true); + Agent(block_node).subgraph()->push_back(node); + } + + // Change all the sub-graph's inputs and outputs corresponding inlink and + // outlink to this sub-graph node. + auto inlink_or_outlink_cleaner = [&](std::vector &nodes) { + for (auto *&n : nodes) { + if (subgraph_uniq.count(n)) { + n = block_node; + } + } + std::unordered_set uniq(nodes.begin(), nodes.end()); + nodes.assign(uniq.begin(), uniq.end()); + }; + for (auto *i : block_node->inputs) { + inlink_or_outlink_cleaner(i->outputs); + } + for (auto *&o : block_node->outputs) { + inlink_or_outlink_cleaner(o->inputs); + } + } + // DetachDeletedNodes(graph_); + FilterRedundantOutputOfSubGraph(graph_); +} + +inline bool CheckNodeIndegreeEquals(const Node &node, size_t n) { + return node.inputs.size() == n; +} + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/ir_passes/subgraph_detector.h b/paddle/fluid/inference/analysis/ir_passes/subgraph_detector.h new file mode 100644 index 0000000000000000000000000000000000000000..5d11c217b69f11d45c6fb6d552dc404fa8313daf --- /dev/null +++ b/paddle/fluid/inference/analysis/ir_passes/subgraph_detector.h @@ -0,0 +1,157 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +/* + * This file defines the the class to partition a graph. + */ + +#pragma once + +#include +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_traits.h" +#include "paddle/fluid/framework/ir/node.h" +#include "paddle/fluid/inference/analysis/argument.h" +#include "paddle/fluid/inference/analysis/helper.h" + +namespace paddle { +namespace inference { +namespace analysis { + +using framework::ir::Graph; +using framework::ir::NodesTSIterator; + +const char kIsFunctionNode[] = "__is_function_node__"; +const char kFunctionNodeSubGraph[] = "__function_node_sub_graph__"; +const char kSubgraphSplitterMarkerAttrName[] = + "_sub_graph_splitter_inside_sub_graph"; + +/* + * Detect the nodes in a sub-graph that meet some conditions. This class doesn't + * modify the graph. + */ +class SubgraphDetector { + public: + // Tell whether a node is inside a sub-graph. + using NodeInsideSubgraphTeller = + std::function; + + SubgraphDetector(Graph *graph, const NodeInsideSubgraphTeller &teller) + : graph_(graph), node_inside_subgraph_teller_(teller) {} + + std::vector> operator()(); + + protected: + // Mark the nodes inside the accepted sub-graph using + // node_inside_subgraph_teller. + void MarkNodesInsideSubGraph(); + + // Merge the marked nodes into sub-graphs and return the sub-graphs. + std::vector> ExtractSubGraphs(); + + private: + Graph *graph_; + NodeInsideSubgraphTeller node_inside_subgraph_teller_; +}; + +/* + * SubGraphFuser - Replace some nodes with the sub-graph node they are inside. + * To some extent, the TensorRT engine is just a fusion op for a model. + */ +class SubGraphFuser { + public: + using NodeInsideSubgraphTeller = SubgraphDetector::NodeInsideSubgraphTeller; + + SubGraphFuser(Graph *graph, const NodeInsideSubgraphTeller &teller, + int min_subgraph_size) + : graph_(graph), + node_inside_subgraph_teller_(teller), + min_subgraph_size_{min_subgraph_size} {} + + // The main method which run all the logic. + void operator()(); + + protected: + // Remove the nodes inside sub-graphs and replace with the SubGraphNode. + void ReplaceNodesWithSubGraphs(); + + private: + Graph *graph_; + NodeInsideSubgraphTeller node_inside_subgraph_teller_; + int min_subgraph_size_; +}; + +struct NodeWrapper { + bool deleted{false}; + bool marked{false}; + int union_find_parent{-1}; + std::vector subgraph; +}; + +/* + * ir::Node agent for subgraph detector. + */ +struct Agent { + explicit Agent(framework::ir::Node *x) : x_(x) {} + + NodeWrapper &wrapper() { + if (!x_->IsWrappedBy()) { + x_->WrappedBy(new NodeWrapper); + } + return x_->template Wrapper(); + } + + bool deleted() { return wrapper().deleted; } + void set_deleted(bool x) { wrapper().deleted = x; } + + bool marked() { return wrapper().marked; } + void set_marked(bool x) { wrapper().marked = x; } + + void set_subgraph(const std::vector &x) { + wrapper().subgraph = x; + } + + int union_find_parent() { return wrapper().union_find_parent; } + void set_union_find_parent(int v) { wrapper().union_find_parent = v; } + + std::vector *subgraph() { return &wrapper().subgraph; } + std::vector &inputs() { return x_->inputs; } + std::vector &outputs() { return x_->outputs; } + + private: + framework::ir::Node *x_; +}; + +// The nodes those have no input will be treated as start points. +static std::vector ExtractStartPoints(const Graph &g) { + std::vector result; + for (auto *node : g.Nodes()) { + if (node->inputs.empty()) { + result.push_back(node); + } + } + return result; +} + +static iterator_range TopologicalSort(const Graph &g) { + auto start_points = ExtractStartPoints(g); + PADDLE_ENFORCE(!start_points.empty()); + NodesTSIterator x(start_points); + return iterator_range(NodesTSIterator(start_points), + NodesTSIterator()); +} + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/ir_passes/subgraph_util.cc b/paddle/fluid/inference/analysis/ir_passes/subgraph_util.cc new file mode 100644 index 0000000000000000000000000000000000000000..7c4aab06a1d2b3fadc76b46c7e95cea7818c56e2 --- /dev/null +++ b/paddle/fluid/inference/analysis/ir_passes/subgraph_util.cc @@ -0,0 +1,171 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +/* + * This file defines the the class to partition a graph. + */ + +#include "paddle/fluid/inference/analysis/ir_passes/subgraph_util.h" +#include +#include + +namespace paddle { +namespace inference { +namespace analysis { +using framework::ir::Node; + +std::vector ExtractParameters( + const std::unordered_set &nodes) { + // We can judge whether a variable is a parameter by + // its presistable property, but sometimes the presistable + // of the feed op output is true, so we have to identify it. + std::vector feed_outputs; + for (const auto &node : nodes) { + if (!node->IsOp()) continue; + std::string op_type = node->Op()->Type(); + if (op_type == "feed" || op_type == "fetch") { + std::vector output_names = node->Op()->OutputArgumentNames(); + std::copy(output_names.begin(), output_names.end(), + std::back_inserter(feed_outputs)); + } + } + + std::vector parameters; + for (const auto &node : nodes) { + if (!node->IsVar()) continue; + if (node->Var()->Persistable() && + std::find(feed_outputs.begin(), feed_outputs.end(), node->Name()) == + feed_outputs.end()) { + parameters.push_back(node->Name()); + } + } + return parameters; +} + +void RenameAndGetOutputs( + const std::vector &subgraph_nodes, + framework::BlockDesc *block_desc, + const std::set &input_names_with_id, + std::set *output_names_with_id, + std::set *output_names, + std::unordered_map *output_name_map, + const std::unordered_map &graph_var_map, + bool is_trt) { + //// In the normal case, the paddle-trt exists bug when runing the googlenet. + // When there are more than two convolutions of 1 * 1 with the same input, the + // paddle-tensorrt will do the merging optimization, which fuse those conv + // into one conv, and then trigger bug. So, We should use strategy to avoid + // this optimization for the time being. This bug will be fixed in the future. + std::unordered_map + same_hierarchy_conv2d_num_map; + + auto add_block_var = [&](const std::string &graph_arg, + const std::string &block_arg) { + auto arg_var_node = graph_var_map.find(graph_arg); + PADDLE_ENFORCE(arg_var_node != graph_var_map.end()); + auto *var_t = block_desc->Var(block_arg); + var_t->SetShape(arg_var_node->second->Var()->GetShape()); + var_t->SetDataType(arg_var_node->second->Var()->GetDataType()); + }; + + for (size_t index = 0; index < block_desc->OpSize(); ++index) { + framework::proto::OpDesc *op = block_desc->Op(index)->Proto(); + framework::OpDesc op_desc(*op, nullptr); + auto correspond_node = subgraph_nodes[index]; + PADDLE_ENFORCE_EQ(correspond_node->Name(), op->type()); + + std::unordered_map var2id; + std::unordered_map in_vars; + for (auto *in_var : correspond_node->inputs) { + var2id[in_var->Name()] = in_var->id(); + in_vars[in_var->Name()] = in_var; + } + // rename for the input variables of op inside subgraph + for (int i = 0; i < op->inputs_size(); i++) { + // one input + auto *in_var = op->mutable_inputs(i); + std::vector replaced_names; + for (int k = 0; k < in_var->arguments_size(); k++) { // all the arguments + const std::string arg_value = in_var->arguments(k); + const std::string arg_value_with_id = + arg_value + std::to_string(var2id[arg_value]); + + if (input_names_with_id.count(arg_value_with_id)) { + replaced_names.push_back(arg_value); + if (graph_var_map.count(arg_value)) { + add_block_var(arg_value, arg_value); + } + } else { + replaced_names.push_back(arg_value_with_id); + if (graph_var_map.count(arg_value)) { + add_block_var(arg_value, arg_value_with_id); + } + } + } + in_var->clear_arguments(); + for (size_t k = 0; k < replaced_names.size(); k++) { + in_var->add_arguments(replaced_names[k]); + } + } + var2id.clear(); + for (auto out_var : correspond_node->outputs) { + var2id[out_var->Name()] = out_var->id(); + } + if (op_desc.Type() == "conv2d" && is_trt) { + auto input_var_name = op_desc.Input("Input").front(); + auto filter_var_name = op_desc.Input("Filter").front(); + auto out_var_name = op_desc.Output("Output").front(); + auto filter_shape = in_vars[filter_var_name]->Var()->GetShape(); + const std::vector strides = + boost::get>(op_desc.GetAttr("strides")); + const std::vector paddings = + boost::get>(op_desc.GetAttr("paddings")); + if (same_hierarchy_conv2d_num_map[input_var_name] > 0) { + (*output_names_with_id) + .insert(out_var_name + std::to_string(var2id[out_var_name])); + (*output_names).insert(out_var_name); + } else if (filter_shape[2] == 1 && filter_shape[3] == 1 && + strides[0] == 1 && strides[1] == 1 && paddings[0] == 0 && + paddings[1] == 0) { + same_hierarchy_conv2d_num_map[input_var_name] += 1; + } + } + // rename for the output variables of op inside subgraph + for (int i = 0; i < op->outputs_size(); i++) { + framework::proto::OpDesc_Var *out_var = op->mutable_outputs(i); + std::vector replaced_names; + for (int k = 0; k < out_var->arguments_size(); k++) { + const std::string arg_value = out_var->arguments(k); + const std::string arg_value_with_id = + arg_value + std::to_string(var2id[arg_value]); + + if (graph_var_map.count(arg_value)) { + add_block_var(arg_value, arg_value_with_id); + } + if (output_names_with_id->count(arg_value_with_id)) { + (*output_name_map)[arg_value] = arg_value_with_id; + } + replaced_names.push_back(arg_value_with_id); + } + out_var->clear_arguments(); + for (size_t k = 0; k < replaced_names.size(); k++) { + out_var->add_arguments(replaced_names[k]); + } + } + } +} + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/ir_passes/subgraph_util.h b/paddle/fluid/inference/analysis/ir_passes/subgraph_util.h new file mode 100644 index 0000000000000000000000000000000000000000..bb445027821096689965096c69b8183dd9da403c --- /dev/null +++ b/paddle/fluid/inference/analysis/ir_passes/subgraph_util.h @@ -0,0 +1,50 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +/* + * This file defines the the class to partition a graph. + */ + +#pragma once +#include +#include +#include +#include +#include + +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_traits.h" +#include "paddle/fluid/framework/ir/node.h" + +namespace paddle { +namespace inference { +namespace analysis { +using framework::ir::Node; + +std::vector ExtractParameters( + const std::unordered_set &nodes); + +void RenameAndGetOutputs( + const std::vector &subgraph_nodes, + framework::BlockDesc *block_desc, + const std::set &input_names_with_id, + std::set *output_names_with_id, + std::set *output_names, + std::unordered_map *output_name_map, + const std::unordered_map &graph_var_map, + bool is_trt = true); + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/ir_passes/tensorrt_subgraph_pass.cc b/paddle/fluid/inference/analysis/ir_passes/tensorrt_subgraph_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..67650a352d8b8239da228462c21877ff440147b8 --- /dev/null +++ b/paddle/fluid/inference/analysis/ir_passes/tensorrt_subgraph_pass.cc @@ -0,0 +1,289 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include + +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" +#include "paddle/fluid/inference/analysis/helper.h" +#include "paddle/fluid/inference/analysis/ir_passes/subgraph_detector.h" +#include "paddle/fluid/inference/analysis/ir_passes/tensorrt_subgraph_pass.h" +#include "paddle/fluid/inference/tensorrt/convert/op_converter.h" +#include "paddle/fluid/inference/tensorrt/engine.h" +#include "paddle/fluid/inference/tensorrt/op_teller.h" +#include "paddle/fluid/string/pretty_log.h" + +namespace paddle { +namespace inference { +namespace analysis { + +using framework::ir::Node; + +void analysis::TensorRtSubgraphPass::ApplyImpl( + framework::ir::Graph *graph) const { + framework::ir::FusePassBase::Init("tensorrt_subgraph_pass", graph); + + auto teller = [](const framework::ir::Node *node) { + if (!node->IsOp() || !node->Op()) return false; + return tensorrt::OpTeller::Global().Tell(node->Op()->Type(), *node->Op()); + }; + + SubGraphFuser fuser(graph, teller, + Get("min_subgraph_size") /*min subgraph size*/); + fuser(); + + std::vector graph_param_names = + ExtractParameters(graph->Nodes()); + // those parameter already exist in trt, and should not have another copy in + // fluid. + std::vector repetitive_params; + + for (auto *node : graph->Nodes()) { + if (node->IsOp() && !Agent(node).subgraph()->empty()) { + CreateTensorRTOp(node, graph, graph_param_names, &repetitive_params); + + std::unordered_set nodes2remove( + Agent(node).subgraph()->begin(), Agent(node).subgraph()->end()); + framework::ir::GraphSafeRemoveNodes(graph, nodes2remove); + } + } + + std::unordered_set nodes2remove; + for (auto *node : graph->Nodes()) { + if (node->IsOp() && Agent(node).deleted()) { + nodes2remove.insert(node); + } + } + framework::ir::GraphSafeRemoveNodes(graph, nodes2remove); + graph->Set(framework::ir::kRepetitiveParamAttr, + new std::vector(repetitive_params)); +} + +std::string GenerateEngineKey(const std::set &engine_inputs, + const std::set &engine_outputs, + const std::string &predictor_id) { + std::string engine_hash_key = ""; + for (auto name : engine_inputs) { + engine_hash_key += name; + } + for (auto name : engine_outputs) { + engine_hash_key += name; + } + engine_hash_key += predictor_id; + auto engine_key = std::to_string(std::hash()(engine_hash_key)); + return engine_key; +} + +void TensorRtSubgraphPass::CreateTensorRTOp( + framework::ir::Node *node, Graph *graph, + const std::vector &graph_params, + std::vector *repetitive_params) const { + auto *op_desc = node->Op(); + auto &subgraph = *Agent(node).subgraph(); + PADDLE_ENFORCE(!subgraph.empty()); + + framework::ProgramDesc *program_desc = + Get("program"); + // Add new block for TensorRTEngineOP + const framework::BlockDesc &main_block = + program_desc->Block(framework::kRootBlockIndex); + // const framework::BlockDesc& main_block = program_desc->Block(0); + framework::BlockDesc *new_block = program_desc->AppendBlock(main_block); + + // An fake block desc. + framework::proto::BlockDesc block_proto; + framework::BlockDesc block_desc(nullptr, &block_proto); + block_desc.Proto()->set_parent_idx(-1); + block_desc.Proto()->set_idx(0); + string::PrettyLogDetail("--- detect a sub-graph with %d nodes", + subgraph.size()); + + for (auto *node : subgraph) { + auto *new_block_op = new_block->AppendOp(); + auto *op = block_desc.AppendOp(); + *new_block_op->Proto() = *node->Op()->Proto(); + *op->Proto() = *node->Op()->Proto(); + } + + // Then, we will use the input_names_with_id and output_names_with_id to + // generate the eigine key. + // So, We use set instead of unordered_set here to ensure that the engine key + // is unique. + std::set input_names; + std::set input_names_with_id; + std::vector params; + + // The node->inputs containes input tensors and parameters. + for (auto *x : node->inputs) { + input_names.insert(x->Name()); + input_names_with_id.insert(x->Name() + std::to_string(x->id())); + if (std::count(graph_params.begin(), graph_params.end(), x->Name()) > 0) { + params.push_back(x->Name()); + } + } + + std::set output_names; + std::set output_names_with_id; + for (auto *x : node->outputs) { + output_names.insert(x->Name()); + output_names_with_id.insert(x->Name() + std::to_string(x->id())); + } + + std::unordered_map output_name_map; + std::unordered_map graph_var_map; + + for (framework::ir::Node *node : graph->Nodes()) { + if (node->IsVar() && node->Var()) { + graph_var_map[node->Name()] = node; + } + } + auto &subgraph_nodes = *Agent(node).subgraph(); + + // The following procedure is used to rename all the intermediate + // variables and the output variables of the subgraph. + // Why we do this? + // During the transition from fluid OP to tensorrt OP, we map + // the input and output Tensor(fluid data structure) of fluid OP + // to the corresponding ITensor (trt data structure) through the + // Tensor name. When we set up ITensor for an variable, we must + // ensure that it has not been set before. + // If there is variable in the fluid graph, which is not only the + // input of a OP, but also the output of a Op, there will be problems. + // So we have to rename the variable in the subgraph to make sure + // it is either an OP's input or an OP's output. + RenameAndGetOutputs(subgraph_nodes, &block_desc, input_names_with_id, + &output_names_with_id, &output_names, &output_name_map, + graph_var_map); + + // When tensorrt engine runs at the end of the operation, + // output_mapping help us copy the data from the renamed ITensor + // to Tensor. + std::vector output_mapping; + for (auto name : output_names) { + PADDLE_ENFORCE(output_name_map.count(name) != 0); + output_mapping.push_back(output_name_map[name]); + } + PADDLE_ENFORCE(!output_mapping.empty()); + PADDLE_ENFORCE(!block_desc.Proto()->vars().empty(), + "the block has no var-desc"); + + // Set attrs + op_desc->SetType("tensorrt_engine"); + op_desc->SetInput( + "Xs", std::vector(input_names.begin(), input_names.end())); + + op_desc->SetOutput( + "Ys", std::vector(output_names.begin(), output_names.end())); + + op_desc->SetBlockAttr("sub_block", new_block); + SetAttr(op_desc->Proto(), "subgraph", + block_desc.Proto()->SerializeAsString()); + SetAttr(op_desc->Proto(), "max_batch_size", Get("max_batch_size")); + SetAttr(op_desc->Proto(), "workspace_size", Get("workspace_size")); + SetAttr(op_desc->Proto(), "gpu_id", Get("gpu_device_id")); + SetAttr(op_desc->Proto(), "output_name_mapping", output_mapping); + SetAttr(op_desc->Proto(), "parameters", params); + + auto enable_int8 = Get("enable_int8"); + auto use_static_engine = Get("use_static_engine"); + auto engine_key = GenerateEngineKey(input_names_with_id, output_names_with_id, + std::to_string(0)); + + // Get "" when there is no cached calibration table data. + bool load_from_memory = Get("model_from_memory"); + std::string calibration_data = ""; + if (enable_int8) { + calibration_data = GetTrtCalibTableData( + Get("model_opt_cache_dir"), engine_key, enable_int8); + } + SetAttr(op_desc->Proto(), "calibration_data", calibration_data); + + SetAttr(op_desc->Proto(), "enable_int8", enable_int8); + SetAttr(op_desc->Proto(), "engine_key", engine_key); + std::string trt_engine_serialized_data = ""; + SetAttr(op_desc->Proto(), "engine_serialized_data", + trt_engine_serialized_data); + + std::unique_ptr calibrator; + if (enable_int8 && calibration_data.size() != 0) { + calibrator.reset(new tensorrt::TRTInt8Calibrator(calibration_data)); + } + // When in int8 mode and calibration_mode, the program just produce the + // calibration table data. + bool calibration_mode = (enable_int8 && calibration_data.size() == 0); + if (calibration_mode) { + // calibraion mode means generate int8 calibration table data process. + return; + } + + std::copy(params.begin(), params.end(), + std::back_inserter(*repetitive_params)); + bool need_serialize = (use_static_engine && !load_from_memory); + + if (need_serialize) { + trt_engine_serialized_data = GetTrtEngineSerializedData( + Get("model_opt_cache_dir"), engine_key); + // we can load the engine info serialized before from the disk. + if (!trt_engine_serialized_data.empty()) { + SetAttr(op_desc->Proto(), "engine_serialized_data", + trt_engine_serialized_data); + LOG(INFO) << "Load TRT Optimized Info from " + << GetTrtEngineSerializedPath( + Get("model_opt_cache_dir"), engine_key); + return; + } + } + + // the following code will NOT run in following situation: + // 1. calibraion mode (generate trt int8 calibraiton table data) + // 2. already load serialized trt engine info. + LOG(INFO) << "Prepare TRT engine (Optimize model structure, Select OP " + "kernel etc). This process may cost a lot of time."; + std::unique_ptr trt_engine( + new tensorrt::TensorRTEngine( + Get("max_batch_size"), Get("workspace_size"), enable_int8, + calibrator.get(), Get("gpu_device_id"))); + auto *scope = param_scope(); + framework::BlockDesc block_desc_temp(nullptr, block_desc.Proto()); + std::unordered_set param_set(params.begin(), params.end()); + inference::Singleton::Global() + .ConvertBlockToTRTEngine( + &block_desc_temp, *scope, + std::vector(input_names.begin(), input_names.end()), + param_set, output_mapping, trt_engine.get()); + nvinfer1::IHostMemory *serialized_engine_data = trt_engine->Serialize(); + trt_engine_serialized_data = + std::string((const char *)serialized_engine_data->data(), + serialized_engine_data->size()); + + if (need_serialize) { + SaveTrtEngineSerializedDataToFile( + GetTrtEngineSerializedPath(Get("model_opt_cache_dir"), + engine_key), + trt_engine_serialized_data); + } + SetAttr(op_desc->Proto(), "engine_serialized_data", + trt_engine_serialized_data); +} + +} // namespace analysis +} // namespace inference +} // namespace paddle + +REGISTER_PASS(tensorrt_subgraph_pass, + paddle::inference::analysis::TensorRtSubgraphPass) + .RequirePassAttr("max_batch_size") + .RequirePassAttr("workspace_size") + .RequirePassAttr("min_subgraph_size"); diff --git a/paddle/fluid/inference/analysis/ir_passes/tensorrt_subgraph_pass.h b/paddle/fluid/inference/analysis/ir_passes/tensorrt_subgraph_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..f530a5a0b337666ba6c470fbf63247cc62041d82 --- /dev/null +++ b/paddle/fluid/inference/analysis/ir_passes/tensorrt_subgraph_pass.h @@ -0,0 +1,42 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include +#include +#include +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/pass.h" +#include "paddle/fluid/inference/analysis/ir_passes/subgraph_util.h" + +namespace paddle { +namespace inference { +namespace analysis { + +class TensorRtSubgraphPass : public framework::ir::FusePassBase { + public: + void ApplyImpl(framework::ir::Graph *graph) const override; + + private: + void CreateTensorRTOp(framework::ir::Node *x, framework::ir::Graph *graph, + const std::vector &graph_params, + std::vector *repetitive_params) const; + void CleanIntermediateOutputs(framework::ir::Node *node); +}; + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/model_store_pass.cc b/paddle/fluid/inference/analysis/model_store_pass.cc deleted file mode 100644 index c313db08875669010ddcca13aa66b383ee6d26f8..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/model_store_pass.cc +++ /dev/null @@ -1,67 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include -#include -#include - -#include "paddle/fluid/inference/analysis/analyzer.h" -#include "paddle/fluid/inference/analysis/argument.h" -#include "paddle/fluid/inference/analysis/model_store_pass.h" - -namespace paddle { -namespace inference { -namespace analysis { - -void ModelStorePass::Run(DataFlowGraph *x) { - if (!argument_->fluid_model_param_path) { - PADDLE_ENFORCE_NOT_NULL(argument_->fluid_model_dir); - argument_->fluid_model_param_path.reset( - new std::string(*argument_->fluid_model_dir + "param")); - } - PADDLE_ENFORCE_NOT_NULL(argument_->model_output_store_path); - // Directly copy param file to destination. - std::stringstream ss; - // NOTE these commands only works on linux. - ss << "mkdir -p " << *argument_->model_output_store_path; - VLOG(3) << "run command: " << ss.str(); - PADDLE_ENFORCE_EQ(system(ss.str().c_str()), 0); - ss.str(""); - - ss << "cp " << *argument_->fluid_model_dir << "/*" - << " " << *argument_->model_output_store_path; - VLOG(3) << "run command: " << ss.str(); - PADDLE_ENFORCE_EQ(system(ss.str().c_str()), 0); - - // Store program - PADDLE_ENFORCE_NOT_NULL(argument_->transformed_program_desc, - "program desc is not transformed, should call " - "DataFlowGraphToFluidPass first."); - VLOG(3) << "store analyzed program to " - << *argument_->model_output_store_path; - const std::string program_output_path = - *argument_->model_output_store_path + "/__model__"; - std::ofstream file(program_output_path, std::ios::binary); - PADDLE_ENFORCE(file.is_open(), "failed to open %s to write.", - program_output_path); - const std::string serialized_message = - argument_->transformed_program_desc->SerializeAsString(); - file.write(serialized_message.c_str(), serialized_message.size()); -} - -bool ModelStorePass::Finalize() { return true; } - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/model_store_pass.h b/paddle/fluid/inference/analysis/model_store_pass.h deleted file mode 100644 index f14b49e09c2f8e79c6fc4accdbf17f4f7a9bb1a3..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/model_store_pass.h +++ /dev/null @@ -1,55 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -/* - * This file defines ModelStorePass, which store the runtime DFG to a Paddle - * model in the disk, and that model can be reloaded for prediction. - */ - -#pragma once -#include -#include "paddle/fluid/inference/analysis/analysis_pass.h" - -namespace paddle { -namespace inference { -namespace analysis { - -class ModelStorePass : public DataFlowGraphPass { - public: - bool Initialize(Argument* argument) override { - if (!argument) { - LOG(ERROR) << "invalid argument"; - return false; - } - argument_ = argument; - return true; - } - - void Run(DataFlowGraph* x) override; - - std::string repr() const override { return "DFG-store-pass"; } - std::string description() const override { - return R"DD(This file defines ModelStorePass, which store the runtime DFG to a Paddle - model in the disk, and that model can be reloaded for prediction again.)DD"; - } - - bool Finalize() override; - - private: - Argument* argument_{nullptr}; -}; - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/model_store_pass_tester.cc b/paddle/fluid/inference/analysis/model_store_pass_tester.cc deleted file mode 100644 index d6493fc25edf25003504542f1b01c4105754c8df..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/model_store_pass_tester.cc +++ /dev/null @@ -1,43 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include "paddle/fluid/inference/analysis/model_store_pass.h" - -#include -#include -#include "paddle/fluid/inference/analysis/analyzer.h" - -namespace paddle { -namespace inference { -namespace analysis { - -DEFINE_string(inference_model_dir, "", "Model path"); - -TEST(DFG_StorePass, test) { - Analyzer analyzer; - Argument argument(FLAGS_inference_model_dir); - argument.model_output_store_path.reset( - new std::string("./_dfg_store_pass_tmp")); - // disable storage in alalyzer - FLAGS_IA_output_storage_path = ""; - analyzer.Run(&argument); - - ModelStorePass pass; - pass.Initialize(&argument); - pass.Run(argument.main_dfg.get()); -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/node.cc b/paddle/fluid/inference/analysis/node.cc deleted file mode 100644 index 3339b5044df0cf91d00aa9ddad310d4bf263bc3c..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/node.cc +++ /dev/null @@ -1,70 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/inference/analysis/node.h" -#include "glog/logging.h" -#include "paddle/fluid/platform/enforce.h" - -namespace paddle { -namespace inference { -namespace analysis { - -std::vector Value::dot_attrs() const { - return std::vector({Dot::Attr("style", "filled,rounded"), - Dot::Attr("shape", "box"), - Dot::Attr("fillcolor", "red")}); -} - -std::vector Function::dot_attrs() const { - return std::vector({Dot::Attr("style", "filled,rounded"), - Dot::Attr("shape", "diamond"), - Dot::Attr("fillcolor", "yellow")}); -} - -Node *NodeMap::Create(Node::Type type) { - switch (type) { - case Node::Type::kFunction: - nodes_.emplace_back(new Function); - break; - case Node::Type::kValue: - nodes_.emplace_back(new Value); - break; - case Node::Type::kFunctionBlock: - nodes_.emplace_back(new FunctionBlock); - break; - default: - PADDLE_THROW("Not supported node type."); - } - nodes_.back()->id_ = size() - 1; - return nodes_.back().get(); -} - -Node *NodeMap::GetMutable(size_t id) { - PADDLE_ENFORCE_GT(size(), id); - return nodes_[id].get(); -} - -const Node &NodeMap::Get(size_t id) const { - PADDLE_ENFORCE_GT(size(), id); - return *nodes_[id].get(); -} - -void NodeMap::Delete(size_t id) { - PADDLE_ENFORCE_LT(id, size()); - nodes_[id]->SetDeleted(); -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/node.h b/paddle/fluid/inference/analysis/node.h deleted file mode 100644 index af34156bc2f101465d87cb10e2155745022eb521..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/node.h +++ /dev/null @@ -1,244 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -/* - * This file defines the Node class and its subclasses. A Node is the basis - * analysis element in a computation graph. - * There are basically two kinds of nodes, the function node and value node. - */ -#pragma once - -#include -#include -#include -#include -#include - -#include "paddle/fluid/framework/var_type.h" -#include "paddle/fluid/inference/analysis/device.h" -#include "paddle/fluid/inference/analysis/dot.h" -#include "paddle/fluid/inference/analysis/helper.h" -#include "paddle/fluid/platform/variant.h" - -namespace paddle { -namespace inference { -namespace analysis { - -class NodeMap; - -// A helper class to maintain the status from Pass. -struct AnyAttr { - using any_t = - boost::variant; - // NOTE T should be a primary type or a struct combined by several primary - // types. - // NOTE the STL containers should not use here. - // Some usages - // Attr attr; - // attr.Bool() = true; - bool &Bool() { return As(); } - float &Float() { return As(); } - int32_t &Int32() { return As(); } - int64_t &Int64() { return As(); } - void *&Pointer() { return As(); } - std::string &String() { return As(); } - - template - T &As() { - if (type_index_ == typeid(AnyAttr)) { - type_index_ = typeid(T); - any_data_ = T(); - } else { - PADDLE_ENFORCE(type_index_ == typeid(T), "fetch error type"); - } - return boost::get(any_data_); - } - - private: - any_t any_data_; - std::type_index type_index_{typeid(AnyAttr)}; -}; - -/* - * Node Representation. - * - * This is a very important class for analysis. It is the base class of all - * nodes computed by a program that may be used as operands to other nodes. - * Node is the super class of other important classes such as Function and - * Value, some nodes can have a name. - */ -class Node { - public: - // Node type. NOTE the new node types should add here. - enum class Type { kNone = -1, kFunction, kValue, kFunctionBlock }; - - Node() = default; - - // Cast to a subclass type, Function for example. - template - Subclass &As() { - return *dynamic_cast(this); - } - - // Formatted representation of this Node. - virtual std::string repr() const { - return name() + "(" + std::to_string(id()) + ")"; - } - - // DOT node representation. One Node type can customize its own node - // representation. - virtual std::vector dot_attrs() const { - return std::vector({Dot::Attr("style", "filled")}); - } - - // Get an additional attribute and convert it to T data type. NOTE this will - // silently create a new attribute if not exists. - AnyAttr &attr(const std::string &name) const { return attrs_[name]; } - - int id() const { return id_; } - - // The Protobuf description is set/get with a void* to decouple Node interface - // from a specific kind of Protobuf message. - void SetPbDesc(void *pb) { attr("pb_desc").Pointer() = pb; } - void *pb_desc() const { return attr("pb_desc").Pointer(); } - - void SetPbMsg(const std::string &s) { attr("pb_msg").String() = s; } - const std::string &pb_msg() const { return attr("pb_msg").String(); } - - void SetDeleted() { deleted_ = true; } - bool deleted() const { return deleted_; } - - void SetName(const std::string &name) { name_ = name; } - const std::string &name() const { return name_; } - - void SetType(Type type) { type_ = type; } - Type type() const { return type_; } - - // Input links. - std::vector inlinks; - // Output links. - std::vector outlinks; - - // Type checks. - bool IsFunction() const { return type_ == Node::Type::kFunction; } - bool IsValue() const { return type_ == Node::Type::kValue; } - bool IsFunctionBlock() const { return type_ == Node::Type::kFunctionBlock; } - - virtual ~Node() {} - - friend class NodeMap; - - PADDLE_DISALLOW_COPY_AND_ASSIGN(Node); - - protected: - // The id number not the name is a node's unique identifier in the computation - // graph. - int id_{-1}; - std::string name_; - Type type_{Type::kNone}; - // Mark this node is deleted by some pass. - bool deleted_{false}; - mutable std::unordered_map attrs_; -}; - -class Function; -/* - * Value represents a value node, it has some attributes including dims, data - * type and so on. - */ -class Value : public Node { - public: - enum class DataType { kInt32, kInt64, kFloat32, kFloat64 }; - using Dims = std::vector; - - void SetDataType(DataType data_type) { data_type_ = data_type; } - DataType data_type() const { return data_type_; } - - void SetDims(const Dims &dims) { dims_ = dims; } - const Dims &dims() const { return dims_; } - - Device device() const { return device_; } - void SetDevice(Device device) { device_ = device; } - - std::vector dot_attrs() const override; - - PADDLE_DISALLOW_COPY_AND_ASSIGN(Value); - - protected: - Value() { SetType(Node::Type::kValue); } - friend class NodeMap; - - private: - DataType data_type_; - Dims dims_; - Device device_; -}; - -/* - * Function represents any kind of executable concepts that takes several Values - * as input, and outputs several Values. - */ -class Function : public Node { - public: - std::vector dot_attrs() const override; - - // Get the operator's type from Desc. - const std::string &func_type() const { return func_type_; } - // Set the operator's type. - void SetFuncType(const std::string &func_type) { func_type_ = func_type; } - - PADDLE_DISALLOW_COPY_AND_ASSIGN(Function); - - protected: - std::string func_type_; - Function() { SetType(Node::Type::kFunction); } - friend class NodeMap; -}; - -/* - * FunctionBlock is a Node that contains a sub-graph multiple Node. - */ -struct FunctionBlock : public Node { - std::string repr() const override { return "block-" + std::to_string(id()); } - std::vector subgraph; - - protected: - FunctionBlock() { SetType(Node::Type::kFunctionBlock); } - friend class NodeMap; -}; - -class NodeMap { - public: - // Create a new node with type. - Node *Create(Node::Type type); - - // Get a node by its id. - Node *GetMutable(size_t id); - - const Node &Get(size_t id) const; - - void Delete(size_t id); - - const std::vector> &nodes() const { return nodes_; } - - size_t size() const { return nodes_.size(); } - - private: - std::vector> nodes_; - std::unordered_map map_; -}; - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/node_attr_flags.h b/paddle/fluid/inference/analysis/node_attr_flags.h deleted file mode 100644 index a3f70e5419a66969e8fb20152a8a8ace39316f57..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/node_attr_flags.h +++ /dev/null @@ -1,32 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -/* - * This file contains all the flags that declared in Node::Attr. - * - * The Node::Attr is designed to share information between different passes, one - * can get other's attributes in a Node by the flags in this file. - */ -#pragma once -namespace paddle { -namespace inference { -namespace analysis { - -#define DECLARE_NODE_ATTR(flag__) const char ATTR_##flag__[] = #flag__; - -DECLARE_NODE_ATTR(supported_by_tensorrt) // bool - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/node_tester.cc b/paddle/fluid/inference/analysis/node_tester.cc deleted file mode 100644 index 9207c15373fb4264ff0e738e93ae88e1c08b554c..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/node_tester.cc +++ /dev/null @@ -1,55 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/inference/analysis/node.h" - -#include - -namespace paddle { -namespace inference { -namespace analysis { - -TEST(NodeAttr, bool) { - AnyAttr x; - x.Bool() = true; - ASSERT_EQ(x.Bool(), true); -} - -TEST(NodeAttr, int32) { - AnyAttr x; - x.Int32() = 32; - ASSERT_EQ(x.Int32(), 32); -} - -TEST(NodeAttr, string) { - AnyAttr x; - x.String() = "Hello"; - ASSERT_EQ(x.String(), "Hello"); -} - -TEST(Node, Attr) { - // Node is an abstract class, use Value instead for they share the same Attr - // logic. - NodeMap nodes; - auto* node = nodes.Create(Node::Type::kValue); - node->attr("v0").Int32() = 2008; - ASSERT_EQ(node->attr("v0").Int32(), 2008); - - node->attr("str").String() = "hello world"; - ASSERT_EQ(node->attr("str").String(), "hello world"); -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/pass_manager.cc b/paddle/fluid/inference/analysis/pass_manager.cc deleted file mode 100644 index a6ac0ee49f8f408faa7a17bf5ef5d2799a9a6238..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/pass_manager.cc +++ /dev/null @@ -1,47 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/inference/analysis/pass_manager.h" -#include "paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.h" -#include "paddle/fluid/string/pretty_log.h" - -namespace paddle { -namespace inference { -namespace analysis { - -bool PassManager::Initialize(Argument* argument) { - argument_ = argument; - for (auto& pass : data_) { - VLOG(3) << "Initializing pass [" << pass->repr() << "]"; - if (!pass->Initialize(argument)) { - LOG(ERROR) << "Failed to initialize pass [" << pass->repr() << "]"; - return false; - } - } - return true; -} - -void DfgPassManager::RunAll() { - PADDLE_ENFORCE(argument_); - VLOG(3) << "Total " << data_.size() << " Analysys passes"; - for (auto& pass : data_) { - string::PrettyLogEndl(string::Style::H1(), "* Running Analysis pass [%s]", - pass->repr()); - pass->Run(argument_->main_dfg.get()); - } -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/pass_manager.h b/paddle/fluid/inference/analysis/pass_manager.h deleted file mode 100644 index 412747c4fcce73303703f586f7a04edf4cc5ee76..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/pass_manager.h +++ /dev/null @@ -1,94 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -/* - * This file defines the logic of pass management. The analysis for inference is - * a pipeline of Passes, a PassManager is a agency that helps to manage the - * executation of the Passes. - * - * There are two modes of Passes, the first one is called NodePass and takes - * an Node as input and output; the second one is called DFGPass and takes a - * DFG(Data Flow Graph) as input and output. It is hard to put all the passes in - * the same pipeline, there are two kinds of PassManagers, both takes a DFG as - * input and output a DFG, but the Passes inside are different: - * - * 1. NodePassManager: the passes inside are all NodePasses, it can have - * different graph trivial algorithm, for example, DFS_NodePassManager will - * trigger the passes in depth first order; - * 2. DfgPassManager: the passes inside are all DfgPasses. - */ - -#pragma once - -#include -#include "paddle/fluid/framework/program_desc.h" -#include "paddle/fluid/inference/analysis/analysis_pass.h" - -namespace paddle { -namespace inference { -namespace analysis { - -/* - * PassManager is the base class for all pass managers, a pass manager has - * several Pass-es registered, and execute them in the linear order. - */ -class PassManager : public OrderedRegistry { - public: - PassManager() = default; - // Call all the passes' Initialize methods. The desc and data_flow_graph are - // globally shared, so pass them as the arguemnts for all the pass managers. - virtual bool Initialize(const Argument& argument) { return false; } - - virtual bool Initialize(Argument* argument); - - // Call all the passes' Finalize methods. - virtual bool Finalize() { - for (auto& pass : data_) { - if (!pass->Finalize()) { - LOG(ERROR) << "Failed to finalize pass [" << pass->repr() << "]"; - return false; - } - } - return true; - } - - // Run all the passes. - virtual void RunAll() = 0; - - // Short identifier. - virtual std::string repr() const = 0; - // Long description. - virtual std::string description() const = 0; - - virtual ~PassManager() = default; - - protected: - Argument* argument_{nullptr}; -}; - -/* - * A pass manager that process a DFG. - */ -class DfgPassManager : public PassManager { - public: - DfgPassManager() = default; - - void RunAll() override; - - virtual ~DfgPassManager() = default; -}; - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/pass_manager_tester.cc b/paddle/fluid/inference/analysis/pass_manager_tester.cc deleted file mode 100644 index 72b0fbf7e571ec97a0ea093d01449c1d5ddb9b91..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/pass_manager_tester.cc +++ /dev/null @@ -1,54 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include - -#include "paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass.h" -#include "paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.h" -#include "paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.h" -#include "paddle/fluid/inference/analysis/pass_manager.h" -#include "paddle/fluid/inference/analysis/ut_helper.h" - -namespace paddle { -namespace inference { -namespace analysis { - -class TestDfgPassManager final : public DfgPassManager { - public: - TestDfgPassManager() = default; - virtual ~TestDfgPassManager() = default; - // Short identifier. - std::string repr() const override { return "test-pass-manager"; } - // Long description. - std::string description() const override { return "test doc"; } -}; - -TEST(PassManager, DFG_pass_manager) { - TestDfgPassManager manager; - DFG_GraphvizDrawPass::Config config("./", "dfg.dot"); - - manager.Register("fluid-to-flow-graph", new FluidToDataFlowGraphPass); - manager.Register("graphviz", new DFG_GraphvizDrawPass(config)); - manager.Register("dfg-to-fluid", new DataFlowGraphToFluidPass); - - Argument argument(FLAGS_inference_model_dir); - - ASSERT_TRUE(&argument); - ASSERT_TRUE(manager.Initialize(&argument)); - manager.RunAll(); -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/passes/CMakeLists.txt b/paddle/fluid/inference/analysis/passes/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..9d74dc6c211e4fcb6d1e7de5369eee847f49fc78 --- /dev/null +++ b/paddle/fluid/inference/analysis/passes/CMakeLists.txt @@ -0,0 +1,18 @@ +cc_library(ir_graph_build_pass SRCS ir_graph_build_pass.cc DEPS analysis_pass argument ir_pass_manager) +cc_library(ir_analysis_pass SRCS ir_analysis_pass.cc DEPS analysis_pass argument ir_pass_manager) +cc_library(memory_optim_pass SRCS memory_optimize_pass.cc DEPS analysis_pass zero_copy_tensor) +cc_library(ir_params_sync_among_devices_pass SRCS ir_params_sync_among_devices_pass.cc DEPS analysis_pass argument ir_pass_manager) +cc_library(ir_graph_to_program_pass SRCS ir_graph_to_program_pass.cc DEPS analysis_pass graph_to_program_pass) + +cc_library(analysis_passes SRCS passes.cc DEPS + ir_graph_build_pass + ir_analysis_pass + ir_params_sync_among_devices_pass + memory_optim_pass + ir_graph_to_program_pass +) + +set(analysis_deps ${analysis_deps} + analysis_passes + subgraph_detector + CACHE INTERNAL "") diff --git a/paddle/fluid/inference/analysis/passes/ir_analysis_pass.cc b/paddle/fluid/inference/analysis/passes/ir_analysis_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..d986811a827b6ed477b30bc43d26f52a71e8f178 --- /dev/null +++ b/paddle/fluid/inference/analysis/passes/ir_analysis_pass.cc @@ -0,0 +1,53 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/analysis/passes/ir_analysis_pass.h" +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/inference/analysis/ir_pass_manager.h" + +namespace paddle { +namespace inference { +namespace analysis { + +void IrAnalysisPass::RunImpl(Argument* argument) { + ARGUMENT_CHECK_FIELD(argument, ir_analysis_passes); + ARGUMENT_CHECK_FIELD(argument, main_program); + ARGUMENT_CHECK_FIELD(argument, scope); + + auto* the_graph = argument->ReleaseMainGraph(); + auto graph = std::unique_ptr(the_graph); + + // Apply passes. + IRPassManager the_ir_manager(argument); + graph = the_ir_manager.Apply(std::move(graph)); + PADDLE_ENFORCE_GT(graph->Nodes().size(), 0); + argument->SetMainGraph(graph.release()); + CollectFusionStatis(argument); +} + +void IrAnalysisPass::CollectFusionStatis(Argument* argument) { + if (!argument->main_graph().Has(framework::ir::kFuseStatisAttr)) { + LOG(INFO) << "argument has no fuse statis"; + return; + } + argument->SetFusionStatis( + argument->main_graph().Get( + framework::ir::kFuseStatisAttr)); +} + +std::string IrAnalysisPass::repr() const { return "ir-analysis-pass"; } + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/passes/ir_analysis_pass.h b/paddle/fluid/inference/analysis/passes/ir_analysis_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..2c2113c06d917b5473e68935889e4a7b16b6cfc1 --- /dev/null +++ b/paddle/fluid/inference/analysis/passes/ir_analysis_pass.h @@ -0,0 +1,40 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include "paddle/fluid/inference/analysis/analysis_pass.h" + +namespace paddle { +namespace inference { +namespace analysis { + +/* + * Perform IR analysis passes. + * + * It is used to fuse some + */ +class IrAnalysisPass : public AnalysisPass { + public: + void RunImpl(Argument* argument) override; + + void CollectFusionStatis(Argument* argument); + + std::string repr() const override; +}; + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/passes/ir_graph_build_pass.cc b/paddle/fluid/inference/analysis/passes/ir_graph_build_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..c6e923c00484f01f17550ae2926dabcadc0c3ac6 --- /dev/null +++ b/paddle/fluid/inference/analysis/passes/ir_graph_build_pass.cc @@ -0,0 +1,86 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/analysis/passes/ir_graph_build_pass.h" +#include +#include +#include "paddle/fluid/framework/executor.h" +#include "paddle/fluid/inference/io.h" +#include "paddle/fluid/platform/enforce.h" + +namespace paddle { +namespace inference { + +extern void ReadBinaryFile(const std::string &filename, std::string *contents); + +namespace analysis { + +void IrGraphBuildPass::RunImpl(Argument *argument) { + if (!argument->scope_valid()) { + argument->SetScope(new framework::Scope); + } + PADDLE_ENFORCE(argument->use_gpu_valid()); + + // The load program should run on the same device with the inference program, + // so that the parameters will on the same device, or they will keep copying + // between difference devices. + platform::Place place; + place = platform::CPUPlace(); + + if (argument->model_dir_valid()) { + auto program = + LoadModel(argument->model_dir(), argument->scope_ptr(), place); + argument->SetMainProgram(program.release()); + } else if (argument->model_program_path_valid() && + argument->model_params_path_valid()) { + auto program = LoadModel( + argument->model_program_path(), argument->model_params_path(), + argument->scope_ptr(), place, + argument->model_from_memory_valid() && argument->model_from_memory()); + argument->SetMainProgram(program.release()); + } else { + PADDLE_THROW( + "either model_dir or (program path and parameter path) should be set."); + } + + auto graph = std::unique_ptr(new Graph(argument->main_program())); + argument->SetMainGraph(graph.release()); + argument->main_graph().Set(framework::ir::kParamScopeAttr, + new framework::Scope *(argument->scope_ptr())); +} + +std::unique_ptr IrGraphBuildPass::LoadModel( + const std::string &path, framework::Scope *scope, + const platform::Place &place) { + framework::Executor exe(place); + return Load(&exe, scope, path); +} + +std::unique_ptr IrGraphBuildPass::LoadModel( + const std::string &program_path, const std::string ¶ms_path, + framework::Scope *scope, const platform::Place &place, + bool model_from_memory) { + framework::Executor exe(place); + if (!model_from_memory) { + return Load(&exe, scope, program_path, params_path); + } else { + return LoadFromMemory(&exe, scope, program_path, params_path); + } +} + +std::string IrGraphBuildPass::repr() const { return "ir-graph-build-pass"; } + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/passes/ir_graph_build_pass.h b/paddle/fluid/inference/analysis/passes/ir_graph_build_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..adbde0433fad28b006b18b47c8fd0a8946d21a98 --- /dev/null +++ b/paddle/fluid/inference/analysis/passes/ir_graph_build_pass.h @@ -0,0 +1,49 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/inference/analysis/analysis_pass.h" +#include "paddle/fluid/platform/place.h" + +namespace paddle { +namespace inference { +namespace analysis { + +/* + * Load program and parameter to memory from the disk or directly from memory. + */ +class IrGraphBuildPass : public AnalysisPass { + public: + void RunImpl(Argument *argument) override; + + std::string repr() const override; + + private: + std::unique_ptr LoadModel( + const std::string &path, framework::Scope *scope, + const platform::Place &place); + std::unique_ptr LoadModel( + const std::string &program_path, const std::string ¶ms_path, + framework::Scope *scope, const platform::Place &place, + bool model_from_memory); + + std::string model_binary_str_; +}; + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/passes/ir_graph_to_program_pass.cc b/paddle/fluid/inference/analysis/passes/ir_graph_to_program_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..35df396fe89eb23317b8f086c668396fdb3a4559 --- /dev/null +++ b/paddle/fluid/inference/analysis/passes/ir_graph_to_program_pass.cc @@ -0,0 +1,49 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/analysis/passes/ir_graph_to_program_pass.h" +#include +#include "paddle/fluid/framework/ir/graph_to_program_pass.h" +#include "paddle/fluid/framework/ir/pass.h" +#include "paddle/fluid/framework/program_desc.h" + +namespace paddle { +namespace inference { +namespace analysis { + +void IrGraphToProgramPass::RunImpl(Argument *argument) { + auto pass = + framework::ir::PassRegistry::Instance().Get("graph_to_program_pass"); + + if (argument->memory_optim_sort_kind_valid()) { + pass->Set(framework::ir::kGraphToProgramSortKind, + new int(argument->memory_optim_sort_kind())); + } + + std::unique_ptr graph(argument->main_graph_ptr()); + + // Direct using ProgramDesc desc(argument->main_program()) may cause + // incomplete copies of information. + framework::ProgramDesc desc; + desc.CopyFrom(*argument->main_program().Proto()); + pass->SetNotOwned("program", &desc); + pass->Apply(graph.release()); // the argument still own the graph. + + argument->SetIrAnalyzedProgram( + new framework::proto::ProgramDesc(*desc.Proto())); +} + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/passes/ir_graph_to_program_pass.h b/paddle/fluid/inference/analysis/passes/ir_graph_to_program_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..838ebdbc9d71eb3a73882e4c4c3e4bf6078150e4 --- /dev/null +++ b/paddle/fluid/inference/analysis/passes/ir_graph_to_program_pass.h @@ -0,0 +1,32 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/inference/analysis/analysis_pass.h" + +namespace paddle { +namespace inference { +namespace analysis { + +class IrGraphToProgramPass : public AnalysisPass { + public: + void RunImpl(Argument *argument) override; + + std::string repr() const override { return "ir-graph-to-param-pass"; } +}; + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/passes/ir_params_sync_among_devices_pass.cc b/paddle/fluid/inference/analysis/passes/ir_params_sync_among_devices_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..1f27e80cf49f49863cf000d71369512242afb7b4 --- /dev/null +++ b/paddle/fluid/inference/analysis/passes/ir_params_sync_among_devices_pass.cc @@ -0,0 +1,86 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/analysis/passes/ir_params_sync_among_devices_pass.h" +#include "paddle/fluid/framework/data_layout.h" +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/tensor_util.h" +#include "paddle/fluid/platform/enforce.h" + +namespace paddle { +namespace inference { +namespace analysis { + +void IrParamsSyncAmongDevicesPass::RunImpl(Argument *argument) { + PADDLE_ENFORCE(argument->scope_valid()); + PADDLE_ENFORCE(argument->use_gpu_valid()); + + platform::Place place; + + // The parameters are on the cpu, therefore, synchronization is not necessary. + if (!argument->use_gpu()) return; + + auto &graph = argument->main_graph(); + std::vector repetitive_params; + + if (graph.Has(framework::ir::kRepetitiveParamAttr)) + repetitive_params = graph.Get>( + framework::ir::kRepetitiveParamAttr); + + LOG(INFO) << "Sync params from CPU to GPU"; + + PADDLE_ENFORCE(argument->gpu_device_id_valid()); + place = platform::CUDAPlace(argument->gpu_device_id()); + + auto *scope = argument->scope_ptr(); + std::vector all_vars = scope->LocalVarNames(); + + // We get all the vars from local_scope instead of the ProgramDesc. + // Because there exists the case that new parameter variables are not added to + // the program in the analysis pass. + for (auto &var_name : all_vars) { + if (std::count(repetitive_params.begin(), repetitive_params.end(), + var_name)) { + scope->EraseVars({var_name}); + continue; + } + auto *var = scope->FindLocalVar(var_name); + PADDLE_ENFORCE(var != nullptr); + if (var->IsType() || + var->IsType()) { + auto *t = var->GetMutable(); + + platform::CPUPlace cpu_place; + framework::LoDTensor temp_tensor; + temp_tensor.Resize(t->dims()); + temp_tensor.mutable_data(cpu_place); + + // Copy the parameter data to a tmp tensor. + TensorCopySync(*t, cpu_place, &temp_tensor); + // Reallocation the space on GPU + t->mutable_data(place); + + // Copy parameter data to newly allocated GPU space. + TensorCopySync(temp_tensor, place, t); + } + } +} + +std::string IrParamsSyncAmongDevicesPass::repr() const { + return "ir-params-sync-among-devices-pass"; +} + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/passes/ir_params_sync_among_devices_pass.h b/paddle/fluid/inference/analysis/passes/ir_params_sync_among_devices_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..61990150a30db147418c4301359428cf3c6db541 --- /dev/null +++ b/paddle/fluid/inference/analysis/passes/ir_params_sync_among_devices_pass.h @@ -0,0 +1,40 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include + +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/inference/analysis/analysis_pass.h" +#include "paddle/fluid/platform/place.h" + +namespace paddle { +namespace inference { +namespace analysis { + +/* + * Sync parameter from CPU to GPU. + */ +class IrParamsSyncAmongDevicesPass : public AnalysisPass { + public: + void RunImpl(Argument *argument) override; + std::string repr() const override; +}; + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/passes/memory_optimize_pass.cc b/paddle/fluid/inference/analysis/passes/memory_optimize_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..4b0a9d9b1c48fcb0d5e44ec1b977c817f3c70b2e --- /dev/null +++ b/paddle/fluid/inference/analysis/passes/memory_optimize_pass.cc @@ -0,0 +1,698 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/analysis/passes/memory_optimize_pass.h" +#include +#include +#include +#include +#include +#include +#include +#include +#include "paddle/fluid/framework/ir/graph_helper.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" +#include "paddle/fluid/framework/ir/graph_to_program_pass.h" +#include "paddle/fluid/framework/ir/graph_traits.h" +#include "paddle/fluid/inference/analysis/helper.h" +#include "paddle/fluid/inference/api/helper.h" +#include "paddle/fluid/string/pretty_log.h" + +namespace paddle { +namespace inference { +namespace analysis { + +using framework::ir::Graph; +using framework::ir::Node; +using framework::ir::TopologyVarientSort; +using space_table_t = MemoryOptimizePass::space_table_t; + +// Collect the lifecycles of the tensors. +// Traverse the graph in topological order. +// The traversal order also affect the lifecycles, so different sort_kind is +// used. +void MemoryOptimizePass::CollectLifeCycle( + std::unordered_map* lifecycles, + int sort_kind) const { + max_lifecycle_ = 0; + for (auto* op_node : framework::ir::TopologyVarientSort( + *graph_, static_cast(sort_kind))) { + if (!op_node->IsOp()) continue; + auto reads = op_node->inputs; + auto writes = op_node->outputs; + + std::vector requires(reads.begin(), reads.end()); + requires.insert(requires.end(), writes.begin(), writes.end()); + + // Disable reuse of feed variables. + if (op_node->Name() == "feed") { + for (auto* node : op_node->outputs) { + auto var = node->Name(); + lifecycles->emplace(var, + std::make_pair(0, std::numeric_limits::max())); + } + } else { + // Normal operators. + for (const Node* node : requires) { + if (node->Var()->Persistable()) continue; + std::string var = node->Name(); + if (!lifecycles->count(var)) { + (*lifecycles)[var] = std::make_pair(max_lifecycle_, max_lifecycle_); + } else { + (*lifecycles)[var].second = + std::max(max_lifecycle_, lifecycles->at(var).second); // max() + } + } + } + + ++max_lifecycle_; + } +} + +// TODO(Superjomn) Make this a general help method. +int DataTypeToSpace(framework::proto::VarType_Type type) { + switch (type) { + case framework::proto::VarType_Type_BOOL: + return sizeof(bool); + case framework::proto::VarType_Type_FP32: + return sizeof(float); + case framework::proto::VarType_Type_INT32: + return sizeof(int32_t); + case framework::proto::VarType_Type_INT64: + return sizeof(int64_t); + default: + PADDLE_THROW("Unknown data type"); + } +} + +// Collect the memory size of the tensors. +void MemoryOptimizePass::CollectVarMemorySize( + const std::unordered_map& batch_var_ave_dim, + std::unordered_map* tensor_nodes, + space_table_t* space_table) const { + // Collect tensors from graph. + for (auto* node : graph_->Nodes()) { + if (node->IsVar() && + node->Var()->GetType() == + framework::proto::VarType::Type::VarType_Type_LOD_TENSOR) { + // Parameters will not be reused. + if (node->Var()->Persistable()) continue; + (*tensor_nodes)[node->Name()] = node; + (*space_table)[node->Name()] = + DataTypeToSpace(node->Var()->GetDataType()) * + batch_var_ave_dim.at(node->Name()); + } + } +} + +// Find a sutable (big enough but smallest to avoid memory waste). +// +// Args: +// @tensor_nodes: the tensor nodes in the ir::Graph. +// @free_existing_tensors: the allocated tensor and are free. +// @space_table: the memory space of tensors. +// @tensor2use: the tensor that requires memory. +// +// Returns: +// true if found some existing tensor to reuse. +// false if no sutable tensor to reuse, one need to allocate a new tensor for +// this requirement. +// The suitable tensor for reuse is one that is approximately equal to the +// memory demand. +bool FindSuitableTensorToReuse( + const std::string& tensor, int space_required, + const std::unordered_map& tensor_nodes, + std::unordered_set* free_existing_tensors, + const space_table_t& space_table, + const std::vector>& var_clusters, + std::string* tensor2use) __SHOULD_USE_RESULT__; + +bool FindSuitableTensorToReuse( + const std::string& tensor, int space_required, + const std::unordered_map& tensor_nodes, + std::unordered_set* free_existing_tensors, + const space_table_t& space_table, + const std::vector>& var_clusters, + std::string* tensor2use) { + std::pair best_fit; + best_fit.second = std::numeric_limits::max(); + VLOG(5) << "Split Tensors to " << var_clusters.size() << " clusters"; + + // find the cluster this var belongs to. + const std::unordered_set* cluster = nullptr; + for (const auto& c : var_clusters) { + if (c.count(tensor)) { + cluster = &c; + break; + } + } + PADDLE_ENFORCE_NOT_NULL(cluster, + "something wrong in memory optimization, the " + "variable %s not in the clusters.", + tensor); + + for (auto& candidate : *free_existing_tensors) { + // This is not a temporary tensor. + if (!space_table.count(candidate)) continue; + // Not in the same cluster. + if (!cluster->count(candidate)) continue; + + size_t space = space_table.at(candidate); + PADDLE_ENFORCE( + space <= std::numeric_limits::type>::max(), + "space overload"); + size_t space_diff = + std::abs((std::make_signed::type)space - space_required); + if (space_diff < best_fit.second) { + best_fit.first = candidate; + best_fit.second = space_diff; + } + } + + if (best_fit.second < std::numeric_limits::max()) { + *tensor2use = best_fit.first; + return true; + } + return false; +} + +// Allocate new tensor instead of reusing the existing one. +void AllocateNewTensor( + const std::string& name, size_t space_required, + const std::unordered_map& tensor_nodes, + std::unordered_set* free_existing_tensors, + space_table_t* space_table, + std::unordered_map* reuse_table) { + // The newly born tensor is free to be used. + free_existing_tensors->insert(name); + // Register the space it has. + PADDLE_ENFORCE(space_table->count(name)); + space_table->at(name) = std::max(space_table->at(name), space_required); + // The allocated new tensor use the memory of itself. + (*reuse_table)[name] = name; +} + +// Free a tensor and make it resuable. +// @tensor: the tensor to free. +// @free_existing_tensors: the free and allocated tensors. +// @reuse_table: a map from a fake tensor to the existing allocated tensor. +void FreeATensor(const std::string& tensor, + std::unordered_set* free_existing_tensors, + std::unordered_map* reuse_table) { + if (tensor == "feed" || tensor == "fetch") return; + // the really allocated tensor. + const auto& free_tensor = reuse_table->at(tensor); + + free_existing_tensors->insert(free_tensor); +} + +// Reuse a free existing tensor. +void ReuseATensor(const std::string& tensor, const std::string& tensor2reuse, + size_t memory_size, + std::unordered_set* free_existing_tensors, + std::unordered_map* reuse_table, + space_table_t* reused_space_table) { + auto it = free_existing_tensors->find(tensor2reuse); + PADDLE_ENFORCE(it != free_existing_tensors->end()); + free_existing_tensors->erase(it); + (*reuse_table)[tensor] = tensor2reuse; + // Update the memory size of a reused tensor, the memory will grow if the + // required memory is larger. + (*reused_space_table)[tensor2reuse] = + std::max(reused_space_table->at(tensor2reuse), memory_size); +} + +// Calculate the memory usage. +void EvaluateMemoryUsage( + const std::unordered_map& reuse_table, + const space_table_t& space_table, + const std::unordered_map& var_batch_ave_size, + size_t* allocated, size_t* saved) { + *allocated = 0; + *saved = 0; + + for (auto elem : reuse_table) { + if (elem.first == elem.second) { + *allocated += space_table.at(elem.first); + VLOG(4) << elem.first << " <-> " << elem.second << " " + << space_table.at(elem.first) << " " + << space_table.at(elem.second); + } else { + *saved += space_table.at(elem.first); + VLOG(4) << "reuse " << elem.first << " -> " << elem.second; + } + } + VLOG(4) << "allocated " << *allocated; + VLOG(4) << "saved " << *saved; +} + +// Return saved ratio. +void MemoryOptimizePass::MakeReusePlan( + const std::vector>& var_clusters, + const std::unordered_map& var_batch_ave_size, + const space_table_t& space_table, + std::unordered_map* reuse_table, int sort_kind, + MemoryAllocation* memory_allocation) const { + // Clear the existing plan. + reuse_table->clear(); + + // The `space_table` stores the real memory size for each tensor. + // The `reused_space_table` stores the maximum memory size required by a + // tensor during the memory reusing, the small tensor might be reused by a + // larger tensor, and the memory size of the small one will grow. + auto reused_space_table = space_table; + + std::unordered_map life_cycles; + std::unordered_map tensor_nodes; + // The allocated tensors whose memory can be reused, they will live across the + // program execution. + std::unordered_set existing_tensors; + // The existing tensor that has been allocated, and is also free to reuse. + std::unordered_set free_existing_tensors; + + CollectLifeCycle(&life_cycles, sort_kind); + + for (int age = 0; age < max_lifecycle_; ++age) { + std::unordered_set born_tensors; + std::unordered_set dead_tensors; + // Gather the dead and born tensors. + for (auto elem_it = life_cycles.begin(); elem_it != life_cycles.end(); + elem_it++) { + if (elem_it->second.first == -1) { + continue; + } + const auto& tensor = elem_it->first; + const auto& lifecycle = elem_it->second; + VLOG(4) << "process " << tensor << " reuse " << lifecycle.first << "->" + << lifecycle.second; + + // Collect newly born tensors. + if (lifecycle.first == age) { + born_tensors.insert(tensor); + } + // Collect dead tensors whose memory can be reused. + else if (lifecycle.second < age) { // NOLINT + dead_tensors.insert(tensor); + // remove to avoid duplicate process. + elem_it->second.first = -1; // avoid duplicate search + } + } + + // Reuse the dead tensors for born tensors + for (const auto& tensor : born_tensors) { + // Skip the feed and fetch tensor for that they share data with others. + std::string tensor2reuse; + if (!space_table.count(tensor)) continue; + size_t space_required = space_table.at(tensor); + if (FindSuitableTensorToReuse(tensor, space_required, tensor_nodes, + &free_existing_tensors, reused_space_table, + var_clusters, &tensor2reuse)) { + if (tensor != tensor2reuse) { + VLOG(4) << tensor << " -> " << tensor2reuse; + } + ReuseATensor(tensor, tensor2reuse, space_required, + &free_existing_tensors, reuse_table, &reused_space_table); + } else { + VLOG(4) << "allocate " << tensor; + AllocateNewTensor(tensor, space_required, tensor_nodes, + &free_existing_tensors, &reused_space_table, + reuse_table); + ReuseATensor(tensor, tensor, space_required, &free_existing_tensors, + reuse_table, &reused_space_table); + } + } + + for (const auto& tensor : dead_tensors) { + // free its memory. + FreeATensor(tensor, &free_existing_tensors, reuse_table); + } + } + + EvaluateMemoryUsage(*reuse_table, reused_space_table, var_batch_ave_size, + &(memory_allocation->allocated), + &(memory_allocation->saved)); + memory_allocation->sort_kind = sort_kind; +} + +void BuildVarNodeTable(Graph* graph, + std::unordered_map* var_node_table) { + for (auto* node : graph->Nodes()) { + if (node->IsVar()) { + (*var_node_table)[node->Name()] = node; + } + } +} + +// NOTE The optimized opdesc doesn't match ir::Graph. +void UpdateOpDescsByReuse( + Graph* graph, + const std::unordered_map& reuse_table, + int sort_kind) { + // TODO(Superjomn) change here to be compatible with the runtime order. + for (auto* node : TopologyVarientSort( + *graph, static_cast(sort_kind))) { + if (node->IsOp()) { + // Replace the original inputs/outputs with the reused tensors. + std::unordered_map> in_args, + out_args; + for (auto argument : node->Op()->Inputs()) { + for (const auto& x : argument.second) { + auto name = x; + if (reuse_table.count(x) && reuse_table.at(x) != x) { + name = reuse_table.at(x); + } + in_args[argument.first].push_back(name); + VLOG(4) << node->Name() << " input " << x << " -> " << name; + } + } + + for (auto argument : node->Op()->Outputs()) { + for (const auto& x : argument.second) { + auto name = x; + if (reuse_table.count(x) && reuse_table.at(x) != x) { + name = reuse_table.at(x); + } + out_args[argument.first].push_back(name); + VLOG(4) << node->Name() << " output " << x << " -> " << name; + } + } + + // Update arguments. + for (auto& arg : in_args) { + node->Op()->SetInput(arg.first, arg.second); + } + for (auto& arg : out_args) { + node->Op()->SetOutput(arg.first, arg.second); + } + node->Op()->Flush(); + } + } +} + +void MemoryOptimizePass::PerformReusePlan( + const std::unordered_map& reuse_table, + int sort_kind, std::unordered_set* vars2remove) const { + std::unordered_map var_node_table; + BuildVarNodeTable(graph_, &var_node_table); + UpdateOpDescsByReuse(graph_, reuse_table, sort_kind); + + for (auto& item : reuse_table) { + if (item.first != item.second) { + vars2remove->insert(item.first); + } + } + VLOG(2) << "to remove vars " << vars2remove->size(); +} + +std::vector split(const std::string& line, char delim) { + std::vector res; + std::string field; + std::stringstream line_stream(line); + while (std::getline(line_stream, field, delim)) { + res.emplace_back(field); + } + return res; +} + +// Deserialize the batch var shapes from the cache file. +std::vector>> DeseralizeBatchVarShapes( + const std::string& path) { + std::ifstream file(path); + PADDLE_ENFORCE(file.is_open(), "failed to open %s to read cache", path); + std::string line; + std::vector>> batch_shapes; + + while (std::getline(file, line)) { + std::map> batch; + for (const auto& var_info : split(line, ';')) { + auto fields = split(var_info, ':'); + PADDLE_ENFORCE_EQ(fields.size(), 2UL); + auto var_name = fields.front(); + auto shape_str = split(fields[1], ','); + std::vector shape; + for (const auto& v : shape_str) shape.push_back(std::stoi(v)); + batch[var_name] = shape; + } + batch_shapes.push_back(batch); + } + return batch_shapes; +} + +// Replace the -1 in shape to a real number to fake the shape. +std::vector>> FakeBatchVarShapes( + const framework::ProgramDesc& program) { + std::vector>> res; + res.emplace_back(); + auto& record = res.front(); + const int fake_batch_size = 3; + for (auto* var : program.Block(0).AllVars()) { + if (var->GetType() == + framework::proto::VarType::Type::VarType_Type_LOD_TENSOR) { + auto shape = var->GetShape(); + for (auto& v : shape) { + if (v < 0) v = fake_batch_size; + } + record[var->Name()].assign(shape.begin(), shape.end()); + } + } + return res; +} + +// Calculate the average dim of each tensor from the batch shape cache. +std::unordered_map GetBatchAverageSize( + const std::vector>>& batches) { + std::unordered_map var2size; + // The average size of the batches for each variable. + int num_batch = 0; + for (const auto& batch : batches) { + num_batch++; + for (const auto& item : batch) { + int dim = std::accumulate(item.second.begin(), item.second.end(), 1, + [](int a, int b) { return a * b; }); + var2size[item.first] += dim; + } + } + + for (auto& item : var2size) { + item.second /= num_batch; + } + + return var2size; +} + +// Analysis the batch shapes loading from the cache file. +// By splitting the variables to different clusters by analyzing their batch +// size, we can pre-schedule the changes of difference LoDTensor when different +// length of input sequences is entered. +// This should works fine for the models operating on sentences. +std::vector> AnalysisBatchShapesByBatchSize( + const std::vector>>& batches) { + // collect the batch size of each shape and combine to a stringstream in + // converient to generate a hash. + std::unordered_map var_batchsize_hashes; + for (auto& batch : batches) { + for (auto& ele : batch) { + PADDLE_ENFORCE(!ele.second.empty()); + int batch_size = ele.second.front(); + // TODO(Superjomn) might consume large memory here, use combine hash. + var_batchsize_hashes[ele.first] << batch_size; + } + } + + // Split to sets by batch size sequences. + std::unordered_map> + shape_sets; + for (auto& ele : var_batchsize_hashes) { + auto hash = std::hash()(ele.second.str()); + shape_sets[hash].insert(ele.first); + } + std::vector> res; + for (auto& ele : shape_sets) { + res.emplace_back(std::move(ele.second)); + } + + VLOG(3) << "Cluster by batch_size and get " << res.size() << " clusters"; + return res; +} + +// Analysis the batch shapes loading from the cache file, and split them to +// different clusters by their size. +// This should works fine for the overall models. +std::vector> AnalysisBatchShapesBySimilarSize( + const space_table_t& space_table, + const std::vector>>& batches, + int interval = 200000) { + PADDLE_ENFORCE_GT(interval, 0); + // cluster to different clusters. + size_t max_size = 0; + for (auto& item : space_table) { + max_size = std::max(item.second, max_size); + } + VLOG(4) << "tensor max size " << max_size; + + std::vector> res; + + // cluster by intervals. + for (size_t interval_size = 0; interval_size <= max_size; + interval_size += interval) { + std::unordered_set cluster; + for (auto& item : space_table) { + if (interval_size <= item.second && + interval_size + interval > item.second) { + cluster.insert(item.first); + } + } + if (!cluster.empty()) { + res.push_back(cluster); + } + } + + VLOG(3) << "Cluster by interval and get " << res.size() << " cluster"; + return res; +} + +std::string MemoryOptimizePass::repr() const { return "memory optimize pass"; } + +std::pair GetRange( + const std::unordered_map& ave_size) { + auto res = std::make_pair(std::numeric_limits::max(), + std::numeric_limits::min()); + for (auto& item : ave_size) { + res.first = std::min(item.second, res.first); + res.second = std::max(item.second, res.second); + } + return res; +} + +void MemoryOptimizePass::RunImpl(Argument* argument) { + // When force update, should not optimize memory. + if (!argument->enable_memory_optim() || + argument->static_memory_optim_force_update()) + return; + graph_ = argument->main_graph_ptr(); + + auto path = GetMemoryCachePath( + argument->model_dir_valid() ? argument->model_dir() : "", + argument->model_program_path_valid() ? argument->model_program_path() + : ""); + VLOG(3) << "Load memory cache from " << path; + std::vector>> batches; + + if (argument->static_memory_optim() && inference::IsFileExists(path)) { + string::PrettyLogInfo("--- Performing static memory optimize"); + batches = DeseralizeBatchVarShapes(path); + } else { + string::PrettyLogInfo("--- Performing dynamic memory optimize"); + batches = FakeBatchVarShapes(argument->main_program()); + } + auto var_batch_ave_size = GetBatchAverageSize(batches); + + // Get min and max memory size. + const auto range = GetRange(var_batch_ave_size); + const int cluster_size = std::max( + static_cast((range.second - range.first) / 100 /*cluster num*/), + 1024); + const int cluster_size1 = std::max( + static_cast((range.second - range.first) / 1000 /*cluster num*/), + 1024); + + std::unordered_map tensor_nodes; + space_table_t space_table; + CollectVarMemorySize(var_batch_ave_size, &tensor_nodes, &space_table); + + std::unordered_map reuse_table; + double max_saving_ratio = 0.; + + std::vector> strategies; + + for (int sort_kind = 0; sort_kind < 2; sort_kind++) { + if (argument->static_memory_optim()) { + // This strategy only make scene in static memory optimize. + strategies.emplace_back([&, sort_kind] { + auto clustered_vars_by_batch_size = + AnalysisBatchShapesByBatchSize(batches); + MemoryAllocation allocation; + MakeReusePlan(clustered_vars_by_batch_size, var_batch_ave_size, + space_table, &reuse_table, sort_kind, &allocation); + return allocation; + }); + } + + strategies.emplace_back([&, sort_kind] { + auto clustered_vars_by_ave_size = + AnalysisBatchShapesBySimilarSize(space_table, batches, cluster_size); + MemoryAllocation allocation; + MakeReusePlan(clustered_vars_by_ave_size, var_batch_ave_size, space_table, + &reuse_table, sort_kind, &allocation); + return allocation; + }); + + strategies.emplace_back([&, sort_kind] { + auto clustered_vars_by_ave_size = + AnalysisBatchShapesBySimilarSize(space_table, batches, cluster_size1); + MemoryAllocation allocation; + MakeReusePlan(clustered_vars_by_ave_size, var_batch_ave_size, space_table, + &reuse_table, sort_kind, &allocation); + return allocation; + }); + + strategies.emplace_back([&, sort_kind] { + auto clustered_vars_by_ave_size = AnalysisBatchShapesBySimilarSize( + space_table, batches, + std::numeric_limits::max()); // no intervals + MemoryAllocation allocation; + MakeReusePlan(clustered_vars_by_ave_size, var_batch_ave_size, space_table, + &reuse_table, sort_kind, &allocation); + return allocation; + }); + } + + std::function* best_strategy{nullptr}; + + // Try all strategies to get the best result. + for (auto& strategy : strategies) { + auto allocation = strategy(); + string::PrettyLogDetail("--- get strategy saving %f memory for workspace", + allocation.GetSavingRatio()); + if (allocation.GetSavingRatio() > max_saving_ratio) { + max_saving_ratio = allocation.GetSavingRatio(); + best_strategy = &strategy; + } + } + if (!best_strategy) { + LOG(ERROR) << "This model makes poor memory optimize, skip memory optimize"; + return; + } + auto memory_allocation = (*best_strategy)(); + + string::PrettyLogInfo( + "--- Saved %.2f%s memory for workspace(temporary variables)", + memory_allocation.GetSavingRatio() * 100, "%"); + + argument->main_graph().Set(framework::ir::kGraphToProgramVarsToRemove, + new std::unordered_set); + auto& vars2remove = + argument->main_graph().Get>( + framework::ir::kGraphToProgramVarsToRemove); + + PerformReusePlan(reuse_table, memory_allocation.sort_kind, &vars2remove); + argument->SetMemoryOptimSortKind(memory_allocation.sort_kind); +} + +float MemoryOptimizePass::MemoryAllocation::GetSavingRatio() const { + return (saved / 1024.) / (allocated / 1024. + saved / 1024.); +} +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/passes/memory_optimize_pass.h b/paddle/fluid/inference/analysis/passes/memory_optimize_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..2da565f2ae15a50a207173b10d4c350456086582 --- /dev/null +++ b/paddle/fluid/inference/analysis/passes/memory_optimize_pass.h @@ -0,0 +1,108 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include +#include "paddle/fluid/inference/analysis/analysis_pass.h" +#include "paddle/fluid/platform/port.h" + +namespace paddle { +namespace inference { +namespace analysis { + +/* + * Memory optimization pass for inference with pre-analysis of memory usage + * without GC. + * Different from training, the inference memory reuse strategies doesn't + * include GC for that overhead is too much when batch size equals one. + * + * The inference memory reuse tries to pre-determine the tensor reusing strategy + * without runtime overhead. + * + * To improve the strategy's performance, a warm-up running is introduced: + * - Before officially deploy the inference program, one should warm it up and + * generate some runtime cache, + * - Run the inference program with several batches of data, it will persist + * some runtime information about memory of tensors to disk, we call the + * information the memory reusing cache, + * - With the memory reusing cache, user can deploy the inference to a + * service, before running the model, the inference program will load the + * memory cache, analysis it and generate the best memory reusing strategy, + * and adjust the execution of the network. + * + * With the warm-up and memory reusing cache design, the memory reusing + * algorithm can analysis the real memory consume of the tensors, even with the + * flexible LoDTensor and special shape changing operators such as + * sequence-pooling. + */ +class MemoryOptimizePass : public AnalysisPass { + public: + using space_table_t = std::unordered_map; + using lifecycle_t = std::pair; + + struct MemoryAllocation { + size_t allocated; // allocated memory in byte. + size_t saved; // saved memory in byte. + int sort_kind; // the kind of the corresponding sorting algorithm. + + // Get the memory saving ratio of temporary variables. + float GetSavingRatio() const; + }; + + virtual ~MemoryOptimizePass() = default; + + protected: + void RunImpl(Argument *argument) override; + + private: + void CollectLifeCycle( + std::unordered_map *lifecycles, + int sort_kind) const; + + void CollectVarMemorySize( + const std::unordered_map &batch_var_ave_dim, + std::unordered_map *tensor_nodes, + space_table_t *space_table) const; + + // Returns percentage of saved memory. + void MakeReusePlan( + const std::vector> &var_clusters, + const std::unordered_map &var_batch_ave_size, + const space_table_t &space_table, + std::unordered_map *reuse_table, int sort_kind, + MemoryAllocation *memory_allocation) const; + + void PerformReusePlan( + const std::unordered_map &reuse_table, + int sort_kind, std::unordered_set *vars2remove) const; + + public: + std::string repr() const override; + + private: + mutable framework::ir::Graph *graph_{nullptr}; + mutable int max_lifecycle_{-1}; +}; + +static std::string GetMemoryCachePath(const std::string &model_path, + const std::string &prog_path) { + auto path = model_path.empty() ? prog_path : model_path; + return path + ".memory_cache"; +} + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/passes/passes.cc b/paddle/fluid/inference/analysis/passes/passes.cc new file mode 100644 index 0000000000000000000000000000000000000000..161b127d6d5ceb3e8b9c1cf98c69eb0387bfb905 --- /dev/null +++ b/paddle/fluid/inference/analysis/passes/passes.cc @@ -0,0 +1,45 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/analysis/passes/passes.h" +#include "paddle/fluid/inference/analysis/passes/ir_analysis_pass.h" +#include "paddle/fluid/inference/analysis/passes/ir_graph_build_pass.h" +#include "paddle/fluid/inference/analysis/passes/ir_graph_to_program_pass.h" +#include "paddle/fluid/inference/analysis/passes/ir_params_sync_among_devices_pass.h" +#include "paddle/fluid/inference/analysis/passes/memory_optimize_pass.h" + +namespace paddle { +namespace inference { +namespace analysis { + +PassRegistry::PassRegistry() { + // Register manually to avoid the trivial `USE_OP` like macro for easier use + // and link. + passes_.emplace("ir_analysis_pass", + std::unique_ptr(new IrAnalysisPass)); + passes_.emplace("ir_graph_build_pass", + std::unique_ptr(new IrGraphBuildPass)); + passes_.emplace("memory_optimize_pass", + std::unique_ptr(new MemoryOptimizePass)); + passes_.emplace( + "ir_params_sync_among_devices_pass", + std::unique_ptr(new IrParamsSyncAmongDevicesPass)); + passes_.emplace( + "ir_graph_to_program_pass", + std::unique_ptr(new IrGraphToProgramPass)); +} + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/passes/passes.h b/paddle/fluid/inference/analysis/passes/passes.h new file mode 100644 index 0000000000000000000000000000000000000000..ea07e0dcbd992c9d10c6662909798ef79a01e3a7 --- /dev/null +++ b/paddle/fluid/inference/analysis/passes/passes.h @@ -0,0 +1,42 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include "paddle/fluid/inference/analysis/analysis_pass.h" + +namespace paddle { +namespace inference { +namespace analysis { + +struct PassRegistry { + PassRegistry(); + + AnalysisPass* Retreive(const std::string& pass_type) { + return passes_[pass_type].get(); + } + + static PassRegistry& Global() { + static auto* x = new PassRegistry; + return *x; + } + + private: + std::unordered_map> passes_; +}; + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/subgraph_splitter.cc b/paddle/fluid/inference/analysis/subgraph_splitter.cc deleted file mode 100644 index 526bbbadfe90c3064d7c620cc22e30f7fef99088..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/subgraph_splitter.cc +++ /dev/null @@ -1,354 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/inference/analysis/subgraph_splitter.h" - -namespace paddle { -namespace inference { -namespace analysis { - -const char *SubGraphSplitter::kMarkerAttrName = - "_sub_graph_splitter_inside_sub_graph"; - -std::vector> SubGraphSplitter::operator()() { - MarkNodesInsideSubGraph(); - return ExtractSubGraphs(); -} - -// Mark the output variables inside a subgraph with the func. -inline void MarkOutLinksInSubGraph(const Function *func) { - for (auto *var : func->outlinks) { - var->attr(SubGraphSplitter::kMarkerAttrName).Bool() = true; - } -} - -void SubGraphSplitter::MarkNodesInsideSubGraph() { - for (auto &node : GraphTraits(*graph_).nodes()) { - if (node_inside_subgraph_teller_(&node)) { - node.attr(kMarkerAttrName).Bool() = true; - if (node.type() == Node::Type::kFunction) { - // If a function is inside the sub-graph, mark all the output variables - // to be inside too, so that two marked functions will be inside a same - // sub-graph, lets take a example: A_function->var->B_function, if - // A_function is marked, var should also be marked, so that B_function - // will be in the same sub-graph with A_function if B_function is - // marked. - MarkOutLinksInSubGraph(static_cast(&node)); - } - } - } -} - -const char *kUnionFindParent = "_sub_graph_splitter_union_find_parent_"; - -// Use the Union Find(UF) algorithm to find fully connected sub-graphs, if node -// a's output is node b, that is a and b is in the same sub-graph. The UF -// algorithm will group them to the same cluster. -using node_map_t = std::unordered_map; -// Find the ancestor id of a node. -int UnionFindGetAncestor(const node_map_t &node_map, size_t id) { - int tmp = id; - do { - tmp = node_map.at(tmp)->attr(kUnionFindParent).Int32(); - } while (node_map.at(tmp)->attr(kUnionFindParent).Int32() != tmp); - return tmp; -} -// Make this two node share the same ancestor. -// TODO(Superjom) bad performance, make a balanced tree latter. -void UnionFindCombine(const node_map_t &node_map, size_t a, size_t b) { - int a_ancestor = UnionFindGetAncestor(node_map, a); - int b_ancestor = UnionFindGetAncestor(node_map, b); - node_map.at(b_ancestor)->attr(kUnionFindParent).Int32() = a_ancestor; - node_map.at(a)->attr(kUnionFindParent).Int32() = a_ancestor; - node_map.at(b)->attr(kUnionFindParent).Int32() = a_ancestor; -} - -// This is a simple representation of a graph. -// The BriefNode hold the pointer of the Node. -// This is to avoid changing the original graph -// in the process of trt graph analysis. -struct BriefNode { - explicit BriefNode(Node *n) { node = n; } - Node *node; - std::vector inlinks; - std::vector outlinks; -}; - -// Union two adjacent BriefNode. -// Suppose we have two adjacent nodes src and dst. -// We will perform the following operations: -// 1. add all inputs(except src) of dst to src inlinks. -// 2. add all outputs of dst to src outlinks. -// 3. change all the dst's inputs and outputs -// corresponding inlinks and outlinks to src node. -// 4. delete all dst's inlinks and outlinks. -void UnionContractedNodes(const std::unordered_map &node_map, - int src_id, int dst_id) { - // merge the two adjacent nodes into one node. - BriefNode *src_node = node_map.at(src_id); - BriefNode *dst_node = node_map.at(dst_id); - - std::unordered_set inputs(src_node->inlinks.begin(), - src_node->inlinks.end()); - std::unordered_set outputs; - - for (auto *n : src_node->outlinks) { - if (n != dst_node) outputs.insert(n); - } - - // Add the inlinks and outlinks of dst node to src node. - std::vector dst_in_nodes = dst_node->inlinks; - for (BriefNode *node : dst_in_nodes) { - if (node != src_node) { - inputs.insert(node); - } - } - - std::vector dst_out_nodes = dst_node->outlinks; - for (BriefNode *node : dst_out_nodes) { - outputs.insert(node); - } - -// update the dst and src node's inlinks and outlinks. -#ifdef __clang__ - src_node->inlinks = std::vector(inputs.begin(), inputs.end()); - src_node->outlinks = std::vector(outputs.begin(), outputs.end()); - dst_node->inlinks.clear(); - dst_node->outlinks.clear(); -#else - src_node->inlinks = - std::move(std::vector(inputs.begin(), inputs.end())); - src_node->outlinks = - std::move(std::vector(outputs.begin(), outputs.end())); - dst_node->inlinks.clear(); - dst_node->outlinks.clear(); -#endif - - auto inlink_or_outlink_cleaner = [&](std::vector &nodes) { - for (auto *&n : nodes) { - if (n == src_node || n == dst_node) { - n = src_node; - } - } - }; - // Change all the dst inputs and outputs corresponding inlink and - // outlink to the src node. - for (auto *node : src_node->inlinks) { - inlink_or_outlink_cleaner(node->outlinks); - } - - for (auto *node : src_node->outlinks) { - inlink_or_outlink_cleaner(node->inlinks); - } -} - -// FlexibleDFS -// If reverse is true, do reverse dfs. -// If enter func is not nullptr, calls enter(node) before visiting any children -// of node. -// If leave func not nullptr, calls leave(node) after visiting all parents of -// node. -void FlexibleDFS(const std::vector &source, bool reverse, - const std::function &enter, - const std::function &leave) { - typedef struct { - const BriefNode *node; - bool leave; - } FNode; - - std::vector stack; - for (auto &node : source) { - stack.push_back(FNode{node, false}); - } - std::unordered_set visited; - while (!stack.empty()) { - auto fnode = stack.back(); - stack.pop_back(); - - if (fnode.leave) { - if (leave && !leave(fnode.node)) return; - } - if (visited.count(fnode.node)) continue; - visited.insert(fnode.node); - - if (enter && !enter(fnode.node)) return; - - if (leave) stack.push_back(FNode{fnode.node, true}); - const std::vector iter_nodes = - reverse == true ? fnode.node->inlinks : fnode.node->outlinks; - for (const BriefNode *node : iter_nodes) { - if (!visited.count(node)) { - stack.push_back(FNode{node, false}); - } - } - } -} - -std::vector> SubGraphSplitter::ExtractSubGraphs() { - // Run the Extract algorithm to find all subgraphs. - std::vector marked_nodes; - // We use brief_node_map to represent the original graph in order to avoid - // changing the original graph. - std::unordered_map brief_node_map; - - for (auto &node : GraphTraits(*graph_).nodes_in_TS()) { - brief_node_map[node.id()] = new BriefNode(&node); - if (node.attr(kMarkerAttrName).Bool()) { - marked_nodes.push_back(&node); - } - } - - // extract sub-graphs in the marked node set, use Union Find algorithm. - node_map_t node_map; // id to ptr - for (auto *n : marked_nodes) { - // n's parent == n.id means it is the ancestor - n->attr(kUnionFindParent).Int32() = n->id(); - node_map[n->id()] = n; - } - - // create breif node map - for (auto &itr : brief_node_map) { - for (Node *node : itr.second->node->inlinks) { - itr.second->inlinks.push_back(brief_node_map[node->id()]); - } - - for (Node *node : itr.second->node->outlinks) { - itr.second->outlinks.push_back(brief_node_map[node->id()]); - } - } - - for (auto &itr : brief_node_map) { - BriefNode *brief_node = itr.second; - - if (!brief_node->node->attr(kMarkerAttrName).Bool()) { - VLOG(4) << brief_node->node->id() << " node not a trt candicate."; - continue; - } - - // Our algorithm must guarantee that: - // 1. The graph is always directed acyclic graph(DAG). - // 2. If there is a path in the subgraph from X to Y (X and Y are both - // nodes in the subgraph), then all paths from X to Y are in the - // subgraph. - // - // In order to achieve the above guarantee. - // For adjacent nodes src -> dst. - // 1. Get all dst input nodes except src. - // 2. Reverse DFS from those input nodes - // 3. If there is a path from input nodes to src, - // then the src and dst nodes can not be fused into one node, - // otherwise it can be done. - - while (true) { - std::unordered_set contract_nodes; - for (auto *out : brief_node->outlinks) { - // must be an trt candidate - if (!out->node->attr(kMarkerAttrName).Bool()) continue; - // get all dst input nodes except src. - std::vector source_nodes; - for (auto *n : out->inlinks) { - if (n != brief_node) { - source_nodes.push_back(n); - } - } - - // Reverse DFS from the source_nodes. - bool have_excess_path = false; - FlexibleDFS(source_nodes, true, nullptr, - [&have_excess_path, brief_node](const BriefNode *n) { - if (n == brief_node) { - have_excess_path = true; - return false; - } - return true; - }); - if (have_excess_path) continue; - contract_nodes.insert(out); - } - if (contract_nodes.empty()) break; - - for (auto dst_node : contract_nodes) { - UnionFindCombine(node_map, brief_node->node->id(), - dst_node->node->id()); - UnionContractedNodes(brief_node_map, brief_node->node->id(), - dst_node->node->id()); - } - } - } - - std::unordered_map> clusters; - for (auto *n : marked_nodes) { - if (n->type() == Node::Type::kFunction) { - clusters[UnionFindGetAncestor(node_map, - n->attr(kUnionFindParent).Int32())] - .push_back(n); - } - } - std::vector> result; - std::for_each(clusters.begin(), clusters.end(), - [&](const decltype(clusters)::value_type &it) { - result.push_back(it.second); - }); - - return result; -} - -void SubGraphFuse::operator()() { ReplaceNodesWithSubGraphs(); } - -void SubGraphFuse::ReplaceNodesWithSubGraphs() { - auto subgraphs = SubGraphSplitter(graph_, node_inside_subgraph_teller_)(); - for (auto &subgraph : subgraphs) { - if (subgraph.size() <= argument_->Get("minimum_subgraph_size")) - continue; - std::unordered_set subgraph_uniq(subgraph.begin(), subgraph.end()); - // replace this sub-graph with the first node. Two steps: 1. Create a Block - // Node that contains this subgraph 2. Mark the nodes inside the sub-graph - // as deleted. 3. Replace the deleted node with the new Block Node. - auto *block_node = static_cast( - graph_->nodes.Create(Node::Type::kFunctionBlock)); - auto io = ExtractInputAndOutputOfSubGraph(subgraph); - block_node->inlinks = std::move(io.first); - block_node->outlinks = std::move(io.second); - - for (auto *node : subgraph) { - // TODO(Superjomn) need a unified mechanism to treat deleted node in each - // pass. - node->SetDeleted(); - block_node->subgraph.push_back(node); - } - - // Change all the sub-graph's inputs and outputs corresponding inlink and - // outlink to this sub-graph node. - auto inlink_or_outlink_cleaner = [&](std::vector &nodes) { - for (auto *&n : nodes) { - if (subgraph_uniq.count(n)) { - n = block_node; - } - } - std::unordered_set uniq(nodes.begin(), nodes.end()); - nodes.assign(uniq.begin(), uniq.end()); - }; - for (auto *i : block_node->inlinks) { - inlink_or_outlink_cleaner(i->outlinks); - } - for (auto *&o : block_node->outlinks) { - inlink_or_outlink_cleaner(o->inlinks); - } - } - FilterRedundantOutputOfSubGraph(graph_); -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/subgraph_splitter.h b/paddle/fluid/inference/analysis/subgraph_splitter.h deleted file mode 100644 index 76e4fda0249e03c617d1b37c079dcd97f21387c1..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/subgraph_splitter.h +++ /dev/null @@ -1,88 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -/* - * This file defines the the class to partition a graph. - */ - -#pragma once - -#include - -#include "paddle/fluid/inference/analysis/argument.h" -#include "paddle/fluid/inference/analysis/data_flow_graph.h" -#include "paddle/fluid/inference/analysis/node.h" - -namespace paddle { -namespace inference { -namespace analysis { - -/* - * Detect the nodes in a sub-graph that meet some conditions. This class doesn't - * modify the graph. - */ -class SubGraphSplitter { - public: - static const char *kMarkerAttrName; - // Tell whether a node is inside a sub-graph. - using NodeInsideSubgraphTeller = std::function; - - SubGraphSplitter(DataFlowGraph *graph, const NodeInsideSubgraphTeller &teller) - : graph_(graph), node_inside_subgraph_teller_(teller) {} - - std::vector> operator()(); - - protected: - // Mark the nodes inside the accepted sub-graph using - // node_inside_subgraph_teller. - void MarkNodesInsideSubGraph(); - - // Merge the marked nodes into sub-graphs and return the sub-graphs. - std::vector> ExtractSubGraphs(); - - private: - DataFlowGraph *graph_; - NodeInsideSubgraphTeller node_inside_subgraph_teller_; -}; - -/* - * SubGraphFuse - Replace some nodes with the sub-graph node they are inside. To - * some extent, the TensorRT engine is just a fusion op for a model. - */ -class SubGraphFuse { - public: - using NodeInsideSubgraphTeller = SubGraphSplitter::NodeInsideSubgraphTeller; - - SubGraphFuse(DataFlowGraph *graph, const NodeInsideSubgraphTeller &teller, - Argument *argument) - : graph_(graph), - node_inside_subgraph_teller_(teller), - argument_(argument) {} - - // The main method which run all the logic. - void operator()(); - - protected: - // Remove the nodes inside sub-graphs and replace with the SubGraphNode. - void ReplaceNodesWithSubGraphs(); - - private: - DataFlowGraph *graph_; - NodeInsideSubgraphTeller node_inside_subgraph_teller_; - Argument *argument_; -}; - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/subgraph_splitter_tester.cc b/paddle/fluid/inference/analysis/subgraph_splitter_tester.cc deleted file mode 100644 index e1dc89fab5fb76d456b07c316ab1cabe6de23b26..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/subgraph_splitter_tester.cc +++ /dev/null @@ -1,92 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/inference/analysis/subgraph_splitter.h" -#include "paddle/fluid/inference/analysis/ut_helper.h" - -namespace paddle { -namespace inference { -namespace analysis { - -SubGraphSplitter::NodeInsideSubgraphTeller teller = [](const Node* node) { - if (node->type() != Node::Type::kFunction) return false; - const auto* func = static_cast(node); - if (func->func_type() == "elementwise_add" || func->func_type() == "relu" || - func->func_type() == "conv2d" || func->func_type() == "mul" || - func->func_type() == "sigmoid" || func->func_type() == "softmax") { - LOG(INFO) << "sub-graph marked " << node->repr(); - return true; - } - return false; -}; - -TEST(SubGraphSplitter, Split) { - auto desc = LoadProgramDesc(FLAGS_inference_model_dir + "/__model__"); - auto dfg = ProgramDescToDFG(desc); - LOG(INFO) << "spliter\n" << dfg.DotString(); - - ASSERT_GT(dfg.nodes.size(), 5UL); - - auto subgraphs = SubGraphSplitter(&dfg, teller)(); - - // Check the number of the marked nodes. - int marked_nodes = 0; - for (auto& node : dfg.nodes.nodes()) { - if (node->IsFunction() && - node->attr(SubGraphSplitter::kMarkerAttrName).Bool()) { - ++marked_nodes; - } - } - EXPECT_EQ(marked_nodes, 6); - - // For human debug. - for (auto& subgraph : subgraphs) { - LOG(INFO) << "subgraph size " << subgraph.size(); - for (auto* node : subgraph) { - LOG(INFO) << "node " << node->repr(); - } - } - - ASSERT_EQ(subgraphs.size(), 1UL); - // The last sub-graph has 5 Functions. - ASSERT_EQ(subgraphs.back().size(), 6UL); -} - -TEST(SubGraphSplitter, Fuse) { - auto desc = LoadProgramDesc(FLAGS_inference_model_dir + "/__model__"); - auto dfg = ProgramDescToDFG(desc); - Argument argument; - argument.Set("minimum_subgraph_size", new int(3)); - - size_t count0 = dfg.nodes.size(); - - SubGraphFuse fuse(&dfg, teller, &argument); - fuse(); - - int count1 = 0; - for (auto& node : dfg.nodes.nodes()) { - if (node->deleted()) { - LOG(INFO) << "deleted " << node->repr(); - } - count1 += node->deleted(); - } - - // At least one nodes should be deleted. - ASSERT_EQ(dfg.nodes.size(), count0 + 1); // added a new FunctionBlock - ASSERT_EQ(11, count1); -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass.cc b/paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass.cc deleted file mode 100644 index 174c8513f92cf869419f04cab5a54af65e9673b8..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass.cc +++ /dev/null @@ -1,80 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include - -#include "paddle/fluid/inference/analysis/analyzer.h" -#include "paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.h" -#include "paddle/fluid/inference/analysis/node_attr_flags.h" -#include "paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass.h" - -namespace paddle { -namespace inference { -namespace analysis { - -void TensorRTSubgraphNodeMarkPass::Run(DataFlowGraph *graph) { - for (auto &node : graph->nodes.nodes()) { - node->attr(ATTR_supported_by_tensorrt).Bool() = teller_(node.get()); - } -} - -class DfgDebuggerPass : public DFG_GraphvizDrawPass { - public: - explicit DfgDebuggerPass(const DFG_GraphvizDrawPass::Config &config) - : DFG_GraphvizDrawPass(config) {} - - std::string repr() const override { - return "tensorrt-subgraph-node-mark-debugger"; - } - - bool Finalize() override { return true; } - - protected: - std::string Draw(DataFlowGraph *graph) override { - Dot dot; - // Add nodes - for (size_t i = 0; i < graph->nodes.size(); i++) { - const Node &node = graph->nodes.Get(i); - if (config_.display_deleted_node || !node.deleted()) { - auto dot_attr = node.dot_attrs(); - if (node.attr(ATTR_supported_by_tensorrt).Bool()) { - dot_attr.assign( - {Dot::Attr{"color", "green"}, Dot::Attr{"style", "filled"}}); - } - dot.AddNode(node.repr(), dot_attr); - } - } - // Add edges - for (size_t i = 0; i < graph->nodes.size(); i++) { - const Node &node = graph->nodes.Get(i); - if (!config_.display_deleted_node && node.deleted()) continue; - for (auto &in : node.inlinks) { - if (!config_.display_deleted_node && in->deleted()) continue; - dot.AddEdge(in->repr(), node.repr(), {}); - } - } - return dot.Build(); - } -}; - -AnalysisPass *TensorRTSubgraphNodeMarkPass::CreateGraphvizDebugerPass() const { - DFG_GraphvizDrawPass::Config config(FLAGS_IA_graphviz_log_root, - "tensorrt_marked_node"); - return new DfgDebuggerPass(config); -} -bool TensorRTSubgraphNodeMarkPass::Finalize() { return true; } - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass.h b/paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass.h deleted file mode 100644 index c881a54c240538b68abdcb9060db69de3bf2b8bb..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass.h +++ /dev/null @@ -1,60 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -/* - * This file defines TensorRTSubgraphNodeMarkPass which helps to mark the ops - * that supported by TensorRT engine. - */ - -#pragma once - -#include -#include "paddle/fluid/inference/analysis/analysis_pass.h" -#include "paddle/fluid/inference/analysis/subgraph_splitter.h" - -namespace paddle { -namespace inference { -namespace analysis { - -/* - * Mark the operators that TensorRT engine supports. - */ -class TensorRTSubgraphNodeMarkPass : public DataFlowGraphPass { - public: - using teller_t = SubGraphSplitter::NodeInsideSubgraphTeller; - - explicit TensorRTSubgraphNodeMarkPass(const teller_t& teller) - : teller_(teller) {} - - bool Initialize(Argument* argument) override { return true; } - - // This class get a sub-graph as input and determine whether to transform this - // sub-graph into TensorRT. - void Run(DataFlowGraph* graph) override; - - std::string repr() const override { return "tensorrt-sub-subgraph-mark"; } - std::string description() const override { - return "tensorrt sub-graph mark pass"; - } - - AnalysisPass* CreateGraphvizDebugerPass() const override; - bool Finalize() override; - - private: - teller_t teller_; -}; - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass_tester.cc b/paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass_tester.cc deleted file mode 100644 index c1d932878e559180af987594535959afdf475587..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass_tester.cc +++ /dev/null @@ -1,50 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include "paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass.h" - -#include -#include "paddle/fluid/inference/analysis/node_attr_flags.h" -#include "paddle/fluid/inference/analysis/ut_helper.h" - -namespace paddle { -namespace inference { -namespace analysis { - -TEST(TensorRTSubgraphNodeMarkPass, test) { - // init - FluidToDataFlowGraphPass pass; - Argument argument(FLAGS_inference_model_dir); - ASSERT_TRUE(pass.Initialize(&argument)); - pass.Run(argument.main_dfg.get()); - - TensorRTSubgraphNodeMarkPass::teller_t teller = [](const Node* node) { - return node->IsFunction() && - static_cast(node)->func_type() == "mul"; - }; - TensorRTSubgraphNodeMarkPass pass1(teller); - ASSERT_TRUE(pass1.Initialize(&argument)); - pass1.Run(argument.main_dfg.get()); - - int counter{0}; - for (auto& node : argument.main_dfg->nodes.nodes()) { - counter += node->attr(ATTR_supported_by_tensorrt).Bool(); - } - ASSERT_EQ(counter, 2); - LOG(INFO) << counter << " nodes marked"; -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/tensorrt_subgraph_pass.cc b/paddle/fluid/inference/analysis/tensorrt_subgraph_pass.cc deleted file mode 100644 index cc1746ecb34c983d219693bcec17c8789c38fa9f..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/tensorrt_subgraph_pass.cc +++ /dev/null @@ -1,36 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include "paddle/fluid/inference/analysis/tensorrt_subgraph_pass.h" -#include "paddle/fluid/inference/analysis/subgraph_splitter.h" - -namespace paddle { -namespace inference { -namespace analysis { - -TensorRTSubGraphPass::TensorRTSubGraphPass( - const TensorRTSubGraphPass::NodeInsideSubgraphTeller &teller) - : node_inside_subgraph_teller_(teller) {} - -void TensorRTSubGraphPass::Run(DataFlowGraph *graph) { - SubGraphFuse(graph, node_inside_subgraph_teller_, argument_)(); - VLOG(4) << "debug info " - << graph->HumanReadableInfo(false /*show_values*/, - true /*show_functions*/); -} - -} // namespace analysis -} // namespace inference - -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/tensorrt_subgraph_pass.h b/paddle/fluid/inference/analysis/tensorrt_subgraph_pass.h deleted file mode 100644 index 3545da9109d79964f36c3d7e738620cc2e0f9a6c..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/tensorrt_subgraph_pass.h +++ /dev/null @@ -1,57 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#pragma once - -#include -#include "paddle/fluid/inference/analysis/analysis_pass.h" -#include "paddle/fluid/inference/analysis/node.h" -#include "paddle/fluid/inference/analysis/subgraph_splitter.h" - -namespace paddle { -namespace inference { -namespace analysis { - -/* - * Parse the graph and replace TensorRT supported nodes with SubGraphNode - */ -class TensorRTSubGraphPass : public DataFlowGraphPass { - public: - // Tell whether to transform a sub-graph into TensorRT. - using NodeInsideSubgraphTeller = SubGraphFuse::NodeInsideSubgraphTeller; - - explicit TensorRTSubGraphPass(const NodeInsideSubgraphTeller& teller); - - bool Initialize(Argument* argument) override { - argument_ = argument; - return true; - } - - // This class get a sub-graph as input and determine whether to transform this - // sub-graph into TensorRT. - void Run(DataFlowGraph* graph) override; - - bool Finalize() override { return true; } - - std::string repr() const override { return "tensorrt-sub-graph"; } - std::string description() const override { return "tensorrt sub graph pass"; } - - private: - NodeInsideSubgraphTeller node_inside_subgraph_teller_; - Argument* argument_; -}; - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/tensorrt_subgraph_pass_tester.cc b/paddle/fluid/inference/analysis/tensorrt_subgraph_pass_tester.cc deleted file mode 100644 index 9748e24b06295a4e7c2995429e6588cd0f225fe6..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/tensorrt_subgraph_pass_tester.cc +++ /dev/null @@ -1,73 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/inference/analysis/tensorrt_subgraph_pass.h" - -#include -#include -#include "paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.h" -#include "paddle/fluid/inference/analysis/ut_helper.h" - -namespace paddle { -namespace inference { -namespace analysis { - -DEFINE_string(dot_dir, "./", ""); - -TEST(TensorRTSubGraphPass, main) { - std::unordered_set teller_set( - {"elementwise_add", "mul", "sigmoid"}); - SubGraphSplitter::NodeInsideSubgraphTeller teller = [&](const Node* node) { - if (node->type() != Node::Type::kFunction) return false; - const auto* func = static_cast(node); - if (teller_set.count(func->func_type())) return true; - return false; - }; - - Argument argument(FLAGS_inference_model_dir); - argument.Set("minimum_subgraph_size", new int(0)); - argument.Set("max_batch_size", new int(3)); - argument.Set("workspace_size", new int(1 << 20)); - argument.Set("precision_mode", new std::string("FP32")); - - DFG_GraphvizDrawPass::Config config{FLAGS_dot_dir, "origin"}; - DFG_GraphvizDrawPass::Config config1{FLAGS_dot_dir, "fusion"}; - - DFG_GraphvizDrawPass dfg_pass(config); - DFG_GraphvizDrawPass dfg_pass1(config1); - FluidToDataFlowGraphPass pass0; - TensorRTSubGraphPass trt_pass(std::move(teller)); - - dfg_pass.Initialize(&argument); - dfg_pass1.Initialize(&argument); - pass0.Initialize(&argument); - trt_pass.Initialize(&argument); - - argument.main_dfg.reset(new DataFlowGraph); - pass0.Run(argument.main_dfg.get()); - dfg_pass.Run(argument.main_dfg.get()); - trt_pass.Run(argument.main_dfg.get()); - dfg_pass1.Run(argument.main_dfg.get()); - - // Check the TRT op's block desc - for (auto& node : argument.main_dfg->nodes.nodes()) { - if (node->IsFunctionBlock()) { - LOG(INFO) << "get function block"; - } - } -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/ut_helper.h b/paddle/fluid/inference/analysis/ut_helper.h index 1073a6f686eaeeaaae2d93ab044149b7df518085..d599099a8050eaeabb8e0544b1bfe3b6b46b17ec 100644 --- a/paddle/fluid/inference/analysis/ut_helper.h +++ b/paddle/fluid/inference/analysis/ut_helper.h @@ -18,8 +18,6 @@ limitations under the License. */ #include #include #include "paddle/fluid/framework/executor.h" -#include "paddle/fluid/inference/analysis/data_flow_graph.h" -#include "paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.h" #include "paddle/fluid/inference/analysis/helper.h" namespace paddle { @@ -32,29 +30,6 @@ namespace analysis { DEFINE_string(inference_model_dir, "", "inference test model dir"); -static DataFlowGraph ProgramDescToDFG( - const framework::proto::ProgramDesc& desc) { - DataFlowGraph graph; - FluidToDataFlowGraphPass pass; - Argument argument; - argument.fluid_model_dir.reset(new std::string(FLAGS_inference_model_dir)); - argument.origin_program_desc.reset(new framework::proto::ProgramDesc(desc)); - pass.Initialize(&argument); - pass.Run(&graph); - pass.Finalize(); - return graph; -} - -class DFG_Tester : public ::testing::Test { - protected: - void SetUp() override { - auto desc = LoadProgramDesc(FLAGS_inference_model_dir + "/__model__"); - argument.origin_program_desc.reset(new framework::proto::ProgramDesc(desc)); - } - - Argument argument; -}; - } // namespace analysis } // namespace inference } // namespace paddle diff --git a/paddle/fluid/inference/api/CMakeLists.txt b/paddle/fluid/inference/api/CMakeLists.txt index 0ddd5d53f836131fe37d412fc867cb38f11ee2b5..882bb3468388e794e975d87de73537ac41f17cf7 100644 --- a/paddle/fluid/inference/api/CMakeLists.txt +++ b/paddle/fluid/inference/api/CMakeLists.txt @@ -18,57 +18,52 @@ if(APPLE) endif(APPLE) -set(inference_deps paddle_inference_api paddle_fluid_api analysis pass ir_pass_manager naive_executor ${GLOB_PASS_LIB}) +set(inference_deps ${analysis_deps} + paddle_inference_api paddle_fluid_api + analysis pass naive_executor + ${GLOB_PASS_LIB}) if(WITH_GPU AND TENSORRT_FOUND) - set(inference_deps ${inference_deps} paddle_inference_tensorrt_subgraph_engine analysis_predictor) + set(inference_deps ${inference_deps} tensorrt_engine tensorrt_converter) endif() -function(inference_api_test TARGET_NAME) - if (WITH_TESTING) - set(options "") - set(oneValueArgs SRC) - set(multiValueArgs ARGS) - cmake_parse_arguments(inference_test "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) +if (ANAKIN_FOUND) + set(inference_deps ${inference_deps} anakin_op_converter anakin_engine) +endif() + +add_subdirectory(details) + +if(WITH_MKLDNN) + set(mkldnn_quantizer_src mkldnn_quantizer.cc) + set(mkldnn_quantizer_cfg mkldnn_quantizer_config) + cc_library(${mkldnn_quantizer_cfg} SRCS mkldnn_quantizer_config.cc DEPS lod_tensor paddle_pass_builder) +endif() - cc_test(${TARGET_NAME} - SRCS ${inference_test_SRC} - DEPS "${inference_deps}" - ARGS --dirname=${PYTHON_TESTS_DIR}/book/) - if(inference_test_ARGS) - set_tests_properties(${TARGET_NAME} - PROPERTIES DEPENDS "${inference_test_ARGS}") - endif() - endif(WITH_TESTING) -endfunction(inference_api_test) +cc_library(analysis_config SRCS analysis_config.cc DEPS ${mkldnn_quantizer_cfg} lod_tensor paddle_pass_builder) +cc_library(paddle_pass_builder SRCS paddle_pass_builder.cc) +cc_library(analysis_predictor SRCS analysis_predictor.cc ${mkldnn_quantizer_src} DEPS paddle_inference_api zero_copy_tensor + reset_tensor_array analysis_config paddle_pass_builder ir_pass_manager ${inference_deps}) +cc_library(paddle_inference_api SRCS api.cc api_impl.cc helper.cc DEPS + lod_tensor scope paddle_pass_builder reset_tensor_array analysis_config + paddle_pass_builder zero_copy_tensor + reset_tensor_array) -cc_library(paddle_inference_api SRCS api.cc api_impl.cc helper.cc DEPS lod_tensor scope) -cc_library(analysis_predictor SRCS analysis_predictor.cc DEPS paddle_inference_api analysis naive_executor zero_copy_tensor) -cc_library(zero_copy_tensor SRCS details/zero_copy_tensor.cc DEPS paddle_inference_api) -cc_library(zero_copy_tensor_dummy SRCS details/zero_copy_tensor_dummy.cc DEPS paddle_inference_api) cc_test(test_paddle_inference_api SRCS api_tester.cc DEPS paddle_inference_api) -inference_api_test(test_api_impl SRC api_impl_tester.cc - ARGS test_word2vec test_image_classification) - -set(PYTHON_TESTS_DIR ${PADDLE_BINARY_DIR}/python/paddle/fluid/tests) -cc_test(test_analysis_predictor SRCS analysis_predictor_tester.cc DEPS analysis_predictor ${inference_deps} paddle_inference_api - ARGS --dirname=${PYTHON_TESTS_DIR}/book) - -if(WITH_GPU AND TENSORRT_FOUND) -cc_library(paddle_inference_tensorrt_subgraph_engine - SRCS api_tensorrt_subgraph_engine.cc - DEPS paddle_inference_api analysis tensorrt_engine paddle_inference_api paddle_fluid_api tensorrt_converter zero_copy_tensor_dummy) - -inference_api_test(test_api_tensorrt_subgraph_engine SRC api_tensorrt_subgraph_engine_tester.cc ARGS test_word2vec) +if(WITH_TESTING) + inference_base_test(test_api_impl SRCS api_impl_tester.cc DEPS ${inference_deps} + ARGS --word2vec_dirname=${WORD2VEC_MODEL_DIR} --book_dirname=${PYTHON_TESTS_DIR}/book) + set_tests_properties(test_api_impl PROPERTIES DEPENDS test_image_classification) endif() +cc_test(test_analysis_predictor SRCS analysis_predictor_tester.cc DEPS analysis_predictor benchmark ${inference_deps} + ARGS --dirname=${WORD2VEC_MODEL_DIR}) if (WITH_ANAKIN AND WITH_MKL) # only needed in CI # compile the libinference_anakin_api.a and anakin.so. - cc_library(inference_anakin_api SRCS api.cc api_anakin_engine.cc DEPS anakin_shared anakin_saber mklml scope zero_copy_tensor_dummy) - cc_library(inference_anakin_api_shared SHARED SRCS api.cc api_anakin_engine.cc DEPS anakin_shared anakin_saber scope) + cc_library(inference_anakin_api SRCS api.cc api_anakin_engine.cc DEPS anakin_shared anakin_saber mklml zero_copy_tensor_dummy device_context) + cc_library(inference_anakin_api_shared SHARED SRCS api.cc api_anakin_engine.cc DEPS anakin_shared anakin_saber zero_copy_tensor_dummy device_context) function(anakin_target target_name) target_compile_options(${target_name} BEFORE PUBLIC ${ANAKIN_COMPILE_EXTRA_FLAGS}) endfunction() diff --git a/paddle/fluid/inference/api/README.md b/paddle/fluid/inference/api/README.md index 20969fac6c8f894ffb4a02b48f795e2a0dcbd096..a2d685d723bd9ab2b84969adb86e177a8754328d 100644 --- a/paddle/fluid/inference/api/README.md +++ b/paddle/fluid/inference/api/README.md @@ -2,25 +2,15 @@ Paddle inference offers the APIs in `C` and `C++` languages. -One can easily deploy a model trained by Paddle following the steps as below: +You can easily deploy a model trained by Paddle following the steps as below: 1. Optimize the native model; 2. Write some codes for deployment. +## The APIs -Let's explain the steps in detail. - -## Optimize the native Fluid Model - -The native model that get from the training phase needs to be optimized for that. - -- Clean the noise such as the cost operators that do not need inference; -- Prune unnecessary computation fork that has nothing to do with the output; -- Remove extraneous variables; -- Memory reuse for native Fluid executor; -- Translate the model storage format to some third-party engine's, so that the inference API can utilize the engine for acceleration; - -We have an official tool to do the optimization, call `paddle_inference_optimize --help` for more information. +All the released APIs are located in the `paddle_inference_api.h` header file. +The stable APIs are wrapped by `namespace paddle`, the unstable APIs are protected by `namespace paddle::contrib`. ## Write some codes diff --git a/paddle/fluid/inference/api/analysis_config.cc b/paddle/fluid/inference/api/analysis_config.cc new file mode 100644 index 0000000000000000000000000000000000000000..b54ea269ff250f02b6331807237e10ee65b0b0b4 --- /dev/null +++ b/paddle/fluid/inference/api/analysis_config.cc @@ -0,0 +1,400 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/inference/api/paddle_analysis_config.h" +#include "paddle/fluid/inference/api/paddle_inference_api.h" +#include "paddle/fluid/inference/api/paddle_pass_builder.h" +#include "paddle/fluid/platform/enforce.h" +#include "paddle/fluid/platform/gpu_info.h" + +namespace paddle { +extern const std::vector kAnakinSubgraphPasses; + +PassStrategy *AnalysisConfig::pass_builder() const { + if (!pass_builder_.get()) { + if (use_gpu_) { + LOG(INFO) << "Create GPU IR passes"; + pass_builder_.reset(new GpuPassStrategy); + } else { + LOG(INFO) << "Create CPU IR passes"; + pass_builder_.reset(new CpuPassStrategy); + } + } else if (pass_builder_->use_gpu() ^ use_gpu()) { + LOG(WARNING) << "The use_gpu flag is not compatible between Config and " + "PassBuilder, the flags are " + << use_gpu() << " " << pass_builder_->use_gpu(); + LOG(WARNING) << "Please make them compatible, still use the existing " + "PassBuilder."; + } + + return pass_builder_.get(); +} + +AnalysisConfig::AnalysisConfig(const std::string &model_dir) { + model_dir_ = model_dir; + + Update(); +} +AnalysisConfig::AnalysisConfig(const std::string &prog_file, + const std::string ¶ms_file) { + prog_file_ = prog_file; + params_file_ = params_file; + + Update(); +} +void AnalysisConfig::SetModel(const std::string &prog_file_path, + const std::string ¶ms_file_path) { + prog_file_ = prog_file_path; + params_file_ = params_file_path; + + Update(); +} +void AnalysisConfig::EnableUseGpu(uint64_t memory_pool_init_size_mb, + int device_id) { +#ifdef PADDLE_WITH_CUDA + use_gpu_ = true; + memory_pool_init_size_mb_ = memory_pool_init_size_mb; + device_id_ = device_id; +#else + LOG(ERROR) << "Please compile with gpu to EnableGpu()"; + use_gpu_ = false; +#endif + + Update(); +} +void AnalysisConfig::DisableGpu() { + use_gpu_ = false; + + Update(); +} + +AnalysisConfig::AnalysisConfig(const AnalysisConfig &other) { +#define CP_MEMBER(member__) member__ = other.member__; + + // Model related. + CP_MEMBER(model_dir_); + CP_MEMBER(prog_file_); + CP_MEMBER(params_file_); + CP_MEMBER(model_from_memory_); // the memory model reuses prog_file_ and + // params_file_ fields. + // Gpu related. + CP_MEMBER(use_gpu_); + CP_MEMBER(device_id_); + CP_MEMBER(memory_pool_init_size_mb_); + + CP_MEMBER(enable_memory_optim_); + CP_MEMBER(static_memory_optim_); + CP_MEMBER(static_memory_optim_force_update_); + // TensorRT related. + CP_MEMBER(use_tensorrt_); + CP_MEMBER(tensorrt_workspace_size_); + CP_MEMBER(tensorrt_max_batchsize_); + CP_MEMBER(tensorrt_min_subgraph_size_); + CP_MEMBER(tensorrt_precision_mode_); + CP_MEMBER(trt_use_static_engine_); + // MKLDNN related. + CP_MEMBER(use_mkldnn_); + CP_MEMBER(mkldnn_enabled_op_types_); + // Quantization related. + CP_MEMBER(use_mkldnn_quantizer_); + CP_MEMBER(mkldnn_quantizer_config_); + + CP_MEMBER(use_anakin_); + CP_MEMBER(anakin_max_batchsize_); + CP_MEMBER(anakin_max_input_shape_); + CP_MEMBER(anakin_min_subgraph_size_); + + // Ir related. + CP_MEMBER(enable_ir_optim_); + CP_MEMBER(use_feed_fetch_ops_); + CP_MEMBER(ir_debug_); + CP_MEMBER(specify_input_name_); + + CP_MEMBER(cpu_math_library_num_threads_); + + CP_MEMBER(serialized_info_cache_); + + if (use_gpu_) { + pass_builder_.reset(new GpuPassStrategy( + *static_cast(other.pass_builder()))); + } else { + pass_builder_.reset(new CpuPassStrategy( + *static_cast(other.pass_builder()))); + } + +#undef CP_MEMBER + + Update(); +} + +void AnalysisConfig::EnableMKLDNN() { +#ifdef PADDLE_WITH_MKLDNN + use_mkldnn_ = true; +#else + LOG(ERROR) << "Please compile with MKLDNN first to use MKLDNN"; + use_mkldnn_ = false; +#endif + + Update(); +} + +void AnalysisConfig::EnableMkldnnQuantizer() { +#ifdef PADDLE_WITH_MKLDNN + if (!mkldnn_quantizer_config_) + mkldnn_quantizer_config_.reset(new MkldnnQuantizerConfig()); + use_mkldnn_quantizer_ = true; +#else + LOG(ERROR) << "Please compile with MKLDNN first to use MkldnnQuantizer"; + use_mkldnn_quantizer_ = false; +#endif + + Update(); +} + +std::shared_ptr AnalysisConfig::mkldnn_quantizer_config() + const { + PADDLE_ENFORCE_NOT_NULL(mkldnn_quantizer_config_, + "MkldnnQuantizer was not enabled yet."); + return mkldnn_quantizer_config_; +} + +void AnalysisConfig::EnableTensorRtEngine( + int workspace_size, int max_batch_size, int min_subgraph_size, + AnalysisConfig::Precision precision_mode, bool use_static) { +#ifdef PADDLE_WITH_CUDA + if (!use_gpu()) { + LOG(ERROR) << "To use TensorRT engine, please call EnableGpu() first"; + return; + } + + use_tensorrt_ = true; + tensorrt_workspace_size_ = workspace_size; + tensorrt_max_batchsize_ = max_batch_size; + tensorrt_min_subgraph_size_ = min_subgraph_size; + tensorrt_precision_mode_ = precision_mode; + trt_use_static_engine_ = use_static; + + Update(); +#else + LOG(ERROR) + << "To use TensorRT engine, please compile inference lib with GPU first."; +#endif +} + +// TODO(Superjomn) refactor this, buggy. +void AnalysisConfig::Update() { + auto info = SerializeInfoCache(); + if (info == serialized_info_cache_) return; + + // Transfer pass_builder and copy the existing compatible passes. + if (!pass_builder_ || ((use_gpu() ^ pass_builder_->use_gpu()))) { + if (use_gpu()) { + pass_builder_.reset(new GpuPassStrategy); + + if (use_tensorrt_) { + // Append after the Affine_channel_conv_fuse pass. + pass_builder()->InsertPass(3, "tensorrt_subgraph_pass"); + } + } else { + pass_builder_.reset(new CpuPassStrategy); + } + + } else { + if (use_gpu()) { + pass_builder_.reset(new GpuPassStrategy( + *static_cast(pass_builder_.get()))); + + } else { + pass_builder_.reset(new CpuPassStrategy( + *static_cast(pass_builder_.get()))); + } + } + + if (use_tensorrt_) { + const auto &passes = pass_builder_->AllPasses(); + if (std::find(passes.begin(), passes.end(), "tensorrt_subgraph_pass") == + std::end(passes)) { + // Append after the Affine_channel_conv_fuse pass. + pass_builder()->InsertPass(3, "tensorrt_subgraph_pass"); + } + pass_builder()->DeletePass("runtime_context_cache_pass"); + pass_builder()->DeletePass("expected_kernel_cache_pass"); + } + + if (use_mkldnn_) { +#ifdef PADDLE_WITH_MKLDNN + if (!enable_ir_optim_) { + LOG(ERROR) + << "EnableMKLDNN() only works when IR optimization is enabled."; + } else { + pass_builder()->EnableMKLDNN(); + } +#endif + } + + // Quantization passes must come after all other optimization passes + if (use_mkldnn_quantizer_) { + if (!enable_ir_optim_) { + LOG(ERROR) << "EnableMkldnnQuantizer() only works when IR optimization " + "is enabled."; + } +#ifdef PADDLE_WITH_MKLDNN + pass_builder()->EnableMkldnnQuantizer(); +#endif + } + +#ifdef PADDLE_WITH_MKLDNN + // Do not optimize before quantization + if (enable_memory_optim_ && !use_mkldnn_quantizer_) { +#else + if (enable_memory_optim_) { +#endif + pass_builder()->AppendAnalysisPass("memory_optimize_pass"); + } + + if (use_anakin_) { + PADDLE_ENFORCE(!use_tensorrt_, + "Anakin sub-graph and TensorRT sub-graph are not allowed to " + "run at the same time!"); + PADDLE_ENFORCE( + use_gpu_, + "Anakin sub-graph engine need gpu, please use the EnableGpu API."); + + pass_builder()->ClearPasses(); + for (const auto &pass : kAnakinSubgraphPasses) { + pass_builder()->AppendPass(pass); + } + } + + if (ir_debug_) { + pass_builder()->TurnOnDebug(); + } +} + +std::string AnalysisConfig::SerializeInfoCache() { + std::stringstream ss; + ss << model_dir_; + ss << prog_file_; + ss << params_file_; + + ss << use_gpu_; + ss << device_id_; + ss << memory_pool_init_size_mb_; + + ss << use_tensorrt_; + ss << tensorrt_workspace_size_; + ss << tensorrt_max_batchsize_; + ss << tensorrt_min_subgraph_size_; + + ss << enable_memory_optim_; + ss << static_memory_optim_; + ss << static_memory_optim_force_update_; + + ss << use_mkldnn_; + for (auto &item : mkldnn_enabled_op_types_) ss << item; + ss << ";"; + + ss << use_mkldnn_quantizer_; + ss << model_from_memory_; + + ss << enable_ir_optim_; + ss << use_feed_fetch_ops_; + ss << ir_debug_; + + ss << specify_input_name_; + ss << cpu_math_library_num_threads_; + ss << use_anakin_; + ss << anakin_min_subgraph_size_; + return ss.str(); +} + +void AnalysisConfig::SetCpuMathLibraryNumThreads( + int cpu_math_library_num_threads) { + cpu_math_library_num_threads_ = cpu_math_library_num_threads; + + Update(); +} + +float AnalysisConfig::fraction_of_gpu_memory_for_pool() const { +#ifdef PADDLE_WITH_CUDA + // Get the GPU memory details and calculate the fraction of memory for the + // GPU memory pool. + size_t gpu_used, gpu_available; + platform::GpuMemoryUsage(&gpu_used, &gpu_available); + double total_gpu_memory = (gpu_used + gpu_available) / 1024. / 1024.; + float fraction_of_gpu_memory = + static_cast(memory_pool_init_size_mb()) / total_gpu_memory; + return fraction_of_gpu_memory; +#else + return 0.; +#endif +} + +void AnalysisConfig::EnableMemoryOptim(bool static_optim, + bool force_update_static_cache) { + enable_memory_optim_ = true; + static_memory_optim_ = static_optim; + static_memory_optim_force_update_ = force_update_static_cache; + + Update(); +} + +bool AnalysisConfig::enable_memory_optim() const { + return enable_memory_optim_; +} + +void AnalysisConfig::SetModelBuffer(const char *prog_buffer, + size_t prog_buffer_size, + const char *param_buffer, + size_t param_buffer_size) { + prog_file_ = std::string(prog_buffer, prog_buffer + prog_buffer_size); + params_file_ = std::string(param_buffer, param_buffer + param_buffer_size); + model_from_memory_ = true; + + Update(); +} + +void AnalysisConfig::SetEngineOptInfo( + std::map engine_opt_info) { + engine_opt_info_ = engine_opt_info; +} + +NativeConfig AnalysisConfig::ToNativeConfig() const { + NativeConfig config; + config.model_dir = model_dir_; + config.prog_file = prog_file_; + config.param_file = params_file_; + config.use_gpu = use_gpu_; + config.device = device_id_; + config.fraction_of_gpu_memory = fraction_of_gpu_memory_for_pool(); + config.specify_input_name = specify_input_name_; + return config; +} + +void AnalysisConfig::SwitchIrDebug(int x) { + ir_debug_ = x; + Update(); +} +void AnalysisConfig::EnableAnakinEngine( + int max_batch_size, std::map> max_input_shape, + int min_subgraph_size) { + anakin_max_batchsize_ = max_batch_size; + anakin_max_input_shape_ = max_input_shape; + anakin_min_subgraph_size_ = min_subgraph_size; + use_anakin_ = true; + Update(); +} +} // namespace paddle diff --git a/paddle/fluid/inference/api/analysis_predictor.cc b/paddle/fluid/inference/api/analysis_predictor.cc index 3095dee0f0106b2408663cd32bb4fb310111eda4..6942604b0723f8665f0e8b058d48a5356a1a01f4 100644 --- a/paddle/fluid/inference/api/analysis_predictor.cc +++ b/paddle/fluid/inference/api/analysis_predictor.cc @@ -13,90 +13,192 @@ // limitations under the License. #include "paddle/fluid/inference/api/analysis_predictor.h" +#include +#include +#include #include #include +#include #include #include "paddle/fluid/framework/feed_fetch_method.h" +#include "paddle/fluid/framework/feed_fetch_type.h" #include "paddle/fluid/framework/ir/fuse_pass_base.h" #include "paddle/fluid/framework/ir/pass.h" #include "paddle/fluid/framework/naive_executor.h" #include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/framework/var_type_traits.h" +#include "paddle/fluid/inference/analysis/helper.h" +#include "paddle/fluid/inference/analysis/passes/memory_optimize_pass.h" #include "paddle/fluid/inference/api/helper.h" #include "paddle/fluid/inference/api/paddle_inference_api.h" #include "paddle/fluid/inference/api/paddle_inference_pass.h" #include "paddle/fluid/inference/utils/singleton.h" +#include "paddle/fluid/memory/memcpy.h" #include "paddle/fluid/platform/cpu_helper.h" +#include "paddle/fluid/platform/gpu_info.h" +#include "paddle/fluid/platform/place.h" #include "paddle/fluid/platform/profiler.h" +#ifdef PADDLE_WITH_MKLDNN +#include "paddle/fluid/inference/api/mkldnn_quantizer.h" +#endif + +#if PADDLE_WITH_TENSORRT +#include "paddle/fluid/inference/tensorrt/convert/op_converter.h" +#include "paddle/fluid/inference/tensorrt/trt_int8_calibrator.h" +#endif + +#if PADDLE_WITH_ANAKIN +#include "paddle/fluid/inference/anakin/convert/op_converter.h" +#endif + DECLARE_bool(profile); -DECLARE_int32(paddle_num_threads); namespace paddle { -using contrib::AnalysisConfig; +using inference::Singleton; +#if PADDLE_WITH_TENSORRT +using inference::tensorrt::TRTInt8Calibrator; +using inference::tensorrt::TRTCalibratorEngine; +using inference::tensorrt::TRTCalibratorEngineManager; +#endif + +namespace { +bool IsPersistable(const framework::VarDesc *var) { + if (var->Persistable() && + var->GetType() != framework::proto::VarType::FEED_MINIBATCH && + var->GetType() != framework::proto::VarType::FETCH_LIST && + var->GetType() != framework::proto::VarType::RAW) { + return true; + } + return false; +} +} // namespace bool AnalysisPredictor::Init( const std::shared_ptr &parent_scope, const std::shared_ptr &program) { VLOG(3) << "Predictor::init()"; -#if !defined(_WIN32) if (FLAGS_profile) { LOG(WARNING) << "Profiler is actived, might affect the performance"; LOG(INFO) << "You can turn off by set gflags '-profile false'"; - auto tracking_device = config_.use_gpu ? platform::ProfilerState::kAll - : platform::ProfilerState::kCPU; + auto tracking_device = config_.use_gpu() ? platform::ProfilerState::kAll + : platform::ProfilerState::kCPU; platform::EnableProfiler(tracking_device); } -#endif // no matter with or without MKLDNN - paddle::platform::SetNumThreads(FLAGS_paddle_num_threads); + paddle::platform::SetNumThreads(config_.cpu_math_library_num_threads()); - if (config_.use_gpu) { - place_ = paddle::platform::CUDAPlace(config_.device); - LOG(WARNING) << "ir optimize only supports CPU currently, enable_ir_optim " - "is turned false."; - config_.enable_ir_optim = false; - } else { - place_ = paddle::platform::CPUPlace(); + if (!PrepareScope(parent_scope)) { + return false; + } + if (!CreateExecutor()) { + return false; + } + if (!PrepareProgram(program)) { + return false; + } + + // Prepare executor, create local variables. + if (!PrepareExecutor()) { + return true; } + + // Get the feed_target_names and fetch_target_names + PrepareFeedFetch(); + + return true; +} + +bool AnalysisPredictor::PrepareScope( + const std::shared_ptr &parent_scope) { if (parent_scope) { + PADDLE_ENFORCE_NOT_NULL( + parent_scope, + "Both program and parent_scope should be set in Clone mode."); scope_ = parent_scope; - sub_scope_ = &(parent_scope->NewScope()); + status_is_cloned_ = true; } else { paddle::framework::InitDevices(false); scope_.reset(new paddle::framework::Scope()); + status_is_cloned_ = false; } - - executor_.reset(new paddle::framework::NaiveExecutor(place_)); - + sub_scope_ = &scope_->NewScope(); + return true; +} +bool AnalysisPredictor::PrepareProgram( + const std::shared_ptr &program) { if (!program) { if (!LoadProgramDesc()) return false; - OptimizeInferenceProgram(); + + // If not cloned, the parameters should be loaded. + // If config_.ir_optim() is True, parameters is loaded in + // OptimizeInferenceProgram(), but other persistable variables + // (like RAW type var) are not created in scope. + // If config_.ir_optim() is False, parameters is loaded in LoadParameters(), + // still need to create other persistable variables. + // So in both case, create persistable variables at first. + executor_->CreateVariables(*inference_program_, 0, true, sub_scope_); + + // Optimize the program, and load parameters and modify them in the + // scope_. + // This will change the scope_ address. + if (config_.ir_optim()) { + status_ir_optim_enabled_ = true; + OptimizeInferenceProgram(); + } else { + // Load parameters + LOG(INFO) << "load parameters "; + LoadParameters(); + } } else { + // If the program is passed from external, no need to optimize it, this + // logic is used in the clone scenario. inference_program_ = program; } - if (config_._use_mkldnn) { - executor_->EnableMKLDNN(*inference_program_); + executor_->CreateVariables(*inference_program_, 0, false, sub_scope_); + + return true; +} +bool AnalysisPredictor::CreateExecutor() { + if (config_.use_gpu_) { + status_use_gpu_ = true; + place_ = paddle::platform::CUDAPlace(config_.device_id_); + } else { + place_ = paddle::platform::CPUPlace(); } + executor_.reset(new paddle::framework::NaiveExecutor(place_)); + return true; +} +bool AnalysisPredictor::PrepareExecutor() { + executor_->Prepare(sub_scope_, *inference_program_, 0, + config_.use_feed_fetch_ops_); - executor_->Prepare(scope_.get(), *inference_program_, 0, - config_.use_feed_fetch_ops); + PADDLE_ENFORCE_NOT_NULL(sub_scope_); - // Get the feed_target_names and fetch_target_names - PrepareFeedFetch(); return true; } +void AnalysisPredictor::SetMkldnnThreadID(int tid) { +#ifdef PADDLE_WITH_MKLDNN + platform::set_cur_thread_id(tid); +#else + LOG(ERROR) << "Please compile with MKLDNN first to use MKLDNN"; +#endif +} + bool AnalysisPredictor::Run(const std::vector &inputs, std::vector *output_data, int batch_size) { + if (UNLIKELY(config_.cpu_math_library_num_threads() > 1)) { + paddle::platform::SetNumThreads(config_.cpu_math_library_num_threads()); + } VLOG(3) << "Predictor::predict"; inference::Timer timer; timer.tic(); // set feed variable - std::vector feeds; framework::Scope *scope = sub_scope_ ? sub_scope_ : scope_.get(); if (!SetFeed(inputs, scope)) { LOG(ERROR) << "fail to set feed"; @@ -112,7 +214,21 @@ bool AnalysisPredictor::Run(const std::vector &inputs, LOG(ERROR) << "fail to get fetches"; return false; } + + // Collect variable shapes for memory optimization. + if (need_collect_var_shapes_for_memory_optim()) { + CollectVarShapes(); + } + VLOG(3) << "predict cost: " << timer.toc() << "ms"; + + // All the containers in the scope will be hold in inference, but the + // operators assume that the container will be reset after each batch. + // Here is a bugfix, collect all the container variables, and reset then to a + // bool; the next time, the operator will call MutableData and construct a new + // container again, so that the container will be empty for each batch. + tensor_array_batch_cleaner_.CollectNoTensorVars(sub_scope_); + tensor_array_batch_cleaner_.ResetNoTensorVars(); return true; } @@ -133,17 +249,34 @@ bool AnalysisPredictor::SetFeed(const std::vector &inputs, framework::DDim ddim = framework::make_ddim(inputs[i].shape); void *input_ptr; if (inputs[i].dtype == PaddleDType::INT64) { - input_ptr = input.mutable_data(ddim, platform::CPUPlace()); + input_ptr = input.mutable_data(ddim, place_); } else if (inputs[i].dtype == PaddleDType::FLOAT32) { - input_ptr = input.mutable_data(ddim, platform::CPUPlace()); + input_ptr = input.mutable_data(ddim, place_); + } else if (inputs[i].dtype == PaddleDType::INT32) { + input_ptr = input.mutable_data(ddim, place_); } else { LOG(ERROR) << "unsupported feed type " << inputs[i].dtype; return false; } - // TODO(panyx0718): Init LoDTensor from existing memcpy to save a copy. - std::memcpy(static_cast(input_ptr), inputs[i].data.data(), - inputs[i].data.length()); + if (platform::is_cpu_place(place_)) { + // TODO(panyx0718): Init LoDTensor from existing memcpy to save a copy. + std::memcpy(static_cast(input_ptr), inputs[i].data.data(), + inputs[i].data.length()); + } else { +#ifdef PADDLE_WITH_CUDA + platform::DeviceContextPool &pool = + platform::DeviceContextPool::Instance(); + auto *dev_ctx = + static_cast(pool.Get(place_)); + auto dst_gpu_place = boost::get(place_); + memory::Copy(dst_gpu_place, static_cast(input_ptr), + platform::CPUPlace(), inputs[i].data.data(), + inputs[i].data.length(), dev_ctx->stream()); +#else + PADDLE_THROW("Not compile with CUDA, should not reach here."); +#endif + } // TODO(Superjomn) Low performance, need optimization for heavy LoD copy. framework::LoD lod; for (auto &level : inputs[i].lod) { @@ -151,8 +284,13 @@ bool AnalysisPredictor::SetFeed(const std::vector &inputs, } input.set_lod(lod); int idx = -1; - if (config_.specify_input_name) { - idx = feed_names_[inputs[i].name]; + if (config_.specify_input_name_) { + auto name = inputs[i].name; + if (feed_names_.find(name) == feed_names_.end()) { + LOG(ERROR) << "feed names from program do not have name: [" << name + << "] from specified input"; + } + idx = feed_names_[name]; } else { idx = boost::get(feeds_[i]->GetAttr("col")); } @@ -184,61 +322,110 @@ void AnalysisPredictor::GetFetchOne(const framework::LoDTensor &fetch, bool AnalysisPredictor::GetFetch(std::vector *outputs, framework::Scope *scope) { VLOG(3) << "Predictor::get_fetch"; - outputs->resize(fetchs_.size()); - for (size_t i = 0; i < fetchs_.size(); ++i) { - int idx = boost::get(fetchs_[i]->GetAttr("col")); + outputs->resize(fetches_.size()); + for (size_t i = 0; i < fetches_.size(); ++i) { + int idx = boost::get(fetches_[i]->GetAttr("col")); PADDLE_ENFORCE((size_t)idx == i); framework::LoDTensor &fetch = framework::GetFetchVariable(*scope, "fetch", idx); auto type = fetch.type(); auto output = &(outputs->at(i)); - if (type == typeid(float)) { + output->name = fetches_[idx]->Input("X")[0]; + if (type == framework::proto::VarType::FP32) { GetFetchOne(fetch, output); output->dtype = PaddleDType::FLOAT32; - } else if (type == typeid(int64_t)) { + } else if (type == framework::proto::VarType::INT64) { GetFetchOne(fetch, output); output->dtype = PaddleDType::INT64; + } else if (type == framework::proto::VarType::INT32) { + GetFetchOne(fetch, output); + output->dtype = PaddleDType::INT32; } else { - LOG(ERROR) << "unknown type, only support float32 and int64 now."; + LOG(ERROR) << "unknown type, only support float32, int64 and int32 now."; } } return true; } -void AnalysisPredictor::OptimizeInferenceProgram() { - LOG(INFO) << "optimize begin"; - FLAGS_IA_enable_ir = config_.enable_ir_optim; - FLAGS_IA_enable_tensorrt_subgraph_engine = false; - FLAGS_IA_output_storage_path = ""; // Don't output the model. +void AnalysisPredictor::PrepareArgument() { + argument_.SetUseGPU(config_.use_gpu()); + argument_.SetGPUDeviceId(config_.gpu_device_id()); + argument_.SetEnableMemoryOptim(config_.enable_memory_optim()); + argument_.SetStaticMemoryOptim(config_.static_memory_optim_); + argument_.SetStaticMemoryOptimForceUpdate( + config_.static_memory_optim_force_update_); + argument_.SetModelFromMemory(config_.model_from_memory_); + argument_.SetEngineOptInfo(config_.engine_opt_info_); // Analyze inference_program - if (!config_.model_dir.empty()) { - argument_.fluid_model_dir.reset(new std::string(config_.model_dir)); + argument_.SetUseAnakin(config_.anakin_engine_enabled()); + argument_.SetPredictorID(predictor_id_); + if (!config_.model_dir().empty()) { + argument_.SetModelDir(config_.model_dir()); } else { PADDLE_ENFORCE( - !config_.param_file.empty(), + !config_.params_file().empty(), "Either model_dir or (param_file, prog_file) should be set."); - PADDLE_ENFORCE(!config_.prog_file.empty()); - argument_.fluid_model_program_path.reset( - new std::string(config_.prog_file)); - argument_.fluid_model_param_path.reset(new std::string(config_.param_file)); + PADDLE_ENFORCE(!config_.prog_file().empty()); + std::string dir = inference::analysis::GetDirRoot(config_.prog_file()); + + argument_.SetModelProgramPath(config_.prog_file()); + argument_.SetModelParamsPath(config_.params_file()); + } + + if (config_.use_gpu() && config_.tensorrt_engine_enabled()) { + LOG(INFO) << "TensorRT subgraph engine is enabled"; + argument_.SetUseTensorRT(true); + argument_.SetTensorRtWorkspaceSize(config_.tensorrt_workspace_size_); + argument_.SetTensorRtMaxBatchSize(config_.tensorrt_max_batchsize_); + argument_.SetTensorRtMinSubgraphSize(config_.tensorrt_min_subgraph_size_); + argument_.SetTensorRtPrecisionMode(config_.tensorrt_precision_mode_); + argument_.SetTensorRtUseStaticEngine(config_.trt_use_static_engine_); } - argument_.origin_program_desc.reset( - new ProgramDesc(*inference_program_->Proto())); - PADDLE_ENFORCE( - config_.ir_mode == contrib::AnalysisConfig::IrPassMode::kExclude, - "Only kExclude is supported yet."); - Analyzer().DisableIrPasses(config_.ir_passes).Run(&argument_); + if (config_.use_gpu() && config_.anakin_engine_enabled()) { + argument_.SetAnakinMaxBatchSize(config_.anakin_max_batchsize_); + argument_.SetAnakinMaxInputShape(config_.anakin_max_input_shape_); + argument_.SetAnakinMinSubgraphSize(config_.anakin_min_subgraph_size_); + LOG(INFO) << "Anakin subgraph engine is enabled"; + } + + if (config_.use_mkldnn_) { + LOG(INFO) << "MKLDNN is enabled"; + argument_.SetMKLDNNEnabledOpTypes(config_.mkldnn_enabled_op_types_); + } - CHECK(argument_.transformed_program_desc); +#ifdef PADDLE_WITH_MKLDNN + if (config_.mkldnn_quantizer_enabled()) { + LOG(INFO) << "Quantization is enabled"; + argument_.SetQuantizeEnabledOpTypes( + config_.mkldnn_quantizer_config()->enabled_op_types()); + argument_.SetQuantizeExcludedOpIds( + config_.mkldnn_quantizer_config()->excluded_op_ids()); + } +#endif + + auto passes = config_.pass_builder()->AllPasses(); + if (!config_.ir_optim()) { + passes.clear(); + LOG(INFO) << "ir_optim is turned off, no IR pass will be executed"; + } + argument_.SetIrAnalysisPasses(passes); + argument_.SetAnalysisPasses(config_.pass_builder()->AnalysisPasses()); + argument_.SetScopeNotOwned(scope_.get()); +} + +// NOTE All the members in AnalysisConfig should be copied to Argument. +void AnalysisPredictor::OptimizeInferenceProgram() { + status_program_optimized_ = true; + + PrepareArgument(); + Analyzer().Run(&argument_); + + PADDLE_ENFORCE(argument_.scope_valid()); VLOG(5) << "to prepare executor"; + ARGUMENT_CHECK_FIELD((&argument_), ir_analyzed_program); inference_program_.reset( - new framework::ProgramDesc(*argument_.transformed_program_desc)); - if (argument_.Has(framework::ir::kParamScopeAttr)) { - // Update scope. - scope_.reset( - argument_.Release(framework::ir::kParamScopeAttr)); - } + new framework::ProgramDesc(argument_.ir_analyzed_program())); LOG(INFO) << "== optimize end =="; } @@ -246,18 +433,26 @@ template <> std::unique_ptr CreatePaddlePredictor< AnalysisConfig, PaddleEngineKind::kAnalysis>(const AnalysisConfig &config) { VLOG(3) << "create AnalysisConfig"; - if (config.use_gpu) { - // 1. GPU memeroy - PADDLE_ENFORCE_GT( - config.fraction_of_gpu_memory, 0.f, - "fraction_of_gpu_memory in the config should be set to range (0., 1.]"); - PADDLE_ENFORCE_GE(config.device, 0, "Invalid device id %d", config.device); + if (config.use_gpu()) { + // 1. GPU memory + PADDLE_ENFORCE_GE(config.memory_pool_init_size_mb(), 0.f); + PADDLE_ENFORCE_GE(config.gpu_device_id(), 0, "Invalid device id %d", + config.gpu_device_id()); std::vector flags; - if (config.fraction_of_gpu_memory >= 0.0f || - config.fraction_of_gpu_memory <= 0.95f) { + + float fraction_of_gpu_memory = config.fraction_of_gpu_memory_for_pool(); + if (fraction_of_gpu_memory > 0.95f) { + LOG(ERROR) + << "Allocate too much memory for the GPU memory pool, assigned " + << config.memory_pool_init_size_mb() << " MB"; + LOG(ERROR) + << "Try to shink the value by setting AnalysisConfig::EnableGpu(...)"; + } + + if (fraction_of_gpu_memory >= 0.0f || fraction_of_gpu_memory <= 0.95f) { flags.push_back("dummpy"); std::string flag = "--fraction_of_gpu_memory_to_use=" + - std::to_string(config.fraction_of_gpu_memory); + std::to_string(fraction_of_gpu_memory); flags.push_back(flag); VLOG(3) << "set flag: " << flag; framework::InitGflags(flags); @@ -265,13 +460,34 @@ std::unique_ptr CreatePaddlePredictor< } std::unique_ptr predictor(new AnalysisPredictor(config)); - if (!dynamic_cast(predictor.get())->Init(nullptr)) { + auto predictor_p = dynamic_cast(predictor.get()); + + if (!predictor_p->Init(nullptr)) { + return nullptr; + } + + if (config.mkldnn_quantizer_enabled() && !predictor_p->MkldnnQuantize()) { return nullptr; } + return predictor; } +bool AnalysisPredictor::MkldnnQuantize() { +#if PADDLE_WITH_MKLDNN + if (!mkldnn_quantizer_) + mkldnn_quantizer_ = new AnalysisPredictor::MkldnnQuantizer( + *this, config_.mkldnn_quantizer_config()); + return mkldnn_quantizer_->Quantize(); +#else + LOG(ERROR) << "Please compile with MKLDNN first to use MkldnnQuantizer"; + return false; +#endif +} + void AnalysisPredictor::PrepareFeedFetch() { + PADDLE_ENFORCE_NOT_NULL(sub_scope_); + CreateFeedFetchVar(sub_scope_); for (auto *op : inference_program_->Block(0).AllOps()) { if (op->Type() == "feed") { int idx = boost::get(op->GetAttr("col")); @@ -280,16 +496,42 @@ void AnalysisPredictor::PrepareFeedFetch() { } feeds_[idx] = op; feed_names_[op->Output("Out")[0]] = idx; + idx2feeds_[idx] = op->Output("Out")[0]; } else if (op->Type() == "fetch") { int idx = boost::get(op->GetAttr("col")); - if (fetchs_.size() <= static_cast(idx)) { - fetchs_.resize(idx + 1); + if (fetches_.size() <= static_cast(idx)) { + fetches_.resize(idx + 1); } - fetchs_[idx] = op; + fetches_[idx] = op; + idx2fetches_[idx] = op->Input("X")[0]; } } } +void AnalysisPredictor::CreateFeedFetchVar(framework::Scope *scope) { + PADDLE_ENFORCE_NOT_NULL(scope); + auto *var = scope->Var("feed"); + var->GetMutable(); + var = scope->Var("fetch"); + var->GetMutable(); +} + +std::vector AnalysisPredictor::GetInputNames() { + std::vector input_names; + for (auto &item : idx2feeds_) { + input_names.push_back(item.second); + } + return input_names; +} + +std::vector AnalysisPredictor::GetOutputNames() { + std::vector output_names; + for (auto &item : idx2fetches_) { + output_names.push_back(item.second); + } + return output_names; +} + std::unique_ptr AnalysisPredictor::GetInputTensor( const std::string &name) { PADDLE_ENFORCE(executor_->scope()->FindVar(name), "no name called %s", name); @@ -297,6 +539,13 @@ std::unique_ptr AnalysisPredictor::GetInputTensor( new ZeroCopyTensor(static_cast(executor_->scope()))); res->input_or_output_ = true; res->SetName(name); + if (platform::is_cpu_place(place_)) { + res->SetPlace(PaddlePlace::kCPU); + } else { + auto gpu_place = boost::get(place_); + res->SetPlace(PaddlePlace::kGPU, gpu_place.GetDeviceId()); + } + return res; } @@ -307,63 +556,336 @@ std::unique_ptr AnalysisPredictor::GetOutputTensor( new ZeroCopyTensor(static_cast(executor_->scope()))); res->input_or_output_ = false; res->SetName(name); + if (platform::is_cpu_place(place_)) { + res->SetPlace(PaddlePlace::kCPU); + } else { + auto gpu_place = boost::get(place_); + res->SetPlace(PaddlePlace::kGPU, gpu_place.GetDeviceId()); + } return res; } bool AnalysisPredictor::ZeroCopyRun() { executor_->Run(); + // Fix TensorArray reuse not cleaned bug. + tensor_array_batch_cleaner_.CollectTensorArrays(sub_scope_); + tensor_array_batch_cleaner_.ResetTensorArray(); return true; } bool AnalysisPredictor::LoadProgramDesc() { // Initialize the inference program - std::unique_ptr tmp_exe( - new framework::Executor(platform::CPUPlace())); - if (!config_.model_dir.empty()) { - // Parameters are saved in separate files sited in - // the specified `dirname`. - inference_program_ = paddle::inference::Load( - static_cast(tmp_exe.get()), scope_.get(), - config_.model_dir); - } else if (!config_.prog_file.empty() && !config_.param_file.empty()) { + std::string filename; + if (!config_.model_dir().empty()) { + filename = config_.model_dir() + "/__model__"; + } else if (!config_.prog_file().empty() && !config_.params_file().empty()) { // All parameters are saved in a single file. // The file names should be consistent with that used // in Python API `fluid.io.save_inference_model`. - inference_program_ = paddle::inference::Load( - static_cast(tmp_exe.get()), scope_.get(), - config_.prog_file, config_.param_file); + filename = config_.prog_file(); } else { + if (config_.model_dir().empty() && config_.prog_file().empty()) { + LOG(ERROR) + << "Either model_dir or (prog_file, param_file) should be set."; + return false; + } LOG(ERROR) << string::Sprintf( - "not valid model path '%s' or program path '%s'.", config_.model_dir, - config_.param_file); + "not valid model path '%s' or program path '%s'.", config_.model_dir(), + config_.params_file()); return false; } + + // Create ProgramDesc + framework::proto::ProgramDesc proto; + if (!config_.model_from_memory()) { + std::string pb_content; + // Read binary + std::ifstream fin(filename, std::ios::in | std::ios::binary); + PADDLE_ENFORCE(static_cast(fin.is_open()), "Cannot open file %s", + filename); + fin.seekg(0, std::ios::end); + pb_content.resize(fin.tellg()); + fin.seekg(0, std::ios::beg); + fin.read(&(pb_content.at(0)), pb_content.size()); + fin.close(); + + proto.ParseFromString(pb_content); + } else { + proto.ParseFromString(config_.prog_file()); + } + inference_program_.reset(new framework::ProgramDesc(proto)); + return true; +} + +bool AnalysisPredictor::LoadParameters() { + PADDLE_ENFORCE_NOT_NULL(inference_program_.get(), + "The inference program should be loaded first."); + + const auto &global_block = inference_program_->MutableBlock(0); + + // create a temporary program to load parameters. + + std::unique_ptr load_program( + new framework::ProgramDesc()); + framework::BlockDesc *load_block = load_program->MutableBlock(0); + std::vector params; + + for (auto *var : global_block->AllVars()) { + if (IsPersistable(var)) { + VLOG(3) << "persistable variable's name: " << var->Name(); + + framework::VarDesc *new_var = load_block->Var(var->Name()); + new_var->SetShape(var->GetShape()); + new_var->SetDataType(var->GetDataType()); + new_var->SetType(var->GetType()); + new_var->SetLoDLevel(var->GetLoDLevel()); + new_var->SetPersistable(true); + + if (!config_.params_file().empty()) { + params.push_back(new_var->Name()); + } else { + // append_op + framework::OpDesc *op = load_block->AppendOp(); + op->SetType("load"); + op->SetOutput("Out", {new_var->Name()}); + op->SetAttr("file_path", {config_.model_dir() + "/" + new_var->Name()}); + op->CheckAttrs(); + } + } + } + + if (!config_.params_file().empty()) { + // sort paramlist to have consistent ordering + std::sort(params.begin(), params.end()); + // append just the load_combine op + framework::OpDesc *op = load_block->AppendOp(); + op->SetType("load_combine"); + op->SetOutput("Out", params); + op->SetAttr("file_path", {config_.params_file()}); + op->CheckAttrs(); + } + + // Use NaiveExecutor to Load parameters. + framework::NaiveExecutor e(place_); + e.Prepare(scope_.get(), *load_program, 0, false); + e.Run(); + VLOG(3) << "get " << scope_->LocalVarNames().size() << " vars after load"; + + return true; +} + +#if PADDLE_WITH_TENSORRT +bool AnalysisPredictor::SaveTrtCalibToDisk() { + PADDLE_ENFORCE(config_.tensorrt_engine_enabled(), + "This func can be invoked only in trt mode"); + auto &block = inference_program_->Block(0); + for (auto &op_desc : block.AllOps()) { + if (op_desc->Type() == "tensorrt_engine") { + std::string engine_name = + boost::get(op_desc->GetAttr("engine_key")); + if (!Singleton::Global().Has(engine_name)) { + LOG(ERROR) << "You should run the predictor(with trt) on the real data " + "to generate calibration info"; + return false; + } + TRTCalibratorEngine *calib_engine = + Singleton::Global().Get(engine_name); + LOG(INFO) << "Wait for calib threads done."; + calib_engine->calib_->waitAndSetDone(); + LOG(INFO) << "Generating TRT Calibration table data, this may cost a lot " + "of time..."; + calib_engine->thr_->join(); + std::string calibration_table_data = + calib_engine->calib_->getCalibrationTableAsString(); + + if (calibration_table_data.empty()) { + LOG(ERROR) << "the calibration table is empty."; + return false; + } + + std::string model_opt_cache_dir = + argument_.Has("model_dir") + ? argument_.model_dir() + : inference::analysis::GetDirRoot(argument_.model_program_path()); + + std::string calibration_table_data_path = + inference::analysis::GetTrtCalibPath( + inference::analysis::GetOrCreateModelOptCacheDir( + model_opt_cache_dir), + engine_name); + + std::ofstream ofile(calibration_table_data_path, std::ios::out); + LOG(INFO) << "Write Paddle-TRT INT8 calibration table data to file " + << calibration_table_data_path; + ofile << calibration_table_data; + ofile.close(); + } + } + // Free all calibrator resources. + Singleton::Global().DeleteALL(); return true; } +#endif AnalysisPredictor::~AnalysisPredictor() { -#if !defined(_WIN32) +#if PADDLE_WITH_TENSORRT + if (config_.tensorrt_engine_enabled() && + config_.tensorrt_precision_mode_ == AnalysisConfig::Precision::kInt8 && + Singleton::Global().Has()) { + SaveTrtCalibToDisk(); + } +#endif if (FLAGS_profile) { platform::DisableProfiler(platform::EventSortingKey::kTotal, "./profile.log"); } -#endif if (sub_scope_) { scope_->DeleteScope(sub_scope_); } + +#if PADDLE_WITH_MKLDNN + if (mkldnn_quantizer_) { + delete mkldnn_quantizer_; + mkldnn_quantizer_ = nullptr; + } +#endif + + // TODO(Superjomn) deduce the directory path. + std::string out_path = inference::analysis::GetMemoryCachePath( + config_.model_dir(), config_.prog_file()); + if (need_collect_var_shapes_for_memory_optim()) { + SerializeBatchVarShapes(out_path); + } } std::unique_ptr AnalysisPredictor::Clone() { + std::lock_guard lk(clone_mutex_); auto *x = new AnalysisPredictor(config_); x->Init(scope_, inference_program_); return std::unique_ptr(x); } +void AnalysisPredictor::CollectVarShapes() { + VLOG(4) << "Collecting var shapes"; + if (batch_var_shapes_.size() >= max_shape_collect_count_) return; + std::map> var_shapes; + for (auto var_name : inference_program_->Block(0).LocalVarNames()) { + auto *var = sub_scope_->FindVar(var_name); + PADDLE_ENFORCE_NOT_NULL(var); + if (var->Type() == framework::VarTypeTrait::kId || + var->Type() == framework::VarTypeTrait::kId) { + auto &tensor = var->Get(); + auto shape = framework::vectorize(tensor.dims()); + var_shapes[var_name].assign(shape.begin(), shape.end()); + } + } + batch_var_shapes_.push_back(var_shapes); + LOG_FIRST_N(INFO, 1) << "Collected " << batch_var_shapes_.size() + << " batch of var shapes for analysis"; +} + +void AnalysisPredictor::SerializeBatchVarShapes(const std::string &path) { + LOG(INFO) << "serialize batch var shapes to " << path; + std::ofstream file(path); + if (!file.is_open()) { + LOG(ERROR) << "failed to serialize the var shapes to " << path; + return; + } + + // The sirialized data format: + // :dim0,dim1,dim2,; + for (auto &batch : batch_var_shapes_) { + for (auto &ele : batch) { + file << ele.first << ":"; + for (size_t i = 0; i < ele.second.size() - 1; i++) { + file << ele.second[i] << ","; + } + file << ele.second.back() << ";"; + } + file << "\n"; + } +} + +bool AnalysisPredictor::need_collect_var_shapes_for_memory_optim() { + if (need_collect_var_shapes_ >= 0) return need_collect_var_shapes_; + bool need = false; + // check if the cache exists + if (!config_.enable_memory_optim()) { + need = false; + } else if (config_.static_memory_optim_ && + !inference::IsFileExists(inference::analysis::GetMemoryCachePath( + config_.model_dir(), config_.prog_file()))) { + need = true; + } else if (config_.static_memory_optim_ && + config_.static_memory_optim_force_update_) { + need = true; + } + + need_collect_var_shapes_ = need ? 1 : 0; + return need; +} + +std::string AnalysisPredictor::GetSerializedProgram() const { + return inference_program_->Proto()->SerializeAsString(); +} + template <> -std::unique_ptr CreatePaddlePredictor( - const contrib::AnalysisConfig &config) { - return CreatePaddlePredictor(config); +std::unique_ptr CreatePaddlePredictor( + const AnalysisConfig &config) { + return CreatePaddlePredictor( + config); } } // namespace paddle + +#if PADDLE_WITH_TENSORRT +USE_TRT_CONVERTER(elementwise_add_weight); +USE_TRT_CONVERTER(elementwise_add_tensor); +USE_TRT_CONVERTER(elementwise_sub_tensor); +USE_TRT_CONVERTER(elementwise_div_tensor); +USE_TRT_CONVERTER(elementwise_mul_tensor); +USE_TRT_CONVERTER(elementwise_max_tensor); +USE_TRT_CONVERTER(elementwise_min_tensor); +USE_TRT_CONVERTER(elementwise_pow_tensor); +USE_TRT_CONVERTER(mul); +USE_TRT_CONVERTER(conv2d); +USE_TRT_CONVERTER(relu); +USE_TRT_CONVERTER(sigmoid); +USE_TRT_CONVERTER(tanh); +USE_TRT_CONVERTER(fc); +USE_TRT_CONVERTER(pool2d); +USE_TRT_CONVERTER(softmax); +USE_TRT_CONVERTER(batch_norm); +USE_TRT_CONVERTER(concat); +USE_TRT_CONVERTER(dropout); +USE_TRT_CONVERTER(pad); +USE_TRT_CONVERTER(split); +USE_TRT_CONVERTER(prelu); +USE_TRT_CONVERTER(conv2d_transpose); +USE_TRT_CONVERTER(leaky_relu); +#endif + +#if PADDLE_WITH_ANAKIN +USE_ANAKIN_CONVERTER(mul); +USE_ANAKIN_CONVERTER(fc); +USE_ANAKIN_CONVERTER(conv2d); +USE_ANAKIN_CONVERTER(conv2d_fusion); +USE_ANAKIN_CONVERTER(concat); +USE_ANAKIN_CONVERTER(split); +USE_ANAKIN_CONVERTER(relu); +USE_ANAKIN_CONVERTER(sigmoid); +USE_ANAKIN_CONVERTER(tanh); +USE_ANAKIN_CONVERTER(pool2d); +USE_ANAKIN_CONVERTER(elementwise_add); +USE_ANAKIN_CONVERTER(elementwise_mul); +USE_ANAKIN_CONVERTER(batch_norm); +USE_ANAKIN_CONVERTER(flatten); +USE_ANAKIN_CONVERTER(reshape); +USE_ANAKIN_CONVERTER(transpose); +USE_ANAKIN_CONVERTER(softmax); +USE_ANAKIN_CONVERTER(detection_out); +USE_ANAKIN_CONVERTER(density_prior_box); +USE_ANAKIN_CONVERTER(dropout); +USE_ANAKIN_CONVERTER(sum); +USE_ANAKIN_CONVERTER(prior_box); +#endif diff --git a/paddle/fluid/inference/api/analysis_predictor.h b/paddle/fluid/inference/api/analysis_predictor.h index 5a9f4d36959d4ee7ca16dec769d9d1283b8787cb..e4c537f426650f16ced32d3cb61b944a78c35b43 100644 --- a/paddle/fluid/inference/api/analysis_predictor.h +++ b/paddle/fluid/inference/api/analysis_predictor.h @@ -13,29 +13,42 @@ // limitations under the License. #pragma once +#include +#include +#include #include #include #include "paddle/fluid/framework/naive_executor.h" #include "paddle/fluid/inference/analysis/analyzer.h" #include "paddle/fluid/inference/api/api_impl.h" +#include "paddle/fluid/inference/api/details/reset_tensor_array.h" +#include "paddle/fluid/inference/api/helper.h" #include "paddle/fluid/inference/api/paddle_inference_api.h" #include "paddle/fluid/string/printf.h" - +#ifdef PADDLE_WITH_TESTING +#include +#include +#endif namespace paddle { using inference::analysis::Argument; using inference::analysis::Analyzer; using framework::proto::ProgramDesc; using framework::NaiveExecutor; -using contrib::AnalysisConfig; -/* This predictor is based on the original native predictor with IR and Analysis - * support. It will optimize IR and Parameters in the runtime. +/** \brief This predictor is based on the original native predictor with IR and + * Analysis support. + * + * It will optimize IR and Parameters in the runtime. + * * TODO(Superjomn) Replace the Navive predictor? */ class AnalysisPredictor : public PaddlePredictor { public: - explicit AnalysisPredictor(const AnalysisConfig &config) : config_(config) {} + explicit AnalysisPredictor(const AnalysisConfig &config) : config_(config) { + predictor_id_ = inference::GetUniqueId(); + } + ~AnalysisPredictor(); bool Init(const std::shared_ptr &parent_scope, const std::shared_ptr &program = nullptr); @@ -44,6 +57,9 @@ class AnalysisPredictor : public PaddlePredictor { std::vector *output_data, int batch_size = -1) override; + std::vector GetInputNames(); + std::vector GetOutputNames(); + std::unique_ptr GetInputTensor( const std::string &name) override; std::unique_ptr GetOutputTensor( @@ -51,19 +67,38 @@ class AnalysisPredictor : public PaddlePredictor { bool ZeroCopyRun() override; + void CreateFeedFetchVar(framework::Scope *scope); void PrepareFeedFetch(); + void PrepareArgument(); void OptimizeInferenceProgram(); Argument &analysis_argument() { return argument_; } std::unique_ptr Clone() override; - framework::Scope *scope() { return executor_->scope(); } + framework::Scope *scope() { return scope_.get(); } framework::ProgramDesc &program() { return *inference_program_; } + void SetMkldnnThreadID(int tid); + + std::string GetSerializedProgram() const override; + + bool MkldnnQuantize(); + protected: + // For memory optimization. + bool need_collect_var_shapes_for_memory_optim(); + void CollectVarShapes(); + void SerializeBatchVarShapes(const std::string &path); + + bool PrepareProgram(const std::shared_ptr &program); + bool PrepareScope(const std::shared_ptr &parent_scope); + bool CreateExecutor(); + bool PrepareExecutor(); + bool LoadProgramDesc(); + bool LoadParameters(); bool SetFeed(const std::vector &input_datas, framework::Scope *scope); @@ -72,10 +107,32 @@ class AnalysisPredictor : public PaddlePredictor { template void GetFetchOne(const framework::LoDTensor &fetchs, PaddleTensor *output_data); - ~AnalysisPredictor(); + +#if PADDLE_WITH_TENSORRT + // When we use Paddle-TRT INT8 engine, we need to generate calibration table + // data first, + // the calibration table contains the range for each op's input and output, + // this whole process can be divided into several steps: + // + // 1. Builds a 32-bit engine, runs it on the calibration set, and records a + // histogram for each + // tensor of the distribution of activation values. + // 2. Builds a calibration table from the histograms. + // + // After step 2, we need to store the calibration table on disk + bool SaveTrtCalibToDisk(); +#endif + +// Some more detailed tests, they are made the friends of the predictor, so that +// the all the details can be tested. +#if PADDLE_WITH_TESTING + FRIEND_TEST(AnalysisPredictor, analysis_off); + FRIEND_TEST(AnalysisPredictor, analysis_on); + FRIEND_TEST(AnalysisPredictor, with_gpu); +#endif private: - contrib::AnalysisConfig config_; + AnalysisConfig config_; Argument argument_; std::unique_ptr executor_; platform::Place place_; @@ -84,10 +141,40 @@ class AnalysisPredictor : public PaddlePredictor { std::shared_ptr inference_program_; std::vector feeds_; std::map feed_names_; - std::vector fetchs_; + // Sorted according to the idx. + std::map idx2feeds_; + std::vector fetches_; + std::map idx2fetches_; + +#if PADDLE_WITH_MKLDNN + // Helper class to perform quantization + class MkldnnQuantizer; + MkldnnQuantizer *mkldnn_quantizer_{nullptr}; + +#if PADDLE_WITH_TESTING + friend class MkldnnQuantizerTest; +#endif +#endif + // Memory buffer for feed inputs. The temporary LoDTensor will cause serious - // concurrency problems, so cache them. + // concurrency problems, wrong results and memory leak, so cache them. std::vector feed_tensors_; + details::TensorArrayBatchCleaner tensor_array_batch_cleaner_; + // A mutex help to make Clone thread safe. + std::mutex clone_mutex_; + + // For memory optimization. + const size_t max_shape_collect_count_{1000}; + int need_collect_var_shapes_{-1}; // -1 for default, 0 for false, 1 for true. + std::vector>> batch_var_shapes_; + int predictor_id_; + + private: + // Some status here that help to determine the status inside the predictor. + bool status_program_optimized_{false}; + bool status_is_cloned_{false}; + bool status_use_gpu_{false}; + bool status_ir_optim_enabled_{false}; }; } // namespace paddle diff --git a/paddle/fluid/inference/api/analysis_predictor_tester.cc b/paddle/fluid/inference/api/analysis_predictor_tester.cc index 13c25da1b52742e6114b294847c21ce735b9fc21..0429a287c74f9db5257181151d90b77da86c694c 100644 --- a/paddle/fluid/inference/api/analysis_predictor_tester.cc +++ b/paddle/fluid/inference/api/analysis_predictor_tester.cc @@ -12,21 +12,100 @@ // See the License for the specific language governing permissions and // limitations under the License. +#include "paddle/fluid/inference/api/analysis_predictor.h" #include #include +#include // NOLINT +#include "paddle/fluid/framework/ir/pass.h" +#include "paddle/fluid/framework/tensor.h" +#include "paddle/fluid/inference/api/helper.h" #include "paddle/fluid/inference/api/paddle_inference_api.h" +#include "paddle/fluid/inference/tests/api/tester_helper.h" +#ifdef PADDLE_WITH_MKLDNN +#include "paddle/fluid/inference/api/mkldnn_quantizer.h" +#endif DEFINE_string(dirname, "", "dirname to tests."); namespace paddle { -namespace inference { -using contrib::AnalysisConfig; -TEST(AnalysisPredictor, ZeroCopy) { +TEST(AnalysisPredictor, analysis_off) { + AnalysisConfig config; + config.SetModel(FLAGS_dirname); + config.SwitchIrOptim(false); + + auto _predictor = CreatePaddlePredictor(config); + auto* predictor = static_cast(_predictor.get()); + + // Without analysis, the scope_ and sub_scope_ are created by predictor + // itself. + ASSERT_TRUE(predictor->scope_); + ASSERT_TRUE(predictor->sub_scope_); + ASSERT_EQ(predictor->scope_->parent(), nullptr); + ASSERT_EQ(predictor->sub_scope_->parent(), predictor->scope_.get()); + // ir is turned off, so program shouldn't be optimized. + ASSERT_FALSE(predictor->status_program_optimized_); + LOG(INFO) << "scope parameters " << predictor->scope_->LocalVarNames().size(); + + // 2. Dummy Input Data + int64_t data[4] = {1, 2, 3, 4}; + PaddleTensor tensor; + tensor.shape = std::vector({4, 1}); + tensor.data.Reset(data, sizeof(data)); + tensor.dtype = PaddleDType::INT64; + + std::vector inputs(4, tensor); + std::vector outputs; + ASSERT_TRUE(predictor->Run(inputs, &outputs)); +} + +TEST(AnalysisPredictor, analysis_on) { AnalysisConfig config; - config.model_dir = FLAGS_dirname + "/word2vec.inference.model"; - config.use_feed_fetch_ops = false; + config.SetModel(FLAGS_dirname); + config.SwitchIrOptim(true); +#ifdef PADDLE_WITH_CUDA + config.EnableUseGpu(100, 0); +#else + config.DisableGpu(); +#endif + + auto _predictor = CreatePaddlePredictor(config); + auto* predictor = static_cast(_predictor.get()); + + ASSERT_TRUE(predictor->scope_); + ASSERT_TRUE(predictor->sub_scope_); + ASSERT_EQ(predictor->scope_->parent(), nullptr); + ASSERT_EQ(predictor->sub_scope_->parent(), predictor->scope_.get()); + // ir is turned on, so program should be optimized. + ASSERT_TRUE(predictor->status_program_optimized_); + // 2. Dummy Input Data + int64_t data[4] = {1, 2, 3, 4}; + PaddleTensor tensor; + tensor.shape = std::vector({4, 1}); + tensor.data.Reset(data, sizeof(data)); + tensor.dtype = PaddleDType::INT64; + + std::vector inputs(4, tensor); + std::vector outputs; + ASSERT_TRUE(predictor->Run(inputs, &outputs)); + + for (auto& output : outputs) { + LOG(INFO) << inference::DescribeTensor(output); + } + + // compare with NativePredictor + auto naive_predictor = + CreatePaddlePredictor(config.ToNativeConfig()); + std::vector naive_outputs; + ASSERT_TRUE(naive_predictor->Run(inputs, &naive_outputs)); + ASSERT_EQ(naive_outputs.size(), 1UL); + inference::CompareTensor(outputs.front(), naive_outputs.front()); +} +TEST(AnalysisPredictor, ZeroCopy) { + AnalysisConfig config; + config.SetModel(FLAGS_dirname); + config.SwitchUseFeedFetchOps(false); auto predictor = CreatePaddlePredictor(config); auto w0 = predictor->GetInputTensor("firstw"); @@ -61,5 +140,348 @@ TEST(AnalysisPredictor, ZeroCopy) { LOG(INFO) << "output_data: " << out_data; } -} // namespace inference +TEST(AnalysisPredictor, Clone) { + AnalysisConfig config; + config.SetModel(FLAGS_dirname); + config.SwitchUseFeedFetchOps(true); + config.SwitchIrOptim(true); + + std::vector> predictors; + predictors.emplace_back(CreatePaddlePredictor(config)); + + LOG(INFO) << "************** to clone ************************"; + const int num_threads = 3; + for (int i = 1; i < num_threads; i++) { + predictors.emplace_back(predictors.front()->Clone()); + } + + auto* root_scope = + static_cast(predictors[0].get())->scope(); + ASSERT_FALSE(root_scope->kids().empty()); + LOG(INFO) << "***** scope ******\n" + << framework::GenScopeTreeDebugInfo(root_scope); + + // 2. Dummy Input Data + int64_t data[4] = {1, 2, 3, 4}; + PaddleTensor tensor; + tensor.shape = std::vector({4, 1}); + tensor.data.Reset(data, sizeof(data)); + tensor.dtype = PaddleDType::INT64; + + std::vector inputs(4, tensor); + std::vector outputs; + predictors[0]->Run(inputs, &outputs); + + LOG(INFO) << "Run with single thread"; + for (int i = 0; i < num_threads; i++) { + LOG(INFO) << "run predictor " << i; + ASSERT_TRUE(predictors[i]->Run(inputs, &outputs)); + } + + LOG(INFO) << "Run with multiple threads"; + std::vector threads; + for (int i = 0; i < num_threads; i++) { + threads.emplace_back([&predictors, &inputs, i] { + LOG(INFO) << "thread #" << i << " running"; + std::vector outputs; + auto predictor = predictors.front()->Clone(); + for (int j = 0; j < 10; j++) { + ASSERT_TRUE(predictor->Run(inputs, &outputs)); + } + }); + } + + for (auto& t : threads) { + t.join(); + } +} + +TEST(AnalysisPredictor, memory_optim) { + AnalysisConfig config(FLAGS_dirname); + config.DisableGpu(); + config.EnableMemoryOptim(true); + config.SwitchIrDebug(); + + auto native_predictor = + CreatePaddlePredictor(config.ToNativeConfig()); + + // 2. Dummy Input Data + int64_t data[4] = {1, 2, 3, 4}; + PaddleTensor tensor; + tensor.shape = std::vector({4, 1}); + tensor.data.Reset(data, sizeof(data)); + tensor.dtype = PaddleDType::INT64; + + std::vector inputs(4, tensor); + std::vector output, output1; + + { + // The first predictor help to cache the memory optimize strategy. + auto predictor = CreatePaddlePredictor(config); + LOG(INFO) << "serialized program: " << predictor->GetSerializedProgram(); + ASSERT_FALSE(predictor->GetSerializedProgram().empty()); + + // Run several times to check the parameters are not reused by mistake. + for (int i = 0; i < 5; i++) { + ASSERT_TRUE(predictor->Run(inputs, &output)); + } + } + + { + output.clear(); + // The second predictor to perform memory optimization. + config.EnableMemoryOptim(false); + auto predictor = CreatePaddlePredictor(config); + + // Run with memory optimization + ASSERT_TRUE(predictor->Run(inputs, &output)); + } + + // Run native + ASSERT_TRUE(native_predictor->Run(inputs, &output1)); + + LOG(INFO) << "the output " << inference::DescribeTensor(output.front()); + LOG(INFO) << "the native output " + << inference::DescribeTensor(output1.front()); + + inference::CompareResult(output, output1); +} + +#ifdef PADDLE_WITH_MKLDNN +class MkldnnQuantizerTest : public testing::Test { + public: + MkldnnQuantizerTest() { + AnalysisConfig config(FLAGS_dirname); + + predictor.reset(new AnalysisPredictor(config)); + auto* predictor_p = static_cast(predictor.get()); + + auto qconfig = std::make_shared(); + + mkldnn_quantizer.reset( + new AnalysisPredictor::MkldnnQuantizer(*predictor_p, qconfig)); + } + + std::pair, float> Histogram( + const framework::LoDTensor& var_tensor, float min_val, float max_val, + int num_bins) const { + return mkldnn_quantizer->Histogram(var_tensor, min_val, max_val, num_bins); + } + + std::pair GetMaxScalingFactor( + const framework::LoDTensor& var_tensor, bool is_unsigned) const { + return mkldnn_quantizer->GetMaxScalingFactor(var_tensor, is_unsigned); + } + + std::pair GetMaxChScalingFactor( + const framework::LoDTensor& var_tensor, bool is_unsigned) const { + return mkldnn_quantizer->GetMaxChScalingFactor(var_tensor, is_unsigned); + } + + std::pair GetKLScalingFactor( + const framework::LoDTensor& var_tensor, bool is_unsigned) const { + return mkldnn_quantizer->GetKLScalingFactor(var_tensor, is_unsigned); + } + + protected: + std::unique_ptr predictor; + std::unique_ptr mkldnn_quantizer; + float abs_error = 1e-6; + static const std::array non_negative_values; + static const std::array positive_and_negative_values; +}; + +const std::array MkldnnQuantizerTest::non_negative_values = { + 0.0158671, 0.026459, 0.0280772, 0.00962479, 0.0131628, + 0.016704, 0.00118407, 0.00765726, 0.0123213, 0.00944741}; +const std::array MkldnnQuantizerTest::positive_and_negative_values = + {-0.0482659, -0.0102493, -0.00794221, -0.00387115, -0.00674586, + -0.0495346, 0.0629528, -0.00531285, -0.0230353, 0.0269089}; + +TEST_F(MkldnnQuantizerTest, histogram_inverted_min_max) { + const auto& values = non_negative_values; + auto min_val = *std::min_element(values.begin(), values.end()); + auto max_val = *std::max_element(values.begin(), values.end()); + + framework::LoDTensor var_tensor; + var_tensor.Resize(framework::make_dim(values.size())); + std::copy(begin(values), end(values), + var_tensor.mutable_data(platform::CPUPlace())); + + ASSERT_THROW(Histogram(var_tensor, max_val, min_val, 3), + platform::EnforceNotMet); +} + +TEST_F(MkldnnQuantizerTest, histogram_non_negative_to_3) { + // all non-negative values + const auto& values = non_negative_values; + auto min_val = *std::min_element(values.begin(), values.end()); + auto max_val = *std::max_element(values.begin(), values.end()); + + framework::LoDTensor var_tensor; + var_tensor.Resize(framework::make_dim(values.size())); + std::copy(begin(values), end(values), + var_tensor.mutable_data(platform::CPUPlace())); + + std::vector histogram; + float bin_width; + + std::tie(histogram, bin_width) = Histogram(var_tensor, min_val, max_val, 3); + + ASSERT_NEAR(bin_width, std::abs(max_val - min_val) / 3.f, abs_error) + << "Improperly calculated bin_width."; + + ASSERT_EQ(histogram[0], 4); + ASSERT_EQ(histogram[1], 4); + ASSERT_EQ(histogram[2], 2); +} + +TEST_F(MkldnnQuantizerTest, histogram_positive_and_negative_to_3) { + const auto& values = positive_and_negative_values; + auto min_val = *std::min_element(values.begin(), values.end()); + auto max_val = *std::max_element(values.begin(), values.end()); + + framework::LoDTensor var_tensor; + var_tensor.Resize(framework::make_dim(values.size())); + std::copy(begin(values), end(values), + var_tensor.mutable_data(platform::CPUPlace())); + + std::vector histogram; + float bin_width; + + std::tie(histogram, bin_width) = Histogram(var_tensor, min_val, max_val, 3); + + ASSERT_NEAR(bin_width, std::abs(max_val - min_val) / 3.0f, abs_error) + << "Improperly calculated bin_width."; + + ASSERT_EQ(histogram[0], 3); + ASSERT_EQ(histogram[1], 5); + ASSERT_EQ(histogram[2], 2); +} + +TEST_F(MkldnnQuantizerTest, histogram_zero_bins) { + const auto& values = non_negative_values; + auto min_val = *std::min_element(values.begin(), values.end()); + auto max_val = *std::max_element(values.begin(), values.end()); + + framework::LoDTensor var_tensor; + var_tensor.Resize(framework::make_dim(values.size())); + std::copy(begin(values), end(values), + var_tensor.mutable_data(platform::CPUPlace())); + + ASSERT_THROW(Histogram(var_tensor, min_val, max_val, 0), + platform::EnforceNotMet); +} + +TEST_F(MkldnnQuantizerTest, histogram_empty) { + // empty tensor + ASSERT_THROW(Histogram({}, -1, 1, 1), platform::EnforceNotMet); + + // zero tensor + framework::LoDTensor var_tensor; + var_tensor.Resize({0}); + ASSERT_TRUE(var_tensor.mutable_data(platform::CPUPlace())); + + ASSERT_THROW(Histogram(var_tensor, -1, 1, 1), platform::EnforceNotMet); +} + +TEST_F(MkldnnQuantizerTest, kl_scaling_factor_signed) { + const auto& values = positive_and_negative_values; + + framework::LoDTensor var_tensor; + var_tensor.Resize(framework::make_dim(values.size())); + std::copy(begin(values), end(values), + var_tensor.mutable_data(platform::CPUPlace())); + + bool is_unsigned; + framework::LoDTensor lod_tensor; + + std::tie(is_unsigned, lod_tensor) = GetKLScalingFactor(var_tensor, false); + + ASSERT_EQ(is_unsigned, false); + ASSERT_EQ(lod_tensor.numel(), 1); + ASSERT_NEAR(lod_tensor.data()[0], 1.0 / 0.0899106152344, abs_error); +} + +TEST_F(MkldnnQuantizerTest, max_scaling_factor_signed) { + const auto& values = positive_and_negative_values; + auto max_val = *std::max_element(values.begin(), values.end()); + + framework::LoDTensor var_tensor; + var_tensor.Resize(framework::make_dim(values.size())); + std::copy(begin(values), end(values), + var_tensor.mutable_data(platform::CPUPlace())); + + bool is_unsigned; + framework::LoDTensor lod_tensor; + + std::tie(is_unsigned, lod_tensor) = GetMaxScalingFactor(var_tensor, false); + + ASSERT_EQ(is_unsigned, false); + ASSERT_EQ(lod_tensor.numel(), 1); + ASSERT_NEAR(lod_tensor.data()[0], 1.0 / max_val, abs_error); +} + +TEST_F(MkldnnQuantizerTest, max_scaling_factor_unsigned) { + const auto& values = non_negative_values; + auto max_val = *std::max_element(values.begin(), values.end()); + + framework::LoDTensor var_tensor; + var_tensor.Resize(framework::make_dim(values.size())); + std::copy(begin(values), end(values), + var_tensor.mutable_data(platform::CPUPlace())); + + bool is_unsigned; + framework::LoDTensor lod_tensor; + + std::tie(is_unsigned, lod_tensor) = GetMaxScalingFactor(var_tensor, true); + + ASSERT_EQ(is_unsigned, true); + ASSERT_EQ(lod_tensor.numel(), 1); + ASSERT_NEAR(lod_tensor.data()[0], 1.0 / max_val, abs_error); +} + +TEST_F(MkldnnQuantizerTest, max_scaling_factor_chwise_unsigned) { + const auto& values = non_negative_values; + auto max_val = *std::max_element(values.begin(), values.end()); + int channels = 3; + + framework::LoDTensor var_tensor; + var_tensor.Resize(framework::make_dim(channels, 1, 1, values.size())); + for (int i = 0; i < channels; i++) + std::copy(begin(values), end(values), + var_tensor.mutable_data(platform::CPUPlace()) + + i * values.size()); + + bool is_unsigned; + framework::LoDTensor lod_tensor; + + std::tie(is_unsigned, lod_tensor) = GetMaxChScalingFactor(var_tensor, true); + + ASSERT_EQ(is_unsigned, true); + ASSERT_EQ(lod_tensor.numel(), channels); + for (int i = 0; i < channels; i++) { + ASSERT_NEAR(lod_tensor.data()[i], 1.0 / max_val, abs_error); + } +} + +TEST_F(MkldnnQuantizerTest, kl_scaling_factor_unsigned) { + const auto& values = non_negative_values; + + framework::LoDTensor var_tensor; + var_tensor.Resize(framework::make_dim(values.size())); + std::copy(begin(values), end(values), + var_tensor.mutable_data(platform::CPUPlace())); + + bool is_unsigned; + framework::LoDTensor lod_tensor; + + std::tie(is_unsigned, lod_tensor) = GetKLScalingFactor(var_tensor, true); + + ASSERT_EQ(is_unsigned, true); + ASSERT_EQ(lod_tensor.numel(), 1); + ASSERT_NEAR(lod_tensor.data()[0], 1.0 / 0.0252845321362, abs_error); +} +#endif + } // namespace paddle diff --git a/paddle/fluid/inference/api/api.cc b/paddle/fluid/inference/api/api.cc index 01ea942d3c8d20180cfc9664b8601ba87a898e86..7d57b6ec74468dbdb0519f85140629a0ac01c18d 100644 --- a/paddle/fluid/inference/api/api.cc +++ b/paddle/fluid/inference/api/api.cc @@ -12,11 +12,13 @@ // See the License for the specific language governing permissions and // limitations under the License. +#include +#include "paddle/fluid/framework/commit.h" #include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/framework/scope.h" #include "paddle/fluid/inference/api/paddle_inference_api.h" +#include "paddle/fluid/inference/api/paddle_pass_builder.h" #include "paddle/fluid/platform/enforce.h" -#include "paddle_inference_api.h" namespace paddle { @@ -26,6 +28,8 @@ int PaddleDtypeSize(PaddleDType dtype) { return sizeof(float); case PaddleDType::INT64: return sizeof(int64_t); + case PaddleDType::INT32: + return sizeof(int32_t); default: assert(false); return -1; @@ -90,11 +94,19 @@ void PaddleBuf::Reset(void *data, size_t length) { void PaddleBuf::Free() { if (memory_owned_ && data_) { - PADDLE_ENFORCE_GT(length_, 0); + PADDLE_ENFORCE_GT(length_, 0UL); free(static_cast(data_)); data_ = nullptr; length_ = 0; } } +std::string get_version() { + std::stringstream ss; + ss << "version: " << framework::paddle_version() << "\n"; + ss << "commit: " << framework::paddle_commit() << "\n"; + ss << "branch: " << framework::paddle_compile_branch() << "\n"; + return ss.str(); +} + } // namespace paddle diff --git a/paddle/fluid/inference/api/api_anakin_engine.h b/paddle/fluid/inference/api/api_anakin_engine.h index 04536ea3a53bbbc9293d92e69a23567e4bfd84c0..e14d93de2c41f740bc175c8e59412d7b828dd381 100644 --- a/paddle/fluid/inference/api/api_anakin_engine.h +++ b/paddle/fluid/inference/api/api_anakin_engine.h @@ -23,7 +23,7 @@ limitations under the License. */ #include "framework/core/net/net.h" #include "framework/graph/graph.h" -#include "paddle/fluid/inference/api/paddle_inference_api.h" +#include "paddle/fluid/inference/api/paddle_anakin_config.h" #include "saber/core/shape.h" #include "saber/saber_types.h" diff --git a/paddle/fluid/inference/api/api_impl.cc b/paddle/fluid/inference/api/api_impl.cc index 7cda9c5d8a8366bd097491f37f5352a10e4fb16c..54f40563c3662af24e794422be4d3262d86c76a7 100644 --- a/paddle/fluid/inference/api/api_impl.cc +++ b/paddle/fluid/inference/api/api_impl.cc @@ -22,12 +22,13 @@ limitations under the License. */ #include "paddle/fluid/framework/feed_fetch_method.h" #include "paddle/fluid/inference/api/api_impl.h" +#include "paddle/fluid/inference/api/details/reset_tensor_array.h" #include "paddle/fluid/inference/api/helper.h" +#include "paddle/fluid/memory/memcpy.h" #include "paddle/fluid/platform/cpu_helper.h" #include "paddle/fluid/platform/profiler.h" DEFINE_bool(profile, false, "Turn on profiler for fluid"); -DECLARE_int32(paddle_num_threads); namespace paddle { namespace { @@ -63,7 +64,6 @@ void NativePaddlePredictor::PrepareFeedFetch() { bool NativePaddlePredictor::Init( std::shared_ptr parent_scope) { VLOG(3) << "Predictor::init()"; -#if !defined(_WIN32) if (FLAGS_profile) { LOG(WARNING) << "Profiler is actived, might affect the performance"; LOG(INFO) << "You can turn off by set gflags '-profile false'"; @@ -72,10 +72,9 @@ bool NativePaddlePredictor::Init( : platform::ProfilerState::kCPU; platform::EnableProfiler(tracking_device); } -#endif // no matter with or without MKLDNN - paddle::platform::SetNumThreads(FLAGS_paddle_num_threads); + paddle::platform::SetNumThreads(config_.cpu_math_library_num_threads()); if (config_.use_gpu) { place_ = paddle::platform::CUDAPlace(config_.device); @@ -120,12 +119,10 @@ bool NativePaddlePredictor::Init( } NativePaddlePredictor::~NativePaddlePredictor() { -#if !defined(_WIN32) if (FLAGS_profile) { platform::DisableProfiler(platform::EventSortingKey::kTotal, "./profile.log"); } -#endif if (sub_scope_) { scope_->DeleteScope(sub_scope_); } @@ -134,11 +131,13 @@ NativePaddlePredictor::~NativePaddlePredictor() { bool NativePaddlePredictor::Run(const std::vector &inputs, std::vector *output_data, int batch_size) { + if (UNLIKELY(config_.cpu_math_library_num_threads() > 1)) { + paddle::platform::SetNumThreads(config_.cpu_math_library_num_threads()); + } VLOG(3) << "Predictor::predict"; Timer timer; timer.tic(); // set feed variable - std::vector feeds; framework::Scope *scope = sub_scope_ != nullptr ? sub_scope_ : scope_.get(); if (!SetFeed(inputs, scope)) { LOG(ERROR) << "fail to set feed"; @@ -157,17 +156,24 @@ bool NativePaddlePredictor::Run(const std::vector &inputs, return false; } VLOG(3) << "predict cost: " << timer.toc() << "ms"; + + // For some other vector like containers not cleaned after each batch. + tensor_array_batch_cleaner_.CollectNoTensorVars(scope_.get()); + tensor_array_batch_cleaner_.ResetNoTensorVars(); return true; } std::unique_ptr NativePaddlePredictor::Clone() { + std::lock_guard lk(clone_mutex_); VLOG(3) << "Predictor::clone"; std::unique_ptr cls(new NativePaddlePredictor(config_)); - - if (!dynamic_cast(cls.get())->Init(scope_)) { + // Hot fix the bug that result diff in multi-thread. + // TODO(Superjomn) re-implement a real clone here. + if (!dynamic_cast(cls.get())->Init(nullptr)) { LOG(ERROR) << "fail to call Init"; return nullptr; } + #ifdef __clang__ // fix clang compile error return cls; @@ -185,22 +191,44 @@ bool NativePaddlePredictor::SetFeed(const std::vector &inputs, << inputs.size(); return false; } + + // Cache the inputs memory for better concurrency performance. + feed_tensors_.resize(inputs.size()); + for (size_t i = 0; i < inputs.size(); ++i) { - framework::LoDTensor input; + auto &input = feed_tensors_[i]; framework::DDim ddim = framework::make_ddim(inputs[i].shape); void *input_ptr; if (inputs[i].dtype == PaddleDType::INT64) { - input_ptr = input.mutable_data(ddim, platform::CPUPlace()); + input_ptr = input.mutable_data(ddim, place_); } else if (inputs[i].dtype == PaddleDType::FLOAT32) { - input_ptr = input.mutable_data(ddim, platform::CPUPlace()); + input_ptr = input.mutable_data(ddim, place_); + } else if (inputs[i].dtype == PaddleDType::INT32) { + input_ptr = input.mutable_data(ddim, place_); } else { LOG(ERROR) << "unsupported feed type " << inputs[i].dtype; return false; } - // TODO(panyx0718): Init LoDTensor from existing memcpy to save a copy. - std::memcpy(static_cast(input_ptr), inputs[i].data.data(), - inputs[i].data.length()); + if (platform::is_cpu_place(place_)) { + // TODO(panyx0718): Init LoDTensor from existing memcpy to save a copy. + std::memcpy(static_cast(input_ptr), inputs[i].data.data(), + inputs[i].data.length()); + } else { +#ifdef PADDLE_WITH_CUDA + platform::DeviceContextPool &pool = + platform::DeviceContextPool::Instance(); + auto *dev_ctx = + static_cast(pool.Get(place_)); + auto dst_gpu_place = boost::get(place_); + memory::Copy(dst_gpu_place, static_cast(input_ptr), + platform::CPUPlace(), inputs[i].data.data(), + inputs[i].data.length(), dev_ctx->stream()); +#else + PADDLE_THROW("Not compile with CUDA, should not reach here."); +#endif + } + // TODO(Superjomn) Low performance, need optimization for heavy LoD copy. framework::LoD lod; for (auto &level : inputs[i].lod) { @@ -248,14 +276,18 @@ bool NativePaddlePredictor::GetFetch(std::vector *outputs, framework::GetFetchVariable(*scope, "fetch", idx); auto type = fetch.type(); auto output = &(outputs->at(i)); - if (type == typeid(float)) { + output->name = fetchs_[idx]->Input("X")[0]; + if (type == framework::DataTypeTrait::DataType) { GetFetchOne(fetch, output); output->dtype = PaddleDType::FLOAT32; - } else if (type == typeid(int64_t)) { + } else if (type == framework::DataTypeTrait::DataType) { GetFetchOne(fetch, output); output->dtype = PaddleDType::INT64; + } else if (type == framework::DataTypeTrait::DataType) { + GetFetchOne(fetch, output); + output->dtype = PaddleDType::INT32; } else { - LOG(ERROR) << "unknown type, only support float32 and int64 now."; + LOG(ERROR) << "unknown type, only support float32, int64 and int32 now."; } } return true; @@ -266,8 +298,8 @@ std::unique_ptr CreatePaddlePredictor< NativeConfig, PaddleEngineKind::kNative>(const NativeConfig &config) { VLOG(3) << "create NativePaddlePredictor"; if (config.use_gpu) { - // 1. GPU memeroy - PADDLE_ENFORCE_GT( + // 1. GPU memory + PADDLE_ENFORCE_GE( config.fraction_of_gpu_memory, 0.f, "fraction_of_gpu_memory in the config should be set to range (0., 1.]"); PADDLE_ENFORCE_GE(config.device, 0, "Invalid device id %d", config.device); diff --git a/paddle/fluid/inference/api/api_impl.h b/paddle/fluid/inference/api/api_impl.h index 7882f6a53c7ce9a2486158ea9b50c018d1814091..96b94777304382a9d4be115a84f80ead69249863 100644 --- a/paddle/fluid/inference/api/api_impl.h +++ b/paddle/fluid/inference/api/api_impl.h @@ -14,23 +14,16 @@ limitations under the License. */ #pragma once -// logging.h and windows.h conflict -#define GLOG_NO_ABBREVIATED_SEVERITIES -// solve static linking error in windows -// https://github.com/google/glog/issues/301 -#define GOOGLE_GLOG_DLL_DECL - #include #include #include #include #include - -#include "paddle/fluid/inference/api/paddle_inference_api.h" - #include "paddle/fluid/framework/ddim.h" #include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/lod_tensor_array.h" #include "paddle/fluid/framework/naive_executor.h" +#include "paddle/fluid/inference/api/details/reset_tensor_array.h" #include "paddle/fluid/inference/api/paddle_inference_api.h" #include "paddle/fluid/inference/io.h" #include "paddle/fluid/platform/init.h" @@ -75,8 +68,14 @@ class NativePaddlePredictor : public PaddlePredictor { std::vector feeds_; std::map feed_names_; std::vector fetchs_; + // Memory buffer for feed inputs. The temporary LoDTensor will cause serious + // concurrency problems, wrong results and memory leak, so cache them. + std::vector feed_tensors_; // Do not use unique_ptr, use parent scope to delete framework::Scope *sub_scope_{nullptr}; + details::TensorArrayBatchCleaner tensor_array_batch_cleaner_; + // A mutex to make Clone thread safe. + std::mutex clone_mutex_; }; } // namespace paddle diff --git a/paddle/fluid/inference/api/api_impl_tester.cc b/paddle/fluid/inference/api/api_impl_tester.cc index b7b8ee6ea08fe907f3f052ae1118f782ac853ca7..2dc5dda34d02c6df9c0ccbc47a1ac960e1aca3f5 100644 --- a/paddle/fluid/inference/api/api_impl_tester.cc +++ b/paddle/fluid/inference/api/api_impl_tester.cc @@ -27,19 +27,24 @@ limitations under the License. */ #define ACC_DIFF 1e-3 #endif -DEFINE_string(dirname, "", "Directory of the inference model."); +DEFINE_string(word2vec_dirname, "", + "Directory of the word2vec inference model."); +DEFINE_string(book_dirname, "", "Directory of the book inference model."); namespace paddle { PaddleTensor LodTensorToPaddleTensor(framework::LoDTensor* t) { PaddleTensor pt; - if (t->type() == typeid(int64_t)) { + if (t->type() == framework::proto::VarType::INT64) { pt.data.Reset(t->data(), t->numel() * sizeof(int64_t)); pt.dtype = PaddleDType::INT64; - } else if (t->type() == typeid(float)) { + } else if (t->type() == framework::proto::VarType::FP32) { pt.data.Reset(t->data(), t->numel() * sizeof(float)); pt.dtype = PaddleDType::FLOAT32; + } else if (t->type() == framework::proto::VarType::INT32) { + pt.data.Reset(t->data(), t->numel() * sizeof(int32_t)); + pt.dtype = PaddleDType::INT32; } else { LOG(FATAL) << "unsupported type."; } @@ -49,7 +54,7 @@ PaddleTensor LodTensorToPaddleTensor(framework::LoDTensor* t) { NativeConfig GetConfig() { NativeConfig config; - config.model_dir = FLAGS_dirname + "/word2vec.inference.model"; + config.model_dir = FLAGS_word2vec_dirname; LOG(INFO) << "dirname " << config.model_dir; config.fraction_of_gpu_memory = 0.15; #ifdef PADDLE_WITH_CUDA @@ -116,7 +121,7 @@ void MainImageClassification(bool use_gpu) { NativeConfig config = GetConfig(); config.use_gpu = use_gpu; config.model_dir = - FLAGS_dirname + "/image_classification_resnet.inference.model"; + FLAGS_book_dirname + "/image_classification_resnet.inference.model"; const bool is_combined = false; std::vector> feed_target_shapes = @@ -187,7 +192,7 @@ void MainThreadsWord2Vec(bool use_gpu) { std::vector threads; for (int tid = 0; tid < num_jobs; ++tid) { threads.emplace_back([&, tid]() { - auto predictor = main_predictor->Clone(); + auto predictor = CreatePaddlePredictor(config); auto& local_inputs = paddle_tensor_feeds[tid]; std::vector local_outputs; ASSERT_TRUE(predictor->Run(local_inputs, &local_outputs)); @@ -220,7 +225,7 @@ void MainThreadsImageClassification(bool use_gpu) { NativeConfig config = GetConfig(); config.use_gpu = use_gpu; config.model_dir = - FLAGS_dirname + "/image_classification_resnet.inference.model"; + FLAGS_book_dirname + "/image_classification_resnet.inference.model"; auto main_predictor = CreatePaddlePredictor(config); std::vector jobs(num_jobs); @@ -245,7 +250,7 @@ void MainThreadsImageClassification(bool use_gpu) { std::vector threads; for (int tid = 0; tid < num_jobs; ++tid) { threads.emplace_back([&, tid]() { - auto predictor = main_predictor->Clone(); + auto predictor = CreatePaddlePredictor(config); auto& local_inputs = paddle_tensor_feeds[tid]; std::vector local_outputs; ASSERT_TRUE(predictor->Run(local_inputs, &local_outputs)); @@ -271,7 +276,7 @@ TEST(inference_api_native, word2vec_cpu_threads) { MainThreadsWord2Vec(false /*use_gpu*/); } TEST(inference_api_native, image_classification_cpu) { - MainThreadsImageClassification(false /*use_gpu*/); + MainImageClassification(false /*use_gpu*/); } TEST(inference_api_native, image_classification_cpu_threads) { MainThreadsImageClassification(false /*use_gpu*/); @@ -279,16 +284,26 @@ TEST(inference_api_native, image_classification_cpu_threads) { #ifdef PADDLE_WITH_CUDA TEST(inference_api_native, word2vec_gpu) { MainWord2Vec(true /*use_gpu*/); } -TEST(inference_api_native, word2vec_gpu_threads) { - MainThreadsWord2Vec(true /*use_gpu*/); -} +// Turn off temporarily for the unstable result. +// TEST(inference_api_native, word2vec_gpu_threads) { +// MainThreadsWord2Vec(true /*use_gpu*/); +// } TEST(inference_api_native, image_classification_gpu) { - MainThreadsImageClassification(true /*use_gpu*/); -} -TEST(inference_api_native, image_classification_gpu_threads) { - MainThreadsImageClassification(true /*use_gpu*/); + MainImageClassification(true /*use_gpu*/); } - +// Turn off temporarily for the unstable result. +// TEST(inference_api_native, image_classification_gpu_threads) { +// MainThreadsImageClassification(true /*use_gpu*/); +// } #endif +TEST(PassBuilder, Delete) { + AnalysisConfig config; + config.DisableGpu(); + config.pass_builder()->DeletePass("attention_lstm_fuse_pass"); + const auto& passes = config.pass_builder()->AllPasses(); + auto it = std::find(passes.begin(), passes.end(), "attention_lstm_fuse_pass"); + ASSERT_EQ(it, passes.end()); +} + } // namespace paddle diff --git a/paddle/fluid/inference/api/api_tensorrt_subgraph_engine.cc b/paddle/fluid/inference/api/api_tensorrt_subgraph_engine.cc deleted file mode 100644 index 7ac468ee4d33f49bba20a07c976055a083743cbc..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/api/api_tensorrt_subgraph_engine.cc +++ /dev/null @@ -1,188 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include "paddle/fluid/inference/analysis/analyzer.h" -#include "paddle/fluid/inference/api/api_impl.h" -#include "paddle/fluid/inference/api/paddle_inference_api.h" -#include "paddle/fluid/inference/tensorrt/convert/op_converter.h" -#include "paddle/fluid/inference/utils/singleton.h" -#include "paddle/fluid/operators/tensorrt_engine_op.h" - -namespace paddle { - -using inference::analysis::Argument; -using inference::Singleton; -using inference::analysis::Analyzer; -using framework::proto::ProgramDesc; -using paddle::contrib::MixedRTConfig; - -class TensorRTSubgraphPredictor : public NativePaddlePredictor { - public: - explicit TensorRTSubgraphPredictor(const MixedRTConfig& config) - : NativePaddlePredictor(config), config_(config) {} - - bool Init(const std::shared_ptr& parent_scope) { - FLAGS_IA_enable_tensorrt_subgraph_engine = true; - VLOG(3) << "Predictor::init()"; - if (config_.use_gpu) { - place_ = paddle::platform::CUDAPlace(config_.device); - } else { - place_ = paddle::platform::CPUPlace(); - } - if (parent_scope) { - scope_ = parent_scope; - sub_scope_ = &(parent_scope->NewScope()); - } else { - paddle::framework::InitDevices(false); - scope_.reset(new paddle::framework::Scope()); - } - - executor_.reset(new paddle::framework::Executor(place_)); - - // Initialize the inference program - if (!config_.model_dir.empty()) { - // Parameters are saved in separate files sited in - // the specified `dirname`. - inference_program_ = paddle::inference::Load( - executor_.get(), scope_.get(), config_.model_dir); - } else if (!config_.prog_file.empty() && !config_.param_file.empty()) { - // All parameters are saved in a single file. - // The file names should be consistent with that used - // in Python API `fluid.io.save_inference_model`. - inference_program_ = paddle::inference::Load( - executor_.get(), scope_.get(), config_.prog_file, config_.param_file); - } else { - LOG(ERROR) << "fail to load inference model."; - return false; - } - - OptimizeInferenceProgram(); - ctx_ = executor_->Prepare(*inference_program_, 0); - - VLOG(5) << "to create variables"; - executor_->CreateVariables(*inference_program_, - sub_scope_ ? sub_scope_ : scope_.get(), 0); - // Get the feed_target_names and fetch_target_names - PrepareFeedFetch(); - return true; - } - - bool Run(const std::vector& inputs, - std::vector* output_data, - int batch_size = -1) override { - PADDLE_ENFORCE_GT(batch_size, 0, - "TensorRT engine needs the argument batch_size set"); - FLAGS_tensorrt_engine_batch_size = batch_size; - return NativePaddlePredictor::Run(inputs, output_data, batch_size); - } - - void OptimizeInferenceProgram() { - // Analyze inference_program - Argument argument; - - argument.Set("minimum_subgraph_size", - new int(config_.minimum_subgraph_size)); - argument.Set("max_batch_size", new int(config_.max_batch_size)); - argument.Set("workspace_size", new int(config_.workspace_size)); - argument.Set("precision_mode", - new std::string(config_.precision_mode)); - - if (!config_.model_dir.empty()) { - argument.fluid_model_dir.reset(new std::string(config_.model_dir)); - } else { - PADDLE_ENFORCE( - !config_.param_file.empty(), - "Either model_dir or (param_file, prog_file) should be set."); - PADDLE_ENFORCE(!config_.prog_file.empty()); - argument.fluid_model_program_path.reset( - new std::string(config_.prog_file)); - argument.fluid_model_param_path.reset( - new std::string(config_.param_file)); - } - argument.origin_program_desc.reset( - new ProgramDesc(*inference_program_->Proto())); - Singleton::Global().Run(&argument); - CHECK(argument.transformed_program_desc); - VLOG(5) << "transformed program:\n" - << argument.transformed_program_desc->SerializeAsString(); - VLOG(5) << "to prepare executor"; - inference_program_.reset( - new framework::ProgramDesc(*argument.transformed_program_desc)); - } - - private: - MixedRTConfig config_; -}; - -template <> -std::unique_ptr -CreatePaddlePredictor( - const MixedRTConfig& config) { - VLOG(3) << "create TensorRTSubgraphPredictor"; - if (config.use_gpu) { - // 1. GPU memeroy - PADDLE_ENFORCE_GT( - config.fraction_of_gpu_memory, 0.f, - "fraction_of_gpu_memory in the config should be set to range (0., 1.]"); - PADDLE_ENFORCE_GE(config.device, 0, "Invalid device id %d", config.device); - std::vector flags; - if (config.fraction_of_gpu_memory >= 0.0f || - config.fraction_of_gpu_memory <= 0.95f) { - flags.push_back("dummpy"); - std::string flag = "--fraction_of_gpu_memory_to_use=" + - std::to_string(config.fraction_of_gpu_memory); - flags.push_back(flag); - VLOG(3) << "set flag: " << flag; - framework::InitGflags(flags); - } - } - - std::unique_ptr predictor( - new TensorRTSubgraphPredictor(config)); - if (!dynamic_cast(predictor.get()) - ->Init(nullptr)) { - return nullptr; - } - return std::move(predictor); -} - -template <> -std::unique_ptr CreatePaddlePredictor( - const MixedRTConfig& config) { - return CreatePaddlePredictor(config); -} - -} // namespace paddle - -USE_TRT_CONVERTER(elementwise_add_weight); -USE_TRT_CONVERTER(elementwise_add_tensor); -USE_TRT_CONVERTER(elementwise_sub_tensor); -USE_TRT_CONVERTER(elementwise_div_tensor); -USE_TRT_CONVERTER(elementwise_mul_tensor); -USE_TRT_CONVERTER(elementwise_max_tensor); -USE_TRT_CONVERTER(elementwise_min_tensor); -USE_TRT_CONVERTER(elementwise_pow_tensor); -USE_TRT_CONVERTER(mul); -USE_TRT_CONVERTER(conv2d); -USE_TRT_CONVERTER(relu); -USE_TRT_CONVERTER(sigmoid); -USE_TRT_CONVERTER(tanh); -USE_TRT_CONVERTER(fc); -USE_TRT_CONVERTER(pool2d); -USE_TRT_CONVERTER(softmax); -USE_TRT_CONVERTER(batch_norm); -USE_TRT_CONVERTER(concat); -USE_TRT_CONVERTER(dropout); -USE_TRT_CONVERTER(pad); diff --git a/paddle/fluid/inference/api/api_tensorrt_subgraph_engine_tester.cc b/paddle/fluid/inference/api/api_tensorrt_subgraph_engine_tester.cc deleted file mode 100644 index 702158ea3bcab854eece3ccd40724d92efcbae67..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/api/api_tensorrt_subgraph_engine_tester.cc +++ /dev/null @@ -1,92 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include -#include -#include -#include "paddle/fluid/inference/analysis/analyzer.h" -#include "paddle/fluid/inference/api/paddle_inference_api.h" - -namespace paddle { - -using contrib::MixedRTConfig; - -DEFINE_string(dirname, "", "Directory of the inference model."); - -void CompareTensorRTWithFluid(bool enable_tensorrt) { - FLAGS_IA_enable_tensorrt_subgraph_engine = enable_tensorrt; - - //# 1. Create PaddlePredictor with a config. - NativeConfig config0; - config0.model_dir = FLAGS_dirname + "word2vec.inference.model"; - config0.use_gpu = true; - config0.fraction_of_gpu_memory = 0.3; - config0.device = 0; - - MixedRTConfig config1; - config1.model_dir = FLAGS_dirname + "word2vec.inference.model"; - config1.use_gpu = true; - config1.fraction_of_gpu_memory = 0.3; - config1.device = 0; - config1.max_batch_size = 10; - - auto predictor0 = CreatePaddlePredictor(config0); - auto predictor1 = CreatePaddlePredictor(config1); - - for (int batch_id = 0; batch_id < 1; batch_id++) { - //# 2. Prepare input. - std::vector data(20); - for (int i = 0; i < 20; i++) data[i] = i; - - PaddleTensor tensor; - tensor.shape = std::vector({10, 1}); - tensor.data = PaddleBuf(data.data(), data.size() * sizeof(int64_t)); - tensor.dtype = PaddleDType::INT64; - - // For simplicity, we set all the slots with the same data. - std::vector slots(4, tensor); - - //# 3. Run - std::vector outputs0; - std::vector outputs1; - CHECK(predictor0->Run(slots, &outputs0)); - CHECK(predictor1->Run(slots, &outputs1, 10)); - - //# 4. Get output. - ASSERT_EQ(outputs0.size(), 1UL); - ASSERT_EQ(outputs1.size(), 1UL); - - const size_t num_elements = outputs0.front().data.length() / sizeof(float); - const size_t num_elements1 = outputs1.front().data.length() / sizeof(float); - EXPECT_EQ(num_elements, num_elements1); - - auto *data0 = static_cast(outputs0.front().data.data()); - auto *data1 = static_cast(outputs1.front().data.data()); - - ASSERT_GT(num_elements, 0UL); - for (size_t i = 0; i < std::min(num_elements, num_elements1); i++) { - EXPECT_NEAR(data0[i], data1[i], 1e-3); - } - } -} - -TEST(paddle_inference_api_tensorrt_subgraph_engine, without_tensorrt) { - CompareTensorRTWithFluid(false); -} - -TEST(paddle_inference_api_tensorrt_subgraph_engine, with_tensorrt) { - CompareTensorRTWithFluid(true); -} - -} // namespace paddle diff --git a/paddle/fluid/inference/api/api_tester.cc b/paddle/fluid/inference/api/api_tester.cc index 7a579610eefda24c911edd28b5f3a178aa10ab1e..2c450ef7cead4d5c3870d5e9186eb221e5dc19a0 100644 --- a/paddle/fluid/inference/api/api_tester.cc +++ b/paddle/fluid/inference/api/api_tester.cc @@ -61,4 +61,10 @@ TEST(paddle_inference_api, demo) { predictor->Run({}, &outputs); } +TEST(paddle_inference_api, get_version) { + LOG(INFO) << "paddle version:\n" << get_version(); + auto version = get_version(); + ASSERT_FALSE(version.empty()); +} + } // namespace paddle diff --git a/paddle/fluid/inference/api/demo_ci/CMakeLists.txt b/paddle/fluid/inference/api/demo_ci/CMakeLists.txt index 03f0f726eb61c2619c7719a865383090f86b5b7f..19ef402d6fd78d6a65bdb0bbd22198f36b872a27 100644 --- a/paddle/fluid/inference/api/demo_ci/CMakeLists.txt +++ b/paddle/fluid/inference/api/demo_ci/CMakeLists.txt @@ -15,12 +15,43 @@ macro(safe_set_static_flag) endforeach(flag_var) endmacro() +if(NOT DEFINED PADDLE_LIB) + message(FATAL_ERROR "please set PADDLE_LIB with -DPADDLE_LIB=/path/paddle/lib") +endif() +if(NOT DEFINED DEMO_NAME) + message(FATAL_ERROR "please set DEMO_NAME with -DDEMO_NAME=demo_name") +endif() + +include_directories("${PADDLE_LIB}/") +include_directories("${PADDLE_LIB}/fluid_inference_install_dir/") +include_directories("${PADDLE_LIB}/third_party/install/protobuf/include") +include_directories("${PADDLE_LIB}/third_party/install/glog/include") +include_directories("${PADDLE_LIB}/third_party/install/gflags/include") +include_directories("${PADDLE_LIB}/third_party/install/xxhash/include") +include_directories("${PADDLE_LIB}/third_party/install/snappy/include") +include_directories("${PADDLE_LIB}/third_party/install/snappystream/include") +include_directories("${PADDLE_LIB}/third_party/install/zlib/include") +include_directories("${PADDLE_LIB}/third_party/boost") +include_directories("${PADDLE_LIB}/third_party/eigen3") + +link_directories("${PADDLE_LIB}/third_party/install/snappy/lib") +link_directories("${PADDLE_LIB}/third_party/install/snappystream/lib") +link_directories("${PADDLE_LIB}/third_party/install/zlib/lib") +link_directories("${PADDLE_LIB}/third_party/install/protobuf/lib") +link_directories("${PADDLE_LIB}/third_party/install/glog/lib") +link_directories("${PADDLE_LIB}/third_party/install/gflags/lib") +link_directories("${PADDLE_LIB}/third_party/install/xxhash/lib") +link_directories("${PADDLE_LIB}/paddle/lib") + if (WIN32) + add_definitions("/DGOOGLE_GLOG_DLL_DECL=") + set(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} /bigobj /MTd") + set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} /bigobj /MT") + set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} /bigobj /MTd") + set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} /bigobj /MT") if (WITH_STATIC_LIB) safe_set_static_flag() add_definitions(-DSTATIC_LIB) - set(CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS} "/w") - set(CMAKE_CXX_FLAGS_RELEASE ${CMAKE_CXX_FLAGS_RELEASE} "/w") endif() set(CMAKE_STATIC_LIBRARY_PREFIX "lib") else() @@ -29,66 +60,50 @@ else() endif() message("flags" ${CMAKE_CXX_FLAGS}) -if(NOT DEFINED PADDLE_LIB) - message(FATAL_ERROR "please set PADDLE_LIB with -DPADDLE_LIB=/path/paddle/lib") -endif() -if(NOT DEFINED DEMO_NAME) - message(FATAL_ERROR "please set DEMO_NAME with -DDEMO_NAME=demo_name") -endif() - - if(WITH_GPU) if(NOT WIN32) set(CUDA_LIB "/usr/local/cuda/lib64/" CACHE STRING "CUDA Library") else() if(CUDA_LIB STREQUAL "") - set(CUDA_LIB "C:\\Program\ Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v8.0\\lib\\x64") + set(CUDA_LIB "C:\\Program\ Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v8.0\\lib\\x64") endif() endif(NOT WIN32) endif() -include_directories("D:/Paddle/") -include_directories("${PADDLE_LIB}") -include_directories("${PADDLE_LIB}/third_party/install/protobuf/include") -include_directories("${PADDLE_LIB}/third_party/install/glog/include") -include_directories("${PADDLE_LIB}/third_party/install/gflags/include") if (NOT WIN32) -include_directories("${PADDLE_LIB}/third_party/install/snappy/include") -include_directories("${PADDLE_LIB}/third_party/install/snappystream/include") -include_directories("${PADDLE_LIB}/third_party/install/zlib/include") -endif(NOT WIN32) - -include_directories("${PADDLE_LIB}/third_party/boost") -include_directories("${PADDLE_LIB}/third_party/eigen3") - -if (NOT WIN32) - if (USE_TENSORRT AND WITH_GPU) + if (USE_TENSORRT AND WITH_GPU) include_directories("${TENSORRT_INCLUDE_DIR}") link_directories("${TENSORRT_LIB_DIR}") endif() endif(NOT WIN32) if (NOT WIN32) -link_directories("${PADDLE_LIB}/third_party/install/snappy/lib") -link_directories("${PADDLE_LIB}/third_party/install/snappystream/lib") -link_directories("${PADDLE_LIB}/third_party/install/zlib/lib") -endif(NOT WIN32) - -link_directories("${PADDLE_LIB}/third_party/install/protobuf/lib") -link_directories("${PADDLE_LIB}/third_party/install/glog/lib") -link_directories("${PADDLE_LIB}/third_party/install/gflags/lib") -link_directories("${PADDLE_LIB}/paddle/lib") - -add_executable(${DEMO_NAME} ${DEMO_NAME}.cc) + set(NGRAPH_PATH "${PADDLE_LIB}/third_party/install/ngraph") + if(EXISTS ${NGRAPH_PATH}) + include(GNUInstallDirs) + include_directories("${NGRAPH_PATH}/include") + link_directories("${NGRAPH_PATH}/${CMAKE_INSTALL_LIBDIR}") + set(NGRAPH_LIB ${NGRAPH_PATH}/${CMAKE_INSTALL_LIBDIR}/libngraph${CMAKE_SHARED_LIBRARY_SUFFIX}) + endif() +endif() if(WITH_MKL) include_directories("${PADDLE_LIB}/third_party/install/mklml/include") - set(MATH_LIB ${PADDLE_LIB}/third_party/install/mklml/lib/libmklml_intel${CMAKE_SHARED_LIBRARY_SUFFIX} - ${PADDLE_LIB}/third_party/install/mklml/lib/libiomp5${CMAKE_SHARED_LIBRARY_SUFFIX}) + if(NOT WIN32) + set(MATH_LIB ${PADDLE_LIB}/third_party/install/mklml/lib/libmklml_intel${CMAKE_SHARED_LIBRARY_SUFFIX} + ${PADDLE_LIB}/third_party/install/mklml/lib/libiomp5${CMAKE_SHARED_LIBRARY_SUFFIX}) + else() + set(MATH_LIB ${PADDLE_LIB}/third_party/install/mklml/lib/libmklml${CMAKE_SHARED_LIBRARY_SUFFIX} + ${PADDLE_LIB}/third_party/install/mklml/lib/libiomp5md${CMAKE_SHARED_LIBRARY_SUFFIX}) + endif() set(MKLDNN_PATH "${PADDLE_LIB}/third_party/install/mkldnn") if(EXISTS ${MKLDNN_PATH}) include_directories("${MKLDNN_PATH}/include") - set(MKLDNN_LIB ${MKLDNN_PATH}/lib/libmkldnn.so.0) + if(WIN32) + set(MKLDNN_LIB ${MKLDNN_PATH}/lib/mkldnn.lib) + else(WIN32) + set(MKLDNN_LIB ${MKLDNN_PATH}/lib/libmkldnn.so.0) + endif(WIN32) endif() else() set(MATH_LIB ${PADDLE_LIB}/third_party/install/openblas/lib/libopenblas${CMAKE_STATIC_LIBRARY_SUFFIX}) @@ -96,40 +111,40 @@ endif() # Note: libpaddle_inference_api.so/a must put before libpaddle_fluid.so/a if(WITH_STATIC_LIB) - set(DEPS - ${PADDLE_LIB}/paddle/lib/libpaddle_fluid${CMAKE_STATIC_LIBRARY_SUFFIX}) + set(DEPS ${PADDLE_LIB}/paddle/lib/libpaddle_fluid${CMAKE_STATIC_LIBRARY_SUFFIX}) else() - set(DEPS - ${PADDLE_LIB}/paddle/lib/libpaddle_fluid${CMAKE_SHARED_LIBRARY_SUFFIX}) + set(DEPS ${PADDLE_LIB}/paddle/lib/libpaddle_fluid${CMAKE_SHARED_LIBRARY_SUFFIX}) endif() if (NOT WIN32) -set(EXTERNAL_LIB "-lrt -ldl -lpthread") -set(DEPS ${DEPS} - ${MATH_LIB} ${MKLDNN_LIB} - glog gflags protobuf snappystream snappy z - ${EXTERNAL_LIB}) + set(EXTERNAL_LIB "-lrt -ldl -lpthread") + set(DEPS ${DEPS} + ${MATH_LIB} ${MKLDNN_LIB} ${NGRAPH_LIB} + glog gflags protobuf snappystream snappy z xxhash + ${EXTERNAL_LIB}) else() -set(DEPS ${DEPS} - ${MATH_LIB} ${MKLDNN_LIB} - ${CMAKE_STATIC_LIBRARY_PREFIX}glog ${CMAKE_STATIC_LIBRARY_PREFIX}gflags ${CMAKE_STATIC_LIBRARY_PREFIX}protobuf - ${EXTERNAL_LIB}) -# NOTE(dzhwinter) shlwapi is deprecated. -set(DEPS ${DEPS} libcmt shlwapi) + set(DEPS ${DEPS} + ${MATH_LIB} ${MKLDNN_LIB} + ${CMAKE_STATIC_LIBRARY_PREFIX}glog ${CMAKE_STATIC_LIBRARY_PREFIX}gflags ${CMAKE_STATIC_LIBRARY_PREFIX}protobuf + ${CMAKE_STATIC_LIBRARY_PREFIX}snappy ${CMAKE_STATIC_LIBRARY_PREFIX}z ${CMAKE_STATIC_LIBRARY_PREFIX}xxhash + snappystream ${EXTERNAL_LIB}) + get_property(os_dependency_modules GLOBAL PROPERTY OS_DEPENDENCY_MODULES) + set(DEPS ${DEPS} libcmt ${os_dependency_modules}) endif(NOT WIN32) if(WITH_GPU) if(NOT WIN32) - if (USE_TENSORRT) + if (USE_TENSORRT) set(DEPS ${DEPS} ${TENSORRT_LIB_DIR}/libnvinfer${CMAKE_STATIC_LIBRARY_SUFFIX}) set(DEPS ${DEPS} ${TENSORRT_LIB_DIR}/libnvinfer_plugin${CMAKE_STATIC_LIBRARY_SUFFIX}) endif() set(DEPS ${DEPS} ${CUDA_LIB}/libcudart${CMAKE_SHARED_LIBRARY_SUFFIX}) else() set(DEPS ${DEPS} ${CUDA_LIB}/cudart${CMAKE_STATIC_LIBRARY_SUFFIX} ) - set(DEPS ${DEPS} ${CUDA_LIB}/cublas${CMAKE_STATIC_LIBRARY_SUFFIX} ) - set(DEPS ${DEPS} ${CUDA_LIB}/cudnn${CMAKE_STATIC_LIBRARY_SUFFIX} ) + set(DEPS ${DEPS} ${CUDA_LIB}/cublas${CMAKE_STATIC_LIBRARY_SUFFIX} ) + set(DEPS ${DEPS} ${CUDA_LIB}/cudnn${CMAKE_STATIC_LIBRARY_SUFFIX} ) endif() endif() +add_executable(${DEMO_NAME} ${DEMO_NAME}.cc) target_link_libraries(${DEMO_NAME} ${DEPS}) diff --git a/paddle/fluid/inference/api/demo_ci/run.sh b/paddle/fluid/inference/api/demo_ci/run.sh index 67994aad70a40c0e0c8a311914d4ea40b96eaf1e..bf2e3593c2beadaea2cb08aa3dcc2370c3e06bf4 100755 --- a/paddle/fluid/inference/api/demo_ci/run.sh +++ b/paddle/fluid/inference/api/demo_ci/run.sh @@ -1,3 +1,4 @@ +#!/bin/bash set -x PADDLE_ROOT=$1 TURN_ON_MKL=$2 # use MKL or Openblas @@ -16,17 +17,17 @@ if [ $2 == ON ]; then fi if [ $3 == ON ]; then use_gpu_list='true false' -else +else use_gpu_list='false' fi USE_TENSORRT=OFF -if [ [-d"$TENSORRT_INCLUDE_DIR"] -a [-d"$TENSORRT_LIB_DIR"] ]; then +if [ -d "$TENSORRT_INCLUDE_DIR" -a -d "$TENSORRT_LIB_DIR" ]; then USE_TENSORRT=ON fi PREFIX=inference-vis-demos%2F -URL_ROOT=http://paddlemodels.cdn.bcebos.com/${PREFIX} +URL_ROOT=http://paddlemodels.bj.bcebos.com/${PREFIX} # download vis_demo data function download() { @@ -54,6 +55,9 @@ mkdir -p build cd build for WITH_STATIC_LIB in ON OFF; do +# TODO(Superjomn) reopen this +# something wrong with the TensorArray reset. +:< #include //NOLINT -#include "paddle/include/paddle_inference_api.h" +#include "utils.h" // NOLINT DEFINE_string(dirname, "", "Directory of the inference model."); DEFINE_bool(use_gpu, false, "Whether use gpu."); diff --git a/paddle/fluid/inference/api/demo_ci/trt_mobilenet_demo.cc b/paddle/fluid/inference/api/demo_ci/trt_mobilenet_demo.cc index 4a8404f21c6ec6a1647e964ac3538b4b49151009..f7da55c9ae368763786c1b1fd3e86d942c5e9fe8 100644 --- a/paddle/fluid/inference/api/demo_ci/trt_mobilenet_demo.cc +++ b/paddle/fluid/inference/api/demo_ci/trt_mobilenet_demo.cc @@ -4,7 +4,7 @@ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at -http://www.apache.org/licenses/LICENSE-2.0 + http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, @@ -36,14 +36,12 @@ namespace demo { */ void Main() { std::unique_ptr predictor; - paddle::contrib::MixedRTConfig config; - config.param_file = FLAGS_modeldir + "/__params__"; - config.prog_file = FLAGS_modeldir + "/__model__"; - config.use_gpu = true; - config.device = 0; - config.max_batch_size = 1; - config.fraction_of_gpu_memory = 0.1; // set by yourself - predictor = CreatePaddlePredictor(config); + paddle::AnalysisConfig config; + config.EnableUseGpu(100, 0); + config.SetModel(FLAGS_modeldir + "/__model__", + FLAGS_modeldir + "/__params__"); + config.EnableTensorRtEngine(); + predictor = CreatePaddlePredictor(config); VLOG(3) << "begin to process data"; // Just a single batch of data. diff --git a/paddle/fluid/inference/api/demo_ci/utils.h b/paddle/fluid/inference/api/demo_ci/utils.h index d70c6aea791219a40c3164b51499f9d5e562be71..1505a898c5bba285b377203c1503b8615666b196 100644 --- a/paddle/fluid/inference/api/demo_ci/utils.h +++ b/paddle/fluid/inference/api/demo_ci/utils.h @@ -88,13 +88,20 @@ void CheckOutput(const std::string& referfile, const PaddleTensor& output) { } break; } - case PaddleDType::FLOAT32: + case PaddleDType::FLOAT32: { for (size_t i = 0; i < numel; ++i) { CHECK_LT( fabs(static_cast(output.data.data())[i] - refer.data[i]), 1e-5); } break; + } + case PaddleDType::INT32: { + for (size_t i = 0; i < numel; ++i) { + CHECK_EQ(static_cast(output.data.data())[i], refer.data[i]); + } + break; + } } } @@ -113,11 +120,18 @@ static std::string SummaryTensor(const PaddleTensor& tensor) { } break; } - case PaddleDType::FLOAT32: + case PaddleDType::FLOAT32: { for (int i = 0; i < std::min(num_elems, 10); i++) { ss << static_cast(tensor.data.data())[i] << " "; } break; + } + case PaddleDType::INT32: { + for (int i = 0; i < std::min(num_elems, 10); i++) { + ss << static_cast(tensor.data.data())[i] << " "; + } + break; + } } return ss.str(); } diff --git a/paddle/fluid/inference/api/demo_ci/vis_demo.cc b/paddle/fluid/inference/api/demo_ci/vis_demo.cc index 8d546e3e9c740c10bcf2984e073c956e3612625c..0d2c418c56db620c71d99b64ee79b18be427cc34 100644 --- a/paddle/fluid/inference/api/demo_ci/vis_demo.cc +++ b/paddle/fluid/inference/api/demo_ci/vis_demo.cc @@ -17,7 +17,7 @@ limitations under the License. */ */ #include -#include // use glog instead of CHECK to avoid importing other paddle header files. +#include #include "utils.h" // NOLINT #ifdef PADDLE_WITH_CUDA @@ -34,26 +34,21 @@ DEFINE_bool(use_gpu, false, "Whether use gpu."); namespace paddle { namespace demo { -using contrib::AnalysisConfig; /* * Use the native and analysis fluid engine to inference the demo. */ void Main(bool use_gpu) { std::unique_ptr predictor, analysis_predictor; AnalysisConfig config; - config.param_file = FLAGS_modeldir + "/__params__"; - config.prog_file = FLAGS_modeldir + "/__model__"; - config.use_gpu = use_gpu; - config.device = 0; - if (FLAGS_use_gpu) { - config.fraction_of_gpu_memory = 0.1; // set by yourself + if (use_gpu) { + config.EnableUseGpu(100, 0); } + config.SetModel(FLAGS_modeldir + "/__model__", + FLAGS_modeldir + "/__params__"); - VLOG(3) << "init predictor"; - predictor = CreatePaddlePredictor(config); - analysis_predictor = CreatePaddlePredictor(config); + predictor = CreatePaddlePredictor(config.ToNativeConfig()); + analysis_predictor = CreatePaddlePredictor(config); - VLOG(3) << "begin to process data"; // Just a single batch of data. std::string line; std::ifstream file(FLAGS_data); @@ -68,13 +63,10 @@ void Main(bool use_gpu) { PaddleBuf(record.data.data(), record.data.size() * sizeof(float)); input.dtype = PaddleDType::FLOAT32; - VLOG(3) << "run executor"; std::vector output, analysis_output; predictor->Run({input}, &output, 1); - VLOG(3) << "output.size " << output.size(); auto& tensor = output.front(); - VLOG(3) << "output: " << SummaryTensor(tensor); // compare with reference result CheckOutput(FLAGS_refer, tensor); diff --git a/paddle/fluid/inference/api/details/CMakeLists.txt b/paddle/fluid/inference/api/details/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..80b53b32a8607b4e67f42ba30bd1a283c93ebed1 --- /dev/null +++ b/paddle/fluid/inference/api/details/CMakeLists.txt @@ -0,0 +1,18 @@ +# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +cc_library(reset_tensor_array SRCS reset_tensor_array.cc DEPS lod_tensor scope) +cc_library(zero_copy_tensor SRCS zero_copy_tensor.cc DEPS scope lod_tensor enforce) +cc_library(zero_copy_tensor_dummy SRCS zero_copy_tensor_dummy.cc) diff --git a/paddle/fluid/inference/api/details/reset_tensor_array.cc b/paddle/fluid/inference/api/details/reset_tensor_array.cc new file mode 100644 index 0000000000000000000000000000000000000000..03c2aa3fb8094ce2996f513b90589de0ef903ae8 --- /dev/null +++ b/paddle/fluid/inference/api/details/reset_tensor_array.cc @@ -0,0 +1,73 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/api/details/reset_tensor_array.h" + +namespace paddle { +namespace details { + +// Should be called after the parameters are loaded. +void TensorArrayBatchCleaner::CollectTensorArrays(framework::Scope *scope) { + if (flag_) { + for (auto &var_name : scope->LocalVarNames()) { + auto *var = scope->FindVar(var_name); + // TODO(Superjomn) should avoid the case when a TensorArray is a + // parameter. + if (var_name == "feed" || var_name == "fetch") continue; + if (var->IsType()) { + VLOG(4) << "collect " << var_name; + arrays_.push_back(var->GetMutable()); + } + } + for (auto *kid : scope->kids()) { + CollectTensorArrays(kid); + } + + VLOG(3) << "Collect " << arrays_.size() << " arrays"; + flag_ = false; + } +} + +// Should be called when `Run` finished. +void TensorArrayBatchCleaner::ResetTensorArray() { + for (auto *arr : arrays_) { + arr->clear(); + } +} + +void TensorArrayBatchCleaner::CollectNoTensorVars(framework::Scope *scope) { + if (no_tensor_flag_) { + for (auto &var_name : scope->LocalVarNames()) { + auto *var = scope->FindVar(var_name); + if (!var->IsInitialized()) continue; + if (!valid_types_.count(var->Type())) { + no_tensor_vars_.insert(var); + } + } + + for (auto *kid : scope->kids()) { + CollectTensorArrays(kid); + } + no_tensor_flag_ = false; // Only collect one time. + } +} + +void TensorArrayBatchCleaner::ResetNoTensorVars() { + for (auto *var : no_tensor_vars_) { + var->Clear(); + } +} + +} // namespace details +} // namespace paddle diff --git a/paddle/fluid/inference/api/details/reset_tensor_array.h b/paddle/fluid/inference/api/details/reset_tensor_array.h new file mode 100644 index 0000000000000000000000000000000000000000..213c6891d0e2320689c8c69266d40611f295edc8 --- /dev/null +++ b/paddle/fluid/inference/api/details/reset_tensor_array.h @@ -0,0 +1,57 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include "paddle/fluid/framework/lod_tensor_array.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/framework/variable.h" + +namespace paddle { +namespace details { + +// Clean the TensorArray each batch to make the behavior the same with the +// training phase. +struct TensorArrayBatchCleaner { + TensorArrayBatchCleaner() { + constexpr auto kTensorId = framework::VarTypeTrait::kId; + constexpr auto kLoDTensorId = + framework::VarTypeTrait::kId; + valid_types_.insert(kTensorId); + valid_types_.insert(kLoDTensorId); + } + // Collect the variables that are not Tensor or LoDTensor, and reset them to a + // bool(trick), because some of them are containers, and some operators just + // keep inserting new items without clearing the containers first; So the + // memory grow larger and larger in inference service deployed online. + void CollectNoTensorVars(framework::Scope *scope); + void ResetNoTensorVars(); + + // Fix the tensor array not clear in the inference scenarios. + void CollectTensorArrays(framework::Scope *scope); + void ResetTensorArray(); + + private: + bool flag_{true}; + bool no_tensor_flag_{true}; + std::vector arrays_; + + std::unordered_set valid_types_; + std::unordered_set no_tensor_vars_; +}; + +} // namespace details +} // namespace paddle diff --git a/paddle/fluid/inference/api/details/zero_copy_tensor.cc b/paddle/fluid/inference/api/details/zero_copy_tensor.cc index 14698f6dfc8885ec1d35f1912bad10a9caa13db4..937b6398f8131a6cf4e8b0002e38f4513f0f884f 100644 --- a/paddle/fluid/inference/api/details/zero_copy_tensor.cc +++ b/paddle/fluid/inference/api/details/zero_copy_tensor.cc @@ -15,6 +15,7 @@ #include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/framework/scope.h" #include "paddle/fluid/inference/api/paddle_inference_api.h" +#include "paddle/fluid/memory/memcpy.h" #include "paddle/fluid/platform/enforce.h" namespace paddle { @@ -33,9 +34,15 @@ void ZeroCopyTensor::Reshape(const std::vector &shape) { tensor->Resize(framework::make_ddim(shape)); } +#define EAGER_GET_TENSOR \ + if (!tensor_) { \ + tensor_ = FindTensor(); \ + } \ + auto *tensor = static_cast(tensor_); + template T *ZeroCopyTensor::mutable_data(PaddlePlace place) { - auto *tensor = static_cast(FindTensor()); + EAGER_GET_TENSOR; switch (static_cast(place)) { case static_cast(PaddlePlace::kCPU): { return tensor->mutable_data(platform::CPUPlace()); @@ -51,8 +58,8 @@ T *ZeroCopyTensor::mutable_data(PaddlePlace place) { } template -T *ZeroCopyTensor::data(PaddlePlace *place, int *size) { - auto *tensor = static_cast(FindTensor()); +T *ZeroCopyTensor::data(PaddlePlace *place, int *size) const { + EAGER_GET_TENSOR; auto *res = tensor->data(); if (platform::is_cpu_place(tensor->place())) { @@ -67,10 +74,88 @@ T *ZeroCopyTensor::data(PaddlePlace *place, int *size) { return res; } -template float *ZeroCopyTensor::data(PaddlePlace *place, int *size); -template int64_t *ZeroCopyTensor::data(PaddlePlace *place, int *size); +PaddleDType ZeroCopyTensor::type() const { + EAGER_GET_TENSOR; + auto type = tensor->type(); + if (type == framework::proto::VarType::FP32) { + return PaddleDType::FLOAT32; + } else if (type == framework::proto::VarType::INT64) { + return PaddleDType::INT64; + } else if (type == framework::proto::VarType::INT32) { + return PaddleDType::INT32; + } else { + LOG(ERROR) << "unknown type, only support float32 and int64 now."; + } + return PaddleDType::FLOAT32; +} + +template +void ZeroCopyTensor::copy_from_cpu(const T *data) { + EAGER_GET_TENSOR; + PADDLE_ENFORCE_GE( + tensor->numel(), 0, + "You should call ZeroCopyTensor::Reshape(const std::vector &shape)" + "function before copy data from cpu."); + size_t ele_size = tensor->numel() * sizeof(T); + + if (place_ == PaddlePlace::kCPU) { + auto *t_data = tensor->mutable_data(platform::CPUPlace()); + std::memcpy(static_cast(t_data), data, ele_size); + } else { +#ifdef PADDLE_WITH_CUDA + platform::DeviceContextPool &pool = platform::DeviceContextPool::Instance(); + platform::CUDAPlace gpu_place(device_); + auto *t_data = tensor->mutable_data(gpu_place); + auto *dev_ctx = + static_cast(pool.Get(gpu_place)); + + memory::Copy(gpu_place, static_cast(t_data), platform::CPUPlace(), + data, ele_size, dev_ctx->stream()); +#else + PADDLE_THROW("Not compile with CUDA, should not reach here."); +#endif + } +} + +template +void ZeroCopyTensor::copy_to_cpu(T *data) { + EAGER_GET_TENSOR; + auto ele_num = tensor->numel(); + auto *t_data = tensor->data(); + auto t_place = tensor->place(); + + if (platform::is_cpu_place(t_place)) { + std::memcpy(static_cast(data), t_data, ele_num * sizeof(T)); + } else { +#ifdef PADDLE_WITH_CUDA + platform::DeviceContextPool &pool = platform::DeviceContextPool::Instance(); + auto gpu_place = boost::get(t_place); + auto *dev_ctx = + static_cast(pool.Get(gpu_place)); + memory::Copy(platform::CPUPlace(), static_cast(data), gpu_place, + t_data, ele_num * sizeof(T), dev_ctx->stream()); + cudaDeviceSynchronize(); +#else + PADDLE_THROW("Not compile with CUDA, should not reach here."); +#endif + } +} +template void ZeroCopyTensor::copy_from_cpu(const float *data); +template void ZeroCopyTensor::copy_from_cpu(const int64_t *data); +template void ZeroCopyTensor::copy_from_cpu(const int32_t *data); +template void ZeroCopyTensor::copy_to_cpu(float *data); +template void ZeroCopyTensor::copy_to_cpu(int64_t *data); +template void ZeroCopyTensor::copy_to_cpu(int32_t *data); + +template float *ZeroCopyTensor::data(PaddlePlace *place, + int *size) const; +template int64_t *ZeroCopyTensor::data(PaddlePlace *place, + int *size) const; +template int32_t *ZeroCopyTensor::data(PaddlePlace *place, + int *size) const; template float *ZeroCopyTensor::mutable_data(PaddlePlace place); template int64_t *ZeroCopyTensor::mutable_data(PaddlePlace place); +template int32_t *ZeroCopyTensor::mutable_data(PaddlePlace place); void *ZeroCopyTensor::FindTensor() const { PADDLE_ENFORCE(!name_.empty(), @@ -84,14 +169,14 @@ void *ZeroCopyTensor::FindTensor() const { return tensor; } -std::vector ZeroCopyTensor::shape() { - auto *tensor = static_cast(FindTensor()); - PADDLE_ENFORCE(tensor, "not found tensor called %s in the scope", name_); - return framework::vectorize(tensor->dims()); +std::vector ZeroCopyTensor::shape() const { + EAGER_GET_TENSOR; + PADDLE_ENFORCE(tensor_, "not found tensor called %s in the scope", name_); + return framework::vectorize2int(tensor->dims()); } void ZeroCopyTensor::SetLoD(const std::vector> &x) { - auto *tensor = static_cast(FindTensor()); + EAGER_GET_TENSOR; framework::LoD lod; for (auto &level : x) { lod.emplace_back(level); @@ -100,8 +185,8 @@ void ZeroCopyTensor::SetLoD(const std::vector> &x) { } std::vector> ZeroCopyTensor::lod() const { + EAGER_GET_TENSOR; std::vector> res; - auto *tensor = static_cast(FindTensor()); for (auto &level : tensor->lod()) { res.emplace_back(level); } diff --git a/paddle/fluid/inference/api/details/zero_copy_tensor_dummy.cc b/paddle/fluid/inference/api/details/zero_copy_tensor_dummy.cc index 2d5b561d801cd9e734cab13b28e7285493e30f94..cbbb3ea2d1395acdf4c460bea4b7868c31a20e53 100644 --- a/paddle/fluid/inference/api/details/zero_copy_tensor_dummy.cc +++ b/paddle/fluid/inference/api/details/zero_copy_tensor_dummy.cc @@ -24,18 +24,20 @@ T *ZeroCopyTensor::mutable_data(PaddlePlace place) { } template -T *ZeroCopyTensor::data(PaddlePlace *place, int *size) { +T *ZeroCopyTensor::data(PaddlePlace *place, int *size) const { return nullptr; } -template float *ZeroCopyTensor::data(PaddlePlace *place, int *size); -template int64_t *ZeroCopyTensor::data(PaddlePlace *place, int *size); +template float *ZeroCopyTensor::data(PaddlePlace *place, + int *size) const; +template int64_t *ZeroCopyTensor::data(PaddlePlace *place, + int *size) const; template float *ZeroCopyTensor::mutable_data(PaddlePlace place); template int64_t *ZeroCopyTensor::mutable_data(PaddlePlace place); void *ZeroCopyTensor::FindTensor() const { return nullptr; } -std::vector ZeroCopyTensor::shape() { return {}; } +std::vector ZeroCopyTensor::shape() const { return {}; } void ZeroCopyTensor::SetLoD(const std::vector> &x) {} diff --git a/paddle/fluid/inference/api/helper.h b/paddle/fluid/inference/api/helper.h index 24f59cf43a9700ff1732e1ef6ad82e1a6294eede..c89dd41e0a6283e0723e2925f28c0372cda6a2b2 100644 --- a/paddle/fluid/inference/api/helper.h +++ b/paddle/fluid/inference/api/helper.h @@ -15,14 +15,21 @@ #pragma once #include +#include +#if !defined(_WIN32) #include +#endif +#include #include // NOLINT +#include #include #include #include #include +#include "paddle/fluid/inference/api/paddle_inference_api.h" +#include "paddle/fluid/platform/enforce.h" +#include "paddle/fluid/platform/port.h" #include "paddle/fluid/string/printf.h" -#include "paddle_inference_api.h" namespace paddle { namespace inference { @@ -44,6 +51,11 @@ class Timer { } }; +static int GetUniqueId() { + static int id = 0; + return id++; +} + static void split(const std::string &str, char sep, std::vector *pieces) { pieces->clear(); @@ -75,6 +87,13 @@ static void split_to_int64(const std::string &str, char sep, std::transform(pieces.begin(), pieces.end(), std::back_inserter(*is), [](const std::string &v) { return std::stoi(v); }); } +static void split_to_int(const std::string &str, char sep, + std::vector *is) { + std::vector pieces; + split(str, sep, &pieces); + std::transform(pieces.begin(), pieces.end(), std::back_inserter(*is), + [](const std::string &v) { return std::stoi(v); }); +} template std::string to_string(const std::vector &vec) { std::stringstream ss; @@ -111,9 +130,18 @@ static void TensorAssignData(PaddleTensor *tensor, } template -static int ZeroCopyTensorAssignData(ZeroCopyTensor *tensor, - const std::vector> &data) { - int size{0}; +static void TensorAssignData(PaddleTensor *tensor, + const std::vector> &data, + const std::vector &lod) { + int size = lod[lod.size() - 1]; + tensor->shape.assign({size, 1}); + tensor->lod.assign({lod}); + TensorAssignData(tensor, data); +} + +template +static void ZeroCopyTensorAssignData(ZeroCopyTensor *tensor, + const std::vector> &data) { auto *ptr = tensor->mutable_data(PaddlePlace::kCPU); int c = 0; for (const auto &f : data) { @@ -121,10 +149,64 @@ static int ZeroCopyTensorAssignData(ZeroCopyTensor *tensor, ptr[c++] = v; } } - return size; } -static std::string DescribeTensor(const PaddleTensor &tensor) { +template +static void ZeroCopyTensorAssignData(ZeroCopyTensor *tensor, + const PaddleBuf &data) { + auto *ptr = tensor->mutable_data(PaddlePlace::kCPU); + for (size_t i = 0; i < data.length() / sizeof(T); i++) { + ptr[i] = *(reinterpret_cast(data.data()) + i); + } +} + +static bool CompareTensor(const PaddleTensor &a, const PaddleTensor &b) { + if (a.dtype != b.dtype) { + LOG(ERROR) << "dtype not match"; + return false; + } + + if (a.lod.size() != b.lod.size()) { + LOG(ERROR) << "lod not match"; + return false; + } + for (size_t i = 0; i < a.lod.size(); i++) { + if (a.lod[i].size() != b.lod[i].size()) { + LOG(ERROR) << "lod not match"; + return false; + } + for (size_t j = 0; j < a.lod[i].size(); j++) { + if (a.lod[i][j] != b.lod[i][j]) { + LOG(ERROR) << "lod not match"; + return false; + } + } + } + + if (a.shape.size() != b.shape.size()) { + LOG(INFO) << "shape not match"; + return false; + } + for (size_t i = 0; i < a.shape.size(); i++) { + if (a.shape[i] != b.shape[i]) { + LOG(ERROR) << "shape not match"; + return false; + } + } + + auto *adata = static_cast(a.data.data()); + auto *bdata = static_cast(b.data.data()); + for (int i = 0; i < VecReduceToInt(a.shape); i++) { + if (adata[i] != bdata[i]) { + LOG(ERROR) << "data not match"; + return false; + } + } + return true; +} + +static std::string DescribeTensor(const PaddleTensor &tensor, + int max_num_of_data = 15) { std::stringstream os; os << "Tensor [" << tensor.name << "]\n"; os << " - type: "; @@ -135,6 +217,9 @@ static std::string DescribeTensor(const PaddleTensor &tensor) { case PaddleDType::INT64: os << "int64"; break; + case PaddleDType::INT32: + os << "int32"; + break; default: os << "unset"; } @@ -146,27 +231,60 @@ static std::string DescribeTensor(const PaddleTensor &tensor) { os << to_string(l) << "; "; } os << "\n"; - os << " - data: "; + os << " - memory length: " << tensor.data.length(); + os << "\n"; + os << " - data: "; int dim = VecReduceToInt(tensor.shape); + float *pdata = static_cast(tensor.data.data()); for (int i = 0; i < dim; i++) { - os << static_cast(tensor.data.data())[i] << " "; + os << pdata[i] << " "; } os << '\n'; return os.str(); } -static void PrintTime(int batch_size, int repeat, int num_threads, int tid, - double latency, int epoch = 1) { - LOG(INFO) << "====== batch_size: " << batch_size << ", repeat: " << repeat - << ", threads: " << num_threads << ", thread id: " << tid - << ", latency: " << latency << "ms ======"; - if (epoch > 1) { - int samples = batch_size * epoch; - LOG(INFO) << "====== sample number: " << samples - << ", average latency of each sample: " << latency / samples - << "ms ======"; +static std::string DescribeZeroCopyTensor(const ZeroCopyTensor &tensor) { + std::stringstream os; + os << "Tensor [" << tensor.name() << "]\n"; + + os << " - shape: " << to_string(tensor.shape()) << '\n'; + os << " - lod: "; + for (auto &l : tensor.lod()) { + os << to_string(l) << "; "; + } + os << "\n"; + PaddlePlace place; + int size; + const auto *data = tensor.data(&place, &size); + os << " - numel: " << size; + os << "\n"; + os << " - data: "; + for (int i = 0; i < size; i++) { + os << data[i] << " "; } + return os.str(); +} + +static void PrintTime(int batch_size, int repeat, int num_threads, int tid, + double batch_latency, int epoch = 1) { + PADDLE_ENFORCE(batch_size > 0, "Non-positive batch size."); + double sample_latency = batch_latency / batch_size; + LOG(INFO) << "====== threads: " << num_threads << ", thread id: " << tid + << " ======"; + LOG(INFO) << "====== batch_size: " << batch_size << ", iterations: " << epoch + << ", repetitions: " << repeat << " ======"; + LOG(INFO) << "====== batch latency: " << batch_latency + << "ms, number of samples: " << batch_size * epoch + << ", sample latency: " << sample_latency + << "ms, fps: " << 1000.f / sample_latency << " ======"; +} + +static bool IsFileExists(const std::string &path) { + std::ifstream file(path); + bool exists = file.is_open(); + file.close(); + return exists; } } // namespace inference diff --git a/paddle/fluid/inference/api/mkldnn_quantizer.cc b/paddle/fluid/inference/api/mkldnn_quantizer.cc new file mode 100644 index 0000000000000000000000000000000000000000..de75e884f53143d9026636ad8663d89a36a30f69 --- /dev/null +++ b/paddle/fluid/inference/api/mkldnn_quantizer.cc @@ -0,0 +1,437 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/api/mkldnn_quantizer.h" +#include +#include +#include +#include +#include +#include "paddle/fluid/framework/eigen.h" +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/pass.h" +#include "paddle/fluid/framework/operator.h" +#include "paddle/fluid/framework/type_defs.h" +#include "paddle/fluid/inference/analysis/analyzer.h" +#include "paddle/fluid/inference/api/analysis_predictor.h" +#include "paddle/fluid/platform/place.h" +#include "paddle/fluid/string/pretty_log.h" + +namespace paddle { + +using platform::CPUPlace; +using framework::LoDTensor; +using framework::ir::Graph; +using ConstEigenVectorArrayMap = + Eigen::Map>; +using string::PrettyLogH1; + +bool AnalysisPredictor::MkldnnQuantizer::CalculateScales() { + PrettyLogH1("--- Calculating scales for quantization"); + using VariableNameMap = std::map>; + std::map> gathered_data; + for (const auto* op : predictor_.inference_program_->Block(0).AllOps()) { + if (op->HasAttr("use_quantizer") && + boost::get(op->GetAttr("use_quantizer"))) { + const VariableNameMap& connections_in = op->Inputs(); + const VariableNameMap& connections_out = op->Outputs(); + + auto glambda = [&](const VariableNameMap& connections, bool is_output) { + for (auto const& conn : connections) { + if (conn.second.size() == 0) continue; + auto& var_name = conn.second[0]; + + // skip if scale already computed + if (scales_.find(var_name) != scales_.end()) return; + + auto* var = predictor_.sub_scope_->FindVar(var_name); + PADDLE_ENFORCE(var, "%s is not in the scope", var_name); + PADDLE_ENFORCE(var->IsType(), + "Only support lod tensor now."); + LoDTensor* var_tensor = var->GetMutable(); + + // force unsigned type if already know it + bool is_unsigned = false; + if (is_output && op->Type() == "conv2d") { + // output of conv2d with relu must be unsigned + is_unsigned = op->HasAttr("fuse_relu") && + boost::get(op->GetAttr("fuse_relu")); + } else if (is_output && op->Type() == "pool2d") { + // output of pool2d with unsigned input must be unsigned + auto input_var_name = op->Input("X")[0]; + if (scales_.find(input_var_name) != scales_.end()) { + is_unsigned = scales_[input_var_name].first; + } + } + + CalculateSingleScale(op->Type(), conn.first, var_name, *var_tensor, + is_unsigned); + } + }; + + // handle outputs first so unsigned outputs could be inferred + glambda(connections_out, true /* is_output */); + glambda(connections_in, false /* is_output */); + } + } + + return true; +} + +void AnalysisPredictor::MkldnnQuantizer::CalculateSingleScale( + const std::string& op_type_name, const std::string& conn_name, + const std::string& var_name, const LoDTensor& var_tensor, + bool is_unsigned) { + auto rule = qconfig_->scale_algo(op_type_name, conn_name); + if (rule == ScaleAlgo::NONE) return; + + PADDLE_ENFORCE( + var_tensor.numel() > 0, + "MkldnnQuantizer: LoDTensor of variable %s for quantization of op " + "%s of connection %s should not be empty.", + var_name, op_type_name, conn_name); + + switch (rule) { + case ScaleAlgo::MAX: + scales_[var_name] = GetMaxScalingFactor(var_tensor, is_unsigned); + break; + case ScaleAlgo::MAX_CH: + scales_[var_name] = GetMaxChScalingFactor(var_tensor, is_unsigned); + break; + case ScaleAlgo::KL: + scales_[var_name] = GetKLScalingFactor(var_tensor, is_unsigned); + break; + default: + throw std::runtime_error( + "MkldnnQuantizer: Unexpected ScaleAlgo specified."); + } +} + +std::vector AnalysisPredictor::MkldnnQuantizer::ExpandQuantizedBins( + std::vector quantized_bins, std::vector reference_bins) const { + std::vector expanded_quantized_bins(reference_bins.size(), 0); + int num_merged_bins = reference_bins.size() / quantized_bins.size(); + int j_start = 0; + int j_end = num_merged_bins; + for (size_t idx = 0; idx < quantized_bins.size(); idx++) { + int zero_count = + std::count(&reference_bins[j_start], &reference_bins[j_end], 0); + num_merged_bins = j_end - j_start; + int avg_bin_ele; + if (zero_count == num_merged_bins) { + avg_bin_ele = 0; + } else { + avg_bin_ele = quantized_bins[idx] / (num_merged_bins - zero_count + 0.0); + } + for (int idx1 = j_start; idx1 < j_end; idx1++) { + expanded_quantized_bins[idx1] = + (reference_bins[idx1] == 0) ? 0 : avg_bin_ele; + } + j_start += num_merged_bins; + j_end += num_merged_bins; + if ((idx + 1) == quantized_bins.size() - 1) { + j_end = reference_bins.size(); + } + } + return expanded_quantized_bins; +} + +std::pair +AnalysisPredictor::MkldnnQuantizer::GetKLScalingFactor( + const LoDTensor& var_tensor, bool is_unsigned) const { + ConstEigenVectorArrayMap eigen_tensor{var_tensor.data(), + var_tensor.numel(), 1}; + int precision_hist_num_bins = 2048; + float max_val = eigen_tensor.maxCoeff(); + float min_val = eigen_tensor.minCoeff(); + bool is_positive = min_val >= 0.0f; + if (is_unsigned) + PADDLE_ENFORCE( + is_positive, + "Tensor is claimed to be unsigned, but its min value (%f) is < 0.0", + min_val); + + int num_quantized_bins = 255; + + std::vector hist; + float bin_width; + int starting_iter; + int ending_iter = precision_hist_num_bins - 1; + if (is_positive) { + std::tie(hist, bin_width) = + Histogram(var_tensor, min_val, max_val, precision_hist_num_bins); + starting_iter = static_cast(ending_iter * 0.7); + } else { + float th = std::max(std::abs(max_val), std::abs(min_val)); + std::tie(hist, bin_width) = + Histogram(var_tensor, -th, th, precision_hist_num_bins); + starting_iter = 0; + if (std::abs(max_val) > std::abs(min_val)) { + while (starting_iter < ending_iter) { + if (hist[starting_iter] == 0) { + ++starting_iter; + continue; + } else { + break; + } + } + starting_iter += static_cast((ending_iter - starting_iter) * 0.6); + } else { + while (ending_iter > 0) { + if (hist[ending_iter] == 0) { + --ending_iter; + continue; + } else { + break; + } + } + starting_iter = static_cast(0.6 * ending_iter); + } + } + auto P_sum = eigen_tensor.size(); + int min_kl_divergence = 0; + int min_kl_index = 0; + bool kl_inited = false; + for (int i = starting_iter; i <= ending_iter; i++) { + std::vector reference_distr_P(&hist[0], &hist[i]); + auto outliers_count = + std::accumulate(&hist[i], &hist[precision_hist_num_bins], 0); + if (reference_distr_P[i - 1] == 0) { + continue; + } + reference_distr_P[i - 1] += outliers_count; + auto reference_distr_bins = reference_distr_P; + std::vector candidate_distr_Q(&hist[0], &hist[i]); + int num_merged_bins = i / num_quantized_bins; + std::vector candidate_distr_Q_quantized(num_quantized_bins, 0); + int j_start = 0; + int j_end = num_merged_bins; + for (int idx = 0; idx < num_quantized_bins; idx++) { + candidate_distr_Q_quantized[idx] = std::accumulate( + &candidate_distr_Q[j_start], &candidate_distr_Q[j_end], 0); + j_start += num_merged_bins; + j_end += num_merged_bins; + if ((idx + 1) == num_quantized_bins - 1) { + j_end = i; + } + } + candidate_distr_Q = + ExpandQuantizedBins(candidate_distr_Q_quantized, reference_distr_bins); + int Q_sum = + std::accumulate(candidate_distr_Q.begin(), candidate_distr_Q.end(), 0); + auto kl_divergence = + SafeEntropy(reference_distr_P, P_sum, candidate_distr_Q, Q_sum); + if (!kl_inited) { + min_kl_divergence = kl_divergence; + min_kl_index = i; + kl_inited = true; + } else if (kl_divergence < min_kl_divergence) { + min_kl_divergence = kl_divergence; + min_kl_index = i; + } else { + } + } + if (min_kl_index == 0) { + while (starting_iter > 0) { + if (hist[starting_iter] == 0) { + starting_iter -= 1; + continue; + } else { + break; + } + } + min_kl_index = starting_iter; + } + + LoDTensor scale_tensor; + scale_tensor.Resize({1}); + auto* scale_ptr = scale_tensor.mutable_data(CPUPlace()); + + scale_ptr[0] = 1.0 / ((min_kl_index + 0.5) * bin_width); + + return std::make_pair(is_unsigned, scale_tensor); +} + +std::pair +AnalysisPredictor::MkldnnQuantizer::GetMaxScalingFactor( + const LoDTensor& var_tensor, bool is_unsigned) const { + ConstEigenVectorArrayMap eigen_tensor{var_tensor.data(), + var_tensor.numel(), 1}; + float max_abs = eigen_tensor.abs().maxCoeff(); + float min_val = eigen_tensor.minCoeff(); + if (is_unsigned) + PADDLE_ENFORCE( + min_val >= 0.0f, + "Tensor is claimed to be unsigned, but its min value (%f) is < 0.0", + min_val); + + LoDTensor scale_tensor; + scale_tensor.Resize({1}); + auto* scale_ptr = scale_tensor.mutable_data(CPUPlace()); + scale_ptr[0] = 1.0 / max_abs; + + return std::make_pair(is_unsigned, scale_tensor); +} + +std::pair +AnalysisPredictor::MkldnnQuantizer::GetMaxChScalingFactor( + const LoDTensor& var_tensor, bool is_unsigned) const { + PADDLE_ENFORCE(var_tensor.dims().size() > 0, "Tensor dimension is empty."); + + ConstEigenVectorArrayMap eigen_tensor{var_tensor.data(), + var_tensor.numel(), 1}; + float min_val = eigen_tensor.minCoeff(); + if (is_unsigned) + PADDLE_ENFORCE( + min_val >= 0.0f, + "Tensor is claimed to be unsigned, but its min value (%f) is < 0.0", + min_val); + + int channels = var_tensor.dims()[0]; + LoDTensor scale_tensor; + scale_tensor.Resize({channels}); + auto* scale_ptr = scale_tensor.mutable_data(CPUPlace()); + + for (int i = 0; i < channels; ++i) { + const auto tensor = var_tensor.Slice(i, i + 1); + + ConstEigenVectorArrayMap eigen_tensor{tensor.data(), tensor.numel(), + 1}; + float max_abs = eigen_tensor.abs().maxCoeff(); + scale_ptr[i] = 1.0 / max_abs; + } + + return std::make_pair(is_unsigned, scale_tensor); +} + +std::pair, float> +AnalysisPredictor::MkldnnQuantizer::Histogram( + const framework::LoDTensor& var_tensor, float min_val, float max_val, + size_t num_bins) const { + PADDLE_ENFORCE_GT(num_bins, 0, + "MkldnnQuantizer: To calculate Histogram, num_bins (" + + std::to_string(num_bins) + ") must be positive."); + PADDLE_ENFORCE_GT( + var_tensor.numel(), 0, + "MkldnnQuantizer: To calculate Histogram, the tensor must not be empty."); + PADDLE_ENFORCE(max_val >= min_val, + "MkldnnQuantizer: To calculate Histogram, max_val (" + + std::to_string(max_val) + + ") must be greater or equal" + "to min_val (" + + std::to_string(min_val) + ")."); + ConstEigenVectorArrayMap eigen_tensor{var_tensor.data(), + var_tensor.numel(), 1}; + auto bin_width = std::abs(max_val - min_val) / num_bins; + std::vector hist(num_bins); + + for (int i = 0; i < eigen_tensor.size(); i++) { + int bin = std::min( + num_bins - 1, + static_cast(floor((eigen_tensor[i] - min_val) / bin_width))); + ++hist[bin]; + } + + return std::make_pair(std::move(hist), std::move(bin_width)); +} + +void AnalysisPredictor::MkldnnQuantizer::PrepareArgument() const { + auto& arg = predictor_.argument_; + if (!arg.scope_valid()) arg.SetScope(new framework::Scope); + arg.SetMainProgramNotOwned(predictor_.inference_program_.get()); + auto graph = std::unique_ptr(new Graph(arg.main_program())); + arg.SetMainGraph(graph.release()); + arg.main_graph().Set(framework::ir::kParamScopeAttr, + new framework::Scope*(arg.scope_ptr())); + + auto* builder = predictor_.config_.pass_builder(); + builder->SetPasses({ + "infer_clean_graph_pass", "cpu_quantize_pass", "cpu_quantize_squash_pass", + }); + if (predictor_.config_.ir_debug_) builder->TurnOnDebug(); + auto passes = builder->AllPasses(); + predictor_.argument_.SetIrAnalysisPasses(passes); + predictor_.argument_.SetAnalysisPasses( + {"ir_analysis_pass", "memory_optimize_pass", "ir_graph_to_program_pass"}); + predictor_.argument_.SetQuantVarScales(scales_); +} + +bool AnalysisPredictor::MkldnnQuantizer::Quantize() { + if (!RunWarmup()) return false; + if (!CalculateScales()) return false; + predictor_.PrepareScope(predictor_.scope_); + predictor_.CreateExecutor(); + if (!RunQuantizePasses()) return false; + predictor_.PrepareExecutor(); + predictor_.PrepareFeedFetch(); + return true; +} + +bool AnalysisPredictor::MkldnnQuantizer::RunQuantizePasses() const { + predictor_.executor_->CreateVariables(*predictor_.inference_program_, 0, true, + predictor_.sub_scope_); + PrepareArgument(); + auto& arg = predictor_.argument_; + Analyzer().Run(&arg); + PADDLE_ENFORCE(arg.scope_valid()); + VLOG(5) << "to prepare executor"; + ARGUMENT_CHECK_FIELD((&arg), ir_analyzed_program); + predictor_.inference_program_.reset( + new framework::ProgramDesc(arg.ir_analyzed_program())); + LOG(INFO) << "== optimize 2 end =="; + predictor_.executor_->CreateVariables(*predictor_.inference_program_, 0, + false, predictor_.sub_scope_); + return true; +} + +bool AnalysisPredictor::MkldnnQuantizer::RunWarmup() const { + VLOG(3) << "Predictor: run a quantization warmup iteration"; + auto warmup_data = qconfig_->warmup_data(); + PADDLE_ENFORCE_NOT_NULL(warmup_data, + "Warmup data cannot be NULL in the config."); + PrettyLogH1("--- Running warmup iteration for quantization"); + + // Run the inference program + std::vector output_slots; + predictor_.Run(*warmup_data, &output_slots, qconfig_->warmup_batch_size()); + + return true; +} + +float AnalysisPredictor::MkldnnQuantizer::SafeEntropy( + std::vector reference_distr_P, int P_sum, + std::vector candidate_distr_Q, int Q_sum) const { + PADDLE_ENFORCE_EQ(reference_distr_P.size(), candidate_distr_Q.size()); + float tmp_sum1 = 0; + float tmp_sum2 = 0; + for (size_t idx = 0; idx < reference_distr_P.size(); idx++) { + int p_idx = reference_distr_P[idx]; + int q_idx = candidate_distr_Q[idx]; + if (p_idx == 0) { + tmp_sum1 += 0; + tmp_sum2 += 0; + } else { + PADDLE_ENFORCE(q_idx != 0, "MkldnnQuantizer: Fatal error!, idx = " + + std::to_string(idx) + + " qindex = 0! p_idx = " + + std::to_string(p_idx)); + } + tmp_sum1 += p_idx * (log(Q_sum * p_idx)); + tmp_sum2 += p_idx * (log(P_sum * q_idx)); + } + return (tmp_sum1 - tmp_sum2) / P_sum; +} + +} // namespace paddle diff --git a/paddle/fluid/inference/api/mkldnn_quantizer.h b/paddle/fluid/inference/api/mkldnn_quantizer.h new file mode 100644 index 0000000000000000000000000000000000000000..f4b0df5d742ed12f856fc7982d955e89288a1888 --- /dev/null +++ b/paddle/fluid/inference/api/mkldnn_quantizer.h @@ -0,0 +1,104 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include +#include +#include +#include +#include +#include "paddle/fluid/framework/naive_executor.h" +#include "paddle/fluid/inference/analysis/analyzer.h" +#include "paddle/fluid/inference/api/analysis_predictor.h" +#include "paddle/fluid/inference/api/api_impl.h" +#include "paddle/fluid/inference/api/details/reset_tensor_array.h" +#include "paddle/fluid/inference/api/helper.h" +#include "paddle/fluid/inference/api/paddle_inference_api.h" +#include "paddle/fluid/string/printf.h" +#ifdef PADDLE_WITH_TESTING +#include +#include +#endif + +namespace paddle { + +/* + * Map variable name to tensor of scaling factors scaling it to MAX=1.0. + * bool denotes whether quantization of the variable should be done to unsigned + * type. + */ +using VarQuantScale = + std::unordered_map>; + +class AnalysisPredictor::MkldnnQuantizer { + public: + explicit MkldnnQuantizer( + AnalysisPredictor& predictor, // NOLINT + const std::shared_ptr& qconfig) + : predictor_(predictor), qconfig_(qconfig) {} + + // Execute full quantization procedure. + bool Quantize(); + +#if PADDLE_WITH_TESTING + friend class MkldnnQuantizerTest; +#endif + + private: + // Run single warmup iteration + bool RunWarmup() const; + // Gather data from variables and calculate scales for them. + bool CalculateScales(); + // Calculate a scale for tensor based on ScaleAlgo rules. + void CalculateSingleScale(const std::string& op_name, + const std::string& conn_name, + const std::string& var_name, + const framework::LoDTensor& var_tensor, + bool is_unsigned); + void PrepareArgument() const; + bool RunQuantizePasses() const; + + std::vector ExpandQuantizedBins(std::vector quantized_bins, + std::vector reference_bins) const; + + // Using the KL-divergence method get the most precise scaling factor. + std::pair GetKLScalingFactor( + const framework::LoDTensor& var_tensor, bool is_unsigned) const; + + std::pair GetMaxChScalingFactor( + const framework::LoDTensor& var_tensor, bool is_unsigned) const; + + std::pair GetMaxScalingFactor( + const framework::LoDTensor& var_tensor, bool is_unsigned) const; + + // Returns histogram and bin width + std::pair, float> Histogram( + const framework::LoDTensor& var_tensor, float min_val, float max_val, + size_t num_bins = 2048) const; + + // Calculate the entropy. + float SafeEntropy(std::vector reference_distr_P, int P_sum, + std::vector candidate_distr_Q, int Q_sum) const; + + private: + AnalysisPredictor& predictor_; + const std::shared_ptr qconfig_; + + // A map: variable name -> scale + VarQuantScale scales_; +}; + +} // namespace paddle diff --git a/paddle/fluid/inference/api/mkldnn_quantizer_config.cc b/paddle/fluid/inference/api/mkldnn_quantizer_config.cc new file mode 100644 index 0000000000000000000000000000000000000000..f9ff542d86d2a7a3ac2e7f004e11eddfea3598d5 --- /dev/null +++ b/paddle/fluid/inference/api/mkldnn_quantizer_config.cc @@ -0,0 +1,40 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/api/paddle_mkldnn_quantizer_config.h" + +namespace paddle { + +MkldnnQuantizerConfig::MkldnnQuantizerConfig() { + // The default configuration of scale computing algorightms + rules_["conv2d"]["Input"] = ScaleAlgo::KL; + rules_["conv2d"]["Filter"] = ScaleAlgo::MAX_CH; + rules_["conv2d"]["Bias"] = ScaleAlgo::NONE; // do not compute scale + rules_["conv2d"]["ResidualData"] = ScaleAlgo::KL; + rules_["conv2d"]["Output"] = ScaleAlgo::KL; // do not compute scale + + rules_["pool2d"]["X"] = ScaleAlgo::KL; + rules_["pool2d"]["Out"] = ScaleAlgo::KL; // do not compute scale +} + +ScaleAlgo MkldnnQuantizerConfig::scale_algo( + const std::string& op_type_name, const std::string& conn_name) const { + if (rules_.find(op_type_name) != rules_.end()) { + auto op_rule = rules_.at(op_type_name); + if (op_rule.find(conn_name) != op_rule.end()) return op_rule.at(conn_name); + } + return default_scale_algo_; +} + +} // namespace paddle diff --git a/paddle/fluid/inference/api/paddle_anakin_config.h b/paddle/fluid/inference/api/paddle_anakin_config.h new file mode 100644 index 0000000000000000000000000000000000000000..0e91c2624bed4459b936ac4477d73ae954e55bcc --- /dev/null +++ b/paddle/fluid/inference/api/paddle_anakin_config.h @@ -0,0 +1,35 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +#pragma once + +#include +#include +#include +#include + +#include "paddle_api.h" // NOLINT + +namespace paddle { +namespace contrib { +// Configurations for Anakin engine. +struct AnakinConfig : public PaddlePredictor::Config { + enum TargetType { NVGPU = 0, X86 }; + int device; + std::string model_file; + int max_batch_size{-1}; + TargetType target_type; +}; + +} // namespace contrib +} // namespace paddle diff --git a/paddle/fluid/inference/api/paddle_analysis_config.h b/paddle/fluid/inference/api/paddle_analysis_config.h new file mode 100644 index 0000000000000000000000000000000000000000..c67c4b5bd0bfeea6d022f9e821f6d0b877c71d7a --- /dev/null +++ b/paddle/fluid/inference/api/paddle_analysis_config.h @@ -0,0 +1,300 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +#pragma once + +#include +#include +#include +#include +#include +#include +#include + +/*! \file */ + +// Here we include some header files with relative paths, for that in deploy, +// the abstract path of this header file will be changed. +#include "paddle_api.h" // NOLINT +#include "paddle_pass_builder.h" // NOLINT +#ifdef PADDLE_WITH_MKLDNN +#include "paddle_mkldnn_quantizer_config.h" // NOLINT +#endif + +namespace paddle { + +class AnalysisPredictor; +struct MkldnnQuantizerConfig; + +// NOTE WIP, not stable yet. +struct AnalysisConfig { + AnalysisConfig() = default; + explicit AnalysisConfig(const AnalysisConfig& other); + explicit AnalysisConfig(const std::string& model_dir); + explicit AnalysisConfig(const std::string& prog_file, + const std::string& params_file); + enum class Precision { + kFloat32 = 0, + kInt8, + }; + + /** Set model with a directory. + */ + void SetModel(const std::string& model_dir) { model_dir_ = model_dir; } + /** Set model with two specific pathes for program and parameters. + */ + void SetModel(const std::string& prog_file_path, + const std::string& params_file_path); + /** Set program file path. + */ + void SetProgFile(const std::string& x) { prog_file_ = x; } + /** Set parameter composed file path. + */ + void SetParamsFile(const std::string& x) { params_file_ = x; } + /** Get the model directory path. + */ + const std::string& model_dir() const { return model_dir_; } + /** Get the program file path. + */ + const std::string& prog_file() const { return prog_file_; } + /** Get the composed parameters file. + */ + const std::string& params_file() const { return params_file_; } + + // GPU related. + + /** + * \brief Turn on GPU. + * @param memory_pool_init_size_mb initial size of the GPU memory pool in MB. + * @param device_id the GPU card to use (default is 0). + */ + void EnableUseGpu(uint64_t memory_pool_init_size_mb, int device_id = 0); + /** Turn off the GPU. + */ + void DisableGpu(); + /** A bool state telling whether the GPU is turned on. + */ + bool use_gpu() const { return use_gpu_; } + /** Get the GPU device id. + */ + int gpu_device_id() const { return device_id_; } + /** Get the initial size in MB of the GPU memory pool. + */ + int memory_pool_init_size_mb() const { return memory_pool_init_size_mb_; } + /** Get the proportion of the initial memory pool size compared to the device. + */ + float fraction_of_gpu_memory_for_pool() const; + + /** \brief Control whether to perform IR graph optimization. + * + * If turned off, the AnalysisConfig will act just like a NativeConfig. + */ + void SwitchIrOptim(int x = true) { enable_ir_optim_ = x; } + /** A boolean state tell whether the ir graph optimization is actived. + */ + bool ir_optim() const { return enable_ir_optim_; } + + /** \brief INTERNAL Determine whether to use the feed and fetch operators. + * Just for internal development, not stable yet. + * When ZeroCopyTensor is used, this should turned off. + */ + void SwitchUseFeedFetchOps(int x = true) { use_feed_fetch_ops_ = x; } + /** A boolean state telling whether to use the feed and fetch operators. + */ + bool use_feed_fetch_ops_enabled() const { return use_feed_fetch_ops_; } + + /** \brief Control whether to specify the inputs' names. + * + * The PaddleTensor type has a `name` member, assign it with the corresponding + * variable name. This is used only when the input PaddleTensors passed to the + * `PaddlePredictor.Run(...)` cannot follow the order in the training phase. + */ + void SwitchSpecifyInputNames(bool x = true) { specify_input_name_ = x; } + + /** A boolean state tell whether the input PaddleTensor names specified should + * be used to reorder the inputs in `PaddlePredictor.Run(...)`. + */ + bool specify_input_name() const { return specify_input_name_; } + + /** + * \brief Turn on the TensorRT engine. + * + * The TensorRT engine will accelerate some subgraphes in the original Fluid + * computation graph. In some models such as TensorRT50, GoogleNet and so on, + * it gains significant performance acceleration. + * + * @param workspace_size the memory size(in byte) used for TensorRT workspace. + * @param max_batch_size the maximum batch size of this prediction task, + * better set as small as possible, or performance loss. + * @param min_subgrpah_size the minimum TensorRT subgraph size needed, if a + * subgraph is less than this, it will not transfer to TensorRT engine. + */ + void EnableTensorRtEngine(int workspace_size = 1 << 20, + int max_batch_size = 1, int min_subgraph_size = 3, + Precision precision = Precision::kFloat32, + bool use_static = false); + /** A boolean state telling whether the TensorRT engine is used. + */ + bool tensorrt_engine_enabled() const { return use_tensorrt_; } + /** + * \brief Turn on the usage of Anakin sub-graph engine. + */ + void EnableAnakinEngine( + int max_batch_size = 1, + std::map> max_input_shape = {}, + int min_subgraph_size = 6); + + /** A boolean state indicating whether the Anakin sub-graph engine is used. + */ + bool anakin_engine_enabled() const { return use_anakin_; } + + /** \brief Control whether to debug IR graph analysis phase. + * + * This will generate DOT files for visualizing the computation graph after + * each analysis pass applied. + */ + void SwitchIrDebug(int x = true); + + /** Turn on MKLDNN. + */ + void EnableMKLDNN(); + /** A boolean state telling whether to use the MKLDNN. + */ + bool mkldnn_enabled() const { return use_mkldnn_; } + + /** Set and get the number of cpu math library threads. + */ + void SetCpuMathLibraryNumThreads(int cpu_math_library_num_threads); + /** An int state telling how many threads are used in the CPU math library. + */ + int cpu_math_library_num_threads() const { + return cpu_math_library_num_threads_; + } + + /** Transform the AnalysisConfig to NativeConfig. + */ + NativeConfig ToNativeConfig() const; + /** Specify the operator type list to use MKLDNN acceleration. + * @param op_list the operator type list. + */ + void SetMKLDNNOp(std::unordered_set op_list) { + mkldnn_enabled_op_types_ = op_list; + } + + /** Turn on quantization. + */ + void EnableMkldnnQuantizer(); + + /** A boolean state telling whether the quantization is enabled. + */ + bool mkldnn_quantizer_enabled() const { return use_mkldnn_quantizer_; } + + std::shared_ptr mkldnn_quantizer_config() const; + + /** Specify the memory buffer of program and parameter + * @param prog_buffer the memory buffer of program. + * @param prog_buffer_size the size of the data. + * @param params_buffer the memory buffer of the composed parameters file. + * @param params_buffer_size the size of the commposed parameters data. + */ + void SetModelBuffer(const char* prog_buffer, size_t prog_buffer_size, + const char* params_buffer, size_t params_buffer_size); + /** A boolean state telling whether the model is set from the CPU memory. + */ + bool model_from_memory() const { return model_from_memory_; } + void SetEngineOptInfo(std::map engine_opt_info); + + /** Turn on memory optimize + * NOTE still in development, will release latter. + */ + void EnableMemoryOptim(bool static_optim = false, + bool force_update_static_cache = false); + /** Tell whether the memory optimization is activated. */ + bool enable_memory_optim() const; + + friend class ::paddle::AnalysisPredictor; + + /** NOTE just for developer, not an official API, easily to be broken. + * Get a pass builder for customize the passes in IR analysis phase. + */ + PassStrategy* pass_builder() const; + + protected: + // Update the config. + void Update(); + + std::string SerializeInfoCache(); + + protected: + // Model pathes. + std::string model_dir_; + std::string prog_file_; + std::string params_file_; + + // GPU related. + bool use_gpu_{false}; + int device_id_{0}; + uint64_t memory_pool_init_size_mb_{100}; // initial size is 100MB. + + // TensorRT related. + bool use_tensorrt_{false}; + // For workspace_size, refer it from here: + // https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#troubleshooting + int tensorrt_workspace_size_; + // While TensorRT allows an engine optimized for a given max batch size + // to run at any smaller size, the performance for those smaller + // sizes may not be as well-optimized. Therefore, Max batch is best + // equivalent to the runtime batch size. + int tensorrt_max_batchsize_; + // We transform the Ops that can be converted into TRT layer in the model, + // and aggregate these Ops into subgraphs for TRT execution. + // We set this variable to control the minimum number of nodes in the + // subgraph, 3 as default value. + int tensorrt_min_subgraph_size_{3}; + Precision tensorrt_precision_mode_; + bool trt_use_static_engine_; + + // memory reuse related. + bool enable_memory_optim_{false}; + bool static_memory_optim_{false}; + bool static_memory_optim_force_update_{false}; + + bool use_mkldnn_{false}; + std::unordered_set mkldnn_enabled_op_types_; + + bool model_from_memory_{false}; + + bool enable_ir_optim_{true}; + bool use_feed_fetch_ops_{true}; + bool ir_debug_{false}; + + bool specify_input_name_{false}; + + int cpu_math_library_num_threads_{1}; + + // A runtime cache, shouldn't be transferred to others. + std::string serialized_info_cache_; + + mutable std::unique_ptr pass_builder_; + + bool use_anakin_{false}; + int anakin_max_batchsize_; + int anakin_min_subgraph_size_{6}; + std::map> anakin_max_input_shape_; + std::map engine_opt_info_; + + bool use_mkldnn_quantizer_{false}; + std::shared_ptr mkldnn_quantizer_config_; +}; + +} // namespace paddle diff --git a/paddle/fluid/inference/api/paddle_api.h b/paddle/fluid/inference/api/paddle_api.h new file mode 100644 index 0000000000000000000000000000000000000000..87f40f09eb9bb552bd246cb39bbbd41abac1c9ac --- /dev/null +++ b/paddle/fluid/inference/api/paddle_api.h @@ -0,0 +1,357 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +#pragma once + +/*! \file paddle_api.h + */ + +/*! \mainpage Paddle Inference APIs + * \section intro_sec Introduction + * The Paddle inference library aims to offer an high performance inference SDK + * for Paddle users. + */ + +#include +#include +#include +#include + +/*! \namespace paddle + */ +namespace paddle { + +/** paddle data type. + */ +enum PaddleDType { + FLOAT32, + INT64, + INT32, + // TODO(Superjomn) support more data types if needed. +}; + +/** + * \brief Memory manager for `PaddleTensor`. + * + * The PaddleBuf holds a buffer for data input or output. The memory can be + * allocated by user or by PaddleBuf itself, but in any case, the PaddleBuf + * should be reused for better performance. + * + * For user allocated memory, the following API can be used: + * - PaddleBuf(void* data, size_t length) to set an external memory by + * specifying the memory address and length. + * - Reset(void* data, size_t length) to reset the PaddleBuf with an external + *memory. + * ATTENTION, for user allocated memory, deallocation should be done by users + *externally after the program finished. The PaddleBuf won't do any allocation + *or deallocation. + * + * To have the PaddleBuf allocate and manage the memory: + * - PaddleBuf(size_t length) will allocate a memory of size `length`. + * - Resize(size_t length) resize the memory to no less than `length`, ATTENTION + * if the allocated memory is larger than `length`, nothing will done. + * + * Usage: + * + * Let PaddleBuf manage the memory internally. + * \code{cpp} + * const int num_elements = 128; + * PaddleBuf buf(num_elements * sizeof(float)); + * \endcode + * + * Or + * \code{cpp} + * PaddleBuf buf; + * buf.Resize(num_elements * sizeof(float)); + * \endcode + * Works the exactly the same. + * + * One can also make the `PaddleBuf` use the external memory. + * \code{cpp} + * PaddleBuf buf; + * void* external_memory = new float[num_elements]; + * buf.Reset(external_memory, num_elements*sizeof(float)); + * ... + * delete[] external_memory; // manage the memory lifetime outside. + * \endcode + */ +class PaddleBuf { + public: + /** PaddleBuf allocate memory internally, and manage it. + */ + explicit PaddleBuf(size_t length) + : data_(new char[length]), length_(length), memory_owned_(true) {} + /** Set external memory, the PaddleBuf won't manage it. + */ + PaddleBuf(void* data, size_t length) + : data_(data), length_(length), memory_owned_{false} {} + /** Copy only available when memory is managed externally. + */ + explicit PaddleBuf(const PaddleBuf&); + + /** Resize the memory. + */ + void Resize(size_t length); + /** Reset to external memory, with address and length set. + */ + void Reset(void* data, size_t length); + /** Tell whether the buffer is empty. + */ + bool empty() const { return length_ == 0; } + /** Get the data's memory address. + */ + void* data() const { return data_; } + /** Get the memory length. + */ + size_t length() const { return length_; } + + ~PaddleBuf() { Free(); } + PaddleBuf& operator=(const PaddleBuf&); + PaddleBuf& operator=(PaddleBuf&&); + PaddleBuf() = default; + PaddleBuf(PaddleBuf&& other); + + private: + void Free(); + void* data_{nullptr}; // pointer to the data memory. + size_t length_{0}; // number of memory bytes. + bool memory_owned_{true}; +}; + +/** Basic input and output data structure for PaddlePredictor. + */ +struct PaddleTensor { + PaddleTensor() = default; + std::string name; // variable name. + std::vector shape; + PaddleBuf data; // blob of data. + PaddleDType dtype; + std::vector> lod; // Tensor+LoD equals LoDTensor +}; + +enum class PaddlePlace { kUNK = -1, kCPU, kGPU }; + +/** Tensor without copy, currently only supports `AnalysisPredictor`. + */ +class ZeroCopyTensor { + public: + void Reshape(const std::vector& shape); + + /** Get the memory in CPU or GPU with specific data type, should Reshape first + * to tell the data size. + * Once can directly call this data to feed the data. + * This is for write the input tensor. + */ + template + T* mutable_data(PaddlePlace place); + /** Get the memory directly, will return the place and element size by + * pointer. + * This is for reading the output tensor. + */ + template + T* data(PaddlePlace* place, int* size) const; + + template + void copy_from_cpu(const T* data); + + template + void copy_to_cpu(T* data); + + std::vector shape() const; + + void SetLoD(const std::vector>& x); + std::vector> lod() const; + const std::string& name() const { return name_; } + void SetPlace(PaddlePlace place, int device = -1) { + place_ = place; + device_ = device; + } + + PaddleDType type() const; + + protected: + explicit ZeroCopyTensor(void* scope) : scope_{scope} {} + void SetName(const std::string& name) { name_ = name; } + void* FindTensor() const; + + private: + std::string name_; + bool input_or_output_; + friend class AnalysisPredictor; + void* scope_{nullptr}; + // The corresponding tensor pointer inside Paddle workspace is cached for + // performance. + mutable void* tensor_{nullptr}; + PaddlePlace place_; + PaddleDType dtype_; + int device_; +}; + +/** A simple Inference API for Paddle. + */ +class PaddlePredictor { + public: + struct Config; + PaddlePredictor() = default; + PaddlePredictor(const PaddlePredictor&) = delete; + PaddlePredictor& operator=(const PaddlePredictor&) = delete; + + /** Predict an record. + * The caller should be responsible for allocating and releasing the memory of + * `inputs`. `inputs` should be available until Run returns. Caller should be + * responsible for the output tensor's buffer, either allocated or passed from + * outside. + */ + virtual bool Run(const std::vector& inputs, + std::vector* output_data, + int batch_size = -1) = 0; + + /** \brief Get input names of the model + */ + virtual std::vector GetInputNames() { return {}; } + + /** \brief Get output names of the model + */ + virtual std::vector GetOutputNames() { return {}; } + + /** \brief Get a mutable tensor directly. + * + * NOTE Only works in AnalysisPredictor. + * + * One can also use this to modify any temporary variable related tensors in + * the predictor. + * + */ + virtual std::unique_ptr GetInputTensor( + const std::string& name) { + return nullptr; + } + /** + * \brief Get an immutable tensor without copy. + * + * NOTE Only works in AnalysisPredictor. + * One can use this API to get any temporary tensors in the predictor and + * read it. + */ + virtual std::unique_ptr GetOutputTensor( + const std::string& name) { + return nullptr; + } + /** + * \brief Run the predictor with zero-copied inputs and outputs. + * + * NOTE Only works in AnalysisPredictor. + * + * This will save the IO copy for transfering inputs and outputs to predictor + * workspace and get some performance improvement. + * To use it, one should call the `AnalysisConfig.SwitchUseFeedFetchOp(true)` + * and then use the `GetInputTensor` and `GetOutputTensor` to directly write + * or read the input/output tensors. + */ + virtual bool ZeroCopyRun() { return false; } + + /** Clone a predictor that share the model weights, the Cloned predictor + * should be thread-safe. + */ + virtual std::unique_ptr Clone() = 0; + + /** Destroy the Predictor. + */ + virtual ~PaddlePredictor() = default; + + /** \brief Get the serialized model program that executes in inference phase. + * Its data type is ProgramDesc, which is a protobuf message. + */ + virtual std::string GetSerializedProgram() const { + assert(false); // Force raise error. + return "NotImplemented"; + } + + /** The common configs for all the predictors. + */ + struct Config { + std::string model_dir; /*!< path to the model directory. */ + }; +}; + +struct NativeConfig : public PaddlePredictor::Config { + // GPU related fields. + bool use_gpu{false}; + int device{0}; + float fraction_of_gpu_memory{ + -1.f}; /*!< Change to a float in (0,1] if needed. */ + + // Specify the exact path of program and parameter files. + std::string prog_file; + std::string param_file; + + /** Specify the variable's name of each input if input tensors don't follow + * the + * `feeds` and `fetches` of the phase `save_inference_model`. + */ + bool specify_input_name{false}; + + /** Set and get the number of cpu math library threads. + */ + void SetCpuMathLibraryNumThreads(int cpu_math_library_num_threads) { + cpu_math_library_num_threads_ = cpu_math_library_num_threads; + } + int cpu_math_library_num_threads() const { + return cpu_math_library_num_threads_; + } + + protected: + // number of cpu math library (such as MKL, OpenBlas) threads for each + // instance. + int cpu_math_library_num_threads_{1}; +}; + +/*! \fn std::unique_ptr CreatePaddlePredictor(const ConfigT& + * config); + * + * \brief A factory to help create different predictors. + * + * Usage: + * + * \code{.cpp} + * NativeConfig config; + * ... // change the configs. + * auto native_predictor = CreatePaddlePredictor(config); + * \endcode + * + * FOR EXTENSION DEVELOPER: + * Different predictors are designated by config type. Similar configs can be + * merged, but there shouldn't be a huge config containing different fields for + * more than one kind of predictors. + */ +template +std::unique_ptr CreatePaddlePredictor(const ConfigT& config); + +/** NOTE The following APIs are too trivial, we will discard it in the following + * versions. + */ +enum class PaddleEngineKind { + kNative = 0, /*!< Use the native Fluid facility. */ + kAutoMixedTensorRT, /*!< Automatically mix Fluid with TensorRT. */ + kAnalysis, /*!< More optimization. */ + kAnakin /*!< Use Anakin for inference, not mature yet. */ +}; + +template +std::unique_ptr CreatePaddlePredictor(const ConfigT& config); + +int PaddleDtypeSize(PaddleDType dtype); + +std::string get_version(); + +} // namespace paddle diff --git a/paddle/fluid/inference/api/paddle_inference_api.h b/paddle/fluid/inference/api/paddle_inference_api.h index d2876dc27c8826c2f27be21fa0b9fef92d03067a..1785bd520a17d5f5060d789b2e4e4f1eda26aa6a 100644 --- a/paddle/fluid/inference/api/paddle_inference_api.h +++ b/paddle/fluid/inference/api/paddle_inference_api.h @@ -26,264 +26,8 @@ limitations under the License. */ #include #include -namespace paddle { - -// Data type. -enum PaddleDType { - FLOAT32, - INT64, - // TODO(Superjomn) support more data types if needed. -}; - -/* - * Memory menage for PaddleTensor. - * The PaddleBuf holds a buffer for data input or output. The memory can be - * allocated by user or by PaddleBuf itself, but in any case, the PaddleBuf - * should be reused for better performance. - * - * For user allocated memory, the following API can be used: - * - PaddleBuf(void* data, size_t length) to set an external memory by - * specifying - * the memory address and length. - * - Reset(void* data, size_t length) to reset the PaddleBuf with an external - * memory. - * ATTENTION, for user allocated memory, deallocation should be done by users - * externally after the program finished. The PaddleBuf won't do any allocation - * or deallocation. - * - * To have the PaddleBuf allocate and manage the memory: - * - PaddleBuf(size_t length) will allocate a memory of size `length`. - * - Resize(size_t length) resize the memory to no less than `length`, ATTENTION - * if the allocated memory is larger than `length`, nothing will done. - */ -class PaddleBuf { - public: - // PaddleBuf allocate memory internally, and manage it. - explicit PaddleBuf(size_t length) - : data_(new char[length]), length_(length), memory_owned_(true) {} - // Set external memory, the PaddleBuf won't manage it. - PaddleBuf(void* data, size_t length) - : data_(data), length_(length), memory_owned_{false} {} - // Copy only available when memory is managed externally. - explicit PaddleBuf(const PaddleBuf&); - - // Resize the memory. - void Resize(size_t length); - // Reset to external memory, with address and length set. - void Reset(void* data, size_t length); - // Tell whether the buffer is empty. - bool empty() const { return length_ == 0; } - // Get the memory address. - void* data() const { return data_; } - // Get the memory length. - size_t length() const { return length_; } - - ~PaddleBuf() { Free(); } - PaddleBuf& operator=(const PaddleBuf&); - PaddleBuf& operator=(PaddleBuf&&); - PaddleBuf() = default; - PaddleBuf(PaddleBuf&& other); - - private: - void Free(); - void* data_{nullptr}; // pointer to the data memory. - size_t length_{0}; // number of memory bytes. - bool memory_owned_{true}; -}; - -// Basic input and output data structure for PaddlePredictor. -struct PaddleTensor { - PaddleTensor() = default; - std::string name; // variable name. - std::vector shape; - PaddleBuf data; // blob of data. - PaddleDType dtype; - std::vector> lod; // Tensor+LoD equals LoDTensor -}; - -enum class PaddlePlace { kUNK = -1, kCPU, kGPU }; -// Tensor without copy, currently only supports AnalysisPredictor. -class ZeroCopyTensor { - public: - void Reshape(const std::vector& shape); - - // Get the memory in CPU or GPU with specific data type, should Reshape first - // to tell the data size. - // Once can directly call this data to feed the data. - // This is for write the input tensor. - template - T* mutable_data(PaddlePlace place); - // Get the memory directly, will return the place and memory size by pointer. - // This is for reading the output tensor. - template - T* data(PaddlePlace* place, int* size); - - std::vector shape(); - - void SetLoD(const std::vector>& x); - std::vector> lod() const; - - protected: - ZeroCopyTensor(void* scope) : scope_{scope} {} - void SetName(const std::string& name) { name_ = name; } - void* FindTensor() const; - - private: - std::string name_; - bool input_or_output_; - friend class AnalysisPredictor; - void* scope_{nullptr}; -}; - -/* - * A simple Inference API for Paddle. - */ -class PaddlePredictor { - public: - struct Config; - PaddlePredictor() = default; - PaddlePredictor(const PaddlePredictor&) = delete; - PaddlePredictor& operator=(const PaddlePredictor&) = delete; - - // Predict an record. - // The caller should be responsible for allocating and releasing the memory of - // `inputs`. `inputs` should be available until Run returns. Caller should be - // responsible for the output tensor's buffer, either allocated or passed from - // outside. - virtual bool Run(const std::vector& inputs, - std::vector* output_data, - int batch_size = -1) = 0; - - // Zero copy input and output optimization. - // Get the input or output tensors, and operate on their memory directly, - // without copy. - virtual std::unique_ptr GetInputTensor( - const std::string& name) { - return nullptr; - } - virtual std::unique_ptr GetOutputTensor( - const std::string& name) { - return nullptr; - } - virtual bool ZeroCopyRun() { return false; } - - // Clone a predictor that share the model weights, the Cloned predictor should - // be thread-safe. - virtual std::unique_ptr Clone() = 0; - - // Destroy the Predictor. - virtual ~PaddlePredictor() = default; - - // The common configs for all the predictors. - struct Config { - std::string model_dir; // path to the model directory. - }; -}; - -struct NativeConfig : public PaddlePredictor::Config { - // GPU related fields. - bool use_gpu{false}; - int device{0}; - float fraction_of_gpu_memory{-1.f}; // Change to a float in (0,1] if needed. - - // Specify the exact path of program and parameter files. - std::string prog_file; - std::string param_file; - - // Specify the variable's name of each input if input tensors don't follow the - // `feeds` and `fetches` of the phase `save_inference_model`. - bool specify_input_name{false}; -}; - -// A factory to help create different predictors. -// -// Usage: -// -// NativeConfig config; -// ... // change the configs. -// auto native_predictor = CreatePaddlePredictor(config); -// -// FOR EXTENSION DEVELOPER: -// Different predictors are designated by config type. Similar configs can be -// merged, but there shouldn't be a huge config containing different fields for -// more than one kind of predictors. -template -std::unique_ptr CreatePaddlePredictor(const ConfigT& config); - -// NOTE The following APIs are too trivial, we will discard it in the following -// versions. -enum class PaddleEngineKind { - kNative = 0, // Use the native Fluid facility. - kAutoMixedTensorRT, // Automatically mix Fluid with TensorRT. - kAnalysis, // More optimization. - kAnakin // Use Anakin for inference, not mature yet. -}; - -template -std::unique_ptr CreatePaddlePredictor(const ConfigT& config); - -// == -// -// ----------------------------------------------------------------------------------- -// NOTE: The following APIs are not mature yet, we are still working on them. - -namespace contrib { - -// Accelerate GPU computation with TensorRT engine. -struct MixedRTConfig : public NativeConfig { - // Determine whether a subgraph will be executed by TRT. - int min_subgraph_size{1}; - // While TensorRT allows an engine optimized for a given max batch size - // to run at any smaller size, the performance for those smaller - // sizes may not be as well-optimized. Therefore, Max batch is best - // equivalent to the runtime batch size. - int max_batch_size{1}; - // For workspace_size, refer it from here: - // https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#troubleshooting - int workspace_size{1 << 30}; - // We transform the Ops that can be converted into TRT layer in the model, - // and aggregate these Ops into subgraphs for TRT execution. - // We set this variable to control the minimum number of nodes in the - // subgraph, 3 as default value. - int minimum_subgraph_size = 3; - // Reserved configuration - // We just support "FP32" now, "FP16" and "INT8" will be supported. - std::string precision_mode = "FP32"; -}; - -// NOTE WIP, not stable yet. -struct AnalysisConfig : public NativeConfig { - enum class IrPassMode { - kSystem, // Use system default passes, not customize. - kInclude, // Specify the passes in `ir_passes`. - kExclude // Specify the disabled passes in `ir_passes`. - }; - - // Determine whether to perform graph optimization. - bool enable_ir_optim = true; - // Manually determine the IR passes to run. - IrPassMode ir_mode{IrPassMode::kExclude}; - std::vector ir_passes{"embedding_fc_lstm_fuse_pass"}; - - // NOT stable yet. - bool use_feed_fetch_ops{true}; - - // NOTE this is just for internal development, please not use it. - // NOT stable yet. - bool _use_mkldnn{false}; -}; - -// Configurations for Anakin engine. -struct AnakinConfig : public PaddlePredictor::Config { - enum TargetType { NVGPU = 0, X86 }; - int device; - std::string model_file; - int max_batch_size{-1}; - TargetType target_type; -}; - -} // namespace contrib - -int PaddleDtypeSize(PaddleDType dtype); - -} // namespace paddle +#include "paddle_analysis_config.h" // NOLINT +#include "paddle_api.h" // NOLINT +#ifdef WITH_ANAKIN +#include "paddle_anakin_config.h" // NOLINT +#endif diff --git a/paddle/fluid/inference/api/paddle_mkldnn_quantizer_config.h b/paddle/fluid/inference/api/paddle_mkldnn_quantizer_config.h new file mode 100644 index 0000000000000000000000000000000000000000..d46f842de7a2277ee5d00672386b12af7ba28deb --- /dev/null +++ b/paddle/fluid/inference/api/paddle_mkldnn_quantizer_config.h @@ -0,0 +1,105 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +#pragma once + +#include +#include +#include +#include +#include +#include + +#include "paddle_api.h" // NOLINT + +namespace paddle { + +// Algorithms for finding scale of quantized Tensors. +enum class ScaleAlgo { + NONE, // Do not compute scale + MAX, // Find scale based on the maximum absolute value + MAX_CH, // Find scale based on the maximum absolute value per channel + KL, // Find scale based on KL Divergence +}; + +struct MkldnnQuantizerConfig { + MkldnnQuantizerConfig(); + + /** Specify a quantization algorithm for a connection (input/output) of the + * operator type. + * @param op_type_name the operator's name. + * @param conn_name name of the connection (input/output) of the operator. + * @param algo the algorithm for computing scale. + */ + void SetScaleAlgo(std::string op_type_name, std::string conn_name, + ScaleAlgo algo) { + rules_[op_type_name][conn_name] = algo; + } + + /** Get the quantization algorithm for a connection (input/output) of the + * operator type. + * @param op_type_name the operator's name. + * @param conn_name name of the connection (input/output) of the operator. + * @return the algorithm for computing scale. + */ + ScaleAlgo scale_algo(const std::string& op_type_name, + const std::string& conn_name) const; + + /** Set the batch of data to be used for warm-up iteration. + * @param data batch of data. + */ + void SetWarmupData(std::shared_ptr> data) { + warmup_data_ = data; + } + + /** Get the batch of data used for warm-up iteration. + * @return batch of data. + */ + std::shared_ptr> warmup_data() const { + return warmup_data_; + } + + void SetWarmupBatchSize(int batch_size) { warmup_bs_ = batch_size; } + + int warmup_batch_size() const { return warmup_bs_; } + + void SetEnabledOpTypes(std::unordered_set op_list) { + enabled_op_types_ = op_list; + } + + const std::unordered_set& enabled_op_types() const { + return enabled_op_types_; + } + + void SetExcludedOpIds(std::unordered_set op_ids_list) { + excluded_op_ids_ = op_ids_list; + } + + const std::unordered_set& excluded_op_ids() const { + return excluded_op_ids_; + } + + void SetDefaultScaleAlgo(ScaleAlgo algo) { default_scale_algo_ = algo; } + + ScaleAlgo default_scale_algo() const { return default_scale_algo_; } + + protected: + std::map> rules_; + std::unordered_set enabled_op_types_; + std::unordered_set excluded_op_ids_; + std::shared_ptr> warmup_data_; + int warmup_bs_{1}; + ScaleAlgo default_scale_algo_{ScaleAlgo::MAX}; +}; + +} // namespace paddle diff --git a/paddle/fluid/inference/api/paddle_pass_builder.cc b/paddle/fluid/inference/api/paddle_pass_builder.cc new file mode 100644 index 0000000000000000000000000000000000000000..9b0873aecb545067180723c363a38bed1552fb2a --- /dev/null +++ b/paddle/fluid/inference/api/paddle_pass_builder.cc @@ -0,0 +1,180 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/api/paddle_pass_builder.h" +#ifdef PADDLE_WITH_CUDA +#include +#endif +#include + +namespace paddle { + +void PaddlePassBuilder::AppendPass(const std::string &pass_type) { + passes_.push_back(pass_type); +} + +void PaddlePassBuilder::TurnOnDebug() { + std::vector passes; + auto it = std::begin(passes_); + while (it != std::end(passes_)) { + if (*it != "graph_viz_pass") { + it = passes_.insert(it + 1, "graph_viz_pass"); + } else { + ++it; + } + } +} + +std::string PaddlePassBuilder::DebugString() { + std::stringstream ss; + ss << "Passes to apply:\n"; + for (auto &pass : passes_) { + ss << " - " << pass << '\n'; + } + return ss.str(); +} + +void PaddlePassBuilder::DeletePass(const std::string &pass_type) { + auto it = std::begin(passes_); + while (it != std::end(passes_)) { + if (*it == pass_type) { + it = passes_.erase(it); + } else { + ++it; + } + } +} + +void PaddlePassBuilder::InsertPass(size_t idx, const std::string &pass_type) { + passes_.insert(std::begin(passes_) + idx, pass_type); +} + +void PaddlePassBuilder::DeletePass(size_t idx) { + passes_.erase(std::begin(passes_) + idx); +} + +void PaddlePassBuilder::AppendAnalysisPass(const std::string &pass) { + analysis_passes_.push_back(pass); +} + +void PaddlePassBuilder::ClearPasses() { passes_.clear(); } + +// The following passes works for Anakin sub-graph engine. +const std::vector kAnakinSubgraphPasses({ + "infer_clean_graph_pass", // + "simplify_anakin_priorbox_detection_out_pass", // + "fillconstant_elementwisemul_fuse", // + "fc_fuse_pass", // + "conv_elementwise_add_fuse_pass", // + "conv_bn_fuse_pass", // + "conv_elementwise_add_fuse_pass", // + "fc_gru_fuse_pass", // + "quant_conv2d_dequant_fuse_pass", // + "anakin_subgraph_pass", +}); + +GpuPassStrategy::GpuPassStrategy() : PassStrategy({}) { + passes_.assign({ + "infer_clean_graph_pass", // + "runtime_context_cache_pass", // + // "identity_scale_op_clean_pass", // + "conv_affine_channel_fuse_pass", // + "conv_eltwiseadd_affine_channel_fuse_pass", // + "conv_bn_fuse_pass", // +#if CUDNN_VERSION >= 7100 // To run conv_fusion, the version of cudnn must be + // guaranteed at least v7 + "conv_elementwise_add_act_fuse_pass", // + "conv_elementwise_add2_act_fuse_pass", // + "conv_elementwise_add_fuse_pass", // +#endif // + "transpose_flatten_concat_fuse_pass", + "expected_kernel_cache_pass", // + }); + + use_gpu_ = true; +} + +void GpuPassStrategy::EnableMKLDNN() { + LOG(ERROR) << "GPU not support MKLDNN yet"; +} + +void GpuPassStrategy::EnableMkldnnQuantizer() { + LOG(ERROR) << "GPU not support MKL-DNN quantization"; +} + +CpuPassStrategy::CpuPassStrategy() : PassStrategy({}) { + // NOTE the large fusions should be located in the front, so that they will + // not be damaged by smaller ones. + passes_.assign({ + "infer_clean_graph_pass", // + // TODO(luotao): runtime_context_cache_pass should be located in the + // front, see https://github.com/PaddlePaddle/Paddle/issues/16609, + // will enhance this pass later. + "runtime_context_cache_pass", // + "attention_lstm_fuse_pass", // + "seqconv_eltadd_relu_fuse_pass", // + // "seqpool_concat_fuse_pass", // + // "embedding_fc_lstm_fuse_pass", // + "fc_lstm_fuse_pass", // + "mul_lstm_fuse_pass", // + "fc_gru_fuse_pass", // + "mul_gru_fuse_pass", // + "seq_concat_fc_fuse_pass", // + "fc_fuse_pass", // + "repeated_fc_relu_fuse_pass", // + "squared_mat_sub_fuse_pass", // + "conv_bn_fuse_pass", // + "conv_eltwiseadd_bn_fuse_pass", // + "is_test_pass", // + "expected_kernel_cache_pass", // + }); + + use_gpu_ = false; +} + +void CpuPassStrategy::EnableMKLDNN() { +// TODO(Superjomn) Consider the way to mix CPU with GPU. +#ifdef PADDLE_WITH_MKLDNN + if (!use_mkldnn_) { + passes_.insert(passes_.begin(), "mkldnn_placement_pass"); + + for (auto &pass : std::vector( + {"depthwise_conv_mkldnn_pass", // + "conv_bn_fuse_pass", // Execute BN passes again to + "conv_eltwiseadd_bn_fuse_pass", // preserve correct pass order + "conv_bias_mkldnn_fuse_pass", // + "conv3d_bias_mkldnn_fuse_pass", // + "conv_elementwise_add_mkldnn_fuse_pass", + "conv_relu_mkldnn_fuse_pass"})) { + passes_.push_back(pass); + } + } + use_mkldnn_ = true; +#else + use_mkldnn_ = false; +#endif +} + +void CpuPassStrategy::EnableMkldnnQuantizer() { +#ifdef PADDLE_WITH_MKLDNN + if (!use_mkldnn_quantizer_) { + passes_.push_back("cpu_quantize_placement_pass"); + } + use_mkldnn_quantizer_ = true; +#else + use_mkldnn_quantizer_ = false; +#endif +} + +} // namespace paddle diff --git a/paddle/fluid/inference/api/paddle_pass_builder.h b/paddle/fluid/inference/api/paddle_pass_builder.h new file mode 100644 index 0000000000000000000000000000000000000000..09ef195d5e66aff0cef17f1594de34c656187a35 --- /dev/null +++ b/paddle/fluid/inference/api/paddle_pass_builder.h @@ -0,0 +1,146 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include + +/*! \file */ + +/*! \namespace paddle */ +namespace paddle { + +/** This is a pass builder based on string. It is part of inference API. + */ +class PaddlePassBuilder { + public: + explicit PaddlePassBuilder(const std::vector &passes) + : passes_(passes) {} + + void SetPasses(std::initializer_list passes) { + passes_ = passes; + } + + /** Append a pass to the end of the passes. */ + void AppendPass(const std::string &pass_type); + + /** Insert a pass to a specific position. + * @param idx the position to insert. + * @param pass_type the pass key. + */ + void InsertPass(size_t idx, const std::string &pass_type); + + /** Delete the `idx`-th pass. */ + void DeletePass(size_t idx); + + /** Delete all the passes that has type `pass_type`. */ + void DeletePass(const std::string &pass_type); + + void ClearPasses(); + /** Append an analysis pass. */ + void AppendAnalysisPass(const std::string &pass); + + /** Visualize the computation graph after each pass by generating a DOT + * language file, one can draw them with the Graphviz toolkit. + */ + void TurnOnDebug(); + + /** Human-readible information. */ + std::string DebugString(); + + const std::vector &AllPasses() const { return passes_; } + std::vector AnalysisPasses() const { + auto passes = analysis_passes_; + // To make sure the ir_graph_to_program should be the last pass so any + // modication of IR will persist to the program. + passes.push_back("ir_graph_to_program_pass"); + return passes; + } + + protected: + std::vector analysis_passes_{ + {"ir_graph_build_pass", "ir_analysis_pass", + "ir_params_sync_among_devices_pass"}}; + std::vector passes_; +}; + +/**Pass strategy to help control the IR passes. + */ +class PassStrategy : public PaddlePassBuilder { + public: + explicit PassStrategy(const std::vector &passes) + : PaddlePassBuilder(passes) {} + + /** The MKLDNN control exists in both CPU and GPU mode, because there can be + * still some CPU kernels running in CPU mode. + */ + virtual void EnableMKLDNN() {} + + /** Enable MKLDNN quantize optimization + */ + virtual void EnableMkldnnQuantizer() {} + + bool use_gpu() const { return use_gpu_; } + + virtual ~PassStrategy() = default; + + protected: + bool use_gpu_{false}; + bool use_mkldnn_{false}; +}; + +/** The CPU passes controller, it is used in AnalysisPredictor with CPU mode. + */ +class CpuPassStrategy : public PassStrategy { + public: + CpuPassStrategy(); + + explicit CpuPassStrategy(const CpuPassStrategy &other) + : PassStrategy(other.AllPasses()) { + use_gpu_ = other.use_gpu_; + use_mkldnn_ = other.use_mkldnn_; + use_mkldnn_quantizer_ = other.use_mkldnn_quantizer_; + } + + virtual ~CpuPassStrategy() = default; + + void EnableMKLDNN() override; + void EnableMkldnnQuantizer() override; + + protected: + bool use_mkldnn_quantizer_{false}; +}; + +/** The GPU passes strategy, it is used in AnalysisPredictor with GPU mode. + */ +class GpuPassStrategy : public PassStrategy { + public: + GpuPassStrategy(); + + explicit GpuPassStrategy(const GpuPassStrategy &other) + : PassStrategy(other.AllPasses()) { + use_gpu_ = true; + } + + void EnableMKLDNN() override; + void EnableMkldnnQuantizer() override; + + virtual ~GpuPassStrategy() = default; +}; + +extern const std::vector kAnakinSubgraphPasses; + +} // namespace paddle diff --git a/paddle/fluid/inference/engine.h b/paddle/fluid/inference/engine.h index ce2b8161715a3fa2278ce950dbac82c6d0042bef..1a13ba510384c010e476bf0ba0ad5b0ba84d3240 100644 --- a/paddle/fluid/inference/engine.h +++ b/paddle/fluid/inference/engine.h @@ -49,11 +49,6 @@ class EngineBase { // Execute the engine, that will run the inference network. virtual void Execute(int batch_size) = 0; - // Return the IO buffer that allocated in engine. One can read/write directly - // on the buffer. If the buffer's buffer is nullptr, one can also allocate - // memory and maintain it outside the engine. - virtual Buffer& buffer(const std::string& name) = 0; - virtual ~EngineBase() {} }; // class EngineBase diff --git a/paddle/fluid/inference/io.cc b/paddle/fluid/inference/io.cc index e246a06fd079d837ac321197914c9f70b528f2c8..ae72a74acce826c3635d5d537540eaad79ff8199 100644 --- a/paddle/fluid/inference/io.cc +++ b/paddle/fluid/inference/io.cc @@ -59,7 +59,8 @@ void ReadBinaryFile(const std::string& filename, std::string* contents) { bool IsPersistable(const framework::VarDesc* var) { if (var->Persistable() && var->GetType() != framework::proto::VarType::FEED_MINIBATCH && - var->GetType() != framework::proto::VarType::FETCH_LIST) { + var->GetType() != framework::proto::VarType::FETCH_LIST && + var->GetType() != framework::proto::VarType::RAW) { return true; } return false; @@ -68,7 +69,8 @@ bool IsPersistable(const framework::VarDesc* var) { void LoadPersistables(framework::Executor* executor, framework::Scope* scope, const framework::ProgramDesc& main_program, const std::string& dirname, - const std::string& param_filename) { + const std::string& param_filename, + bool model_from_memory = false) { const framework::BlockDesc& global_block = main_program.Block(0); framework::ProgramDesc* load_program = new framework::ProgramDesc(); @@ -77,7 +79,7 @@ void LoadPersistables(framework::Executor* executor, framework::Scope* scope, for (auto* var : global_block.AllVars()) { if (IsPersistable(var)) { - VLOG(3) << "persistable variable's name: " << var->Name(); + VLOG(4) << "persistable variable's name: " << var->Name(); framework::VarDesc* new_var = load_block->Var(var->Name()); new_var->SetShape(var->GetShape()); @@ -107,6 +109,7 @@ void LoadPersistables(framework::Executor* executor, framework::Scope* scope, op->SetType("load_combine"); op->SetOutput("Out", paramlist); op->SetAttr("file_path", {param_filename}); + op->SetAttr("model_from_memory", {model_from_memory}); op->CheckAttrs(); } @@ -129,16 +132,17 @@ std::unique_ptr Load(framework::Executor* executor, "model version %ld is not supported.", main_program->Version()); - LoadPersistables(executor, scope, *main_program, dirname, ""); + // model_from_memory is false in seperate parameters. + LoadPersistables(executor, scope, *main_program, dirname, "", + false /* model_from_memory */); return main_program; } std::unique_ptr Load( framework::Executor* executor, framework::Scope* scope, const std::string& prog_filename, const std::string& param_filename) { - std::string model_filename = prog_filename; std::string program_desc_str; - ReadBinaryFile(model_filename, &program_desc_str); + ReadBinaryFile(prog_filename, &program_desc_str); std::unique_ptr main_program( new framework::ProgramDesc(program_desc_str)); @@ -146,7 +150,22 @@ std::unique_ptr Load( "model version %ld is not supported.", main_program->Version()); - LoadPersistables(executor, scope, *main_program, "", param_filename); + LoadPersistables(executor, scope, *main_program, "", param_filename, + false /* model_from_memory */); + return main_program; +} + +std::unique_ptr LoadFromMemory( + framework::Executor* executor, framework::Scope* scope, + const std::string& prog_buffer, const std::string& param_buffer) { + std::unique_ptr main_program( + new framework::ProgramDesc(prog_buffer)); + PADDLE_ENFORCE(framework::IsProgramVersionSupported(main_program->Version()), + "model version %ld is not supported.", + main_program->Version()); + + LoadPersistables(executor, scope, *main_program, "", param_buffer, + true /* model_filename */); return main_program; } diff --git a/paddle/fluid/inference/io.h b/paddle/fluid/inference/io.h index ab492577c1476abee30d6dd1c740394391e5a93a..317ef9d93acf3af174cb44da6099425fff1418eb 100644 --- a/paddle/fluid/inference/io.h +++ b/paddle/fluid/inference/io.h @@ -30,7 +30,8 @@ void Init(const std::vector argv); void LoadPersistables(framework::Executor* executor, framework::Scope* scope, const framework::ProgramDesc& main_program, const std::string& dirname, - const std::string& param_filename); + const std::string& param_filename, + bool model_from_memory); std::unique_ptr Load(framework::Executor* executor, framework::Scope* scope, @@ -41,6 +42,10 @@ std::unique_ptr Load(framework::Executor* executor, const std::string& prog_filename, const std::string& param_filename); +std::unique_ptr LoadFromMemory( + framework::Executor* executor, framework::Scope* scope, + const std::string& prog_buffer, const std::string& param_buffer); + // Save the variables from a scope to disk. void SaveVars(const framework::Scope& scope, const std::vector& vars, const std::string& dirname, diff --git a/paddle/fluid/inference/tensorrt/CMakeLists.txt b/paddle/fluid/inference/tensorrt/CMakeLists.txt index a610687a5b11999a7cb7426dbe961e5972ee1746..f4977d08c4d051b8a528e122c47948c3c81d153c 100644 --- a/paddle/fluid/inference/tensorrt/CMakeLists.txt +++ b/paddle/fluid/inference/tensorrt/CMakeLists.txt @@ -1,4 +1,6 @@ -nv_library(tensorrt_engine SRCS engine.cc DEPS framework_proto device_context) +nv_library(tensorrt_engine SRCS engine.cc trt_int8_calibrator.cc DEPS ${GLOB_OPERATOR_DEPS} framework_proto device_context) +nv_library(tensorrt_op_teller SRCS op_teller.cc DEPS framework_proto) nv_test(test_tensorrt SRCS test_tensorrt.cc DEPS dynload_cuda device_context dynamic_loader) nv_test(test_tensorrt_engine SRCS test_engine.cc DEPS dynload_cuda tensorrt_engine) +add_subdirectory(plugin) add_subdirectory(convert) diff --git a/paddle/fluid/inference/tensorrt/convert/CMakeLists.txt b/paddle/fluid/inference/tensorrt/convert/CMakeLists.txt index 0a35e10f6936313928ab21a6f17c40335e8fc882..840abd26a755c39bc9c17315aefdd0dec862e77c 100644 --- a/paddle/fluid/inference/tensorrt/convert/CMakeLists.txt +++ b/paddle/fluid/inference/tensorrt/convert/CMakeLists.txt @@ -1,33 +1,42 @@ # Add TRT tests nv_library(tensorrt_converter - SRCS mul_op.cc conv2d_op.cc fc_op.cc pool2d_op.cc elementwise_op.cc -batch_norm_op.cc activation_op.cc softmax_op.cc concat_op.cc dropout_op.cc pad_op.cc - DEPS tensorrt_engine operator scope framework_proto op_registry) + SRCS mul_op.cc conv2d_op.cc fc_op.cc pool2d_op.cc elementwise_op.cc + batch_norm_op.cc activation_op.cc softmax_op.cc concat_op.cc dropout_op.cc + pad_op.cc split_op.cc prelu_op.cc leaky_relu_op.cc + DEPS tensorrt_engine tensorrt_plugin operator scope framework_proto op_registry) nv_test(test_op_converter SRCS test_op_converter.cc DEPS - ${FLUID_CORE_MODULES} tensorrt_engine tensorrt_converter) + ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine tensorrt_converter) nv_test(test_io_converter SRCS test_io_converter.cc io_converter.cc DEPS dynload_cuda dynamic_loader lod_tensor) nv_test(test_trt_mul_op SRCS test_mul_op.cc mul_op.cc - DEPS ${FLUID_CORE_MODULES} tensorrt_engine mul_op SERIAL) + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine mul_op SERIAL) nv_test(test_trt_fc_op SRCS test_fc_op.cc fc_op.cc - DEPS ${FLUID_CORE_MODULES} tensorrt_engine mul_op SERIAL) + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine mul_op SERIAL) nv_test(test_trt_activation_op SRCS test_activation_op.cc activation_op.cc - DEPS ${FLUID_CORE_MODULES} tensorrt_engine activation_op SERIAL) + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine activation_op SERIAL) nv_test(test_trt_conv_op SRCS test_conv2d_op.cc conv2d_op.cc - DEPS ${FLUID_CORE_MODULES} tensorrt_engine conv_op SERIAL) + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine conv_op conv_transpose_op SERIAL) nv_test(test_trt_pool2d_op SRCS test_pool2d_op.cc pool2d_op.cc - DEPS ${FLUID_CORE_MODULES} tensorrt_engine pool_op SERIAL) + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine pool_op tensorrt_plugin SERIAL) nv_test(test_trt_elementwise_op SRCS test_elementwise_op.cc elementwise_op.cc - DEPS ${FLUID_CORE_MODULES} tensorrt_engine elementwise_add_op SERIAL) + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine tensorrt_plugin + elementwise_add_op elementwise_mul_op SERIAL) nv_test(test_trt_softmax_op SRCS test_softmax_op.cc softmax_op.cc - DEPS ${FLUID_CORE_MODULES} tensorrt_engine softmax_op SERIAL) + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine softmax_op SERIAL) nv_test(test_trt_batch_norm_op SRCS test_batch_norm_op.cc batch_norm_op.cc - DEPS ${FLUID_CORE_MODULES} tensorrt_engine batch_norm_op SERIAL) + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine batch_norm_op SERIAL) nv_test(test_trt_concat_op SRCS test_concat_op.cc concat_op.cc - DEPS ${FLUID_CORE_MODULES} tensorrt_engine concat_op SERIAL) + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine concat_op SERIAL) nv_test(test_trt_dropout_op SRCS test_dropout_op.cc dropout_op.cc - DEPS ${FLUID_CORE_MODULES} tensorrt_engine dropout_op SERIAL) - + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine dropout_op SERIAL) nv_test(test_trt_pad_op SRCS test_pad_op.cc pad_op.cc - DEPS ${FLUID_CORE_MODULES} tensorrt_engine pad_op SERIAL) + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine pad_op SERIAL) +nv_test(test_trt_split_op SRCS test_split_op.cc split_op.cc + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine tensorrt_plugin + split_op concat_op SERIAL) +nv_test(test_trt_prelu_op SRCS test_prelu_op.cc prelu_op.cc + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine tensorrt_plugin + prelu_op SERIAL) +nv_test(test_trt_leaky_relu_op SRCS test_leaky_relu_op.cc leaky_relu_op.cc + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine activation_op SERIAL) diff --git a/paddle/fluid/inference/tensorrt/convert/activation_op.cc b/paddle/fluid/inference/tensorrt/convert/activation_op.cc index e73c5bbf57501e4ff3c080a46d91685035652bfa..0b756534ec6fbf27a3e92bf39fb7544d9785ca48 100644 --- a/paddle/fluid/inference/tensorrt/convert/activation_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/activation_op.cc @@ -27,7 +27,7 @@ class ActivationOpConverter : public OpConverter { // Here the two nullptr looks strange, that's because the // framework::OpDesc's constructor is strange. framework::OpDesc op_desc(op, nullptr); - LOG(INFO) + VLOG(3) << "convert a fluid Activation op to tensorrt activation layer whose " "type is " << op_type_; diff --git a/paddle/fluid/inference/tensorrt/convert/batch_norm_op.cc b/paddle/fluid/inference/tensorrt/convert/batch_norm_op.cc index 3330af2da6c97ad153dcecd86be4b441eac62b5e..d017bac66dd99a4b54c44ec786de61d1e66b8981 100644 --- a/paddle/fluid/inference/tensorrt/convert/batch_norm_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/batch_norm_op.cc @@ -23,7 +23,7 @@ class BatchNormOpConverter : public OpConverter { public: void operator()(const framework::proto::OpDesc& op, const framework::Scope& scope, bool test_mode) override { - LOG(INFO) << "convert a fluid batch norm op to tensorrt batch_norm"; + VLOG(3) << "convert a fluid batch norm op to tensorrt batch_norm"; framework::OpDesc op_desc(op, nullptr); PADDLE_ENFORCE_EQ(op_desc.Input("X").size(), 1); diff --git a/paddle/fluid/inference/tensorrt/convert/concat_op.cc b/paddle/fluid/inference/tensorrt/convert/concat_op.cc index a11dfa1e8f2dacfad067d025678911200db500fb..525ba9dc341c8c1343553ac9523611f79ac3aa2d 100644 --- a/paddle/fluid/inference/tensorrt/convert/concat_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/concat_op.cc @@ -19,13 +19,13 @@ namespace inference { namespace tensorrt { /* - * MulOp, IMatrixMultiplyLayer in TRT. This Layer doesn't has weights. + * ConcatOp */ class ConcatOpConverter : public OpConverter { public: void operator()(const framework::proto::OpDesc& op, const framework::Scope& scope, bool test_mode) override { - VLOG(4) << "convert a fluid mul op to tensorrt mul layer without bias"; + VLOG(3) << "convert a fluid mul op to tensorrt mul layer without bias"; framework::OpDesc op_desc(op, nullptr); // Declare inputs diff --git a/paddle/fluid/inference/tensorrt/convert/conv2d_op.cc b/paddle/fluid/inference/tensorrt/convert/conv2d_op.cc index 0a37d3968c39d2c244bbd82161afddf6330e421d..39a99a21ea702032669ed4ed3016ab34128c9925 100644 --- a/paddle/fluid/inference/tensorrt/convert/conv2d_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/conv2d_op.cc @@ -18,74 +18,122 @@ namespace paddle { namespace inference { namespace tensorrt { +template +void ConvertConv2d(TensorRTEngine* engine, const framework::proto::OpDesc& op, + const framework::Scope& scope, bool test_mode, + RegistFunc fadd_layer, SetDilationFunc fset_dilation, + const std::string& name) { + VLOG(3) << "convert a fluid " << name << " op to tensorrt layer without bias"; + + framework::OpDesc op_desc(op, nullptr); + PADDLE_ENFORCE_EQ(op_desc.Input("Input").size(), 1); + PADDLE_ENFORCE_EQ(op_desc.Input("Filter").size(), 1); // Y is a weight + PADDLE_ENFORCE_EQ(op_desc.Output("Output").size(), 1); + + PADDLE_ENFORCE(engine != nullptr); + auto* X = engine->GetITensor(op_desc.Input("Input").front()); + + // Declare weights + auto* Y_v = scope.FindVar(op_desc.Input("Filter").front()); + PADDLE_ENFORCE_NOT_NULL(Y_v); + auto* Y_t = Y_v->GetMutable(); + + platform::CPUPlace cpu_place; + std::unique_ptr weight_tensor( + new framework::LoDTensor()); + weight_tensor->Resize(Y_t->dims()); + TensorCopySync((*Y_t), cpu_place, weight_tensor.get()); + + auto* weight_data = weight_tensor->mutable_data(cpu_place); + + PADDLE_ENFORCE_EQ(weight_tensor->dims().size(), 4UL); + const int n_output = weight_tensor->dims()[0]; + const int n_input = weight_tensor->dims()[1]; + const int filter_h = weight_tensor->dims()[2]; + const int filter_w = weight_tensor->dims()[3]; + const int groups = boost::get(op_desc.GetAttr("groups")); + const std::vector dilations = + boost::get>(op_desc.GetAttr("dilations")); + const std::vector strides = + boost::get>(op_desc.GetAttr("strides")); + const std::vector paddings = + boost::get>(op_desc.GetAttr("paddings")); + + nvinfer1::DimsHW nv_ksize(filter_h, filter_w); + nvinfer1::DimsHW nv_dilations(dilations[0], dilations[1]); + nvinfer1::DimsHW nv_strides(strides[0], strides[1]); + nvinfer1::DimsHW nv_paddings(paddings[0], paddings[1]); + + TensorRTEngine::Weight weight{nvinfer1::DataType::kFLOAT, + static_cast(weight_data), + static_cast(weight_tensor->numel())}; + + TensorRTEngine::Weight bias{nvinfer1::DataType::kFLOAT, nullptr, 0}; + auto* layer = fadd_layer(const_cast(X), n_output, n_input, + nv_ksize, weight, bias); + PADDLE_ENFORCE(layer != nullptr); + layer->setStride(nv_strides); + layer->setPadding(nv_paddings); + layer->setNbGroups(groups); + // set dilations + fset_dilation(layer, nv_dilations); + + auto output_name = op_desc.Output("Output").front(); + layer->setName((name + " (Output: " + output_name + ")").c_str()); + engine->weight_map[op_desc.Input("Filter").front()] = + std::move(weight_tensor); + layer->getOutput(0)->setName(output_name.c_str()); + engine->SetITensor(output_name, layer->getOutput(0)); + + if (test_mode) { + engine->DeclareOutput(output_name); + } +} + class Conv2dOpConverter : public OpConverter { public: void operator()(const framework::proto::OpDesc& op, const framework::Scope& scope, bool test_mode) override { - LOG(INFO) - << "convert a fluid conv2d op to tensorrt conv layer without bias"; - - framework::OpDesc op_desc(op, nullptr); - PADDLE_ENFORCE_EQ(op_desc.Input("Input").size(), 1); - PADDLE_ENFORCE_EQ(op_desc.Input("Filter").size(), 1); // Y is a weight - PADDLE_ENFORCE_EQ(op_desc.Output("Output").size(), 1); - - auto* X = engine_->GetITensor(op_desc.Input("Input").front()); - // Declare weights - auto* Y_v = scope.FindVar(op_desc.Input("Filter").front()); - PADDLE_ENFORCE_NOT_NULL(Y_v); - auto* Y_t = Y_v->GetMutable(); - - platform::CPUPlace cpu_place; - std::unique_ptr weight_tensor( - new framework::LoDTensor()); - weight_tensor->Resize(Y_t->dims()); - TensorCopySync((*Y_t), cpu_place, weight_tensor.get()); - - auto* weight_data = - weight_tensor->mutable_data(platform::CPUPlace()); - - PADDLE_ENFORCE_EQ(weight_tensor->dims().size(), 4UL); - const int n_output = weight_tensor->dims()[0]; - const int filter_h = weight_tensor->dims()[2]; - const int filter_w = weight_tensor->dims()[3]; - - const int groups = boost::get(op_desc.GetAttr("groups")); - const std::vector dilations = - boost::get>(op_desc.GetAttr("dilations")); - const std::vector strides = - boost::get>(op_desc.GetAttr("strides")); - const std::vector paddings = - boost::get>(op_desc.GetAttr("paddings")); - - nvinfer1::DimsHW nv_ksize(filter_h, filter_w); - nvinfer1::DimsHW nv_dilations(dilations[0], dilations[1]); - nvinfer1::DimsHW nv_strides(strides[0], strides[1]); - nvinfer1::DimsHW nv_paddings(paddings[0], paddings[1]); - - TensorRTEngine::Weight weight{nvinfer1::DataType::kFLOAT, - static_cast(weight_data), - weight_tensor->memory_size() / sizeof(float)}; - - TensorRTEngine::Weight bias{nvinfer1::DataType::kFLOAT, nullptr, 0}; - auto* layer = TRT_ENGINE_ADD_LAYER( - engine_, Convolution, *const_cast(X), n_output, - nv_ksize, weight.get(), bias.get()); - PADDLE_ENFORCE(layer != nullptr); - layer->setStride(nv_strides); - layer->setPadding(nv_paddings); - layer->setDilation(nv_dilations); - layer->setNbGroups(groups); - - auto output_name = op_desc.Output("Output").front(); - layer->setName(("conv2d (Output: " + output_name + ")").c_str()); - engine_->weight_map[op_desc.Input("Filter").front()] = - std::move(weight_tensor); - layer->getOutput(0)->setName(output_name.c_str()); - engine_->SetITensor(output_name, layer->getOutput(0)); - if (test_mode) { - engine_->DeclareOutput(output_name); - } + ConvertConv2d( + engine_, op, scope, test_mode, + [&](nvinfer1::ITensor* inputs, int n_output, /* Conv output maps */ + int n_input, /* Conv input maps */ + nvinfer1::DimsHW& ksize, TensorRTEngine::Weight& weight, + TensorRTEngine::Weight& bias) -> nvinfer1::IConvolutionLayer* { + auto* layer = + TRT_ENGINE_ADD_LAYER(engine_, Convolution, *inputs, n_output, + ksize, weight.get(), bias.get()); + return layer; + }, + [](nvinfer1::IConvolutionLayer* layer, nvinfer1::DimsHW& dilations) { + layer->setDilation(dilations); + }, + "conv2d"); + } +}; + +class Deconv2dOpConverter : public OpConverter { + public: + void operator()(const framework::proto::OpDesc& op, + const framework::Scope& scope, bool test_mode) override { + ConvertConv2d( + engine_, op, scope, test_mode, + [&](nvinfer1::ITensor* inputs, int n_output, /* Deconv input maps */ + int n_input, /* Deconv output maps */ + nvinfer1::DimsHW& ksize, TensorRTEngine::Weight& weight, + TensorRTEngine::Weight& bias) -> nvinfer1::IDeconvolutionLayer* { + auto* layer = + TRT_ENGINE_ADD_LAYER(engine_, Deconvolution, *inputs, n_input, + ksize, weight.get(), bias.get()); + return layer; + }, + [](nvinfer1::IDeconvolutionLayer* layer, nvinfer1::DimsHW& dilations) { + PADDLE_ENFORCE( + dilations.d[0] == 1 && dilations.d[1] == 1, + "Dilations must be (1, 1) for tensorRT, but given (%d, %d)", + dilations.d[0], dilations.d[1]); + }, + "conv2d_transpose"); } }; @@ -94,3 +142,4 @@ class Conv2dOpConverter : public OpConverter { } // namespace paddle REGISTER_TRT_OP_CONVERTER(conv2d, Conv2dOpConverter); +REGISTER_TRT_OP_CONVERTER(conv2d_transpose, Deconv2dOpConverter); diff --git a/paddle/fluid/inference/tensorrt/convert/dropout_op.cc b/paddle/fluid/inference/tensorrt/convert/dropout_op.cc index 9533ecbcfda4e2500fd201d8efc64fc5bd97169a..ddbc724e3b2a48b75df17f9bda691a1fd3883c32 100644 --- a/paddle/fluid/inference/tensorrt/convert/dropout_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/dropout_op.cc @@ -25,7 +25,7 @@ class DropoutOpConverter : public OpConverter { public: void operator()(const framework::proto::OpDesc& op, const framework::Scope& scope, bool test_mode) override { - VLOG(4) << "convert a fluid dropout op to tensorrt dropout layer"; + VLOG(3) << "convert a fluid dropout op to tensorrt dropout layer"; framework::OpDesc op_desc(op, nullptr); // Declare inputs auto* input1 = engine_->GetITensor(op_desc.Input("X")[0]); diff --git a/paddle/fluid/inference/tensorrt/convert/elementwise_op.cc b/paddle/fluid/inference/tensorrt/convert/elementwise_op.cc index 0a6ce568f194f03c7259e1ebf28dd6ce4df2d594..0c5a1a6ef16f05308df22452ed5e184e94e117d2 100644 --- a/paddle/fluid/inference/tensorrt/convert/elementwise_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/elementwise_op.cc @@ -4,7 +4,7 @@ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at -http://www.apache.org/licenses/LICENSE-2.0 + http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, @@ -13,11 +13,25 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/inference/tensorrt/convert/op_converter.h" +#include "paddle/fluid/inference/tensorrt/plugin/elementwise_op_plugin.h" namespace paddle { namespace inference { namespace tensorrt { +static bool CheckDims(const nvinfer1::Dims& dims_x, + const nvinfer1::Dims& dims_y) { + if (dims_x.nbDims != dims_y.nbDims) { + return false; + } + for (int i = 0; i < dims_x.nbDims; i++) { + if (dims_x.d[i] != dims_y.d[i]) { + return false; + } + } + return true; +} + class ElementwiseWeightOpConverter : public OpConverter { public: ElementwiseWeightOpConverter() {} @@ -25,8 +39,9 @@ class ElementwiseWeightOpConverter : public OpConverter { const framework::Scope& scope, bool test_mode) override { // Here the two nullptr looks strange, that's because the // framework::OpDesc's constructor is strange. + nvinfer1::ILayer* layer = nullptr; framework::OpDesc op_desc(op, nullptr); - LOG(INFO) << "convert a fluid elementwise op to tensorrt IScaleLayer"; + VLOG(3) << "Convert a fluid elementwise op to TensorRT IScaleLayer"; PADDLE_ENFORCE_EQ(op_desc.Input("X").size(), 1); PADDLE_ENFORCE_EQ(op_desc.Input("Y").size(), 1); // Y is a weight @@ -34,7 +49,8 @@ class ElementwiseWeightOpConverter : public OpConverter { auto* X = engine_->GetITensor(op_desc.Input("X").front()); nvinfer1::Dims dims_x = X->getDimensions(); - PADDLE_ENFORCE(dims_x.nbDims >= 3); + PADDLE_ENFORCE(dims_x.nbDims >= 3, "x dims experts 3, but %d is given.", + dims_x.nbDims); auto* Y_v = scope.FindVar(op_desc.Input("Y").front()); PADDLE_ENFORCE_NOT_NULL(Y_v); @@ -83,13 +99,21 @@ class ElementwiseWeightOpConverter : public OpConverter { 0}; TensorRTEngine::Weight power_weights{nvinfer1::DataType::kFLOAT, nullptr, 0}; + if (op_type_ == "add") { + nvinfer1::IScaleLayer* scale_layer = TRT_ENGINE_ADD_LAYER( + engine_, Scale, *X, scale_mode, shift_weights.get(), + scale_weights.get(), power_weights.get()); + layer = scale_layer; + } else if (op_type_ == "mul") { + nvinfer1::IScaleLayer* scale_layer = TRT_ENGINE_ADD_LAYER( + engine_, Scale, *X, scale_mode, scale_weights.get(), + shift_weights.get(), power_weights.get()); + layer = scale_layer; + } - nvinfer1::IScaleLayer* layer = TRT_ENGINE_ADD_LAYER( - engine_, Scale, *const_cast(X), scale_mode, - shift_weights.get(), scale_weights.get(), power_weights.get()); auto output_name = op_desc.Output("Out")[0]; - - layer->setName(("elementwise_add (Output: " + output_name + ")").c_str()); + layer->setName( + ("elementwise_" + op_type_ + "(Output: " + output_name + ")").c_str()); layer->getOutput(0)->setName(output_name.c_str()); engine_->weight_map[op_desc.Input("Y").front()] = std::move(weight_tensor); engine_->SetITensor(output_name, layer->getOutput(0)); @@ -98,6 +122,9 @@ class ElementwiseWeightOpConverter : public OpConverter { engine_->DeclareOutput(output_name); } } + + protected: + std::string op_type_; }; class ElementwiseTensorOpConverter : public OpConverter { @@ -105,10 +132,12 @@ class ElementwiseTensorOpConverter : public OpConverter { ElementwiseTensorOpConverter() {} void operator()(const framework::proto::OpDesc& op, const framework::Scope& scope, bool test_mode) override { + auto op_pair = ops.find(op_type_); + PADDLE_ENFORCE(op_pair != ops.end(), "Wrong elementwise op type!"); + // Here the two nullptr looks strange, that's because the // framework::OpDesc's constructor is strange. framework::OpDesc op_desc(op, nullptr); - LOG(INFO) << "convert a fluid elementwise op to tensorrt IScaleLayer"; PADDLE_ENFORCE_EQ(op_desc.Input("X").size(), 1); PADDLE_ENFORCE_EQ(op_desc.Input("Y").size(), 1); // Y is a weight @@ -119,29 +148,34 @@ class ElementwiseTensorOpConverter : public OpConverter { nvinfer1::Dims dims_x = X->getDimensions(); nvinfer1::Dims dims_y = Y->getDimensions(); - // The two input tensor should have the same dims - PADDLE_ENFORCE(dims_x.nbDims >= 3); - if (dims_x.nbDims == dims_y.nbDims) { - for (int i = 0; i < dims_x.nbDims; i++) { - if (dims_x.d[i] != dims_y.d[i]) - PADDLE_THROW("TensorRT unsupported tensor shape for Elementwise op!"); - } + int axis = boost::get(op_desc.GetAttr("axis")); + auto output_name = op_desc.Output("Out")[0]; + if (CheckDims(dims_x, dims_y)) { + // The two input tensor should have the same dims + VLOG(3) << "Convert a fluid elementwise op to TensorRT IElementWiseLayer"; + nvinfer1::IElementWiseLayer* layer = TRT_ENGINE_ADD_LAYER( + engine_, ElementWise, *const_cast(X), + *const_cast(Y), op_pair->second); + + layer->setName(("elementwise (Output: " + output_name + ")").c_str()); + layer->getOutput(0)->setName(output_name.c_str()); + engine_->SetITensor(output_name, layer->getOutput(0)); } else { - PADDLE_THROW("TensorRT unsupported tensor shape for Elementwise op!"); + VLOG(3) << "Convert a fluid elementwise op to TensorRT " + "ElementWisePluginLayer"; + + plugin::ElementWisePlugin* plugin = + new plugin::ElementWisePlugin(op_type_, dims_x, dims_y, axis); + plugin->AddInput(X); + plugin->AddInput(Y); + nvinfer1::IPluginLayer* layer = engine_->AddPlugin( + const_cast(plugin->GetInputs().data()), 2, + reinterpret_cast(plugin)); + + layer->setName(("elementwise (Output: " + output_name + ")").c_str()); + layer->getOutput(0)->setName(output_name.c_str()); + engine_->SetITensor(output_name, layer->getOutput(0)); } - - auto op_pair = ops.find(op_type_); - if (op_pair == ops.end()) { - PADDLE_THROW("Wrong elementwise op type!"); - } - nvinfer1::IElementWiseLayer* layer = TRT_ENGINE_ADD_LAYER( - engine_, ElementWise, *const_cast(X), - *const_cast(Y), op_pair->second); - - auto output_name = op_desc.Output("Out")[0]; - layer->setName(("elementwise (Output: " + output_name + ")").c_str()); - layer->getOutput(0)->setName(output_name.c_str()); - engine_->SetITensor(output_name, layer->getOutput(0)); if (test_mode) { // the test framework can not determine which is the // output, so place the declaration inside. engine_->DeclareOutput(output_name); @@ -165,6 +199,16 @@ const std::unordered_map {"max", nvinfer1::ElementWiseOperation::kMAX}, }; +class ElementwiseWeightAddOpConverter : public ElementwiseWeightOpConverter { + public: + ElementwiseWeightAddOpConverter() { op_type_ = "add"; } +}; + +class ElementwiseWeightMulOpConverter : public ElementwiseWeightOpConverter { + public: + ElementwiseWeightMulOpConverter() { op_type_ = "mul"; } +}; + class ElementwiseTensorAddOpConverter : public ElementwiseTensorOpConverter { public: ElementwiseTensorAddOpConverter() { op_type_ = "add"; } @@ -204,7 +248,10 @@ class ElementwiseTensorPowOpConverter : public ElementwiseTensorOpConverter { } // namespace inference } // namespace paddle -REGISTER_TRT_OP_CONVERTER(elementwise_add_weight, ElementwiseWeightOpConverter); +REGISTER_TRT_OP_CONVERTER(elementwise_add_weight, + ElementwiseWeightAddOpConverter); +REGISTER_TRT_OP_CONVERTER(elementwise_mul_weight, + ElementwiseWeightMulOpConverter); REGISTER_TRT_OP_CONVERTER(elementwise_add_tensor, ElementwiseTensorAddOpConverter); diff --git a/paddle/fluid/inference/tensorrt/convert/fc_op.cc b/paddle/fluid/inference/tensorrt/convert/fc_op.cc index 7c21ecd95da07b498eed2ab1bbdcc0e8cd184787..42dcd68e40e04e775961fd943070f3df2f28d99a 100644 --- a/paddle/fluid/inference/tensorrt/convert/fc_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/fc_op.cc @@ -52,7 +52,7 @@ class FcOpConverter : public OpConverter { public: void operator()(const framework::proto::OpDesc& op, const framework::Scope& scope, bool test_mode) override { - VLOG(4) << "convert a fluid fc op to tensorrt fc layer without bias"; + VLOG(3) << "convert a fluid fc op to tensorrt fc layer without bias"; framework::OpDesc op_desc(op, nullptr); PADDLE_ENFORCE_EQ(op_desc.Input("X").size(), 1); @@ -85,10 +85,10 @@ class FcOpConverter : public OpConverter { Y_t->dims()[0] * Y_t->dims()[1] * sizeof(float)); TensorRTEngine::Weight weight{nvinfer1::DataType::kFLOAT, static_cast(weight_data), - Y_t->memory_size() / sizeof(float)}; + static_cast(Y_t->numel())}; TensorRTEngine::Weight tmp_weight(nvinfer1::DataType::kFLOAT, static_cast(tmp->data()), - Y_t->memory_size() / sizeof(float)); + static_cast(Y_t->numel())); weight.dims.assign({Y_t->dims()[0], Y_t->dims()[1]}); tmp_weight.dims = weight.dims; diff --git a/paddle/fluid/inference/tensorrt/convert/io_converter.h b/paddle/fluid/inference/tensorrt/convert/io_converter.h index 71c48e085d25d2bc6720d93735f661f9e3af7b40..5daa242f6ab802a50fa6105f0102b817b700f461 100644 --- a/paddle/fluid/inference/tensorrt/convert/io_converter.h +++ b/paddle/fluid/inference/tensorrt/convert/io_converter.h @@ -45,7 +45,7 @@ class EngineIOConverter { static void ConvertInput(const std::string& op_type, const LoDTensor& in, void* out, size_t max_size, cudaStream_t* stream) { PADDLE_ENFORCE(stream != nullptr); - auto* converter = Registry::Lookup( + auto* converter = Registry::Global().Lookup( op_type, "default" /* default_type */); PADDLE_ENFORCE_NOT_NULL(converter); converter->SetStream(stream); @@ -56,7 +56,7 @@ class EngineIOConverter { LoDTensor* out, size_t max_size, cudaStream_t* stream) { PADDLE_ENFORCE(stream != nullptr); - auto* converter = Registry::Lookup( + auto* converter = Registry::Global().Lookup( op_type, "default" /* default_type */); PADDLE_ENFORCE_NOT_NULL(converter); converter->SetStream(stream); @@ -69,12 +69,12 @@ class EngineIOConverter { cudaStream_t* stream_{nullptr}; }; -#define REGISTER_TENSORRT_IO_CONVERTER(op_type__, Converter__) \ - struct trt_io_##op_type__##_converter { \ - trt_io_##op_type__##_converter() { \ - Registry::Register(#op_type__); \ - } \ - }; \ +#define REGISTER_TENSORRT_IO_CONVERTER(op_type__, Converter__) \ + struct trt_io_##op_type__##_converter { \ + trt_io_##op_type__##_converter() { \ + Registry::Global().Register(#op_type__); \ + } \ + }; \ trt_io_##op_type__##_converter trt_io_##op_type__##_converter__; } // namespace tensorrt diff --git a/paddle/fluid/inference/tensorrt/convert/leaky_relu_op.cc b/paddle/fluid/inference/tensorrt/convert/leaky_relu_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..3f6ed04c46d70b1ab68b4c01ef0c908a1a8d1a19 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/convert/leaky_relu_op.cc @@ -0,0 +1,95 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/inference/tensorrt/convert/op_converter.h" + +namespace paddle { +namespace inference { +namespace tensorrt { + +// LeakyRelu converter from fluid to tensorRT +class LeakyReluOpConverter : public OpConverter { + public: + void operator()(const framework::proto::OpDesc& op, + const framework::Scope& scope, bool test_mode) override { + VLOG(4) << "convert fluid leaky_relu op to tensorrt layer"; + + framework::OpDesc op_desc(op, nullptr); + // Declare inputs + int input_num = op_desc.Input("X").size(); + PADDLE_ENFORCE(input_num == 1); + auto* input = engine_->GetITensor(op_desc.Input("X")[0]); + // Get output + size_t output_num = op_desc.Output("Out").size(); + PADDLE_ENFORCE(output_num == 1); + // Get attrs + float alpha = boost::get(op_desc.GetAttr("alpha")); + + platform::CPUPlace place; + std::unique_ptr alpha_tensor( + new framework::LoDTensor()); + alpha_tensor->Resize(framework::make_ddim({2})); + float* alpha_data = alpha_tensor->mutable_data(place); + alpha_data[0] = alpha; + alpha_data[1] = 1.f - alpha; + // the leaky relu formula y = (x > 0) ? x : alpha * x is equal to + // y = alpha * x + (x > 0) ? (1 - alpha) * x : 0 + TensorRTEngine::Weight scale{nvinfer1::DataType::kFLOAT, &alpha_data[0], 1}; + TensorRTEngine::Weight shift{nvinfer1::DataType::kFLOAT, nullptr, 0}; + TensorRTEngine::Weight power{nvinfer1::DataType::kFLOAT, nullptr, 0}; + // y_scale = alpha * x + auto* scale_layer = TRT_ENGINE_ADD_LAYER( + engine_, Scale, *input, nvinfer1::ScaleMode::kUNIFORM, shift.get(), + scale.get(), power.get()); + PADDLE_ENFORCE(nullptr != scale_layer); + // y_relu = (x > 0) : x : 0 + auto* relu_layer = TRT_ENGINE_ADD_LAYER(engine_, Activation, *input, + nvinfer1::ActivationType::kRELU); + PADDLE_ENFORCE(nullptr != relu_layer); + // + TensorRTEngine::Weight sub_scale{nvinfer1::DataType::kFLOAT, &alpha_data[1], + 1}; + auto* scale_relu_layer = + TRT_ENGINE_ADD_LAYER(engine_, Scale, *(relu_layer->getOutput(0)), + nvinfer1::ScaleMode::kUNIFORM, shift.get(), + sub_scale.get(), power.get()); + PADDLE_ENFORCE(nullptr != scale_relu_layer); + auto* output_layer = + TRT_ENGINE_ADD_LAYER(engine_, ElementWise, *(scale_layer->getOutput(0)), + *(scale_relu_layer->getOutput(0)), + nvinfer1::ElementWiseOperation::kSUM); + PADDLE_ENFORCE(nullptr != output_layer); + // keep alpha tensor to avoid release it's memory + std::string alpha_name = op_desc.Output("Out")[0] + "_alpha"; + PADDLE_ENFORCE(engine_->weight_map.find(alpha_name) == + engine_->weight_map.end()); + engine_->weight_map[alpha_name] = std::move(alpha_tensor); + + std::string layer_name = "leaky_relu (Output: "; + auto output_name = op_desc.Output("Out")[0]; + output_layer->getOutput(0)->setName(output_name.c_str()); + engine_->SetITensor(output_name, output_layer->getOutput(0)); + layer_name += output_name; + if (test_mode) { + engine_->DeclareOutput(output_name); + } + output_layer->setName((layer_name + ")").c_str()); + } +}; + +} // namespace tensorrt +} // namespace inference +} // namespace paddle + +REGISTER_TRT_OP_CONVERTER(leaky_relu, LeakyReluOpConverter); diff --git a/paddle/fluid/inference/tensorrt/convert/mul_op.cc b/paddle/fluid/inference/tensorrt/convert/mul_op.cc index 514eb659a8da73b6e56b5d17148ec0cb2aeaa135..5b6aaad49833cedbd8d1ee0ec5d24c7f983190e6 100644 --- a/paddle/fluid/inference/tensorrt/convert/mul_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/mul_op.cc @@ -25,7 +25,7 @@ class MulOpConverter : public OpConverter { public: void operator()(const framework::proto::OpDesc& op, const framework::Scope& scope, bool test_mode) override { - VLOG(4) << "convert a fluid mul op to tensorrt mul layer without bias"; + VLOG(3) << "convert a fluid mul op to tensorrt mul layer without bias"; framework::OpDesc op_desc(op, nullptr); // Declare inputs diff --git a/paddle/fluid/inference/tensorrt/convert/op_converter.h b/paddle/fluid/inference/tensorrt/convert/op_converter.h index d309d94c560f2b484fac6b6cd40cc2704d641069..55515569ead6e40c9b1b45fe31189dab7e2f2bb4 100644 --- a/paddle/fluid/inference/tensorrt/convert/op_converter.h +++ b/paddle/fluid/inference/tensorrt/convert/op_converter.h @@ -16,9 +16,12 @@ limitations under the License. */ #include #include +#include +#include #include "paddle/fluid/framework/block_desc.h" #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/inference/analysis/helper.h" #include "paddle/fluid/inference/tensorrt/engine.h" #include "paddle/fluid/inference/utils/singleton.h" @@ -26,6 +29,37 @@ namespace paddle { namespace inference { namespace tensorrt { +using FluidDT = framework::proto::VarType_Type; +using TRT_DT = nvinfer1::DataType; + +namespace { // NOLINT + +TRT_DT FluidDataType2TRT(FluidDT type) { + switch (type) { + case FluidDT::VarType_Type_FP32: + return TRT_DT::kFLOAT; + case FluidDT::VarType_Type_INT32: + return TRT_DT::kINT32; + default: + return TRT_DT::kINT32; + } + PADDLE_THROW("unkown type"); + return TRT_DT::kINT32; +} + +nvinfer1::Dims Vec2TRT_Dims(const std::vector& shape) { + PADDLE_ENFORCE_GT(shape.size(), 1UL, + "TensorRT' tensor input requires at least 2 dimensions"); + PADDLE_ENFORCE_LE(shape.size(), 4UL, + "TensorRT' tensor input requires at most 4 dimensions"); + PADDLE_ENFORCE(shape.size() == 4UL || shape.size() == 2UL); + if (shape.size() == 4UL) + return nvinfer1::DimsCHW(shape[1], shape[2], shape[3]); + return nvinfer1::DimsCHW(shape[1], 1, 1); +} + +} // namespace // NOLINT + /* * Convert Op from Fluid to TensorRT Engine. */ @@ -52,7 +86,7 @@ class OpConverter { PADDLE_ENFORCE_EQ(op_desc.Input("Y").size(), 1UL); std::string Y = op_desc.Input("Y")[0]; if (parameters.count(Y)) { - it = Registry::Lookup("fc"); + it = Registry::Global().Lookup("fc"); } } if (op_desc.Type().find("elementwise") != std::string::npos) { @@ -61,7 +95,7 @@ class OpConverter { // TODO(xingzhaolong): all mul, sub, div // static std::unordered_set add_weight_op_set {"add", "mul", // "sub", "div"}; - static std::unordered_set add_weight_op_set{"add"}; + static std::unordered_set add_weight_op_set{"add", "mul"}; PADDLE_ENFORCE_EQ(op_desc.Input("Y").size(), 1UL); int op_type_len = op_desc.Type().size(); std::string op_type = op_desc.Type().substr(op_type_len - 3, op_type_len); @@ -69,28 +103,28 @@ class OpConverter { if (parameters.count(Y)) { PADDLE_ENFORCE(add_weight_op_set.count(op_type) > 0, "Unsupported elementwise type" + op_type); - it = - Registry::Lookup("elementwise_" + op_type + "_weight"); + it = Registry::Global().Lookup("elementwise_" + op_type + + "_weight"); PADDLE_ENFORCE_NOT_NULL(it, "no OpConverter for optype [%s]", op_desc.Type()); } else { PADDLE_ENFORCE(add_tensor_op_set.count(op_type) > 0, "Unsupported elementwise type" + op_type); - it = - Registry::Lookup("elementwise_" + op_type + "_tensor"); + it = Registry::Global().Lookup("elementwise_" + op_type + + "_tensor"); } PADDLE_ENFORCE_NOT_NULL(it, "no OpConverter for optype [%s]", op_desc.Type()); } if (op_desc.Type() == "depthwise_conv2d") { - it = Registry::Lookup("conv2d"); + it = Registry::Global().Lookup("conv2d"); PADDLE_ENFORCE_NOT_NULL(it, "no OpConverter for optype [%s]", op_desc.Type()); } if (!it) { - it = Registry::Lookup(op_desc.Type()); + it = Registry::Global().Lookup(op_desc.Type()); } PADDLE_ENFORCE_NOT_NULL(it, "no OpConverter for optype [%s]", op_desc.Type()); @@ -103,12 +137,41 @@ class OpConverter { void ConvertBlock(const framework::proto::BlockDesc& block, const std::unordered_set& parameters, const framework::Scope& scope, TensorRTEngine* engine) { + std::unique_lock lk(mut_); for (int i = 0; i < block.ops_size(); i++) { const auto& op = block.ops(i); ConvertOp(op, parameters, scope, engine); } } + // The scope here should be inited with the parameter vars. + void ConvertBlockToTRTEngine( + framework::BlockDesc* block_desc, const framework::Scope& scope, + const std::vector& inputs, + const std::unordered_set& parameters, + const std::vector& outputs, TensorRTEngine* engine) { + engine->InitNetwork(); + for (auto& input : inputs) { + if (parameters.count(input)) continue; + auto* var = block_desc->FindVar(input); + PADDLE_ENFORCE(var, "no variable called %s", input); + PADDLE_ENFORCE_EQ(var->GetType(), FluidDT::VarType_Type_LOD_TENSOR, + "TensorRT engine only takes LoDTensor as input"); + auto var_shape = var->GetShape(); + + engine->DeclareInput( + input, FluidDataType2TRT( + var->Proto()->type().lod_tensor().tensor().data_type()), + Vec2TRT_Dims(var_shape)); + } + framework::proto::BlockDesc* block_proto = block_desc->Proto(); + ConvertBlock(*block_proto, parameters, scope, engine); + for (auto& output : outputs) { + engine->DeclareOutput(output); + } + engine->FreezeNetwork(); + } + void SetEngine(TensorRTEngine* engine) { engine_ = engine; } virtual ~OpConverter() {} @@ -125,6 +188,7 @@ class OpConverter { std::unordered_map converters_; // fluid inference scope framework::Scope* scope_{nullptr}; + std::mutex mut_; }; } // namespace tensorrt @@ -134,9 +198,9 @@ class OpConverter { #define REGISTER_TRT_OP_CONVERTER(op_type__, Converter__) \ struct trt_##op_type__##_converter : public ::paddle::framework::Registrar { \ trt_##op_type__##_converter() { \ - ::paddle::inference:: \ - Registry::Register< \ - ::paddle::inference::tensorrt::Converter__>(#op_type__); \ + ::paddle::inference::Registry< \ + paddle::inference::tensorrt::OpConverter>::Global() \ + .Register<::paddle::inference::tensorrt::Converter__>(#op_type__); \ } \ }; \ trt_##op_type__##_converter trt_##op_type__##_converter__; \ diff --git a/paddle/fluid/inference/tensorrt/convert/pad_op.cc b/paddle/fluid/inference/tensorrt/convert/pad_op.cc index 218030a591fcc7e533ef37062265449d4b6044bc..4afcb0aecec9d07b52d2fd701fae8750067a6041 100644 --- a/paddle/fluid/inference/tensorrt/convert/pad_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/pad_op.cc @@ -25,7 +25,7 @@ class PadOpConverter : public OpConverter { public: void operator()(const framework::proto::OpDesc& op, const framework::Scope& scope, bool test_mode) override { - VLOG(4) << "convert a fluid transpose op to tensorrt tranpose layer"; + VLOG(3) << "convert a fluid transpose op to tensorrt tranpose layer"; framework::OpDesc op_desc(op, nullptr); // Declare inputs diff --git a/paddle/fluid/inference/tensorrt/convert/pool2d_op.cc b/paddle/fluid/inference/tensorrt/convert/pool2d_op.cc index f9bb66a6e9f81a10368db7710108c319860e940a..1d0d83d1f368f879878a4df8b2eefae0bc89423d 100644 --- a/paddle/fluid/inference/tensorrt/convert/pool2d_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/pool2d_op.cc @@ -13,25 +13,57 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/inference/tensorrt/convert/op_converter.h" +#include "paddle/fluid/inference/tensorrt/plugin/avg_pool_op_plugin.h" namespace paddle { namespace inference { namespace tensorrt { +void DealCeilMode(const nvinfer1::Dims &input_shape, std::vector ksize, + std::vector strides, std::vector paddings, + nvinfer1::DimsHW *pre_pad, nvinfer1::DimsHW *post_pad, + int input_dims) { + int input_height = input_shape.d[input_dims - 2]; + int input_width = input_shape.d[input_dims - 1]; + int floor_h_output_size = + (input_height - ksize[0] + 2 * paddings[0]) / strides[0] + 1; + int ceil_h_output_size = + (input_height - ksize[0] + 2 * paddings[0] + strides[0] - 1) / + strides[0] + + 1; + + int floor_w_output_size = + (input_width - ksize[1] + 2 * paddings[1]) / strides[1] + 1; + int ceil_w_output_size = + (input_width - ksize[1] + 2 * paddings[1] + strides[1] - 1) / strides[1] + + 1; + if (floor_h_output_size != ceil_h_output_size) { + post_pad->h() = strides[0] - 1; + } + + if (floor_w_output_size != ceil_w_output_size) { + post_pad->w() = strides[1] - 1; + } +} + /* * Pool2dOp, IPoolingLayer in TRT. This Layer doesn't has weights. */ class Pool2dOpConverter : public OpConverter { public: - void operator()(const framework::proto::OpDesc& op, - const framework::Scope& scope, bool test_mode) override { + void operator()(const framework::proto::OpDesc &op, + const framework::Scope &scope, bool test_mode) override { VLOG(4) << "convert a fluid pool2d op to tensorrt pool2d layer without bias"; framework::OpDesc op_desc(op, nullptr); // Declare inputs PADDLE_ENFORCE_EQ(op_desc.Input("X").size(), 1); PADDLE_ENFORCE_EQ(op_desc.Output("Out").size(), 1); - auto* input1 = engine_->GetITensor(op_desc.Input("X")[0]); + auto *input1 = engine_->GetITensor(op_desc.Input("X")[0]); + nvinfer1::Dims input_shape = input1->getDimensions(); + int input_dims = input_shape.nbDims; + + PADDLE_ENFORCE_EQ(input_dims, 3UL); bool global_pooling = boost::get(op_desc.GetAttr("global_pooling")); std::string pool_type = @@ -42,18 +74,7 @@ class Pool2dOpConverter : public OpConverter { boost::get>(op_desc.GetAttr("strides")); std::vector paddings = boost::get>(op_desc.GetAttr("paddings")); - - nvinfer1::DimsHW nv_ksize(ksize[0], ksize[1]); - if (global_pooling == true) { - nvinfer1::Dims input_shape = input1->getDimensions(); - int nbDims = input_shape.nbDims; - nv_ksize.d[0] = input_shape.d[nbDims - 2]; - nv_ksize.d[1] = input_shape.d[nbDims - 1]; - } - const nvinfer1::DimsHW nv_strides(strides[0], strides[1]); - const nvinfer1::DimsHW nv_paddings(paddings[0], paddings[1]); - - PADDLE_ENFORCE_EQ(input1->getDimensions().nbDims, 3UL); + bool ceil_mode = boost::get(op_desc.GetAttr("ceil_mode")); nvinfer1::PoolingType nv_pool_type = nvinfer1::PoolingType::kMAX; if (pool_type == "max") { @@ -64,12 +85,67 @@ class Pool2dOpConverter : public OpConverter { PADDLE_THROW("TensorRT unsupported pooling type!"); } - auto* layer = TRT_ENGINE_ADD_LAYER(engine_, Pooling, - *const_cast(input1), - nv_pool_type, nv_ksize); - PADDLE_ENFORCE_NOT_NULL(layer, "pool layer could not be created."); - layer->setStride(nv_strides); - layer->setPadding(nv_paddings); + nvinfer1::DimsHW nv_ksize(ksize[0], ksize[1]); + nvinfer1::DimsHW nv_strides(strides[0], strides[1]); + nvinfer1::DimsHW nv_paddings(paddings[0], paddings[1]); + + nvinfer1::ILayer *layer = nullptr; + + if (global_pooling == true) { + nv_ksize.d[0] = input_shape.d[input_dims - 2]; + nv_ksize.d[1] = input_shape.d[input_dims - 1]; + auto *layer = TRT_ENGINE_ADD_LAYER( + engine_, Pooling, *const_cast(input1), + nv_pool_type, nv_ksize); + PADDLE_ENFORCE_NOT_NULL(layer, "pool layer could not be created."); + auto output_name = op_desc.Output("Out")[0]; + layer->setName(("pool2d (Output: " + output_name + ")").c_str()); + layer->getOutput(0)->setName(output_name.c_str()); + engine_->SetITensor(output_name, layer->getOutput(0)); + if (test_mode) { + engine_->DeclareOutput(output_name); + } + return; + } + + if (pool_type == "max") { + // Under ceil mode, the pre_pad and post_pad are used to + // record the the padding size. In some ceil mode cases, + // we do not need padding, so we initialize the two vars to 0. + + nvinfer1::DimsHW pre_pad(0, 0); + nvinfer1::DimsHW post_pad(0, 0); + if (ceil_mode) { + // If ceil mode is true, we will pad the appropriate size to the input. + DealCeilMode(input_shape, ksize, strides, paddings, &pre_pad, &post_pad, + input_dims); + auto *pad_layer = TRT_ENGINE_ADD_LAYER( + engine_, Padding, *const_cast(input1), pre_pad, + post_pad); + PADDLE_ENFORCE_NOT_NULL( + pad_layer, "pad layer in poolOp converter could not be created."); + input1 = pad_layer->getOutput(0); + } + auto *pool_layer = TRT_ENGINE_ADD_LAYER( + engine_, Pooling, *const_cast(input1), + nv_pool_type, nv_ksize); + PADDLE_ENFORCE_NOT_NULL(pool_layer, "pool layer could not be created."); + pool_layer->setStride(nv_strides); + pool_layer->setPadding(nv_paddings); + layer = pool_layer; + } else { + // Average pooling needs to exclude the padding pixels from the average + // mean. + // It is not supported well by TRT, we use a plugin here. + std::vector input_shape_v; + for (int i = 0; i < input_dims; i++) { + input_shape_v.push_back(input_shape.d[i]); + } + plugin::AvgPoolPlugin *plugin = new plugin::AvgPoolPlugin( + ceil_mode, ksize, strides, paddings, input_shape_v); + auto *avg_pool_layer = engine_->AddPlugin(&input1, 1, plugin); + layer = avg_pool_layer; + } auto output_name = op_desc.Output("Out")[0]; layer->setName(("pool2d (Output: " + output_name + ")").c_str()); diff --git a/paddle/fluid/inference/tensorrt/convert/prelu_op.cc b/paddle/fluid/inference/tensorrt/convert/prelu_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..2ae804106e5f7b51fc43e33cad986619e6a57d74 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/convert/prelu_op.cc @@ -0,0 +1,77 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/inference/tensorrt/convert/op_converter.h" +#include "paddle/fluid/inference/tensorrt/plugin/prelu_op_plugin.h" + +namespace paddle { +namespace inference { +namespace tensorrt { + +/* + * PRelu converter from fluid to tensorRT. + */ +class PReluOpConverter : public OpConverter { + public: + void operator()(const framework::proto::OpDesc& op, + const framework::Scope& scope, bool test_mode) override { + VLOG(4) << "convert fluid prelu op to tensorrt prelu layer"; + + framework::OpDesc op_desc(op, nullptr); + // Declare inputs + int input_num = op_desc.Input("X").size(); + PADDLE_ENFORCE(input_num == 1); + auto* input = engine_->GetITensor(op_desc.Input("X")[0]); + // Get output + size_t output_num = op_desc.Output("Out").size(); + PADDLE_ENFORCE(output_num == 1); + // Get attrs + std::string mode = boost::get(op_desc.GetAttr("mode")); + // + auto* alpha_var = scope.FindVar(op_desc.Input("Alpha")[0]); + PADDLE_ENFORCE_NOT_NULL(alpha_var); + auto* alpha_tensor = alpha_var->GetMutable(); + + platform::CPUPlace cpu_place; + std::unique_ptr alpha_tensor_temp( + new framework::LoDTensor()); + alpha_tensor_temp->Resize(alpha_tensor->dims()); + TensorCopySync(*alpha_tensor, cpu_place, alpha_tensor_temp.get()); + float* alpha_data = alpha_tensor_temp->mutable_data(cpu_place); + + plugin::PReluPlugin* plugin = + new plugin::PReluPlugin(alpha_data, alpha_tensor_temp->numel(), mode); + nvinfer1::IPluginLayer* layer = + engine_->AddPlugin(&input, input_num, plugin); + // keep alpha tensor to avoid release it's memory + engine_->weight_map[op_desc.Input("Alpha")[0]] = + std::move(alpha_tensor_temp); + + std::string layer_name = "prelu (Output: "; + auto output_name = op_desc.Output("Out")[0]; + layer->getOutput(0)->setName(output_name.c_str()); + engine_->SetITensor(output_name, layer->getOutput(0)); + layer_name += output_name; + if (test_mode) { + engine_->DeclareOutput(output_name); + } + layer->setName((layer_name + ")").c_str()); + } +}; + +} // namespace tensorrt +} // namespace inference +} // namespace paddle + +REGISTER_TRT_OP_CONVERTER(prelu, PReluOpConverter); diff --git a/paddle/fluid/inference/tensorrt/convert/softmax_op.cc b/paddle/fluid/inference/tensorrt/convert/softmax_op.cc index 0064f90fd7944403c14d4d47616ea82f681ceb74..80bfb2d190a5637032e7c18fbac7f22b3a9e81e1 100644 --- a/paddle/fluid/inference/tensorrt/convert/softmax_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/softmax_op.cc @@ -25,7 +25,7 @@ class SoftMaxOpConverter : public OpConverter { public: void operator()(const framework::proto::OpDesc& op, const framework::Scope& scope, bool test_mode) override { - VLOG(4) + VLOG(3) << "convert a fluid softmax op to tensorrt softmax layer without bias"; framework::OpDesc op_desc(op, nullptr); // Declare inputs diff --git a/paddle/fluid/inference/tensorrt/convert/split_op.cc b/paddle/fluid/inference/tensorrt/convert/split_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..ae5b1b98060a4e73b2d1761d4edafb152f364070 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/convert/split_op.cc @@ -0,0 +1,67 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/inference/tensorrt/convert/op_converter.h" +#include "paddle/fluid/inference/tensorrt/plugin/split_op_plugin.h" + +namespace paddle { +namespace inference { +namespace tensorrt { + +class SplitOpConverter : public OpConverter { + public: + void operator()(const framework::proto::OpDesc& op, + const framework::Scope& scope, bool test_mode) override { + VLOG(4) << "convert a fluid split op to tensorrt split layer"; + + framework::OpDesc op_desc(op, nullptr); + // Declare inputs + auto* input = engine_->GetITensor(op_desc.Input("X")[0]); + auto input_dims = input->getDimensions(); + int input_num = op_desc.Input("X").size(); + size_t output_num = op_desc.Output("Out").size(); + + // Get Attrs + PADDLE_ENFORCE(input_num == 1); + int axis = boost::get(op_desc.GetAttr("axis")); + std::vector output_lengths = + boost::get>(op_desc.GetAttr("sections")); + // split on batch is not supported in TensorRT + PADDLE_ENFORCE(axis != 0); + axis += (axis < 0) ? input_dims.nbDims : -1; + + PADDLE_ENFORCE(output_lengths.size() == output_num); + plugin::SplitPlugin* plugin = new plugin::SplitPlugin(axis, output_lengths); + nvinfer1::IPluginLayer* layer = + engine_->AddPlugin(&input, input_num, plugin); + + std::string layer_name = "split (Output: "; + for (size_t i = 0; i < output_num; i++) { + auto output_name = op_desc.Output("Out")[i]; + layer->getOutput(i)->setName(output_name.c_str()); + engine_->SetITensor(output_name, layer->getOutput(i)); + layer_name += output_name; + if (test_mode) { + engine_->DeclareOutput(output_name); + } + } + layer->setName((layer_name + ")").c_str()); + } +}; + +} // namespace tensorrt +} // namespace inference +} // namespace paddle + +REGISTER_TRT_OP_CONVERTER(split, SplitOpConverter); diff --git a/paddle/fluid/inference/tensorrt/convert/test_conv2d_op.cc b/paddle/fluid/inference/tensorrt/convert/test_conv2d_op.cc index f8711c6b60d74639529624c25429bc245de46479..95916746d6fcb528d26a8f8bb39980b55c4f3704 100644 --- a/paddle/fluid/inference/tensorrt/convert/test_conv2d_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/test_conv2d_op.cc @@ -16,6 +16,9 @@ limitations under the License. */ #include "paddle/fluid/inference/tensorrt/convert/op_converter.h" #include "paddle/fluid/inference/tensorrt/convert/ut_helper.h" +USE_OP(conv2d); +USE_OP(conv2d_transpose); + namespace paddle { namespace inference { namespace tensorrt { @@ -51,7 +54,37 @@ TEST(conv2d_op, test) { validator.Execute(3); } +TEST(conv2d_transpose_op, test) { + std::unordered_set parameters({"deconv2d-Y"}); + framework::Scope scope; + TRTConvertValidation validator(5, parameters, scope, 1 << 15); + + validator.DeclInputVar("deconv2d-X", nvinfer1::Dims3(3, 5, 5)); + validator.DeclParamVar("deconv2d-Y", nvinfer1::Dims4(3, 2, 3, 3)); + validator.DeclOutputVar("deconv2d-Out", nvinfer1::Dims3(2, 5, 5)); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("conv2d_transpose"); + desc.SetInput("Input", {"deconv2d-X"}); + desc.SetInput("Filter", {"deconv2d-Y"}); + desc.SetOutput("Output", {"deconv2d-Out"}); + + const std::vector strides({1, 1}); + const std::vector paddings({1, 1}); + const std::vector dilations({1, 1}); + const int groups = 1; + + desc.SetAttr("strides", strides); + desc.SetAttr("paddings", paddings); + desc.SetAttr("dilations", dilations); + desc.SetAttr("groups", groups); + + validator.SetOp(*desc.Proto()); + + validator.Execute(3); +} + } // namespace tensorrt } // namespace inference } // namespace paddle -USE_OP(conv2d); diff --git a/paddle/fluid/inference/tensorrt/convert/test_elementwise_op.cc b/paddle/fluid/inference/tensorrt/convert/test_elementwise_op.cc index 7537d02a35b66a41c158cd8eb1b1e5d4107e7d84..cc967464a5f29151a061e99cda6870f9f370ec1b 100644 --- a/paddle/fluid/inference/tensorrt/convert/test_elementwise_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/test_elementwise_op.cc @@ -20,13 +20,12 @@ namespace paddle { namespace inference { namespace tensorrt { -TEST(elementwise_op, add_weight_test) { +TEST(elementwise_op, add_weight) { std::unordered_set parameters({"elementwise_add-Y"}); framework::Scope scope; TRTConvertValidation validator(10, parameters, scope, 1 << 15); validator.DeclInputVar("elementwise_add-X", nvinfer1::DimsCHW(10, 3, 3)); validator.DeclParamVar("elementwise_add-Y", nvinfer1::Dims3(10, 1, 1)); - // validator.DeclParamVar("mul-Y", nvinfer1::Dims2(8, 2)); validator.DeclOutputVar("elementwise_add-Out", nvinfer1::DimsCHW(10, 3, 3)); // Prepare Op description @@ -44,30 +43,65 @@ TEST(elementwise_op, add_weight_test) { validator.Execute(8); } -TEST(elementwise_op, add_tensor_test) { - std::unordered_set parameters; - framework::Scope scope; - TRTConvertValidation validator(8, parameters, scope, 1 << 15); - validator.DeclInputVar("elementwise_add-X", nvinfer1::DimsCHW(10, 3, 3)); - validator.DeclInputVar("elementwise_add-Y", nvinfer1::Dims3(10, 3, 3)); - // validator.DeclParamVar("mul-Y", nvinfer1::Dims2(8, 2)); - validator.DeclOutputVar("elementwise_add-Out", nvinfer1::DimsCHW(10, 3, 3)); - - // Prepare Op description - framework::OpDesc desc; - desc.SetType("elementwise_add"); - desc.SetInput("X", {"elementwise_add-X"}); - desc.SetInput("Y", {"elementwise_add-Y"}); - desc.SetOutput("Out", {"elementwise_add-Out"}); - - // the defalut axis of elementwise op is -1 - - validator.SetOp(*desc.Proto()); +TEST(elementwise_op, native) { + for (std::string type : {"add", "mul"}) { + int batch_size = 8; + std::unordered_set parameters; + framework::Scope scope; + TRTConvertValidation validator(batch_size, parameters, scope, 1 << 15); + validator.DeclInputVar("elementwise_" + type + "-X", + nvinfer1::DimsCHW(10, 3, 3)); + validator.DeclInputVar("elementwise_" + type + "-Y", + nvinfer1::Dims3(10, 3, 3)); + validator.DeclOutputVar("elementwise_" + type + "-Out", + nvinfer1::DimsCHW(10, 3, 3)); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("elementwise_" + type); + desc.SetInput("X", {"elementwise_" + type + "-X"}); + desc.SetInput("Y", {"elementwise_" + type + "-Y"}); + desc.SetOutput("Out", {"elementwise_" + type + "-Out"}); + + int axis = -1; + desc.SetAttr("axis", axis); + + validator.SetOp(*desc.Proto()); + validator.Execute(batch_size); + } +} - validator.Execute(8); +TEST(elementwise_op, plugin) { + for (std::string type : {"add", "mul"}) { + int batch_size = 8; + std::unordered_set parameters; + framework::Scope scope; + TRTConvertValidation validator(batch_size, parameters, scope, 1 << 15); + validator.DeclInputVar("elementwise_" + type + "-X", + nvinfer1::DimsCHW(10, 3, 3)); + validator.DeclInputVar("elementwise_" + type + "-Y", + nvinfer1::Dims3(10, 1, 1)); + validator.DeclOutputVar("elementwise_" + type + "-Out", + nvinfer1::DimsCHW(10, 3, 3)); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("elementwise_" + type); + desc.SetInput("X", {"elementwise_" + type + "-X"}); + desc.SetInput("Y", {"elementwise_" + type + "-Y"}); + desc.SetOutput("Out", {"elementwise_" + type + "-Out"}); + + int axis = -1; + desc.SetAttr("axis", axis); + + validator.SetOp(*desc.Proto()); + validator.Execute(batch_size); + } } } // namespace tensorrt } // namespace inference } // namespace paddle + USE_OP(elementwise_add); +USE_OP(elementwise_mul); diff --git a/paddle/fluid/inference/tensorrt/convert/test_leaky_relu_op.cc b/paddle/fluid/inference/tensorrt/convert/test_leaky_relu_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..d00826af075159004d3727a7519e7c319dbddb02 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/convert/test_leaky_relu_op.cc @@ -0,0 +1,48 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/inference/tensorrt/convert/op_converter.h" +#include "paddle/fluid/inference/tensorrt/convert/ut_helper.h" + +namespace paddle { +namespace inference { +namespace tensorrt { + +TEST(leaky_relu_op, test_leaky_relu) { + std::unordered_set parameters; + framework::Scope scope; + TRTConvertValidation validator(10, parameters, scope, 1000); + validator.DeclInputVar("leaky_relu_input", nvinfer1::DimsCHW(3, 2, 2)); + validator.DeclOutputVar("leaky_relu_out", nvinfer1::DimsCHW(3, 2, 2)); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("leaky_relu"); + desc.SetInput("X", {"leaky_relu_input"}); + desc.SetOutput("Out", {"leaky_relu_out"}); + + desc.SetAttr("alpha", 0.1f); + + validator.SetOp(*desc.Proto()); + + validator.Execute(1); +} + +} // namespace tensorrt +} // namespace inference +} // namespace paddle + +// USE_OP(leaky_relu); +USE_OP(leaky_relu); diff --git a/paddle/fluid/inference/tensorrt/convert/test_mul_op.cc b/paddle/fluid/inference/tensorrt/convert/test_mul_op.cc index 3d34cd7d5d0deca4d83a3f5b5ed0fb396c6acd56..282f53559aa75b2c7c252450e392e1996f9b1d81 100644 --- a/paddle/fluid/inference/tensorrt/convert/test_mul_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/test_mul_op.cc @@ -1,16 +1,16 @@ /* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at - http://www.apache.org/licenses/LICENSE-2.0 + http://www.apache.org/licenses/LICENSE-2.0 - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. */ +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ #include #include "paddle/fluid/framework/op_registry.h" diff --git a/paddle/fluid/inference/tensorrt/convert/test_op_converter.cc b/paddle/fluid/inference/tensorrt/convert/test_op_converter.cc index 01d7f700da9cc67d0ebbd3d9649e3823f58a8811..c5a413221ebff6b9be114151dbb93fd23a148440 100644 --- a/paddle/fluid/inference/tensorrt/convert/test_op_converter.cc +++ b/paddle/fluid/inference/tensorrt/convert/test_op_converter.cc @@ -29,9 +29,9 @@ TEST(OpConverter, ConvertBlock) { // init trt engine cudaStream_t stream_; std::unique_ptr engine_; - engine_.reset(new TensorRTEngine(5, 1 << 15, &stream_)); - engine_->InitNetwork(); PADDLE_ENFORCE_EQ(cudaStreamCreate(&stream_), 0); + engine_.reset(new TensorRTEngine(5, 1 << 15, stream_)); + engine_->InitNetwork(); engine_->DeclareInput("conv2d-X", nvinfer1::DataType::kFLOAT, nvinfer1::Dims3(2, 5, 5)); diff --git a/paddle/fluid/inference/tensorrt/convert/test_pool2d_op.cc b/paddle/fluid/inference/tensorrt/convert/test_pool2d_op.cc index aedd6b62df040eeee4e48f628128511cd8bf4439..bded833505cd25352adc4123de415613d1fc926d 100644 --- a/paddle/fluid/inference/tensorrt/convert/test_pool2d_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/test_pool2d_op.cc @@ -20,18 +20,21 @@ namespace paddle { namespace inference { namespace tensorrt { -void test_pool2d(bool global_pooling) { +void test_pool2d(bool global_pooling, bool ceil_mode, + std::string pool_type = "max") { framework::Scope scope; std::unordered_set parameters; TRTConvertValidation validator(5, parameters, scope, 1 << 15); // The ITensor's Dims should not contain the batch size. // So, the ITensor's Dims of input and output should be C * H * W. - validator.DeclInputVar("pool2d-X", nvinfer1::Dims3(3, 4, 4)); + validator.DeclInputVar("pool2d-X", nvinfer1::Dims3(3, 6, 7)); if (global_pooling) validator.DeclOutputVar("pool2d-Out", nvinfer1::Dims3(3, 1, 1)); + else if (ceil_mode) + validator.DeclOutputVar("pool2d-Out", nvinfer1::Dims3(3, 3, 4)); else - validator.DeclOutputVar("pool2d-Out", nvinfer1::Dims3(3, 2, 2)); + validator.DeclOutputVar("pool2d-Out", nvinfer1::Dims3(3, 3, 3)); // Prepare Op description framework::OpDesc desc; @@ -42,13 +45,14 @@ void test_pool2d(bool global_pooling) { std::vector ksize({2, 2}); std::vector strides({2, 2}); std::vector paddings({0, 0}); - std::string pooling_t = "max"; + std::string pooling_t = pool_type; desc.SetAttr("pooling_type", pooling_t); desc.SetAttr("ksize", ksize); desc.SetAttr("strides", strides); desc.SetAttr("paddings", paddings); desc.SetAttr("global_pooling", global_pooling); + desc.SetAttr("ceil_mode", ceil_mode); LOG(INFO) << "set OP"; validator.SetOp(*desc.Proto()); @@ -57,9 +61,11 @@ void test_pool2d(bool global_pooling) { validator.Execute(3); } -TEST(Pool2dOpConverter, normal) { test_pool2d(false); } +TEST(Pool2dOpConverter, normal) { test_pool2d(false, false); } +TEST(Pool2dOpConverter, test_global_pooling) { test_pool2d(true, false); } -TEST(Pool2dOpConverter, test_global_pooling) { test_pool2d(true); } +TEST(Pool2dOpConverter, max_ceil_test) { test_pool2d(false, true); } +TEST(Pool2dOpConverter, avg_ceil_test) { test_pool2d(false, true, "avg"); } } // namespace tensorrt } // namespace inference diff --git a/paddle/fluid/inference/tensorrt/convert/test_prelu_op.cc b/paddle/fluid/inference/tensorrt/convert/test_prelu_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..b086c910d38a243d98315f2d6eb82ecc0ec5c06d --- /dev/null +++ b/paddle/fluid/inference/tensorrt/convert/test_prelu_op.cc @@ -0,0 +1,93 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/inference/tensorrt/convert/op_converter.h" +#include "paddle/fluid/inference/tensorrt/convert/ut_helper.h" + +namespace paddle { +namespace inference { +namespace tensorrt { + +TEST(prelu_op, test_channel_wise) { + std::unordered_set parameters({"prelu_alpha"}); + framework::Scope scope; + TRTConvertValidation validator(10, parameters, scope, 1000); + validator.DeclInputVar("prelu_input", nvinfer1::DimsCHW(3, 2, 2)); + validator.DeclParamVar("prelu_alpha", nvinfer1::Dims3(3, 1, 1)); + validator.DeclOutputVar("prelu_out", nvinfer1::DimsCHW(3, 2, 2)); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("prelu"); + desc.SetInput("X", {"prelu_input"}); + desc.SetInput("Alpha", {"prelu_alpha"}); + desc.SetOutput("Out", {"prelu_out"}); + + desc.SetAttr("mode", std::string("channel")); + + validator.SetOp(*desc.Proto()); + + validator.Execute(1); +} + +TEST(prelu_op, test_element_wise) { + std::unordered_set parameters({"prelu_alpha"}); + framework::Scope scope; + TRTConvertValidation validator(10, parameters, scope, 1000); + validator.DeclInputVar("prelu_input", nvinfer1::DimsCHW(3, 2, 2)); + validator.DeclParamVar("prelu_alpha", nvinfer1::Dims4(10, 3, 2, 2)); + validator.DeclOutputVar("prelu_out", nvinfer1::DimsCHW(3, 2, 2)); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("prelu"); + desc.SetInput("X", {"prelu_input"}); + desc.SetInput("Alpha", {"prelu_alpha"}); + desc.SetOutput("Out", {"prelu_out"}); + + desc.SetAttr("mode", std::string("element")); + + validator.SetOp(*desc.Proto()); + + validator.Execute(1); +} + +TEST(prelu_op, test_scalar) { + std::unordered_set parameters({"prelu_alpha"}); + framework::Scope scope; + TRTConvertValidation validator(10, parameters, scope, 1000); + validator.DeclInputVar("prelu_input", nvinfer1::DimsCHW(3, 2, 2)); + validator.DeclParamVar("prelu_alpha", nvinfer1::Dims3(1, 1, 1)); + validator.DeclOutputVar("prelu_out", nvinfer1::DimsCHW(3, 2, 2)); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("prelu"); + desc.SetInput("X", {"prelu_input"}); + desc.SetInput("Alpha", {"prelu_alpha"}); + desc.SetOutput("Out", {"prelu_out"}); + + desc.SetAttr("mode", std::string("all")); + + validator.SetOp(*desc.Proto()); + + validator.Execute(1); +} + +} // namespace tensorrt +} // namespace inference +} // namespace paddle + +USE_OP(prelu); diff --git a/paddle/fluid/inference/tensorrt/convert/test_split_op.cc b/paddle/fluid/inference/tensorrt/convert/test_split_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..5aacc5c600dd1371e3865adc888bb8e24640e7d9 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/convert/test_split_op.cc @@ -0,0 +1,115 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/inference/tensorrt/convert/op_converter.h" +#include "paddle/fluid/inference/tensorrt/convert/ut_helper.h" + +namespace paddle { +namespace inference { +namespace tensorrt { + +template +void TensorRTSplitTest(const std::vector &in_shape, + const std::vector §ions) { + std::unordered_set parameters({""}); + framework::Scope scope; + TRTConvertValidation validator(BatchSize + 1, parameters, scope, 10000); + + auto make_dim = [](const std::vector &shape) { + nvinfer1::DimsCHW dim; + dim.c() = shape[0]; + dim.h() = shape[1]; + dim.w() = shape[2]; + return dim; + }; + validator.DeclInputVar("split_input", make_dim(in_shape)); + std::vector output_vars; + for (size_t i = 0; i < sections.size(); ++i) { + auto out_shape = in_shape; + out_shape[Axis - 1] = sections[i]; + std::string output_name = "split_out" + std::to_string(i); + validator.DeclOutputVar(output_name, make_dim(out_shape)); + output_vars.push_back(output_name); + } + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("split"); + desc.SetInput("X", {"split_input"}); + desc.SetOutput("Out", output_vars); + + desc.SetAttr("axis", Axis); + desc.SetAttr("num", 0); + desc.SetAttr("sections", sections); + + validator.SetOp(*desc.Proto()); + + validator.Execute(BatchSize); +} + +// batch = 0, axis = 1, same shape +TEST(split_op, test_same_shape_axis1_batch1) { + TensorRTSplitTest<1, 1>({4, 2, 2}, {2, 2}); +} +// batch = 0, axis = 1, different shape +TEST(split_op, test_different_shape_axis1_batch1) { + TensorRTSplitTest<1, 1>({3, 2, 2}, {2, 1}); +} +// batch = 10, axis = 1, same shape +TEST(split_op, test_same_shape_axis1_batch10) { + TensorRTSplitTest<10, 1>({4, 2, 2}, {2, 2}); +} +// batch = 10, axis = 1, different shape +TEST(split_op, test_different_shape_axis1_batch10) { + TensorRTSplitTest<10, 1>({3, 2, 2}, {2, 1}); +} +// batch = 0, axis = 2, same shape +TEST(split_op, test_same_shape_axis2_batch1) { + TensorRTSplitTest<1, 2>({3, 4, 2}, {2, 2}); +} +// batch = 0, axis = 2, different shape +TEST(split_op, test_different_shape_axis2_batch1) { + TensorRTSplitTest<1, 2>({3, 3, 2}, {2, 1}); +} +// batch = 10, axis = 2, same shape +TEST(split_op, test_same_shape_axis2_batch10) { + TensorRTSplitTest<10, 2>({3, 4, 2}, {2, 2}); +} +// batch = 10, axis = 2, different shape +TEST(split_op, test_different_shape_axis2_batch10) { + TensorRTSplitTest<10, 2>({3, 3, 2}, {2, 1}); +} +// batch = 0, axis = 3, same shape +TEST(split_op, test_same_shape_axis3_batch1) { + TensorRTSplitTest<1, 3>({3, 2, 4}, {2, 2}); +} +// batch = 0, axis = 3, different shape +TEST(split_op, test_different_shape_axis3_batch1) { + TensorRTSplitTest<1, 3>({3, 2, 3}, {2, 1}); +} +// batch = 10, axis = 3, same shape +TEST(split_op, test_same_shape_axis3_batch10) { + TensorRTSplitTest<10, 3>({3, 2, 4}, {2, 2}); +} +// batch = 10, axis = 3, different shape +TEST(split_op, test_different_shape_axis3_batch10) { + TensorRTSplitTest<10, 3>({3, 2, 3}, {2, 1}); +} + +} // namespace tensorrt +} // namespace inference +} // namespace paddle + +USE_OP(split); diff --git a/paddle/fluid/inference/tensorrt/convert/ut_helper.h b/paddle/fluid/inference/tensorrt/convert/ut_helper.h index 0a6f171fc40a838fd81d6a51aca0430d5526f188..2571abbf69892dae626c7178609c2825775fdf2e 100644 --- a/paddle/fluid/inference/tensorrt/convert/ut_helper.h +++ b/paddle/fluid/inference/tensorrt/convert/ut_helper.h @@ -4,7 +4,7 @@ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at -http://www.apache.org/licenses/LICENSE-2.0 + http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, @@ -19,7 +19,9 @@ limitations under the License. */ #pragma once +#include #include +#include #include #include "paddle/fluid/framework/lod_tensor.h" @@ -78,11 +80,10 @@ class TRTConvertValidation { scope_(scope), if_add_batch_(if_add_batch), max_batch_size_(max_batch_size) { - // create engine. - engine_.reset(new TensorRTEngine(max_batch_size, workspace_size, &stream_)); - engine_->InitNetwork(); - PADDLE_ENFORCE_EQ(cudaStreamCreate(&stream_), 0); + engine_.reset( + new TensorRTEngine(max_batch_size, workspace_size, false, nullptr, 0)); + engine_->InitNetwork(); } // Declare a Variable as input with random initialization. @@ -116,13 +117,12 @@ class TRTConvertValidation { } void DeclVar(const std::string& name, const std::vector dim_vec) { - platform::CUDAPlace place; - platform::CUDADeviceContext ctx(place); + platform::CUDADeviceContext ctx(place_); auto* x = scope_.Var(name); auto* x_tensor = x->GetMutable(); x_tensor->Resize(framework::make_ddim(dim_vec)); - RandomizeTensor(x_tensor, place, ctx); + RandomizeTensor(x_tensor, place_, ctx); } // Declare a variable in a fluid Scope. void DeclVar(const std::string& name, const nvinfer1::Dims& dims, @@ -148,19 +148,6 @@ class TRTConvertValidation { // Declare outputs. op_desc_.reset(new framework::OpDesc(desc, nullptr)); - - // Set Inputs. - for (const auto& input : op_desc_->InputArgumentNames()) { - if (parameters_.count(input)) continue; - auto* var = scope_.FindVar(input); - PADDLE_ENFORCE(var); - auto tensor = var->GetMutable(); - - engine_->SetInputFromGPU( - input, static_cast(tensor->data()), - sizeof(float) * - analysis::AccuDims(tensor->dims(), tensor->dims().size())); - } } // We use the set 'neglected_output' here, because some Ops like batch norm, @@ -170,43 +157,71 @@ class TRTConvertValidation { std::unordered_set neglected_output = {}) { // Execute Fluid Op PADDLE_ENFORCE_LE(batch_size, max_batch_size_); - platform::CUDAPlace place; - platform::CUDADeviceContext ctx(place); - op_->Run(scope_, place); - // Execute TRT. - engine_->Execute(batch_size); - cudaStreamSynchronize(*engine_->stream()); + platform::CUDADeviceContext ctx(place_); + op_->Run(scope_, place_); - ASSERT_FALSE(op_desc_->OutputArgumentNames().empty()); - const size_t output_space_size = 3000; + std::vector input_output_names; + + // Note: we need filter the parameter + for (const auto& input : op_desc_->InputArgumentNames()) { + if (parameters_.count(input)) continue; + input_output_names.push_back(input); + } + + // Collect the fluid outputs. + std::vector> fluid_outs; for (const auto& output : op_desc_->OutputArgumentNames()) { if (neglected_output.count(output)) continue; + input_output_names.push_back(output); std::vector fluid_out; - std::vector trt_out(output_space_size); - engine_->GetOutputInCPU(output, &trt_out[0], output_space_size); - cudaStreamSynchronize(*engine_->stream()); - auto* var = scope_.FindVar(output); - auto tensor = var->GetMutable(); + auto* tensor = var->GetMutable(); framework::TensorToVector(*tensor, ctx, &fluid_out); + fluid_outs.push_back(fluid_out); + } + + // Bind input and output for TRT. + const int num_bindings = input_output_names.size(); + std::vector buffers(num_bindings); + + for (const std::string& name : input_output_names) { + auto* var = scope_.FindVar(name); + auto* tensor = var->GetMutable(); + const int bind_index = engine_->engine()->getBindingIndex(name.c_str()); + buffers[bind_index] = + static_cast(tensor->mutable_data(place_)); + } + + // Execute TRT. + engine_->Execute(batch_size, &buffers, stream_); - size_t fluid_out_size = fluid_out.size(); + ASSERT_FALSE(op_desc_->OutputArgumentNames().empty()); + int index = 0; + for (const auto& output : op_desc_->OutputArgumentNames()) { + if (neglected_output.count(output)) continue; + std::vector trt_out; + auto* var = scope_.FindVar(output); + auto* tensor = var->GetMutable(); + framework::TensorToVector(*tensor, ctx, &trt_out); + + size_t fluid_out_size = fluid_outs[index].size(); if (if_add_batch_ == true) { fluid_out_size = batch_size * (framework::product(tensor->dims()) / max_batch_size_); } - // Compare two output - ASSERT_FALSE(fluid_out.empty()); + for (size_t i = 0; i < fluid_out_size; i++) { // Loose the threshold for CI in different machine model. - EXPECT_LT(std::abs(fluid_out[i] - trt_out[i]), 2e-5); + EXPECT_LT(std::abs(fluid_outs[index][i] - trt_out[i]), 2e-5); } + index += 1; } } framework::Scope& scope() { return scope_; } private: + platform::CUDAPlace place_; std::unique_ptr engine_; cudaStream_t stream_; std::unique_ptr op_; diff --git a/paddle/fluid/inference/tensorrt/engine.cc b/paddle/fluid/inference/tensorrt/engine.cc index 14e9e14d33d637ee68e37593cc48721e5169499f..fddf5f11c285da4687b08d1962b6f1f51390e03e 100644 --- a/paddle/fluid/inference/tensorrt/engine.cc +++ b/paddle/fluid/inference/tensorrt/engine.cc @@ -32,36 +32,18 @@ void TensorRTEngine::Build(const DescType &paddle_model) { PADDLE_ENFORCE(false, "not implemented"); } -void TensorRTEngine::Execute(int batch_size) { +void TensorRTEngine::Execute(int batch_size, std::vector *buffers, + cudaStream_t stream) { freshDeviceId(); batch_size_ = batch_size; - std::vector buffers; - for (auto &buf : buffers_) { - PADDLE_ENFORCE_NOT_NULL(buf.buffer, "buffer should be allocated"); - PADDLE_ENFORCE_GT(buf.max_size, 0); - PADDLE_ENFORCE(buf.device == DeviceType::GPU); - buffers.push_back(buf.buffer); - } - PADDLE_ENFORCE_NOT_NULL(stream_); - infer_context_->enqueue(batch_size, buffers.data(), *stream_, nullptr); - cudaStreamSynchronize(*stream_); + infer_context_->enqueue(batch_size, buffers->data(), stream, nullptr); + cudaStreamSynchronize(stream); SetRuntimeBatch(batch_size); } -TensorRTEngine::~TensorRTEngine() { - cudaStreamSynchronize(*stream_); - // clean buffer - for (auto &buf : buffers_) { - if (buf.device == DeviceType::GPU && buf.buffer != nullptr) { - PADDLE_ENFORCE_EQ(0, cudaFree(buf.buffer)); - buf.buffer = nullptr; - buf.max_size = 0; - } - } -} - void TensorRTEngine::FreezeNetwork() { freshDeviceId(); + VLOG(3) << "TRT to freeze network"; PADDLE_ENFORCE(infer_builder_ != nullptr, "Call InitNetwork first to initialize network."); PADDLE_ENFORCE(infer_network_ != nullptr, @@ -69,35 +51,18 @@ void TensorRTEngine::FreezeNetwork() { // build engine. infer_builder_->setMaxBatchSize(max_batch_); infer_builder_->setMaxWorkspaceSize(max_workspace_); + if (enable_int8_) { + infer_builder_->setInt8Mode(true); + PADDLE_ENFORCE( + calibrator_ != nullptr, + "The precision mode is 'INT8', the calibrator should not be nullptr"); + infer_builder_->setInt8Calibrator(calibrator_); + } infer_engine_.reset(infer_builder_->buildCudaEngine(*infer_network_)); PADDLE_ENFORCE(infer_engine_ != nullptr, "build cuda engine failed!"); infer_context_.reset(infer_engine_->createExecutionContext()); - - // allocate GPU buffers. - buffers_.resize(buffer_sizes_.size()); - for (auto &item : buffer_sizes_) { - // The output buffers are not set in the network building phrase, need to - // infer from the TesorRT network. - if (item.second == 0) { - auto slot_offset = infer_engine_->getBindingIndex(item.first.c_str()); - auto dims = infer_engine_->getBindingDimensions(slot_offset); - item.second = kDataTypeSize[static_cast( - infer_engine_->getBindingDataType(slot_offset))] * - analysis::AccuDims(dims.d, dims.nbDims) * max_batch_; - PADDLE_ENFORCE_GT(item.second, 0); - } - - auto &buf = buffer(item.first); - buf.max_size = item.second * max_batch_; - CHECK(buf.buffer == nullptr); // buffer should be allocated only once. - - PADDLE_ENFORCE_EQ(0, cudaMalloc(&buf.buffer, item.second * max_batch_)); - buf.size = 0; - PADDLE_ENFORCE_LE(buf.max_size, 1 << 30); // 10G - buf.device = DeviceType::GPU; - } } nvinfer1::ITensor *TensorRTEngine::DeclareInput(const std::string &name, @@ -133,6 +98,10 @@ void TensorRTEngine::DeclareOutput(const nvinfer1::ILayer *layer, int offset, buffer_sizes_[name] = 0; } +bool TensorRTEngine::HasDeclared(const std::string &name) { + return buffer_sizes_.count(name) > 0; +} + void TensorRTEngine::DeclareOutput(const std::string &name) { PADDLE_ENFORCE_EQ(0, buffer_sizes_.count(name), "duplicate output name %s", name); @@ -147,83 +116,6 @@ void TensorRTEngine::DeclareOutput(const std::string &name) { buffer_sizes_[name] = 0; } -void *TensorRTEngine::GetOutputInGPU(const std::string &name) { - return buffer(name).buffer; -} - -void TensorRTEngine::GetOutputInGPU(const std::string &name, void *dst, - size_t max_size) { - // determine data size - auto *output = TensorRTEngine::GetITensor(name); - nvinfer1::Dims dims = output->getDimensions(); - auto dim_size = analysis::AccuDims(dims.d, dims.nbDims); - size_t dst_size = dim_size * runtime_batch_ * - kDataTypeSize[static_cast(output->getType())]; - - auto it = buffer_sizes_.find(name); - PADDLE_ENFORCE(it != buffer_sizes_.end()); - PADDLE_ENFORCE_GT(it->second, 0); - PADDLE_ENFORCE_LE(dst_size, it->second); - PADDLE_ENFORCE_GE(max_size, dst_size); - auto &buf = buffer(name); - PADDLE_ENFORCE_NOT_NULL(buf.buffer, "buffer should be allocated before"); - PADDLE_ENFORCE_EQ(cudaMemcpyAsync(dst, buf.buffer, dst_size, - cudaMemcpyDeviceToDevice, *stream_), - 0); -} - -void TensorRTEngine::GetOutputInCPU(const std::string &name, void *dst, - size_t max_size) { - // determine data size - - auto *output = TensorRTEngine::GetITensor(name); - nvinfer1::Dims dims = output->getDimensions(); - auto dim_size = analysis::AccuDims(dims.d, dims.nbDims); - size_t dst_size = dim_size * runtime_batch_ * - kDataTypeSize[static_cast(output->getType())]; - auto it = buffer_sizes_.find(name); - PADDLE_ENFORCE(it != buffer_sizes_.end()); - PADDLE_ENFORCE_GT(it->second, 0); - PADDLE_ENFORCE_LE(dst_size, it->second); - PADDLE_ENFORCE_GE(max_size, dst_size); - auto &buf = buffer(name); - PADDLE_ENFORCE_NOT_NULL(buf.buffer, "buffer should be allocated before"); - PADDLE_ENFORCE_EQ(0, cudaMemcpyAsync(dst, buf.buffer, dst_size, - cudaMemcpyDeviceToHost, *stream_)); -} - -Buffer &TensorRTEngine::buffer(const std::string &name) { - PADDLE_ENFORCE(infer_engine_ != nullptr, "call FreezeNetwork first."); - auto it = buffer_sizes_.find(name); - PADDLE_ENFORCE(it != buffer_sizes_.end()); - auto slot_offset = infer_engine_->getBindingIndex(name.c_str()); - return buffers_[slot_offset]; -} - -void TensorRTEngine::SetInputFromCPU(const std::string &name, const void *data, - size_t size) { - auto &buf = buffer(name); - PADDLE_ENFORCE_NOT_NULL(buf.buffer); - PADDLE_ENFORCE_NOT_NULL(data); - PADDLE_ENFORCE_NOT_NULL(stream_); - PADDLE_ENFORCE_LE(size, buf.max_size, "buffer is too small"); - PADDLE_ENFORCE(buf.device == DeviceType::GPU); - buf.size = size; - PADDLE_ENFORCE_EQ(0, cudaMemcpyAsync(buf.buffer, data, size, - cudaMemcpyHostToDevice, *stream_)); -} - -void TensorRTEngine::SetInputFromGPU(const std::string &name, const void *data, - size_t size) { - auto &buf = buffer(name); - buf.size = size; - PADDLE_ENFORCE_NOT_NULL(buf.buffer); - PADDLE_ENFORCE_LE(size, buf.max_size, "buffer is too small"); - PADDLE_ENFORCE(buf.device == DeviceType::GPU); - PADDLE_ENFORCE_EQ(0, cudaMemcpyAsync(buf.buffer, data, size, - cudaMemcpyDeviceToDevice, *stream_)); -} - void TensorRTEngine::SetITensor(const std::string &name, nvinfer1::ITensor *tensor) { PADDLE_ENFORCE(tensor != nullptr); @@ -243,11 +135,18 @@ void TensorRTEngine::SetRuntimeBatch(size_t batch_size) { int TensorRTEngine::GetRuntimeBatch() { return runtime_batch_; } +nvinfer1::IPluginLayer *TensorRTEngine::AddPlugin( + nvinfer1::ITensor *const *inputs, int num_inputs, + plugin::PluginTensorRT *plugin) { + owned_plugin_.emplace_back(plugin); + return infer_network_.get()->addPluginExt(inputs, num_inputs, *plugin); +} + void TensorRTEngine::freshDeviceId() { int count; cudaGetDeviceCount(&count); - PADDLE_ENFORCE_LT(device_, count); - cudaSetDevice(device_); + PADDLE_ENFORCE_LT(device_id_, count); + cudaSetDevice(device_id_); } } // namespace tensorrt diff --git a/paddle/fluid/inference/tensorrt/engine.h b/paddle/fluid/inference/tensorrt/engine.h index bd3ba4cea6551a7f6651e311e2649de191a6faa1..657dfd9355f9e3167a123b1f71655869d030a3df 100644 --- a/paddle/fluid/inference/tensorrt/engine.h +++ b/paddle/fluid/inference/tensorrt/engine.h @@ -22,23 +22,30 @@ limitations under the License. */ #include "paddle/fluid/framework/tensor.h" #include "paddle/fluid/inference/engine.h" #include "paddle/fluid/inference/tensorrt/helper.h" +#include "paddle/fluid/inference/tensorrt/plugin/trt_plugin.h" +#include "paddle/fluid/inference/tensorrt/plugin/trt_plugin_factory.h" +#include "paddle/fluid/inference/tensorrt/trt_int8_calibrator.h" #include "paddle/fluid/inference/utils/singleton.h" namespace paddle { namespace inference { namespace tensorrt { +class TRTInt8Calibrator; /* * TensorRT Engine. * * There are two alternative ways to use it, one is to build from a paddle * protobuf model, another way is to manully construct the network. */ -class TensorRTEngine : public EngineBase { +class TensorRTEngine { + using DescType = ::paddle::framework::proto::BlockDesc; + public: // Weight is model parameter. class Weight { public: + Weight() = default; Weight(nvinfer1::DataType dtype, void* value, size_t num_elem) { w_.type = dtype; w_.values = value; @@ -52,28 +59,28 @@ class TensorRTEngine : public EngineBase { nvinfer1::Weights w_; }; - TensorRTEngine(int max_batch, int max_workspace, - cudaStream_t* stream = nullptr, int device = 0, + TensorRTEngine(int max_batch, int max_workspace, bool enable_int8 = false, + TRTInt8Calibrator* calibrator = nullptr, int device_id = 0, nvinfer1::ILogger& logger = NaiveLogger::Global()) : max_batch_(max_batch), max_workspace_(max_workspace), - stream_(stream ? stream : &default_stream_), - logger_(logger), - device_(device) { - freshDeviceId(); - cudaStreamCreate(stream_); - } + enable_int8_(enable_int8), + calibrator_(calibrator), + device_id_(device_id), + logger_(logger) {} - virtual ~TensorRTEngine(); + ~TensorRTEngine() {} // TODO(Superjomn) implement it later when graph segmentation is supported. - void Build(const DescType& paddle_model) override; + void Build(const DescType& paddle_model); - void Execute(int batch_size) override; + void Execute(int batch_size, std::vector* buffers, + cudaStream_t stream); // Initialize the inference network, so that TensorRT layers can add to this // network. void InitNetwork() { + freshDeviceId(); infer_builder_.reset(createInferBuilder(&logger_)); infer_network_.reset(infer_builder_->createNetwork()); } @@ -91,38 +98,39 @@ class TensorRTEngine : public EngineBase { const std::string& name); // Set the itensor_map_[name] as the network's output, and set its name. void DeclareOutput(const std::string& name); + // Check if the ITensor has been declared + bool HasDeclared(const std::string& name); - // GPU memory address for an ITensor with specific name. One can operate on - // these memory directly for acceleration, for example, output the converted - // data directly to the buffer to save data copy overhead. - // NOTE this should be used after calling `FreezeNetwork`. - Buffer& buffer(const std::string& name) override; - - cudaStream_t* stream() { return stream_; } - - // Fill an input from CPU memory with name and size. - void SetInputFromCPU(const std::string& name, const void* data, size_t size); - // TODO(Superjomn) is this method necessary given that buffer(xxx) can be - // accessed directly. Fill an input from GPU memory with name and size. - void SetInputFromGPU(const std::string& name, const void* data, size_t size); - // Get an output called name, the output of tensorrt is in GPU, so this method - // Return the output's GPU memory address without copy. - void* GetOutputInGPU(const std::string& name); - // Copy data into dst inside the GPU device. - void GetOutputInGPU(const std::string& name, void* dst, size_t max_size); - // LOW EFFICENCY! Get output to CPU, this will trigger a memory copy from GPU - // to CPU. - void GetOutputInCPU(const std::string& name, void* dst, size_t max_size); - // Fill an ITensor into map itensor_map_. void SetITensor(const std::string& name, nvinfer1::ITensor* tensor); // Get an ITensor called name. nvinfer1::ITensor* GetITensor(const std::string& name); nvinfer1::ICudaEngine* engine() { return infer_engine_.get(); } nvinfer1::INetworkDefinition* network() { return infer_network_.get(); } + + nvinfer1::IHostMemory* Serialize() { + PADDLE_ENFORCE(infer_engine_ != nullptr, + "You should build engine first and then serialize"); + ihost_memory_.reset(infer_engine_->serialize()); + return ihost_memory_.get(); + } + + void Deserialize(const std::string& engine_serialized_data) { + freshDeviceId(); + infer_ptr runtime(createInferRuntime(&logger_)); + infer_engine_.reset(runtime->deserializeCudaEngine( + engine_serialized_data.c_str(), engine_serialized_data.size(), + &inference::Singleton::Global())); + PADDLE_ENFORCE(infer_engine_ != nullptr, + "build cuda engine failed when deserialize engine info.!"); + infer_context_.reset(infer_engine_->createExecutionContext()); + } + void SetRuntimeBatch(size_t batch_size); int GetRuntimeBatch(); - int GetDevice() { return device_; } + int GetDeviceId() { return device_id_; } + nvinfer1::IPluginLayer* AddPlugin(nvinfer1::ITensor* const* inputs, + int num_inputs, plugin::PluginTensorRT*); // A pointer to CPU memory is needed of the TRT weight. // Before TRT runs, fluid loads weight into GPU storage. @@ -133,6 +141,11 @@ class TensorRTEngine : public EngineBase { weight_map; private: + // Each ICudaEngine object is bound to a specific GPU when it is instantiated, + // ensure that the thread is associated with the correct device by calling + // freshDeviceId(). + void freshDeviceId(); + // the max batch size int max_batch_; // the runtime batch size @@ -140,20 +153,20 @@ class TensorRTEngine : public EngineBase { // the max memory size the engine uses int max_workspace_; + bool enable_int8_; + TRTInt8Calibrator* calibrator_; // batch size of the current data, will be updated each Executation. int batch_size_{-1}; - cudaStream_t* stream_; - // If stream_ is not set from outside, hold its own stream. - cudaStream_t default_stream_; + + int device_id_; nvinfer1::ILogger& logger_; - std::vector buffers_; // max data size for the buffers. std::unordered_map buffer_sizes_; std::unordered_map itensor_map_; - // The specific GPU id that the TensorRTEngine bounded to. - int device_; + + std::vector> owned_plugin_; // TensorRT related internal members template @@ -170,15 +183,11 @@ class TensorRTEngine : public EngineBase { infer_ptr infer_network_; infer_ptr infer_engine_; infer_ptr infer_context_; - // Each ICudaEngine object is bound to a specific GPU when it is instantiated, - // ensure that the thread is associated with the correct device by calling - // freshDeviceId(). - void freshDeviceId(); + infer_ptr ihost_memory_; }; // class TensorRTEngine // Add an layer__ into engine__ with args ARGS. // For example: -// TRT_ENGINE_ADD_LAYER(xxx, FullyConnected, input, dim, weights, bias) // // Reference // https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#charRNN_define_network @@ -190,38 +199,6 @@ class TensorRTEngine : public EngineBase { #define TRT_ENGINE_ADD_LAYER(engine__, layer__, ARGS...) \ engine__->network()->add##layer__(ARGS); -/* - * Helper to control the TensorRT engine's creation and deletion. - */ -class TRT_EngineManager { - public: - bool HasEngine(const std::string& name) const { - return engines_.count(name) != 0; - } - - // Get an engine called `name`. - TensorRTEngine* Get(const std::string& name) const { - return engines_.at(name).get(); - } - - // Create or get an engine called `name` - TensorRTEngine* Create(int max_batch, int max_workspace, cudaStream_t* stream, - const std::string& name, int gpu_device = 0) { - auto* p = new TensorRTEngine(max_batch, max_workspace, stream, gpu_device); - engines_[name].reset(p); - return p; - } - - void DeleteALl() { - for (auto& item : engines_) { - item.second.reset(nullptr); - } - } - - private: - std::unordered_map> engines_; -}; - } // namespace tensorrt } // namespace inference } // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/helper.h b/paddle/fluid/inference/tensorrt/helper.h index b6e7968108403c9c9c192759c44eac040d1c5073..010942a0678fe9a592d1a95ba9cdc6adc42cc2ec 100644 --- a/paddle/fluid/inference/tensorrt/helper.h +++ b/paddle/fluid/inference/tensorrt/helper.h @@ -17,6 +17,9 @@ #include #include #include +#include +#include +#include #include "paddle/fluid/platform/dynload/tensorrt.h" #include "paddle/fluid/platform/enforce.h" @@ -52,7 +55,7 @@ class NaiveLogger : public nvinfer1::ILogger { void log(nvinfer1::ILogger::Severity severity, const char* msg) override { switch (severity) { case Severity::kINFO: - LOG(INFO) << msg; + VLOG(3) << msg; break; case Severity::kWARNING: LOG(WARNING) << msg; @@ -74,6 +77,32 @@ class NaiveLogger : public nvinfer1::ILogger { ~NaiveLogger() override {} }; +class NaiveProfiler : public nvinfer1::IProfiler { + public: + typedef std::pair Record; + std::vector mProfile; + + virtual void reportLayerTime(const char* layerName, float ms) { + auto record = + std::find_if(mProfile.begin(), mProfile.end(), + [&](const Record& r) { return r.first == layerName; }); + if (record == mProfile.end()) + mProfile.push_back(std::make_pair(layerName, ms)); + else + record->second += ms; + } + + void printLayerTimes() { + float totalTime = 0; + for (size_t i = 0; i < mProfile.size(); i++) { + printf("%-40.40s %4.3fms\n", mProfile[i].first.c_str(), + mProfile[i].second); + totalTime += mProfile[i].second; + } + printf("Time over all layers: %4.3f\n", totalTime); + } +}; + } // namespace tensorrt } // namespace inference } // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/op_teller.cc b/paddle/fluid/inference/tensorrt/op_teller.cc new file mode 100644 index 0000000000000000000000000000000000000000..9fecad6eb3889f48f2e0012a718ed0d04f34ae66 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/op_teller.cc @@ -0,0 +1,49 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/tensorrt/op_teller.h" + +namespace paddle { +namespace inference { +namespace tensorrt { + +// Just tell by the op_types. +struct SimpleOpTypeSetTeller : public Teller { + SimpleOpTypeSetTeller() {} + + bool operator()(const std::string& op_type, + const framework::OpDesc& desc) override { + return teller_set.count(op_type); + } + + private: + std::unordered_set teller_set{ + {"mul", "conv2d", "pool2d", "relu", "softmax", "sigmoid", + "depthwise_conv2d", "batch_norm", "concat", "tanh", "pad", + "elementwise_add", "elementwise_mul", "dropout", "split", "prelu", + "conv2d_transpose", "leaky_relu"}}; +}; + +bool OpTeller::Tell(const std::string& op_type, const framework::OpDesc& desc) { + for (auto& teller : tellers_) { + if ((*teller)(op_type, desc)) return true; + } + return false; +} + +OpTeller::OpTeller() { tellers_.emplace_back(new SimpleOpTypeSetTeller); } + +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/op_teller.h b/paddle/fluid/inference/tensorrt/op_teller.h new file mode 100644 index 0000000000000000000000000000000000000000..b98f052bf2478098d74f19858ec79823d5ab1e2d --- /dev/null +++ b/paddle/fluid/inference/tensorrt/op_teller.h @@ -0,0 +1,68 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include "paddle/fluid/framework/op_desc.h" + +namespace paddle { +namespace inference { +namespace tensorrt { + +/* + * Single Op teller definition. + * One can override this and define a more complex tell logic, considerring more + * issues such as op_desc. + */ +struct Teller { + virtual bool operator()(const std::string& op_type, + const framework::OpDesc& desc) = 0; + + virtual ~Teller() = default; +}; +/* + * A real example: + * + * struct SomeTeller : public Teller { + * bool operator()(const std::string& op_type, + * const framework::OpDesc& desc) override { + * return op_type == "fc" && desc.Inputs().size() == 2; + * } + *}; + */ + +/* + * class OpTeller helps to tell whether a fluid + * operator can be transformed to a TensorRT layer. + */ +class OpTeller { + public: + static OpTeller& Global() { + static std::unique_ptr x(new OpTeller); + return *x; + } + + bool Tell(const std::string& op_type, const framework::OpDesc& desc); + + private: + OpTeller(); + + private: + std::vector> tellers_; +}; + +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/plugin/CMakeLists.txt b/paddle/fluid/inference/tensorrt/plugin/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..709aa103d1b6681221328b180d65e90f08d3368e --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/CMakeLists.txt @@ -0,0 +1,5 @@ +nv_library(tensorrt_plugin + SRCS trt_plugin.cc split_op_plugin.cu elementwise_op_plugin.cu + prelu_op_plugin.cu trt_plugin_factory.cc + avg_pool_op_plugin.cu + DEPS enforce tensorrt_engine prelu) diff --git a/paddle/fluid/inference/tensorrt/plugin/avg_pool_op_plugin.cu b/paddle/fluid/inference/tensorrt/plugin/avg_pool_op_plugin.cu new file mode 100644 index 0000000000000000000000000000000000000000..f27a838162c89b6377a7ffd995608b3a5a49eeae --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/avg_pool_op_plugin.cu @@ -0,0 +1,71 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/tensorrt/plugin/avg_pool_op_plugin.h" +#include "paddle/fluid/inference/tensorrt/plugin/trt_plugin_factory.h" +#include "paddle/fluid/operators/math/pooling.h" + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +AvgPoolPlugin* CreateAvgPoolPluginDeserialize(const void* buffer, + size_t length) { + return new AvgPoolPlugin(buffer, length); +} +REGISTER_TRT_PLUGIN("avg_pool_plugin", CreateAvgPoolPluginDeserialize); + +nvinfer1::Dims AvgPoolPlugin::getOutputDimensions( + int index, const nvinfer1::Dims* inputDims, int nbInputs) { + assert(nbInputs == 1); + assert(index == 0); + assert(inputDims[0].nbDims == 3); + nvinfer1::Dims const& input_dims = inputDims[0]; + + nvinfer1::Dims output_dims = input_dims; + + output_dims.d[1] = output_shape_[1]; + output_dims.d[2] = output_shape_[2]; + return output_dims; +} + +int AvgPoolPlugin::enqueue(int batchSize, const void* const* inputs, + void** outputs, void* workspace, + cudaStream_t stream) { + auto const& input_dims = this->getInputDims(0); + int input_size = 0; + float const* idata = reinterpret_cast(inputs[0]); + float** odatas = reinterpret_cast(outputs); + + paddle::operators::math::AvgPool pool_process; + paddle::operators::math::Pool2dDirectCUDAFunctor< + paddle::operators::math::AvgPool, float> + pool2d_forward; + + std::vector input_shape = input_shape_; + std::vector output_shape = output_shape_; + input_shape.insert(input_shape.begin(), batchSize); + output_shape.insert(output_shape.begin(), batchSize); + + pool2d_forward(idata, input_shape, output_shape, ksize_, strides_, paddings_, + pool_process, true, odatas[0], stream); + + return cudaGetLastError() != cudaSuccess; +} + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/plugin/avg_pool_op_plugin.h b/paddle/fluid/inference/tensorrt/plugin/avg_pool_op_plugin.h new file mode 100644 index 0000000000000000000000000000000000000000..a7c0aa5794e6bb131d012cb12d6d9fc12a73bd0d --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/avg_pool_op_plugin.h @@ -0,0 +1,115 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include "paddle/fluid/inference/tensorrt/plugin/trt_plugin.h" + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +class AvgPoolPlugin : public PluginTensorRT { + private: + bool ceil_mode_; + std::vector ksize_; + std::vector strides_; + std::vector paddings_; + std::vector input_shape_; + std::vector output_shape_; + + protected: + size_t getSerializationSize() override { + return SerializedSize(getPluginType()) + SerializedSize(ceil_mode_) + + SerializedSize(ksize_) + SerializedSize(strides_) + + SerializedSize(paddings_) + SerializedSize(input_shape_) + + SerializedSize(output_shape_) + getBaseSerializationSize(); + } + + // TRT will call this func when we need to serialize the configuration of + // tensorrt. + void serialize(void *buffer) override { + SerializeValue(&buffer, getPluginType()); + serializeBase(buffer); + SerializeValue(&buffer, ceil_mode_); + SerializeValue(&buffer, ksize_); + SerializeValue(&buffer, strides_); + SerializeValue(&buffer, paddings_); + SerializeValue(&buffer, input_shape_); + SerializeValue(&buffer, output_shape_); + } + + public: + AvgPoolPlugin() {} + AvgPoolPlugin(bool ceil_mode, std::vector ksize, + std::vector strides, std::vector paddings, + std::vector input_shape) + : ceil_mode_(ceil_mode), + ksize_(ksize), + strides_(strides), + paddings_(paddings), + input_shape_(input_shape) { + int output_h, output_w; + output_shape_ = input_shape_; + if (!ceil_mode_) { + output_h = + (input_shape[1] - ksize_[0] + 2 * paddings_[0]) / strides_[0] + 1; + output_w = + (input_shape[2] - ksize_[1] + 2 * paddings_[1]) / strides_[1] + 1; + } else { + output_h = + (input_shape[1] - ksize_[0] + 2 * paddings_[0] + strides_[0] - 1) / + strides_[0] + + 1; + output_w = + (input_shape[2] - ksize_[1] + 2 * paddings_[1] + strides_[1] - 1) / + strides_[1] + + 1; + } + output_shape_[1] = output_h; + output_shape_[2] = output_w; + } + + // It was used for tensorrt deserialization. + // It should not be called by users. + AvgPoolPlugin(void const *serialData, size_t serialLength) { + deserializeBase(serialData, serialLength); + DeserializeValue(&serialData, &serialLength, &ceil_mode_); + DeserializeValue(&serialData, &serialLength, &ksize_); + DeserializeValue(&serialData, &serialLength, &strides_); + DeserializeValue(&serialData, &serialLength, &paddings_); + DeserializeValue(&serialData, &serialLength, &input_shape_); + DeserializeValue(&serialData, &serialLength, &output_shape_); + } + + AvgPoolPlugin *clone() const override { + return new AvgPoolPlugin(ceil_mode_, ksize_, strides_, paddings_, + input_shape_); + } + + const char *getPluginType() const override { return "avg_pool_plugin"; } + int getNbOutputs() const override { return 1; } + nvinfer1::Dims getOutputDimensions(int index, const nvinfer1::Dims *inputs, + int nbInputDims) override; + int initialize() override { return 0; } + int enqueue(int batchSize, const void *const *inputs, void **outputs, + void *workspace, cudaStream_t stream) override; +}; + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/plugin/elementwise_op_plugin.cu b/paddle/fluid/inference/tensorrt/plugin/elementwise_op_plugin.cu new file mode 100644 index 0000000000000000000000000000000000000000..9aed3ddab1448fde7cb6b0e13bcf0b05e23622e9 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/elementwise_op_plugin.cu @@ -0,0 +1,145 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/inference/tensorrt/plugin/elementwise_op_plugin.h" +#include "paddle/fluid/inference/tensorrt/plugin/trt_plugin_factory.h" + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +ElementWisePlugin* CreateElementWisePluginDeserialize(const void* buffer, + size_t length) { + return new ElementWisePlugin(buffer, length); +} +REGISTER_TRT_PLUGIN("elementwise_plugin", CreateElementWisePluginDeserialize); + +namespace details { + +template +struct Add { + __device__ T operator()(const T& a, const T& b) const { return a + b; } +}; + +template +struct Mul { + __device__ T operator()(const T& a, const T& b) const { return a * b; } +}; + +template +__global__ void ColumnWiseKernel(Operator op, const T* x, const T* y, T* out, + int batch_size, int num_rows, int num_cols) { + for (int batch_id = 0; batch_id < batch_size; ++batch_id) { + int row = blockIdx.x; + for (; row < num_rows; row += gridDim.x) { + T value_y = y[batch_id * num_rows + row]; + int col = threadIdx.x; + int offset = (batch_id * num_rows + row) * num_cols; + for (; col < num_cols; col += blockDim.x) { + T value_x = x[offset + col]; + out[offset + col] = op(value_x, value_y); + } + } + } +} + +template +static void ElementWise(Operator op, const T* x, const T* y, T* out, + int batch_size, int prev, int midd, int post, + cudaStream_t stream) { + const int kThreadsPerBlock = 1024; + const int kMaximumBlocks = 65535; + if (prev == 1) { + int num_threads = (post > kThreadsPerBlock) ? kThreadsPerBlock + : (((post + 31) >> 5) << 5); + int num_blocks = (midd < kMaximumBlocks) ? midd : kMaximumBlocks; + ColumnWiseKernel<<>>( + op, x, y, out, batch_size, midd, post); + } else if (post == 1) { + PADDLE_THROW("Not implemented."); + } else { + PADDLE_THROW("Not implemented."); + } +} + +} // namespace details + +nvinfer1::Dims ElementWisePlugin::getOutputDimensions( + int index, const nvinfer1::Dims* input_dims, int num_inputs) { + PADDLE_ENFORCE_EQ(index, 0); + PADDLE_ENFORCE_EQ(num_inputs, 2); + PADDLE_ENFORCE_NOT_NULL(input_dims); + return input_dims[0]; +} + +int ElementWisePlugin::initialize() { + PADDLE_ENFORCE_GT(dims_y_.nbDims, 0); + + axis_ = (axis_ == -1) ? dims_x_.nbDims - dims_y_.nbDims : axis_; + int trimed_nb_dims = dims_y_.nbDims; + for (; trimed_nb_dims > 0; --trimed_nb_dims) { + if (dims_y_.d[trimed_nb_dims - 1] != 1) { + break; + } + } + dims_y_.nbDims = trimed_nb_dims; + + PADDLE_ENFORCE_GE(dims_x_.nbDims, dims_y_.nbDims + axis_); + PADDLE_ENFORCE_LT(axis_, dims_x_.nbDims); + + prev_size_ = 1; + midd_size_ = 1; + post_size_ = 1; + for (int i = 0; i < axis_; ++i) { + prev_size_ *= dims_x_.d[i]; + } + + for (int i = 0; i < dims_y_.nbDims; ++i) { + PADDLE_ENFORCE_EQ(dims_x_.d[i + axis_], dims_y_.d[i], + "Broadcast dimension mismatch."); + midd_size_ *= dims_y_.d[i]; + } + + for (int i = axis_ + dims_y_.nbDims; i < dims_x_.nbDims; ++i) { + post_size_ *= dims_x_.d[i]; + } + return 0; +} + +int ElementWisePlugin::enqueue(int batch_size, const void* const* inputs, + void** outputs, void* workspace, + cudaStream_t stream) { + const float* x = reinterpret_cast(inputs[0]); + const float* y = reinterpret_cast(inputs[1]); + float* out = reinterpret_cast(outputs[0]); + + if (type_ == "add") { + details::ElementWise(details::Add(), x, y, out, batch_size, + prev_size_, midd_size_, post_size_, stream); + } else if (type_ == "mul") { + details::ElementWise(details::Mul(), x, y, out, batch_size, + prev_size_, midd_size_, post_size_, stream); + } else { + PADDLE_THROW("Not implemented."); + } + + return cudaGetLastError() != cudaSuccess; +} + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/plugin/elementwise_op_plugin.h b/paddle/fluid/inference/tensorrt/plugin/elementwise_op_plugin.h new file mode 100644 index 0000000000000000000000000000000000000000..3b040f14c531c540b8a855da85ecc3008224526c --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/elementwise_op_plugin.h @@ -0,0 +1,93 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include "paddle/fluid/inference/tensorrt/plugin/trt_plugin.h" + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +class ElementWisePlugin : public PluginTensorRT { + public: + ElementWisePlugin(std::string type, nvinfer1::Dims const &dims_x, + nvinfer1::Dims const &dims_y, int axis) + : type_(type), + dims_x_(dims_x), + dims_y_(dims_y), + axis_(axis), + prev_size_(1), + midd_size_(1), + post_size_(1) {} + + ElementWisePlugin(void const *serial_data, size_t serial_length) { + deserializeBase(serial_data, serial_length); + const char *elementwise_type; + DeserializeValue(&serial_data, &serial_length, &elementwise_type); + type_ = std::string(elementwise_type); + DeserializeValue(&serial_data, &serial_length, &axis_); + DeserializeValue(&serial_data, &serial_length, &dims_x_); + DeserializeValue(&serial_data, &serial_length, &dims_y_); + } + + ElementWisePlugin *clone() const override { + // return new ElementWisePlugin(dims_x_, dims_y_, axis_); + return nullptr; + } + + const char *getPluginType() const override { return "elementwise_plugin"; } + + nvinfer1::Dims getOutputDimensions(int index, + const nvinfer1::Dims *input_dims, + int num_inputs) override; + + int initialize() override; + + // execute the layer + int enqueue(int batch_size, const void *const *inputs, void **outputs, + void *workspace, cudaStream_t stream); + + protected: + size_t getSerializationSize() override { + return SerializedSize(getPluginType()) + SerializedSize(axis_) + + SerializedSize(dims_x_) + SerializedSize(dims_y_) + + getBaseSerializationSize(); + } + + void serialize(void *buffer) override { + SerializeValue(&buffer, getPluginType()); + serializeBase(buffer); + SerializeValue(&buffer, type_.c_str()); + SerializeValue(&buffer, axis_); + SerializeValue(&buffer, dims_x_); + SerializeValue(&buffer, dims_y_); + } + + std::string type_; + nvinfer1::Dims dims_x_; + nvinfer1::Dims dims_y_; + int axis_; + int prev_size_; + int midd_size_; + int post_size_; +}; + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/plugin/prelu_op_plugin.cu b/paddle/fluid/inference/tensorrt/plugin/prelu_op_plugin.cu new file mode 100644 index 0000000000000000000000000000000000000000..b8a044fe99b91893c8c9ef661b4f46ebaa6db8c7 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/prelu_op_plugin.cu @@ -0,0 +1,82 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include +#include "glog/logging.h" +#include "paddle/fluid/inference/tensorrt/plugin/prelu_op_plugin.h" +#include "paddle/fluid/inference/tensorrt/plugin/trt_plugin_factory.h" +#include "paddle/fluid/operators/math/prelu.h" + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +PReluPlugin *CreatePreluPluginDeserialize(const void *buffer, size_t length) { + return new PReluPlugin(buffer, length); +} +REGISTER_TRT_PLUGIN("prelu_plugin", CreatePreluPluginDeserialize); + +int PReluPlugin::initialize() { + cudaMalloc(&p_gpu_weight_, sizeof(float) * weight_.size()); + cudaMemcpy(p_gpu_weight_, weight_.data(), weight_.size() * sizeof(float), + cudaMemcpyHostToDevice); +} + +nvinfer1::Dims PReluPlugin::getOutputDimensions(int index, + const nvinfer1::Dims *inputDims, + int nbInputs) { + assert(nbInputs == 1); + assert(index < this->getNbOutputs()); + nvinfer1::Dims const &input_dims = inputDims[0]; + nvinfer1::Dims output_dims = input_dims; + return output_dims; +} + +int PReluPlugin::enqueue(int batch_size, const void *const *inputs, + void **outputs, void *workspace, cudaStream_t stream) { + // input dims is CHW. + const auto &input_dims = this->getInputDims(0); + const float *input = reinterpret_cast(inputs[0]); + // const float *alpha = reinterpret_cast(alpha_.get().values); + const float *alpha = p_gpu_weight_; + float *output = reinterpret_cast(outputs)[0]; + + std::vector input_shape; + input_shape.push_back(batch_size); + for (int i = 0; i < input_dims.nbDims; i++) { + input_shape.push_back(input_dims.d[i]); + } + + if (mode_ == "channel") { + operators::math::PreluChannelWiseDirectCUDAFunctor + prelu_channel_wise; + prelu_channel_wise(stream, input, alpha, output, input_shape); + } else if (mode_ == "element") { + operators::math::PreluElementWiseDirectCUDAFunctor + prelu_element_wise; + prelu_element_wise(stream, input, alpha, output, input_shape); + } else { + operators::math::PreluScalarDirectCUDAFunctor prelu_scalar; + prelu_scalar(stream, input, alpha, output, input_shape); + } + return cudaGetLastError() != cudaSuccess; +} + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/plugin/prelu_op_plugin.h b/paddle/fluid/inference/tensorrt/plugin/prelu_op_plugin.h new file mode 100644 index 0000000000000000000000000000000000000000..a96649503f1c764e07370cb2b47b10f3dae72be4 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/prelu_op_plugin.h @@ -0,0 +1,87 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include "paddle/fluid/framework/tensor.h" +#include "paddle/fluid/framework/tensor_util.h" + +#include "paddle/fluid/inference/tensorrt/engine.h" +#include "paddle/fluid/inference/tensorrt/plugin/trt_plugin.h" + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +class PReluPlugin : public PluginTensorRT { + std::vector weight_; + float *p_gpu_weight_; + std::string mode_; + + protected: + size_t getSerializationSize() override { + return getBaseSerializationSize() + SerializedSize(mode_.c_str()) + + SerializedSize(weight_) + SerializedSize(getPluginType()); + } + + // TRT will call this func when we need to serialize the configuration of + // tensorrt. + // It should not be called by users. + void serialize(void *buffer) override { + SerializeValue(&buffer, getPluginType()); + serializeBase(buffer); + SerializeValue(&buffer, weight_); + SerializeValue(&buffer, mode_.c_str()); + } + + public: + PReluPlugin(const float *weight, const int weight_num, + std::string const &mode) + : mode_(mode) { + weight_.resize(weight_num); + std::copy(weight, weight + weight_num, weight_.data()); + } + + // It was used for tensorrt deserialization. + // It should not be called by users. + PReluPlugin(void const *serialData, size_t serialLength) { + deserializeBase(serialData, serialLength); + DeserializeValue(&serialData, &serialLength, &weight_); + const char *prelu_mode; + DeserializeValue(&serialData, &serialLength, &prelu_mode); + mode_ = std::string(prelu_mode); + } + ~PReluPlugin() { cudaFree(p_gpu_weight_); } + int initialize() override; + + PReluPlugin *clone() const override { + return new PReluPlugin(weight_.data(), weight_.size(), mode_); + } + + const char *getPluginType() const override { return "prelu_plugin"; } + int getNbOutputs() const override { return 1; } + nvinfer1::Dims getOutputDimensions(int index, const nvinfer1::Dims *inputs, + int nbInputDims) override; + int enqueue(int batchSize, const void *const *inputs, void **outputs, + void *workspace, cudaStream_t stream) override; +}; + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/plugin/split_op_plugin.cu b/paddle/fluid/inference/tensorrt/plugin/split_op_plugin.cu new file mode 100644 index 0000000000000000000000000000000000000000..b5503c3b95ee2429dd865fd6de416a04aafbccf0 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/split_op_plugin.cu @@ -0,0 +1,184 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include "paddle/fluid/inference/tensorrt/plugin/split_op_plugin.h" +#include "paddle/fluid/inference/tensorrt/plugin/trt_plugin_factory.h" + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +SplitPlugin* CreateSplitPluginDeserialize(const void* buffer, size_t length) { + return new SplitPlugin(buffer, length); +} +REGISTER_TRT_PLUGIN("split_plugin", CreateSplitPluginDeserialize); + +// copied from operators::math::SplitFunctor +template +__global__ void SplitKernel(const T* input_data, const int in_row, + const int in_col, const int* out_cols, + int out_cols_size, T** outputs_data) { + int tid_x = blockIdx.x * blockDim.x + threadIdx.x; + int curr_segment = 0; + int curr_offset = out_cols[0]; + for (; tid_x < in_col; tid_x += blockDim.x * gridDim.x) { + int curr_col_offset = out_cols[curr_segment + 1]; + while (curr_col_offset <= tid_x) { + curr_offset = curr_col_offset; + ++curr_segment; + curr_col_offset = out_cols[curr_segment + 1]; + } + + int local_col = tid_x - curr_offset; + int segment_width = curr_col_offset - curr_offset; + T* output_ptr = outputs_data[curr_segment]; + if (output_ptr != nullptr) { + int tid_y = blockIdx.y * blockDim.y + threadIdx.y; + for (; tid_y < in_row; tid_y += blockDim.y * gridDim.y) + output_ptr[tid_y * segment_width + local_col] = + input_data[tid_y * in_col + tid_x]; + } + } +} + +template +__global__ void SplitKernel(const T* input_data, const int in_row, + const int in_col, const int fixed_out_col, + T** outputs_data) { + int tid_x = blockIdx.x * blockDim.x + threadIdx.x; + for (; tid_x < in_col; tid_x += blockDim.x * gridDim.x) { + int split = tid_x / fixed_out_col; + int in_offset = tid_x - split * fixed_out_col; + T* output_ptr = outputs_data[split]; + if (output_ptr != nullptr) { + int tid_y = blockIdx.y * blockDim.y + threadIdx.y; + for (; tid_y < in_row; tid_y += blockDim.y * gridDim.y) + output_ptr[tid_y * fixed_out_col + in_offset] = + input_data[tid_y * in_col + tid_x]; + } + } +} + +nvinfer1::Dims SplitPlugin::getOutputDimensions( + int index, const nvinfer1::Dims* input_dims, int num_inputs) { + PADDLE_ENFORCE_EQ(num_inputs, 1); + PADDLE_ENFORCE_LT(index, this->getNbOutputs()); + + nvinfer1::Dims output_dims = input_dims[0]; + output_dims.d[axis_] = output_length_.at(index); + return output_dims; +} + +int SplitPlugin::initialize() { + PADDLE_ENFORCE_LE(axis_, nvinfer1::Dims::MAX_DIMS); + // notice input dims is [C, H, W] + nvinfer1::Dims dims = this->getInputDims(0); + outer_rows_ = 1; + inner_cols_ = 1; + for (int i = 0; i < axis_; ++i) { + outer_rows_ *= dims.d[i]; + } + for (int i = axis_ + 1; i < dims.nbDims; ++i) { + inner_cols_ *= dims.d[i]; + } + same_shape_ = true; + std::vector segment_offsets(1, 0); + for (int i = 0; i < this->getNbOutputs(); ++i) { + if (output_length_[i] != output_length_[0]) { + same_shape_ = false; + } + segment_offsets.push_back(segment_offsets.back() + + output_length_[i] * inner_cols_); + } + inner_cols_ *= dims.d[axis_]; + d_segment_offsets_ = segment_offsets; + segment_offsets_ = std::move(segment_offsets); + d_output_ptrs_.resize(this->getNbOutputs(), nullptr); + return 0; +} + +template +inline void Split(cudaStream_t stream, const bool same_shape, + const int outer_rows, const int inner_cols, + const std::vector& segment_offsets, + const int* d_segment_offsets, const T* input, T** outputs) { + const int kThreadsPerBlock = 1024; + const int kMaxBlocks = 65535; + int block_cols = kThreadsPerBlock; + if (inner_cols < kThreadsPerBlock) { // block_cols is aligned by 32. + block_cols = ((inner_cols + 31) >> 5) << 5; + } + int block_rows = kThreadsPerBlock / block_cols; + dim3 block_size = dim3(block_cols, block_rows, 1); + + int grid_cols = + std::min((inner_cols + block_cols - 1) / block_cols, kMaxBlocks); + int grid_rows = + std::min(kMaxBlocks / grid_cols, std::max(outer_rows / block_rows, 1)); + dim3 grid_size = dim3(grid_cols, grid_rows, 1); + + if (same_shape) { + SplitKernel<<>>( + input, outer_rows, inner_cols, segment_offsets[1], outputs); + } else { + SplitKernel<<>>( + input, outer_rows, inner_cols, d_segment_offsets, + static_cast(segment_offsets.size()), outputs); + } +} + +int SplitPlugin::enqueue(int batchSize, const void* const* inputs, + void** outputs, void* workspace, cudaStream_t stream) { + float const* input_ptr = reinterpret_cast(inputs[0]); + if (((batchSize == 1 && axis_ == 0) || axis_ == -1) && + this->getNbOutputs() < 10) { + float** output_ptrs = reinterpret_cast(outputs); + int data_type_size = (this->getDataType() == nvinfer1::DataType::kFLOAT) + ? sizeof(float) + : sizeof(__half); + for (int i = 0; i < this->getNbOutputs(); ++i) { + PADDLE_ENFORCE( + cudaMemcpyAsync( + output_ptrs[i], input_ptr + segment_offsets_[i], + (segment_offsets_[i + 1] - segment_offsets_[i]) * data_type_size, + cudaMemcpyDeviceToDevice, stream) == cudaSuccess); + } + } else { + outer_rows_ *= batchSize; + const int* d_segment_offsets_ptr = + thrust::raw_pointer_cast(&d_segment_offsets_[0]); + float** output_ptrs = thrust::raw_pointer_cast(&d_output_ptrs_[0]); + PADDLE_ENFORCE(cudaMemcpyAsync(output_ptrs, outputs, + this->getNbOutputs() * sizeof(float*), + cudaMemcpyHostToDevice, + stream) == cudaSuccess); + if (this->getDataType() == nvinfer1::DataType::kFLOAT) { + Split(stream, same_shape_, outer_rows_, inner_cols_, segment_offsets_, + d_segment_offsets_ptr, input_ptr, output_ptrs); + } else { + Split(stream, same_shape_, outer_rows_, inner_cols_, segment_offsets_, + d_segment_offsets_ptr, (__half*)input_ptr, // NOLINT + (__half**)output_ptrs); // NOLINT + } + } + return cudaGetLastError() != cudaSuccess; +} + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/plugin/split_op_plugin.h b/paddle/fluid/inference/tensorrt/plugin/split_op_plugin.h new file mode 100644 index 0000000000000000000000000000000000000000..cbb72590567a35bee29387d4c00518b437913508 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/split_op_plugin.h @@ -0,0 +1,79 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include "paddle/fluid/inference/tensorrt/plugin/trt_plugin.h" + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +class SplitPlugin : public PluginTensorRT { + public: + SplitPlugin() {} + SplitPlugin(int axis, std::vector const &output_lengths) + : axis_(axis), same_shape_(true), output_length_(output_lengths) {} + + SplitPlugin(void const *serial_data, size_t serial_length) { + deserializeBase(serial_data, serial_length); + DeserializeValue(&serial_data, &serial_length, &axis_); + DeserializeValue(&serial_data, &serial_length, &output_length_); + } + + SplitPlugin *clone() const override { + return new SplitPlugin(axis_, output_length_); + } + + const char *getPluginType() const override { return "split_plugin"; } + int getNbOutputs() const override { return output_length_.size(); } + nvinfer1::Dims getOutputDimensions(int index, + const nvinfer1::Dims *input_dims, + int num_inputs) override; + + int initialize() override; + int enqueue(int batchSize, const void *const *inputs, void **outputs, + void *workspace, cudaStream_t stream) override; + + protected: + size_t getSerializationSize() override { + return SerializedSize(getPluginType()) + SerializedSize(axis_) + + SerializedSize(output_length_) + getBaseSerializationSize(); + } + + void serialize(void *buffer) override { + SerializeValue(&buffer, getPluginType()); + serializeBase(buffer); + SerializeValue(&buffer, axis_); + SerializeValue(&buffer, output_length_); + } + + int axis_; + int outer_rows_; + int inner_cols_; + bool same_shape_; + std::vector output_length_; + std::vector segment_offsets_; + thrust::device_vector d_segment_offsets_; + thrust::device_vector d_output_ptrs_; +}; + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/plugin/trt_plugin.cc b/paddle/fluid/inference/tensorrt/plugin/trt_plugin.cc new file mode 100644 index 0000000000000000000000000000000000000000..b0f4cff3ac184beeed2ebd3a4b7531d570c87075 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/trt_plugin.cc @@ -0,0 +1,61 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/tensorrt/plugin/trt_plugin.h" + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +void PluginTensorRT::serializeBase(void*& buffer) { + SerializeValue(&buffer, input_dims_); + SerializeValue(&buffer, max_batch_size_); + SerializeValue(&buffer, data_type_); + SerializeValue(&buffer, data_format_); +} + +void PluginTensorRT::deserializeBase(void const*& serial_data, + size_t& serial_length) { + DeserializeValue(&serial_data, &serial_length, &input_dims_); + DeserializeValue(&serial_data, &serial_length, &max_batch_size_); + DeserializeValue(&serial_data, &serial_length, &data_type_); + DeserializeValue(&serial_data, &serial_length, &data_format_); +} + +size_t PluginTensorRT::getBaseSerializationSize() { + return (SerializedSize(input_dims_) + SerializedSize(max_batch_size_) + + SerializedSize(data_type_) + SerializedSize(data_format_)); +} + +bool PluginTensorRT::supportsFormat(nvinfer1::DataType type, + nvinfer1::PluginFormat format) const { + return ((type == nvinfer1::DataType::kFLOAT) && + (format == nvinfer1::PluginFormat::kNCHW)); +} + +void PluginTensorRT::configureWithFormat( + const nvinfer1::Dims* input_dims, int num_inputs, + const nvinfer1::Dims* output_dims, int num_outputs, nvinfer1::DataType type, + nvinfer1::PluginFormat format, int max_batch_size) { + data_type_ = type; + data_format_ = format; + input_dims_.assign(input_dims, input_dims + num_inputs); + max_batch_size_ = max_batch_size; +} + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/plugin/trt_plugin.h b/paddle/fluid/inference/tensorrt/plugin/trt_plugin.h new file mode 100644 index 0000000000000000000000000000000000000000..3b737bd726ad09637f8530a114362d98d1dac1b0 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/trt_plugin.h @@ -0,0 +1,118 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include + +#include "paddle/fluid/inference/tensorrt/plugin/trt_plugin_utils.h" +#include "paddle/fluid/platform/enforce.h" +#include "paddle/fluid/platform/profiler.h" + +DECLARE_bool(profile); + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +class PluginTensorRT; + +typedef std::function + PluginDeserializeFunc; + +typedef std::function PluginConstructFunc; + +class PluginTensorRT : public nvinfer1::IPluginExt { + public: + PluginTensorRT() {} + // It was used for TensorRT deserialization. + // It should not be called by users. + PluginTensorRT(const void* serialized_data, size_t length) {} + virtual ~PluginTensorRT() {} + + nvinfer1::Dims const& getInputDims(int index) const { + return input_dims_.at(index); + } + size_t getMaxBatchSize() const { return max_batch_size_; } + nvinfer1::DataType getDataType() const { return data_type_; } + nvinfer1::PluginFormat getDataFormat() const { return data_format_; } + virtual const char* getPluginVersion() const { return "1"; } + + void AddInput(nvinfer1::ITensor* input) { inputs_.push_back(input); } + std::vector& GetInputs() { return inputs_; } + + virtual nvinfer1::IPluginExt* clone() const = 0; + virtual const char* getPluginType() const = 0; + + // Following functions are inherit from nvinfer1::IPluginExt + // Get the number of outputs from the layer + int getNbOutputs() const { return 1; } + // Get the dimension of an output tensor + virtual nvinfer1::Dims getOutputDimensions(int index, + const nvinfer1::Dims* input_dims, + int num_inputs) = 0; + // Find the workspace size required by the layer + size_t getWorkspaceSize(int) const override { return 0; } + + // Initialize the layer for execution. + // This is called when the engine is created. + int initialize() override { return 0; } + // Shutdown the layer. This is called when the engine is destroyed + void terminate() override {} + // Execute the layer + virtual int enqueue(int batch_size, const void* const* inputs, void** outputs, + void* workspace, cudaStream_t stream) = 0; + + // Find the size of the serialization buffer required + virtual size_t getSerializationSize() = 0; + // Serialize the layer config to buffer. + // TensorRT will call this func to serialize the configuration of TensorRT + // engine. It should not be called by users. + virtual void serialize(void* buffer) = 0; + + // Check format support. The default is FLOAT32 and NCHW. + bool supportsFormat(nvinfer1::DataType type, + nvinfer1::PluginFormat format) const override; + // Configure the layer + void configureWithFormat(const nvinfer1::Dims* input_dims, int num_inputs, + const nvinfer1::Dims* output_dims, int num_outputs, + nvinfer1::DataType type, + nvinfer1::PluginFormat format, + int max_batch_size) override; + + protected: + // Deserialize input_dims, max_batch_size, data_type, data_format + void deserializeBase(void const*& serial_data, // NOLINT + size_t& serial_length); // NOLINT + size_t getBaseSerializationSize(); + // Serialize input_dims, max_batch_size, data_type, data_format + void serializeBase(void*& buffer); // NOLINT + + std::vector input_dims_; + size_t max_batch_size_; + nvinfer1::DataType data_type_; + nvinfer1::PluginFormat data_format_; + + std::vector inputs_; +}; + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/plugin/trt_plugin_factory.cc b/paddle/fluid/inference/tensorrt/plugin/trt_plugin_factory.cc new file mode 100644 index 0000000000000000000000000000000000000000..3c20b6d1e725273dbfdc20c01fb01deea4e8d88e --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/trt_plugin_factory.cc @@ -0,0 +1,48 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/tensorrt/plugin/trt_plugin_factory.h" + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +PluginTensorRT* PluginFactoryTensorRT::createPlugin(const char* layer_name, + const void* serial_data, + size_t serial_length) { + const char* plugin_type; + DeserializeValue(&serial_data, &serial_length, &plugin_type); + + PADDLE_ENFORCE(Has(plugin_type), + "trt plugin type %s does not exists, check it.", plugin_type); + auto plugin = plugin_registry_[plugin_type](serial_data, serial_length); + owned_plugins_.emplace_back(plugin); + + return plugin; +} + +bool PluginFactoryTensorRT::RegisterPlugin( + const std::string& op_name, PluginDeserializeFunc deserialize_func) { + if (Has(op_name)) return false; + auto ret = plugin_registry_.emplace(op_name, deserialize_func); + return ret.second; +} + +void PluginFactoryTensorRT::DestroyPlugins() { owned_plugins_.clear(); } + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/plugin/trt_plugin_factory.h b/paddle/fluid/inference/tensorrt/plugin/trt_plugin_factory.h new file mode 100644 index 0000000000000000000000000000000000000000..139c75595f9f44cacf7d14cda6b1c8eb4ef3c0ee --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/trt_plugin_factory.h @@ -0,0 +1,78 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include +#include +#include + +#include "paddle/fluid/inference/tensorrt/plugin/trt_plugin.h" +#include "paddle/fluid/inference/tensorrt/plugin/trt_plugin_utils.h" +#include "paddle/fluid/inference/utils/singleton.h" +#include "paddle/fluid/platform/enforce.h" + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +class PluginFactoryTensorRT : public nvinfer1::IPluginFactory, + public DeleteHelper { + public: + // Deserialization method + PluginTensorRT* createPlugin(const char* layer_name, const void* serial_data, + size_t serial_length) override; + + bool RegisterPlugin(const std::string& op_name, + PluginDeserializeFunc deserialize_func); + + bool Has(const std::string& op_name) { + return plugin_registry_.find(op_name) != plugin_registry_.end(); + } + + void DestroyPlugins(); + + protected: + std::unordered_map plugin_registry_; + + std::list> owned_plugins_; +}; + +class TrtPluginRegistrar { + public: + TrtPluginRegistrar(const std::string& name, + PluginDeserializeFunc deserialize_func) { + inference::Singleton::Global().RegisterPlugin( + name, deserialize_func); + } +}; + +#define REGISTER_TRT_PLUGIN(name, deserialize_func) \ + REGISTER_TRT_PLUGIN_UNIQ(__COUNTER__, name, deserialize_func) + +#define REGISTER_TRT_PLUGIN_UNIQ(ctr, name, deserialize_func) \ + static paddle::inference::tensorrt::plugin::TrtPluginRegistrar \ + trt_plugin_registrar##ctr __attribute__((unused)) = \ + paddle::inference::tensorrt::plugin::TrtPluginRegistrar( \ + name, deserialize_func) + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/plugin/trt_plugin_utils.h b/paddle/fluid/inference/tensorrt/plugin/trt_plugin_utils.h new file mode 100644 index 0000000000000000000000000000000000000000..1cae4ccae4cc593785d9b3b0e87523e740eef4ff --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/trt_plugin_utils.h @@ -0,0 +1,134 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include +#include +#include "paddle/fluid/platform/enforce.h" + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +// Some trt base classes lack of the destructor. +// We use a assisted class to fix this. +struct DeleteHelper { + protected: + virtual ~DeleteHelper() {} +}; + +template +inline void SerializeValue(void** buffer, T const& value); + +template +inline void DeserializeValue(void const** buffer, size_t* buffer_size, + T* value); + +namespace details { + +template +struct Serializer {}; + +template +struct Serializer::value || + std::is_enum::value || + std::is_pod::value>::type> { + static size_t SerializedSize(T const& value) { return sizeof(T); } + + static void Serialize(void** buffer, T const& value) { + std::memcpy(*buffer, &value, sizeof(T)); + reinterpret_cast(*buffer) += sizeof(T); + } + + static void Deserialize(void const** buffer, size_t* buffer_size, T* value) { + assert(*buffer_size >= sizeof(T)); + std::memcpy(value, *buffer, sizeof(T)); + reinterpret_cast(*buffer) += sizeof(T); + *buffer_size -= sizeof(T); + } +}; + +template <> +struct Serializer { + static size_t SerializedSize(const char* value) { return strlen(value) + 1; } + + static void Serialize(void** buffer, const char* value) { + std::strcpy(static_cast(*buffer), value); // NOLINT + reinterpret_cast(*buffer) += strlen(value) + 1; + } + + static void Deserialize(void const** buffer, size_t* buffer_size, + const char** value) { + *value = static_cast(*buffer); + size_t data_size = strnlen(*value, *buffer_size) + 1; + assert(*buffer_size >= data_size); + reinterpret_cast(*buffer) += data_size; + *buffer_size -= data_size; + } +}; + +template +struct Serializer, + typename std::enable_if::value || + std::is_enum::value || + std::is_pod::value>::type> { + static size_t SerializedSize(std::vector const& value) { + return sizeof(value.size()) + value.size() * sizeof(T); + } + + static void Serialize(void** buffer, std::vector const& value) { + SerializeValue(buffer, value.size()); + size_t nbyte = value.size() * sizeof(T); + std::memcpy(*buffer, value.data(), nbyte); + reinterpret_cast(*buffer) += nbyte; + } + + static void Deserialize(void const** buffer, size_t* buffer_size, + std::vector* value) { + size_t size; + DeserializeValue(buffer, buffer_size, &size); + value->resize(size); + size_t nbyte = value->size() * sizeof(T); + PADDLE_ENFORCE_GE(*buffer_size, nbyte); + std::memcpy(value->data(), *buffer, nbyte); + reinterpret_cast(*buffer) += nbyte; + *buffer_size -= nbyte; + } +}; + +} // namespace details + +template +inline size_t SerializedSize(T const& value) { + return details::Serializer::SerializedSize(value); +} + +template +inline void SerializeValue(void** buffer, T const& value) { + return details::Serializer::Serialize(buffer, value); +} + +template +inline void DeserializeValue(void const** buffer, size_t* buffer_size, + T* value) { + return details::Serializer::Deserialize(buffer, buffer_size, value); +} + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/test_engine.cc b/paddle/fluid/inference/tensorrt/test_engine.cc index da1f6535cb3b2476cd475797861d6d2bb6d88856..a03dd45db0f80487cb4c2e6b68f94944e8558ae4 100644 --- a/paddle/fluid/inference/tensorrt/test_engine.cc +++ b/paddle/fluid/inference/tensorrt/test_engine.cc @@ -17,6 +17,8 @@ limitations under the License. */ #include #include +#include "paddle/fluid/framework/tensor.h" +#include "paddle/fluid/framework/tensor_util.h" #include "paddle/fluid/inference/tensorrt/engine.h" #include "paddle/fluid/platform/enforce.h" @@ -27,19 +29,34 @@ namespace tensorrt { class TensorRTEngineTest : public ::testing::Test { protected: void SetUp() override { - // ASSERT_EQ(0, cudaStreamCreate(&stream_)); - engine_ = new TensorRTEngine(10, 1 << 10, &stream_); + ctx_ = new platform::CUDADeviceContext(platform::CUDAPlace(0)); + + engine_ = new TensorRTEngine(10, 1 << 10); engine_->InitNetwork(); } void TearDown() override { - delete engine_; - cudaStreamDestroy(stream_); + if (engine_) { + delete engine_; + engine_ = nullptr; + } + } + + void PrepareInputOutput(const std::vector &input, + std::vector output_shape) { + TensorFromVector(input, *ctx_, &input_); + output_.Resize(framework::make_ddim(output_shape)); + } + + void GetOutput(std::vector *output) { + TensorToVector(output_, *ctx_, output); } protected: - TensorRTEngine* engine_; - cudaStream_t stream_; + framework::Tensor input_; + framework::Tensor output_; + TensorRTEngine *engine_; + platform::CUDADeviceContext *ctx_; }; TEST_F(TensorRTEngineTest, add_layer) { @@ -48,12 +65,14 @@ TEST_F(TensorRTEngineTest, add_layer) { float raw_weight[size] = {2.}; // Weight in CPU memory. float raw_bias[size] = {3.}; + std::vector buffers(2); // TRT binded inputs + LOG(INFO) << "create weights"; TensorRTEngine::Weight weight(nvinfer1::DataType::kFLOAT, raw_weight, size); TensorRTEngine::Weight bias(nvinfer1::DataType::kFLOAT, raw_bias, size); - auto* x = engine_->DeclareInput("x", nvinfer1::DataType::kFLOAT, + auto *x = engine_->DeclareInput("x", nvinfer1::DataType::kFLOAT, nvinfer1::DimsCHW{1, 1, 1}); - auto* fc_layer = TRT_ENGINE_ADD_LAYER(engine_, FullyConnected, *x, size, + auto *fc_layer = TRT_ENGINE_ADD_LAYER(engine_, FullyConnected, *x, size, weight.get(), bias.get()); PADDLE_ENFORCE(fc_layer != nullptr); @@ -63,18 +82,24 @@ TEST_F(TensorRTEngineTest, add_layer) { ASSERT_EQ(engine_->engine()->getNbBindings(), 2); // fill in real data - float x_v = 1234; - engine_->SetInputFromCPU("x", reinterpret_cast(&x_v), - 1 * sizeof(float)); + std::vector x_v = {1234}; + std::vector y_cpu; + PrepareInputOutput(x_v, {1}); + + auto *x_v_gpu_data = input_.mutable_data(ctx_->GetPlace()); + auto *y_gpu_data = output_.mutable_data(ctx_->GetPlace()); + + buffers[0] = reinterpret_cast(x_v_gpu_data); + buffers[1] = reinterpret_cast(y_gpu_data); + LOG(INFO) << "to execute"; - engine_->Execute(1); + engine_->Execute(1, &buffers, ctx_->stream()); LOG(INFO) << "to get output"; - float y_cpu; - engine_->GetOutputInCPU("y", &y_cpu, 1 * sizeof(float)); + GetOutput(&y_cpu); LOG(INFO) << "to checkout output"; - ASSERT_EQ(y_cpu, x_v * 2 + 3); + ASSERT_EQ(y_cpu[0], x_v[0] * 2 + 3); } TEST_F(TensorRTEngineTest, add_layer_multi_dim) { @@ -83,12 +108,13 @@ TEST_F(TensorRTEngineTest, add_layer_multi_dim) { // instead of row-major, which is [[1.0, 1.1], [3.3, 4.4]] float raw_weight[4] = {1.0, 1.1, 3.3, 4.4}; float raw_bias[2] = {1.3, 2.4}; + std::vector buffers(2); // TRT binded inputs TensorRTEngine::Weight weight(nvinfer1::DataType::kFLOAT, raw_weight, 4); TensorRTEngine::Weight bias(nvinfer1::DataType::kFLOAT, raw_bias, 2); - auto* x = engine_->DeclareInput("x", nvinfer1::DataType::kFLOAT, + auto *x = engine_->DeclareInput("x", nvinfer1::DataType::kFLOAT, nvinfer1::DimsCHW{1, 2, 1}); - auto* fc_layer = TRT_ENGINE_ADD_LAYER(engine_, FullyConnected, *x, 2, + auto *fc_layer = TRT_ENGINE_ADD_LAYER(engine_, FullyConnected, *x, 2, weight.get(), bias.get()); PADDLE_ENFORCE(fc_layer != nullptr); @@ -96,19 +122,27 @@ TEST_F(TensorRTEngineTest, add_layer_multi_dim) { engine_->FreezeNetwork(); ASSERT_EQ(engine_->engine()->getNbBindings(), 2); - float x_v[2] = {1.0, 2.0}; - engine_->SetInputFromCPU("x", reinterpret_cast(&x_v), - 2 * sizeof(float)); - engine_->Execute(1); + // fill in real data + std::vector x_v = {1.0, 2.0}; + std::vector y_cpu; + PrepareInputOutput(x_v, {2}); + + auto *x_v_gpu_data = input_.mutable_data(ctx_->GetPlace()); + auto *y_gpu_data = output_.mutable_data(ctx_->GetPlace()); + + buffers[0] = reinterpret_cast(x_v_gpu_data); + buffers[1] = reinterpret_cast(y_gpu_data); + + engine_->Execute(1, &buffers, ctx_->stream()); LOG(INFO) << "to get output"; - float y_cpu[2] = {-1., -1.}; + GetOutput(&y_cpu); auto dims = engine_->GetITensor("y")->getDimensions(); ASSERT_EQ(dims.nbDims, 3); ASSERT_EQ(dims.d[0], 2); ASSERT_EQ(dims.d[1], 1); - engine_->GetOutputInCPU("y", &y_cpu[0], 2 * sizeof(float)); + ASSERT_EQ(y_cpu[0], 4.5); ASSERT_EQ(y_cpu[1], 14.5); } @@ -117,12 +151,13 @@ TEST_F(TensorRTEngineTest, test_conv2d) { // Weight in CPU memory. float raw_weight[9] = {1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0}; float raw_bias[1] = {0}; + std::vector buffers(2); // TRT binded inputs TensorRTEngine::Weight weight(nvinfer1::DataType::kFLOAT, raw_weight, 9); TensorRTEngine::Weight bias(nvinfer1::DataType::kFLOAT, raw_bias, 1); - auto* x = engine_->DeclareInput("x", nvinfer1::DataType::kFLOAT, + auto *x = engine_->DeclareInput("x", nvinfer1::DataType::kFLOAT, nvinfer1::Dims3{1, 3, 3}); - auto* conv_layer = + auto *conv_layer = TRT_ENGINE_ADD_LAYER(engine_, Convolution, *x, 1, nvinfer1::DimsHW{3, 3}, weight.get(), bias.get()); PADDLE_ENFORCE(conv_layer != nullptr); @@ -133,28 +168,36 @@ TEST_F(TensorRTEngineTest, test_conv2d) { engine_->FreezeNetwork(); ASSERT_EQ(engine_->engine()->getNbBindings(), 2); - float x_v[18] = {1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, - 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0}; - engine_->SetInputFromCPU("x", reinterpret_cast(&x_v), - 18 * sizeof(float)); - engine_->Execute(2); + // fill in real data + std::vector x_v = {1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, + 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0}; + std::vector y_cpu; + PrepareInputOutput(x_v, {18}); + + auto *x_v_gpu_data = input_.mutable_data(ctx_->GetPlace()); + auto *y_gpu_data = output_.mutable_data(ctx_->GetPlace()); + + buffers[0] = reinterpret_cast(x_v_gpu_data); + buffers[1] = reinterpret_cast(y_gpu_data); + + engine_->Execute(2, &buffers, ctx_->stream()); LOG(INFO) << "to get output"; - float* y_cpu = new float[18]; - engine_->GetOutputInCPU("y", &y_cpu[0], 18 * sizeof(float)); + GetOutput(&y_cpu); + ASSERT_EQ(y_cpu[0], 4.0); ASSERT_EQ(y_cpu[1], 6.0); } TEST_F(TensorRTEngineTest, test_pool2d) { // Weight in CPU memory. - auto* x = engine_->DeclareInput("x", nvinfer1::DataType::kFLOAT, + auto *x = engine_->DeclareInput("x", nvinfer1::DataType::kFLOAT, nvinfer1::Dims3{1, 2, 2}); + std::vector buffers(2); // TRT binded inputs nvinfer1::PoolingType pool_t = nvinfer1::PoolingType::kAVERAGE; - auto* pool_layer = - TRT_ENGINE_ADD_LAYER(engine_, Pooling, *const_cast(x), - pool_t, nvinfer1::DimsHW{2, 2}); + auto *pool_layer = TRT_ENGINE_ADD_LAYER(engine_, Pooling, *x, pool_t, + nvinfer1::DimsHW{2, 2}); PADDLE_ENFORCE(pool_layer != nullptr); pool_layer->setStride(nvinfer1::DimsHW{1, 1}); @@ -164,14 +207,21 @@ TEST_F(TensorRTEngineTest, test_pool2d) { engine_->FreezeNetwork(); ASSERT_EQ(engine_->engine()->getNbBindings(), 2); - float x_v[8] = {1.0, 2.0, 5.0, 0.0, 2.0, 3.0, 5.0, 10.0}; - engine_->SetInputFromCPU("x", reinterpret_cast(&x_v), - 8 * sizeof(float)); - engine_->Execute(2); + // fill in real data + std::vector x_v = {1.0, 2.0, 5.0, 0.0, 2.0, 3.0, 5.0, 10.0}; + std::vector y_cpu; + PrepareInputOutput(x_v, {2}); + + auto *x_v_gpu_data = input_.mutable_data(ctx_->GetPlace()); + auto *y_gpu_data = output_.mutable_data(ctx_->GetPlace()); + + buffers[0] = reinterpret_cast(x_v_gpu_data); + buffers[1] = reinterpret_cast(y_gpu_data); + + engine_->Execute(2, &buffers, ctx_->stream()); LOG(INFO) << "to get output"; - float* y_cpu = new float[2]; - engine_->GetOutputInCPU("y", &y_cpu[0], 2 * sizeof(float)); + GetOutput(&y_cpu); ASSERT_EQ(y_cpu[0], 2.0); ASSERT_EQ(y_cpu[1], 5.0); diff --git a/paddle/fluid/inference/tensorrt/trt_int8_calibrator.cc b/paddle/fluid/inference/tensorrt/trt_int8_calibrator.cc new file mode 100644 index 0000000000000000000000000000000000000000..4a85c8b8fe6d70052edd3be59f98582c9b2e86b9 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/trt_int8_calibrator.cc @@ -0,0 +1,147 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/tensorrt/trt_int8_calibrator.h" +#include "glog/logging.h" + +namespace paddle { +namespace inference { +namespace tensorrt { + +// set the batch size before constructing the thread to execute engine +int TRTInt8Calibrator::getBatchSize() const { return batch_size_; } + +TRTInt8Calibrator::TRTInt8Calibrator( + const std::unordered_map& buffers, int batch_size, + std::string engine_name, const platform::Place place) + : batch_size_(batch_size), engine_name_(engine_name) { + int i = 0; + VLOG(4) << "Init a new calibrator: " << engine_name_; + for (const auto it : buffers) { + framework::Tensor temp_tensor; + std::string input_name = it.first; + int data_size = it.second; + int num_ele = data_size / sizeof(int16_t); + framework::DDim data_shape = framework::make_ddim({num_ele}); + temp_tensor.Resize(data_shape); + data_tensors_.push_back(temp_tensor); + data_buffers_[input_name] = std::pair( + static_cast(temp_tensor.mutable_data(place)), num_ele); + i += 1; + } +} + +TRTInt8Calibrator::TRTInt8Calibrator(const std::string& calib_data) + : batch_size_(0), + calib_running_(false), + data_is_set_(false), + done_(true), + calibration_table_(calib_data) {} + +void TRTInt8Calibrator::waitAndSetDone() { + std::unique_lock lk(mut_); + while ((calib_running_ || data_is_set_) && !done_) cond_.wait(lk); + if (!done_) { + done_ = true; + cond_.notify_all(); + } +} + +// There might be more than one input for trt subgraph, +// So, we use a map to store input information. +bool TRTInt8Calibrator::setBatch( + const std::unordered_map& data) { + VLOG(3) << "set batch: " << engine_name_; + std::unique_lock lk(mut_); + // There is a producer and a consumer. The producer set the batch data and + // the consumer get the batch data. The size of the data pool is one. + // So, the producer has to wait for the consumer to finish processing before + // they can set the data. + while ((calib_running_ || data_is_set_) && (!done_)) cond_.wait(lk); + // The done_ is set to true using waitAndSetDone, When all calibration data + // are processed. + if (done_) return false; + + // Sets the batch. + for (const auto& it : data) { + auto dataptr = data_buffers_.find(it.first); + if (dataptr == data_buffers_.end()) { + LOG(FATAL) << "FATAL " << engine_name_ << " input name '" << it.first + << "' does not match with the buffer names"; + } + const auto& d = dataptr->second; + PADDLE_ENFORCE( + cudaMemcpy(d.first, it.second, d.second, cudaMemcpyDeviceToDevice), + "Fail to cudaMemcpy %s for %s", engine_name_, it.first); + } + + data_is_set_ = true; + cond_.notify_all(); + return true; +} + +bool TRTInt8Calibrator::getBatch(void** bindings, const char** names, + int num_bindings) { + VLOG(4) << "get batch: " << engine_name_; + std::unique_lock lk(mut_); + // The consumer has just finished processing a data. + // The producer can set the data again. + calib_running_ = false; + cond_.notify_all(); + + // As long as there is data in the pool, the consumer can get it. + while (!data_is_set_ && !done_) cond_.wait(lk); + if (done_) return false; + + // Gets the batch + for (int i = 0; i < num_bindings; i++) { + auto it = data_buffers_.find(names[i]); + if (it == data_buffers_.end()) { + LOG(FATAL) << "Calibration engine asked for unknown tensor name '" + << names[i] << "' at position " << i; + } + bindings[i] = it->second.first; + } + + data_is_set_ = false; + calib_running_ = true; + VLOG(4) << "get batch done: " << engine_name_; + return true; +} + +void TRTInt8Calibrator::setDone() { + std::unique_lock lk(mut_); + done_ = true; + cond_.notify_all(); +} + +const void* TRTInt8Calibrator::readCalibrationCache(size_t& length) { + if (calibration_table_.empty()) return nullptr; + length = calibration_table_.size(); + return calibration_table_.data(); +} + +void TRTInt8Calibrator::writeCalibrationCache(const void* ptr, + std::size_t length) { + calibration_table_ = std::string((const char*)ptr, length); + VLOG(4) << "Got calibration data for " << engine_name_ << " " << ptr + << " length=" << length; +} +TRTInt8Calibrator::~TRTInt8Calibrator() { + VLOG(4) << "Destroying calibrator for " << engine_name_; +} + +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/trt_int8_calibrator.h b/paddle/fluid/inference/tensorrt/trt_int8_calibrator.h new file mode 100644 index 0000000000000000000000000000000000000000..5815bc9a1464293e0a56f05e34183580eac96cea --- /dev/null +++ b/paddle/fluid/inference/tensorrt/trt_int8_calibrator.h @@ -0,0 +1,128 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include // NOLINT +#include +#include +#include +#include +#include "paddle/fluid/framework/tensor.h" +#include "paddle/fluid/inference/tensorrt/engine.h" +#include "paddle/fluid/platform/place.h" + +namespace paddle { +namespace inference { +namespace tensorrt { + +class TensorRTEngine; + +struct TRTInt8Calibrator : public nvinfer1::IInt8EntropyCalibrator { + public: + TRTInt8Calibrator(const std::unordered_map& buffers, + int batch_size, std::string engine_name, + const platform::Place place); + + explicit TRTInt8Calibrator(const std::string& calibration_data); + ~TRTInt8Calibrator(); + + int getBatchSize() const override; + + bool getBatch(void* bindings[], const char* names[], + int num_bindings) override; + + bool setBatch(const std::unordered_map& data); + void setDone(); + void waitAndSetDone(); + + const void* readCalibrationCache(std::size_t& length) override; + void writeCalibrationCache(const void* ptr, std::size_t length) override; + const std::string& getCalibrationTableAsString() { + return calibration_table_; + } + + private: + const int batch_size_; + + bool calib_running_{true}; + bool data_is_set_{false}; + bool done_{false}; + + std::mutex mut_; + std::condition_variable cond_; + + std::unordered_map> data_buffers_; + std::vector data_tensors_; + + std::string engine_name_; + std::string calibration_table_; +}; + +class TRTCalibratorEngine { + public: + TRTCalibratorEngine() {} + std::unique_ptr calib_; + std::unique_ptr thr_; + std::unique_ptr engine_; +}; +/* + * Manager to control the TensorRT Int8 calibration creation and deltetion. + */ +class TRTCalibratorEngineManager { + public: + bool Has() const { return res_.size() > 0; } + bool Has(const std::string& name) const { + if (res_.count(name) == 0) return false; + return res_.at(name).get() != nullptr; + } + + // Get Int8Calibrator via name + TRTCalibratorEngine* Get(const std::string& name) const { + return res_.at(name).get(); + } + + // Look up or create a calibrator. + TRTCalibratorEngine* LookupOrCreate(const std::string& engine_name) { + if (res_.count(engine_name) == 0) { + auto* p = new TRTCalibratorEngine; + res_[engine_name].reset(p); + } + return res_.at(engine_name).get(); + } + + // Create an Int8Calibrator + TRTCalibratorEngine* Create(const std::string& engine_name) { + auto* p = new TRTCalibratorEngine; + res_[engine_name].reset(p); + return p; + } + + void DeleteALL() { + for (auto& item : res_) { + item.second.reset(nullptr); + } + } + + private: + std::unordered_map> res_; +}; + +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tests/api/CMakeLists.txt b/paddle/fluid/inference/tests/api/CMakeLists.txt index c3dd1f433691e1c96e9f38ef7b595befad26408f..8ecb0310c9775393631b99681e13cbea7a5b781e 100644 --- a/paddle/fluid/inference/tests/api/CMakeLists.txt +++ b/paddle/fluid/inference/tests/api/CMakeLists.txt @@ -1,17 +1,13 @@ -set(INFERENCE_URL "http://paddle-inference-dist.cdn.bcebos.com") -set(INFERENCE_DEMO_INSTALL_DIR "${THIRD_PARTY_PATH}/inference_demo" CACHE STRING - "A path setting inference demo download directories.") -set(INFERENCE_EXTRA_DEPS paddle_inference_api paddle_fluid_api ir_pass_manager analysis_predictor) -function (inference_download install_dir url filename) - message(STATUS "Download inference test stuff from ${url}/${filename}") - execute_process(COMMAND bash -c "mkdir -p ${install_dir}") - execute_process(COMMAND bash -c "cd ${install_dir} && wget -q ${url}/${filename}") - message(STATUS "finish downloading ${filename}") -endfunction() +set(INFERENCE_EXTRA_DEPS paddle_inference_api paddle_fluid_api ir_pass_manager analysis_predictor benchmark) -function (inference_download_and_uncompress install_dir url filename) - inference_download(${install_dir} ${url} ${filename}) - execute_process(COMMAND bash -c "cd ${install_dir} && tar xzf ${filename}") +if(WITH_GPU AND TENSORRT_FOUND) + set(INFERENCE_EXTRA_DEPS ${INFERENCE_EXTRA_DEPS} analysis ${analysis_deps} ir_pass_manager analysis_predictor) +endif() + +function(download_model install_dir model_name) + if (NOT EXISTS ${install_dir}) + inference_download_and_uncompress(${install_dir} ${INFERENCE_URL} ${model_name}) + endif() endfunction() function(download_model_and_data install_dir model_name data_name) @@ -23,26 +19,72 @@ endfunction() function(inference_analysis_api_test target install_dir filename) inference_analysis_test(${target} SRCS ${filename} - EXTRA_DEPS ${INFERENCE_EXTRA_DEPS} + EXTRA_DEPS ${INFERENCE_EXTRA_DEPS} benchmark ARGS --infer_model=${install_dir}/model --infer_data=${install_dir}/data.txt) endfunction() -# RNN1 -if(NOT APPLE) +function(inference_analysis_api_int8_test target model_dir data_dir filename) + inference_analysis_test(${target} SRCS ${filename} + EXTRA_DEPS ${INFERENCE_EXTRA_DEPS} benchmark + ARGS --infer_model=${model_dir}/model + --infer_data=${data_dir}/data.bin + --warmup_batch_size=100 + --batch_size=50 + --iterations=2) +endfunction() + +function(inference_analysis_api_test_with_fake_data target install_dir filename model_name) + download_model(${install_dir} ${model_name}) + inference_analysis_test(${target} SRCS ${filename} + EXTRA_DEPS ${INFERENCE_EXTRA_DEPS} + ARGS --infer_model=${install_dir}/model) +endfunction() + +function(inference_analysis_api_test_with_refer_result target install_dir filename) + inference_analysis_test(${target} SRCS ${filename} + EXTRA_DEPS ${INFERENCE_EXTRA_DEPS} + ARGS --infer_model=${install_dir}/model --infer_data=${install_dir}/data.txt + --refer_result=${install_dir}/result.txt) +endfunction() + +if(NOT APPLE AND WITH_MKLML) + # RNN1 set(RNN1_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/rnn1") download_model_and_data(${RNN1_INSTALL_DIR} "rnn1%2Fmodel.tar.gz" "rnn1%2Fdata.txt.tar.gz") - inference_analysis_api_test(test_analyzer_rnn1 ${RNN1_INSTALL_DIR} analyzer_rnn1_tester.cc) + inference_analysis_api_test(test_analyzer_rnn1 ${RNN1_INSTALL_DIR} analyzer_rnn1_tester.cc SERIAL) + + # seq_pool1 + set(SEQ_POOL1_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/seq_pool") + download_model_and_data(${SEQ_POOL1_INSTALL_DIR} "seq_pool1_model_.tar.gz" "seq_pool1_data.txt.tar.gz") + inference_analysis_api_test(test_analyzer_seq_pool1 ${SEQ_POOL1_INSTALL_DIR} analyzer_seq_pool1_tester.cc SERIAL) else() - # TODO: fix this test on MACOS, the reason is that - # fusion_seqexpand_concat_fc_op is not supported on MACOS - message(WARNING "These tests has been disabled in OSX before being fixed: \n test_analyzer_rnn1") + # TODO: fix this test on MACOS and OPENBLAS, the reason is that + # fusion_seqexpand_concat_fc_op is not supported on MACOS and OPENBLAS + message(WARNING "These tests has been disabled in OSX or WITH_MKL=OFF before being fixed: \n test_analyzer_rnn1") + message(WARNING "These tests has been disabled in OSX or WITH_MKL=OFF before being fixed: \n test_analyzer_seq_pool1") endif() + # RNN2 set(RNN2_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/rnn2") download_model_and_data(${RNN2_INSTALL_DIR} "rnn2_model.tar.gz" "rnn2_data.txt.tar.gz") inference_analysis_api_test(test_analyzer_rnn2 ${RNN2_INSTALL_DIR} analyzer_rnn2_tester.cc) +# TODO(luotao, Superjom) Disable DAM test, temporarily fix +# https://github.com/PaddlePaddle/Paddle/issues/15032#issuecomment-455990914. +# After inference framework refactor, will reopen it. +# normal DAM +set(DAM_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/dam") +download_model_and_data(${DAM_INSTALL_DIR} "DAM_model.tar.gz" "DAM_data.txt.tar.gz") +#inference_analysis_api_test(test_analyzer_dam ${DAM_INSTALL_DIR} analyzer_dam_tester.cc EXTRA_DEPS legacy_allocator SERIAL) + +# small DAM +set(DAM_SMALL_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/small_dam") +download_model_and_data(${DAM_SMALL_INSTALL_DIR} "dam_small_model.tar.gz" "dam_small_data.txt.tar.gz") +inference_analysis_test(test_analyzer_small_dam SRCS analyzer_dam_tester.cc + EXTRA_DEPS ${INFERENCE_EXTRA_DEPS} + ARGS --infer_model=${DAM_SMALL_INSTALL_DIR}/model --infer_data=${DAM_SMALL_INSTALL_DIR}/data.txt --max_turn_num=1 SERIAL) + # chinese_ner set(CHINESE_NER_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/chinese_ner") download_model_and_data(${CHINESE_NER_INSTALL_DIR} "chinese_ner_model.tar.gz" "chinese_ner-data.txt.tar.gz") @@ -53,6 +95,16 @@ set(LAC_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/lac") download_model_and_data(${LAC_INSTALL_DIR} "lac_model.tar.gz" "lac_data.txt.tar.gz") inference_analysis_api_test(test_analyzer_lac ${LAC_INSTALL_DIR} analyzer_lac_tester.cc) +# MM DNN +set(MM_DNN_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/mm_dnn") +download_model_and_data(${MM_DNN_INSTALL_DIR} "MM_DNN_model.tar.gz" "MM_DNN_data.txt.tar.gz") +inference_analysis_api_test(test_analyzer_mm_dnn ${MM_DNN_INSTALL_DIR} analyzer_mm_dnn_tester.cc SERIAL) + +# Pyramid DNN +set(PYRAMID_DNN_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/pyramid_dnn") +download_model_and_data(${PYRAMID_DNN_INSTALL_DIR} "PyramidDNN_model.tar.gz" "PyramidDNN_data.txt.tar.gz") +inference_analysis_api_test(test_analyzer_pyramid_dnn ${PYRAMID_DNN_INSTALL_DIR} analyzer_pyramid_dnn_tester.cc) + # text_classification set(TEXT_CLASSIFICATION_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/text_classification") download_model_and_data(${TEXT_CLASSIFICATION_INSTALL_DIR} "text-classification-Senta.tar.gz" "text_classification_data.txt.tar.gz") @@ -63,48 +115,93 @@ set(SEQ_CONV1_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/seq_conv1") download_model_and_data(${SEQ_CONV1_INSTALL_DIR} "seq_conv1_model.tar.gz" "seq_conv1_data.txt.tar.gz") inference_analysis_api_test(test_analyzer_seq_conv1 ${SEQ_CONV1_INSTALL_DIR} analyzer_seq_conv1_tester.cc) +# transformer, the dataset only works on batch_size=8 now +set(TRANSFORMER_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/transformer") +download_model_and_data(${TRANSFORMER_INSTALL_DIR} "temp%2Ftransformer_model.tar.gz" "temp%2Ftransformer_data.txt.tar.gz") +inference_analysis_test(test_analyzer_transformer SRCS analyzer_transformer_tester.cc + EXTRA_DEPS ${INFERENCE_EXTRA_DEPS} + ARGS --infer_model=${TRANSFORMER_INSTALL_DIR}/model --infer_data=${TRANSFORMER_INSTALL_DIR}/data.txt --batch_size=8 SERIAL) + # ocr set(OCR_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/ocr") if (NOT EXISTS ${OCR_INSTALL_DIR}) - inference_download_and_uncompress(${OCR_INSTALL_DIR} "http://paddlemodels.cdn.bcebos.com/" "inference-vis-demos%2Focr.tar.gz") + inference_download_and_uncompress(${OCR_INSTALL_DIR} "http://paddlemodels.bj.bcebos.com/" "inference-vis-demos%2Focr.tar.gz") +endif() +inference_analysis_api_test_with_refer_result(test_analyzer_ocr ${OCR_INSTALL_DIR} analyzer_vis_tester.cc SERIAL) + +# mobilenet with transpose op +set(MOBILENET_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/mobilenet") +if (NOT EXISTS ${MOBILENET_INSTALL_DIR}) + inference_download_and_uncompress(${MOBILENET_INSTALL_DIR} "http://paddlemodels.bj.bcebos.com/" "inference-vis-demos%2Fmobilenet.tar.gz") endif() -inference_analysis_api_test(test_analyzer_ocr ${OCR_INSTALL_DIR} analyzer_vis_tester.cc) +inference_analysis_api_test_with_refer_result(test_analyzer_mobilenet_transpose ${MOBILENET_INSTALL_DIR} analyzer_vis_tester.cc SERIAL) + +# googlenet +inference_analysis_api_test_with_fake_data(test_analyzer_googlenet + "${INFERENCE_DEMO_INSTALL_DIR}/googlenet" analyzer_resnet50_tester.cc "googlenet.tar.gz" SERIAL) # resnet50 -set(RESNET50_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/resnet50") -if (NOT EXISTS ${RESNET50_INSTALL_DIR}) - inference_download_and_uncompress(${RESNET50_INSTALL_DIR} ${INFERENCE_URL} "resnet50_model.tar.gz") +inference_analysis_api_test_with_fake_data(test_analyzer_resnet50 + "${INFERENCE_DEMO_INSTALL_DIR}/resnet50" analyzer_resnet50_tester.cc "resnet50_model.tar.gz" SERIAL) + +# mobilenet with depthwise_conv op +inference_analysis_api_test_with_fake_data(test_analyzer_mobilenet_depthwise_conv + "${INFERENCE_DEMO_INSTALL_DIR}/mobilenet_depthwise_conv" analyzer_resnet50_tester.cc "mobilenet_model.tar.gz" SERIAL) + +# int8 image classification tests +if(WITH_MKLDNN) + set(INT8_DATA_DIR "${INFERENCE_DEMO_INSTALL_DIR}/int8v2") + if (NOT EXISTS ${INT8_DATA_DIR}) + inference_download_and_uncompress(${INT8_DATA_DIR} "${INFERENCE_URL}/int8" "imagenet_val_100_tail.tar.gz") + endif() + + #resnet50 int8 + set(INT8_RESNET50_MODEL_DIR "${INT8_DATA_DIR}/resnet50") + if (NOT EXISTS ${INT8_RESNET50_MODEL_DIR}) + inference_download_and_uncompress(${INT8_RESNET50_MODEL_DIR} "${INFERENCE_URL}/int8" "resnet50_int8_model.tar.gz" ) + endif() + inference_analysis_api_int8_test(test_analyzer_int8_resnet50 ${INT8_RESNET50_MODEL_DIR} ${INT8_DATA_DIR} analyzer_int8_image_classification_tester.cc SERIAL) + + #mobilenet int8 + set(INT8_MOBILENET_MODEL_DIR "${INT8_DATA_DIR}/mobilenet") + if (NOT EXISTS ${INT8_MOBILENET_MODEL_DIR}) + inference_download_and_uncompress(${INT8_MOBILENET_MODEL_DIR} "${INFERENCE_URL}/int8" "mobilenetv1_int8_model.tar.gz" ) + endif() + inference_analysis_api_int8_test(test_analyzer_int8_mobilenet ${INT8_MOBILENET_MODEL_DIR} ${INT8_DATA_DIR} analyzer_int8_image_classification_tester.cc SERIAL) endif() -inference_analysis_test(test_analyzer_resnet50 SRCS analyzer_resnet50_tester.cc - EXTRA_DEPS ${INFERENCE_EXTRA_DEPS} ARGS --infer_model=${RESNET50_INSTALL_DIR}/model) + +# bert, max_len=20, embedding_dim=128 +set(BERT_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/bert_emb128") +download_model_and_data(${BERT_INSTALL_DIR} "bert_emb128_model.tar.gz" "bert_data_len20.txt.tar.gz") +inference_analysis_api_test(test_analyzer_bert ${BERT_INSTALL_DIR} analyzer_bert_tester.cc SERIAL) # anakin if (WITH_ANAKIN AND WITH_MKL) # only needed in CI - # anakin rnn1 - set(ANAKIN_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/anakin") - set(ANAKIN_RNN1_INSTALL_DIR "${ANAKIN_INSTALL_DIR}/rnn1") - inference_download(${ANAKIN_RNN1_INSTALL_DIR} ${INFERENCE_URL} "anakin_test%2Fditu_rnn.anakin2.model.bin") - inference_download(${ANAKIN_RNN1_INSTALL_DIR} ${INFERENCE_URL} "anakin_test%2Fditu_rnn_data.txt") - cc_test(test_anakin_rnn1 SRCS anakin_rnn1_tester.cc - ARGS --model=${ANAKIN_RNN1_INSTALL_DIR}/anakin_test%2Fditu_rnn.anakin2.model.bin - --datapath=${ANAKIN_RNN1_INSTALL_DIR}/anakin_test%2Fditu_rnn_data.txt - DEPS inference_anakin_api_shared SERIAL) - # anakin mobilenet - if(WITH_GPU) - set(ANAKIN_MOBILENET_INSTALL_DIR "${ANAKIN_INSTALL_DIR}/mobilenet") - inference_download(${ANAKIN_MOBILENET_INSTALL_DIR} ${INFERENCE_URL} "mobilenet_v2.anakin.bin") - cc_test(test_anakin_mobilenet SRCS anakin_mobilenet_tester.cc - ARGS --model=${ANAKIN_MOBILENET_INSTALL_DIR}/mobilenet_v2.anakin.bin - DEPS inference_anakin_api_shared dynload_cuda SERIAL) - endif() + # anakin rnn1 + set(ANAKIN_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/anakin") + set(ANAKIN_RNN1_INSTALL_DIR "${ANAKIN_INSTALL_DIR}/rnn1") + inference_download(${ANAKIN_RNN1_INSTALL_DIR} ${INFERENCE_URL} "anakin_test%2Fditu_rnn.anakin2.model.bin") + inference_download(${ANAKIN_RNN1_INSTALL_DIR} ${INFERENCE_URL} "anakin_test%2Fditu_rnn_data.txt") + cc_test(test_anakin_rnn1 SRCS anakin_rnn1_tester.cc + ARGS --model=${ANAKIN_RNN1_INSTALL_DIR}/anakin_test%2Fditu_rnn.anakin2.model.bin + --datapath=${ANAKIN_RNN1_INSTALL_DIR}/anakin_test%2Fditu_rnn_data.txt + DEPS inference_anakin_api_shared SERIAL) + # anakin mobilenet + if(WITH_GPU) + set(ANAKIN_MOBILENET_INSTALL_DIR "${ANAKIN_INSTALL_DIR}/mobilenet") + inference_download(${ANAKIN_MOBILENET_INSTALL_DIR} ${INFERENCE_URL} "mobilenet_v2.anakin.bin") + cc_test(test_anakin_mobilenet SRCS anakin_mobilenet_tester.cc + ARGS --model=${ANAKIN_MOBILENET_INSTALL_DIR}/mobilenet_v2.anakin.bin + DEPS inference_anakin_api_shared dynload_cuda SERIAL) + endif() endif() if(WITH_GPU AND TENSORRT_FOUND) - set(TRT_MODEL_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/trt") - if (NOT EXISTS ${TRT_MODEL_INSTALL_DIR}) - inference_download_and_uncompress(${TRT_MODEL_INSTALL_DIR} ${INFERENCE_URL}/tensorrt_test "trt_test_models.tar.gz") - endif() - cc_test(test_trt_models SRCS trt_models_tester.cc - ARGS --dirname=${TRT_MODEL_INSTALL_DIR}/trt_test_models - DEPS paddle_inference_tensorrt_subgraph_engine) + set(TRT_MODEL_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/trt") + if (NOT EXISTS ${TRT_MODEL_INSTALL_DIR}) + inference_download_and_uncompress(${TRT_MODEL_INSTALL_DIR} ${INFERENCE_URL}/tensorrt_test "trt_test_models.tar.gz") + endif() + inference_analysis_test(test_trt_models SRCS trt_models_tester.cc + EXTRA_DEPS ${INFERENCE_EXTRA_DEPS} + ARGS --infer_model=${TRT_MODEL_INSTALL_DIR}/trt_test_models SERIAL) endif() diff --git a/paddle/fluid/inference/tests/api/anakin_rnn1_tester.cc b/paddle/fluid/inference/tests/api/anakin_rnn1_tester.cc index c4022225fd4526998af8526d0afb87e7a5be6336..da42688f29f044639d6d97ff09f932490d4c3d54 100644 --- a/paddle/fluid/inference/tests/api/anakin_rnn1_tester.cc +++ b/paddle/fluid/inference/tests/api/anakin_rnn1_tester.cc @@ -13,7 +13,6 @@ See the License for the specific language governing permissions and limitations under the License. */ #include -#include #include #include #include diff --git a/paddle/fluid/inference/tests/api/analyzer_bert_tester.cc b/paddle/fluid/inference/tests/api/analyzer_bert_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..9b2e74ec16eb3b6e98bfcc8cc546ed74a7966f33 --- /dev/null +++ b/paddle/fluid/inference/tests/api/analyzer_bert_tester.cc @@ -0,0 +1,210 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/tests/api/tester_helper.h" + +namespace paddle { +namespace inference { + +using paddle::PaddleTensor; + +template +void GetValueFromStream(std::stringstream *ss, T *t) { + (*ss) >> (*t); +} + +template <> +void GetValueFromStream(std::stringstream *ss, std::string *t) { + *t = ss->str(); +} + +// Split string to vector +template +void Split(const std::string &line, char sep, std::vector *v) { + std::stringstream ss; + T t; + for (auto c : line) { + if (c != sep) { + ss << c; + } else { + GetValueFromStream(&ss, &t); + v->push_back(std::move(t)); + ss.str({}); + ss.clear(); + } + } + + if (!ss.str().empty()) { + GetValueFromStream(&ss, &t); + v->push_back(std::move(t)); + ss.str({}); + ss.clear(); + } +} + +// Parse tensor from string +template +bool ParseTensor(const std::string &field, paddle::PaddleTensor *tensor) { + std::vector data; + Split(field, ':', &data); + if (data.size() < 2) return false; + + std::string shape_str = data[0]; + + std::vector shape; + Split(shape_str, ' ', &shape); + + std::string mat_str = data[1]; + + std::vector mat; + Split(mat_str, ' ', &mat); + + tensor->shape = shape; + auto size = + std::accumulate(shape.begin(), shape.end(), 1, std::multiplies()) * + sizeof(T); + tensor->data.Resize(size); + std::copy(mat.begin(), mat.end(), static_cast(tensor->data.data())); + tensor->dtype = GetPaddleDType(); + + return true; +} + +// Parse input tensors from string +bool ParseLine(const std::string &line, + std::vector *tensors) { + std::vector fields; + Split(line, ';', &fields); + + if (fields.size() < 5) return false; + + tensors->clear(); + tensors->reserve(5); + + int i = 0; + // src_id + paddle::PaddleTensor src_id; + ParseTensor(fields[i++], &src_id); + tensors->push_back(src_id); + + // pos_id + paddle::PaddleTensor pos_id; + ParseTensor(fields[i++], &pos_id); + tensors->push_back(pos_id); + + // segment_id + paddle::PaddleTensor segment_id; + ParseTensor(fields[i++], &segment_id); + tensors->push_back(segment_id); + + // self_attention_bias + paddle::PaddleTensor self_attention_bias; + ParseTensor(fields[i++], &self_attention_bias); + tensors->push_back(self_attention_bias); + + // next_segment_index + paddle::PaddleTensor next_segment_index; + ParseTensor(fields[i++], &next_segment_index); + tensors->push_back(next_segment_index); + + return true; +} + +bool LoadInputData(std::vector> *inputs) { + if (FLAGS_infer_data.empty()) { + LOG(ERROR) << "please set input data path"; + return false; + } + + std::ifstream fin(FLAGS_infer_data); + std::string line; + int sample = 0; + + // The unit-test dataset only have 10 samples, each sample have 5 feeds. + while (std::getline(fin, line)) { + std::vector feed_data; + ParseLine(line, &feed_data); + inputs->push_back(std::move(feed_data)); + sample++; + if (!FLAGS_test_all_data && sample == FLAGS_batch_size) break; + } + LOG(INFO) << "number of samples: " << sample; + + return true; +} + +void SetConfig(AnalysisConfig *config) { config->SetModel(FLAGS_infer_model); } + +void profile(bool use_mkldnn = false) { + AnalysisConfig config; + SetConfig(&config); + + if (use_mkldnn) { + config.EnableMKLDNN(); + } + + std::vector> outputs; + std::vector> inputs; + LoadInputData(&inputs); + TestPrediction(reinterpret_cast(&config), + inputs, &outputs, FLAGS_num_threads); +} + +TEST(Analyzer_bert, profile) { profile(); } +#ifdef PADDLE_WITH_MKLDNN +TEST(Analyzer_bert, profile_mkldnn) { profile(true); } +#endif + +// Check the fuse status +TEST(Analyzer_bert, fuse_statis) { + AnalysisConfig cfg; + SetConfig(&cfg); + int num_ops; + auto predictor = CreatePaddlePredictor(cfg); + auto fuse_statis = GetFuseStatis( + static_cast(predictor.get()), &num_ops); + LOG(INFO) << "num_ops: " << num_ops; +} + +// Compare result of NativeConfig and AnalysisConfig +void compare(bool use_mkldnn = false) { + AnalysisConfig cfg; + SetConfig(&cfg); + if (use_mkldnn) { + cfg.EnableMKLDNN(); + } + + std::vector> inputs; + LoadInputData(&inputs); + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), inputs); +} + +TEST(Analyzer_bert, compare) { compare(); } +#ifdef PADDLE_WITH_MKLDNN +TEST(Analyzer_bert, compare_mkldnn) { compare(true /* use_mkldnn */); } +#endif + +// Compare Deterministic result +TEST(Analyzer_bert, compare_determine) { + AnalysisConfig cfg; + SetConfig(&cfg); + + std::vector> inputs; + LoadInputData(&inputs); + CompareDeterministic(reinterpret_cast(&cfg), + inputs); +} +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tests/api/analyzer_dam_tester.cc b/paddle/fluid/inference/tests/api/analyzer_dam_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..e10d239a5d1b30e089a110c6155520e3b035860a --- /dev/null +++ b/paddle/fluid/inference/tests/api/analyzer_dam_tester.cc @@ -0,0 +1,319 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/tests/api/tester_helper.h" + +DEFINE_int32(max_turn_num, 9, + "The max turn number: 1 for the small and 9 for the normal."); + +namespace paddle { +namespace inference { + +constexpr int32_t kMaxTurnLen = 50; + +static std::vector result_data; + +struct DataRecord { + std::vector> *turns; + std::vector> *turns_mask; + std::vector> response; // response data : 1 + std::vector> response_mask; // response mask data : 1 + size_t batch_iter{0}; + size_t batch_size{1}; + size_t num_samples; // total number of samples + + DataRecord() { + turns = new std::vector>[FLAGS_max_turn_num]; // turns data : FLAGS_max_turn_num + turns_mask = new std::vector>[FLAGS_max_turn_num]; // turns mask data : FLAGS_max_turn_num + } + + explicit DataRecord(const std::string &path, int batch_size = 1) + : DataRecord() { + this->batch_size = batch_size; + Load(path); + } + + ~DataRecord() { + delete[] turns; + delete[] turns_mask; + } + + DataRecord NextBatch() { + DataRecord data; + size_t batch_end = batch_iter + batch_size; + // NOTE skip the final batch, if no enough data is provided. + if (batch_end <= response.size()) { + for (int i = 0; i < FLAGS_max_turn_num; ++i) { + data.turns[i].assign(turns[i].begin() + batch_iter, + turns[i].begin() + batch_end); + } + for (int i = 0; i < FLAGS_max_turn_num; ++i) { + data.turns_mask[i].assign(turns_mask[i].begin() + batch_iter, + turns_mask[i].begin() + batch_end); + } + data.response.assign(response.begin() + batch_iter, + response.begin() + batch_end); + data.response_mask.assign(response_mask.begin() + batch_iter, + response_mask.begin() + batch_end); + CHECK(!data.response.empty()); + CHECK(!data.response_mask.empty()); + CHECK_EQ(data.response.size(), data.response_mask.size()); + } + batch_iter += batch_size; + return data; + } + + void Load(const std::string &path) { + std::ifstream file(path); + std::string line; + size_t num_lines = 0; + result_data.clear(); + while (std::getline(file, line)) { + num_lines++; + std::vector data; + split(line, ',', &data); + CHECK_EQ(data.size(), (size_t)(2 * FLAGS_max_turn_num + 3)); + // load turn data + std::vector turns_tmp[FLAGS_max_turn_num]; + for (int i = 0; i < FLAGS_max_turn_num; ++i) { + split_to_int64(data[i], ' ', &turns_tmp[i]); + turns[i].push_back(std::move(turns_tmp[i])); + } + // load turn_mask data + std::vector turns_mask_tmp[FLAGS_max_turn_num]; + for (int i = 0; i < FLAGS_max_turn_num; ++i) { + split_to_float(data[FLAGS_max_turn_num + i], ' ', &turns_mask_tmp[i]); + turns_mask[i].push_back(std::move(turns_mask_tmp[i])); + } + // load response data + std::vector response_tmp; + split_to_int64(data[2 * FLAGS_max_turn_num], ' ', &response_tmp); + response.push_back(std::move(response_tmp)); + // load response_mask data + std::vector response_mask_tmp; + split_to_float(data[2 * FLAGS_max_turn_num + 1], ' ', &response_mask_tmp); + response_mask.push_back(std::move(response_mask_tmp)); + // load result data + float result_tmp; + result_tmp = std::stof(data[2 * FLAGS_max_turn_num + 2]); + result_data.push_back(result_tmp); + } + num_samples = num_lines; + } +}; + +void PrepareInputs(std::vector *input_slots, DataRecord *data, + int batch_size) { + PaddleTensor turns_tensor[FLAGS_max_turn_num]; + PaddleTensor turns_mask_tensor[FLAGS_max_turn_num]; + PaddleTensor response_tensor; + PaddleTensor response_mask_tensor; + std::string turn_pre = "turn_"; + std::string turn_mask_pre = "turn_mask_"; + + auto one_batch = data->NextBatch(); + PADDLE_ENFORCE(!one_batch.response.empty()); + int size = one_batch.response[0].size(); + CHECK_EQ(size, kMaxTurnLen); + // turn tensor assignment + for (int i = 0; i < FLAGS_max_turn_num; ++i) { + turns_tensor[i].name = turn_pre + std::to_string(i); + turns_tensor[i].shape.assign({batch_size, size, 1}); + turns_tensor[i].dtype = PaddleDType::INT64; + TensorAssignData(&turns_tensor[i], one_batch.turns[i]); + } + // turn mask tensor assignment + for (int i = 0; i < FLAGS_max_turn_num; ++i) { + turns_mask_tensor[i].name = turn_mask_pre + std::to_string(i); + turns_mask_tensor[i].shape.assign({batch_size, size, 1}); + turns_mask_tensor[i].dtype = PaddleDType::FLOAT32; + TensorAssignData(&turns_mask_tensor[i], one_batch.turns_mask[i]); + } + // response tensor assignment + response_tensor.name = "response"; + response_tensor.shape.assign({batch_size, size, 1}); + response_tensor.dtype = PaddleDType::INT64; + TensorAssignData(&response_tensor, one_batch.response); + // response mask tensor assignment + response_mask_tensor.name = "response_mask"; + response_mask_tensor.shape.assign({batch_size, size, 1}); + response_mask_tensor.dtype = PaddleDType::FLOAT32; + TensorAssignData(&response_mask_tensor, one_batch.response_mask); + + // Set inputs. + for (int i = 0; i < FLAGS_max_turn_num; ++i) { + input_slots->push_back(std::move(turns_tensor[i])); + } + for (int i = 0; i < FLAGS_max_turn_num; ++i) { + input_slots->push_back(std::move(turns_mask_tensor[i])); + } + input_slots->push_back(std::move(response_tensor)); + input_slots->push_back(std::move(response_mask_tensor)); +} + +void SetConfig(AnalysisConfig *cfg) { + cfg->SetModel(FLAGS_infer_model + "/__model__", FLAGS_infer_model + "/param"); + cfg->SwitchSpecifyInputNames(); + cfg->SwitchIrOptim(true); +} + +void SetInput(std::vector> *inputs) { + DataRecord data(FLAGS_infer_data, FLAGS_batch_size); + std::vector input_slots; + int test_batch_num = + FLAGS_test_all_data ? data.num_samples / FLAGS_batch_size : 1; + LOG(INFO) << "The number of samples to be test: " + << test_batch_num * FLAGS_batch_size; + for (int bid = 0; bid < test_batch_num; ++bid) { + input_slots.clear(); + PrepareInputs(&input_slots, &data, FLAGS_batch_size); + (*inputs).emplace_back(input_slots); + } +} + +// Easy for profiling independently. +void profile(bool use_mkldnn = false) { + AnalysisConfig cfg; + SetConfig(&cfg); + + if (use_mkldnn) { + cfg.EnableMKLDNN(); + // Enable all the mkldnn supported ops except conv3d in dam + std::unordered_set op_list = {"softmax", "elementwise_add", + "relu"}; + cfg.SetMKLDNNOp(op_list); + } + + std::vector> outputs; + std::vector> input_slots_all; + SetInput(&input_slots_all); + + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, FLAGS_num_threads); + + if (FLAGS_num_threads == 1 && !FLAGS_test_all_data) { + PADDLE_ENFORCE_GT(outputs.size(), 0); + auto output = outputs.back(); + PADDLE_ENFORCE_GT(output.size(), 0); + size_t size = GetSize(output[0]); + PADDLE_ENFORCE_GT(size, 0); + float *result = static_cast(output[0].data.data()); + for (size_t i = 0; i < size; i++) { + EXPECT_NEAR(result[i], result_data[i], 1e-3); + } + } +} + +TEST(Analyzer_dam, profile) { profile(); } +#ifdef PADDLE_WITH_MKLDNN +TEST(Analyzer_dam, profile_mkldnn) { profile(true /* use_mkldnn */); } +#endif + +// Check the fuse status +TEST(Analyzer_dam, fuse_statis) { + AnalysisConfig cfg; + SetConfig(&cfg); + + int num_ops; + auto predictor = CreatePaddlePredictor(cfg); + auto fuse_statis = GetFuseStatis( + static_cast(predictor.get()), &num_ops); + ASSERT_TRUE(fuse_statis.count("fc_fuse")); +} + +// Compare result of NativeConfig and AnalysisConfig +void compare(bool use_mkldnn = false) { + AnalysisConfig cfg; + SetConfig(&cfg); + if (use_mkldnn) { + cfg.EnableMKLDNN(); + // Enable all the mkldnn supported ops except conv3d in dam + std::unordered_set op_list = {"softmax", "elementwise_add", + "relu"}; + cfg.SetMKLDNNOp(op_list); + } + + std::vector> input_slots_all; + SetInput(&input_slots_all); + + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), input_slots_all); +} + +// Compare result of NativeConfig and AnalysisConfig with memory optimization. +TEST(Analyzer_dam, compare_with_static_memory_optim) { + // The small dam will core in CI, but works in local. + if (FLAGS_max_turn_num == 9) { + AnalysisConfig cfg, cfg1; + DataRecord data(FLAGS_infer_data, FLAGS_batch_size); + + std::vector> input_slots_all; + SetInput(&input_slots_all); + // Run the first time to force to update memory cache + SetConfig(&cfg); + cfg.EnableMemoryOptim(true, true /*force update*/); + + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), + input_slots_all); + + // Run second time to use the memory cache and perform memory optimization. + SetConfig(&cfg1); + cfg1.EnableMemoryOptim(true, false /*do not force update*/); + + CompareNativeAndAnalysis( + reinterpret_cast(&cfg1), + input_slots_all); + } +} + +TEST(Analyzer_dam, compare_with_dynamic_memory_optim) { + // The small dam will core in CI, but works in local. + if (FLAGS_max_turn_num == 9) { + AnalysisConfig cfg, cfg1; + DataRecord data(FLAGS_infer_data, FLAGS_batch_size); + + std::vector> input_slots_all; + SetInput(&input_slots_all); + // Run the first time to force to update memory cache + SetConfig(&cfg); + cfg.EnableMemoryOptim(); + + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), + input_slots_all); + } +} + +TEST(Analyzer_dam, compare) { compare(); } + +#ifdef PADDLE_WITH_MKLDNN +TEST(Analyzer_dam, compare_mkldnn) { compare(true /* use_mkldnn */); } +#endif + +// Compare Deterministic result +TEST(Analyzer_dam, compare_determine) { + AnalysisConfig cfg; + SetConfig(&cfg); + + std::vector> input_slots_all; + SetInput(&input_slots_all); + CompareDeterministic(reinterpret_cast(&cfg), + input_slots_all); +} + +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tests/api/analyzer_int8_image_classification_tester.cc b/paddle/fluid/inference/tests/api/analyzer_int8_image_classification_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..fbf67d933786e3ee2baab7a20911da2837cdce4d --- /dev/null +++ b/paddle/fluid/inference/tests/api/analyzer_int8_image_classification_tester.cc @@ -0,0 +1,167 @@ +/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include "paddle/fluid/inference/api/paddle_analysis_config.h" +#include "paddle/fluid/inference/tests/api/tester_helper.h" + +namespace paddle { +namespace inference { +namespace analysis { + +void SetConfig(AnalysisConfig *cfg) { + cfg->SetModel(FLAGS_infer_model); + cfg->DisableGpu(); + cfg->SwitchIrOptim(); + cfg->SwitchSpecifyInputNames(); + cfg->SetCpuMathLibraryNumThreads(FLAGS_paddle_num_threads); + cfg->EnableMKLDNN(); +} + +template +class TensorReader { + public: + TensorReader(std::ifstream &file, size_t beginning_offset, + std::vector shape, std::string name) + : file_(file), position(beginning_offset), shape_(shape), name_(name) { + numel = std::accumulate(shape_.begin(), shape_.end(), size_t{1}, + std::multiplies()); + } + + PaddleTensor NextBatch() { + PaddleTensor tensor; + tensor.name = name_; + tensor.shape = shape_; + tensor.dtype = GetPaddleDType(); + tensor.data.Resize(numel * sizeof(T)); + + file_.seekg(position); + file_.read(static_cast(tensor.data.data()), numel * sizeof(T)); + position = file_.tellg(); + + if (file_.eof()) LOG(ERROR) << name_ << ": reached end of stream"; + if (file_.fail()) + throw std::runtime_error(name_ + ": failed reading file."); + + return tensor; + } + + protected: + std::ifstream &file_; + size_t position; + std::vector shape_; + std::string name_; + size_t numel; +}; + +std::shared_ptr> GetWarmupData( + const std::vector> &test_data, + int num_images = FLAGS_warmup_batch_size) { + int test_data_batch_size = test_data[0][0].shape[0]; + auto iterations_max = test_data.size(); + PADDLE_ENFORCE( + static_cast(num_images) <= iterations_max * test_data_batch_size, + "The requested quantization warmup data size " + + std::to_string(num_images) + " is bigger than all test data size."); + + PaddleTensor images; + images.name = "image"; + images.shape = {num_images, 3, 224, 224}; + images.dtype = PaddleDType::FLOAT32; + images.data.Resize(sizeof(float) * num_images * 3 * 224 * 224); + + PaddleTensor labels; + labels.name = "label"; + labels.shape = {num_images, 1}; + labels.dtype = PaddleDType::INT64; + labels.data.Resize(sizeof(int64_t) * num_images); + + for (int i = 0; i < num_images; i++) { + auto batch = i / test_data_batch_size; + auto element_in_batch = i % test_data_batch_size; + std::copy_n(static_cast(test_data[batch][0].data.data()) + + element_in_batch * 3 * 224 * 224, + 3 * 224 * 224, + static_cast(images.data.data()) + i * 3 * 224 * 224); + + std::copy_n(static_cast(test_data[batch][1].data.data()) + + element_in_batch, + 1, static_cast(labels.data.data()) + i); + } + + auto warmup_data = std::make_shared>(2); + (*warmup_data)[0] = std::move(images); + (*warmup_data)[1] = std::move(labels); + return warmup_data; +} + +void SetInput(std::vector> *inputs, + int32_t batch_size = FLAGS_batch_size) { + std::ifstream file(FLAGS_infer_data, std::ios::binary); + if (!file) { + FAIL() << "Couldn't open file: " << FLAGS_infer_data; + } + + int64_t total_images{0}; + file.read(reinterpret_cast(&total_images), sizeof(total_images)); + LOG(INFO) << "Total images in file: " << total_images; + + std::vector image_batch_shape{batch_size, 3, 224, 224}; + std::vector label_batch_shape{batch_size, 1}; + auto images_offset_in_file = static_cast(file.tellg()); + auto labels_offset_in_file = + images_offset_in_file + sizeof(float) * total_images * 3 * 224 * 224; + + TensorReader image_reader(file, images_offset_in_file, + image_batch_shape, "image"); + TensorReader label_reader(file, labels_offset_in_file, + label_batch_shape, "label"); + + auto iterations_max = total_images / batch_size; + for (auto i = 0; i < iterations_max; i++) { + auto images = image_reader.NextBatch(); + auto labels = label_reader.NextBatch(); + inputs->emplace_back( + std::vector{std::move(images), std::move(labels)}); + } +} + +TEST(Analyzer_int8_resnet50, quantization) { + AnalysisConfig cfg; + SetConfig(&cfg); + + AnalysisConfig q_cfg; + SetConfig(&q_cfg); + + // read data from file and prepare batches with test data + std::vector> input_slots_all; + SetInput(&input_slots_all); + + // prepare warmup batch from input data read earlier + // warmup batch size can be different than batch size + std::shared_ptr> warmup_data = + GetWarmupData(input_slots_all); + + // configure quantizer + q_cfg.EnableMkldnnQuantizer(); + q_cfg.mkldnn_quantizer_config()->SetWarmupData(warmup_data); + q_cfg.mkldnn_quantizer_config()->SetWarmupBatchSize(FLAGS_warmup_batch_size); + + CompareQuantizedAndAnalysis(&cfg, &q_cfg, input_slots_all); +} + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tests/api/analyzer_lac_tester.cc b/paddle/fluid/inference/tests/api/analyzer_lac_tester.cc index 5fb551810fd4d1c56547a8aa581cb6c4587df031..142905dcd8d9964d93d0c5f7444823eef2b84900 100644 --- a/paddle/fluid/inference/tests/api/analyzer_lac_tester.cc +++ b/paddle/fluid/inference/tests/api/analyzer_lac_tester.cc @@ -18,8 +18,6 @@ namespace paddle { namespace inference { namespace analysis { -using contrib::AnalysisConfig; - struct DataRecord { std::vector data; std::vector lod; @@ -98,20 +96,17 @@ void GetOneBatch(std::vector *input_slots, DataRecord *data, auto one_batch = data->NextBatch(); PaddleTensor input_tensor; input_tensor.name = "word"; - input_tensor.shape.assign({static_cast(one_batch.data.size()), 1}); - input_tensor.lod.assign({one_batch.lod}); input_tensor.dtype = PaddleDType::INT64; - TensorAssignData(&input_tensor, {one_batch.data}); + TensorAssignData(&input_tensor, {one_batch.data}, one_batch.lod); PADDLE_ENFORCE_EQ(batch_size, static_cast(one_batch.lod.size() - 1)); input_slots->assign({input_tensor}); } void SetConfig(AnalysisConfig *cfg) { - cfg->model_dir = FLAGS_infer_model; - cfg->use_gpu = false; - cfg->device = 0; - cfg->specify_input_name = true; - cfg->enable_ir_optim = true; + cfg->SetModel(FLAGS_infer_model); + cfg->DisableGpu(); + cfg->SwitchSpecifyInputNames(); + cfg->SwitchIrOptim(); } void SetInput(std::vector> *inputs) { @@ -129,11 +124,12 @@ void SetInput(std::vector> *inputs) { TEST(Analyzer_LAC, profile) { AnalysisConfig cfg; SetConfig(&cfg); - std::vector outputs; + std::vector> outputs; std::vector> input_slots_all; SetInput(&input_slots_all); - TestPrediction(cfg, input_slots_all, &outputs, FLAGS_num_threads); + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, FLAGS_num_threads); if (FLAGS_num_threads == 1 && !FLAGS_test_all_data) { // the first inference result @@ -141,11 +137,13 @@ TEST(Analyzer_LAC, profile) { 24, 25, 25, 25, 38, 30, 31, 14, 15, 44, 24, 25, 25, 25, 25, 25, 44, 24, 25, 25, 25, 36, 42, 43, 44, 14, 15, 44, 14, 15, 44, 14, 15, 44, 38, 39, 14, 15, 44, 22, 23, 23, 23, 23, 23, 23, 23}; - PADDLE_ENFORCE_EQ(outputs.size(), 1UL); - size_t size = GetSize(outputs[0]); + PADDLE_ENFORCE_GT(outputs.size(), 0); + auto output = outputs.back(); + PADDLE_ENFORCE_EQ(output.size(), 1UL); + size_t size = GetSize(output[0]); size_t batch1_size = sizeof(lac_ref_data) / sizeof(int64_t); PADDLE_ENFORCE_GE(size, batch1_size); - int64_t *pdata = static_cast(outputs[0].data.data()); + int64_t *pdata = static_cast(output[0].data.data()); for (size_t i = 0; i < batch1_size; ++i) { EXPECT_EQ(pdata[i], lac_ref_data[i]); } @@ -175,7 +173,19 @@ TEST(Analyzer_LAC, compare) { std::vector> input_slots_all; SetInput(&input_slots_all); - CompareNativeAndAnalysis(cfg, input_slots_all); + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), input_slots_all); +} + +// Compare Deterministic result +TEST(Analyzer_LAC, compare_determine) { + AnalysisConfig cfg; + SetConfig(&cfg); + + std::vector> input_slots_all; + SetInput(&input_slots_all); + CompareDeterministic(reinterpret_cast(&cfg), + input_slots_all); } } // namespace analysis diff --git a/paddle/fluid/inference/tests/api/analyzer_mm_dnn_tester.cc b/paddle/fluid/inference/tests/api/analyzer_mm_dnn_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..2eb347a44b394a55706d5aa88bee7fe1fcc7838e --- /dev/null +++ b/paddle/fluid/inference/tests/api/analyzer_mm_dnn_tester.cc @@ -0,0 +1,174 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/tests/api/tester_helper.h" + +namespace paddle { +namespace inference { + +struct DataRecord { + std::vector> query, title; + std::vector lod1, lod2; + size_t batch_iter{0}, batch_size{1}, num_samples; // total number of samples + DataRecord() = default; + explicit DataRecord(const std::string &path, int batch_size = 1) + : batch_size(batch_size) { + Load(path); + } + DataRecord NextBatch() { + DataRecord data; + size_t batch_end = batch_iter + batch_size; + // NOTE skip the final batch, if no enough data is provided. + if (batch_end <= query.size()) { + GetInputPerBatch(query, &data.query, &data.lod1, batch_iter, batch_end); + GetInputPerBatch(title, &data.title, &data.lod2, batch_iter, batch_end); + } + batch_iter += batch_size; + return data; + } + void Load(const std::string &path) { + std::ifstream file(path); + std::string line; + int num_lines = 0; + while (std::getline(file, line)) { + num_lines++; + std::vector data; + split(line, '\t', &data); + // load query data + std::vector query_data; + split_to_int64(data[0], ' ', &query_data); + // load title data + std::vector title_data; + split_to_int64(data[1], ' ', &title_data); + query.push_back(std::move(query_data)); + title.push_back(std::move(title_data)); + } + num_samples = num_lines; + } +}; + +void PrepareInputs(std::vector *input_slots, DataRecord *data, + int batch_size) { + PaddleTensor lod_query_tensor, lod_title_tensor; + lod_query_tensor.name = "left"; + lod_title_tensor.name = "right"; + auto one_batch = data->NextBatch(); + // assign data + TensorAssignData(&lod_query_tensor, one_batch.query, one_batch.lod1); + TensorAssignData(&lod_title_tensor, one_batch.title, one_batch.lod2); + // Set inputs. + input_slots->assign({lod_query_tensor, lod_title_tensor}); + for (auto &tensor : *input_slots) { + tensor.dtype = PaddleDType::INT64; + } +} + +void SetConfig(AnalysisConfig *cfg) { + cfg->SetModel(FLAGS_infer_model); + cfg->DisableGpu(); + cfg->SwitchSpecifyInputNames(); + cfg->SwitchIrOptim(); +} + +void SetInput(std::vector> *inputs) { + DataRecord data(FLAGS_infer_data, FLAGS_batch_size); + std::vector input_slots; + int epoch = FLAGS_test_all_data ? data.num_samples / FLAGS_batch_size : 1; + LOG(INFO) << "number of samples: " << epoch * FLAGS_batch_size; + for (int bid = 0; bid < epoch; ++bid) { + PrepareInputs(&input_slots, &data, FLAGS_batch_size); + (*inputs).emplace_back(input_slots); + } +} + +// Easy for profiling independently. +void profile(bool use_mkldnn = false) { + AnalysisConfig cfg; + SetConfig(&cfg); + std::vector> outputs; + + if (use_mkldnn) { + cfg.EnableMKLDNN(); + } + + std::vector> input_slots_all; + SetInput(&input_slots_all); + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, FLAGS_num_threads); + + if (FLAGS_num_threads == 1 && !FLAGS_test_all_data) { + PADDLE_ENFORCE_GT(outputs.size(), 0); + PADDLE_ENFORCE_EQ(outputs.back().size(), 2UL); + for (auto &output : outputs.back()) { + size_t size = GetSize(output); + PADDLE_ENFORCE_GT(size, 0); + float *result = static_cast(output.data.data()); + // output is probability, which is in (-1, 1). + for (size_t i = 0; i < size; i++) { + EXPECT_GT(result[i], -1); + EXPECT_LT(result[i], 1); + } + } + } +} + +TEST(Analyzer_MM_DNN, profile) { profile(); } +#ifdef PADDLE_WITH_MKLDNN +TEST(Analyzer_MM_DNN, profile_mkldnn) { profile(true /* use_mkldnn */); } +#endif + +// Check the fuse status +TEST(Analyzer_MM_DNN, fuse_statis) { + AnalysisConfig cfg; + SetConfig(&cfg); + + int num_ops; + auto predictor = CreatePaddlePredictor(cfg); + auto fuse_statis = GetFuseStatis( + static_cast(predictor.get()), &num_ops); +} + +// Compare result of NativeConfig and AnalysisConfig +void compare(bool use_mkldnn = false) { + AnalysisConfig cfg; + SetConfig(&cfg); + + if (use_mkldnn) { + cfg.EnableMKLDNN(); + } + + std::vector> input_slots_all; + SetInput(&input_slots_all); + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), input_slots_all); +} + +TEST(Analyzer_MM_DNN, compare) { compare(); } +#ifdef PADDLE_WITH_MKLDNN +TEST(Analyzer_MM_DNN, compare_mkldnn) { compare(true /* use_mkldnn */); } +#endif + +// Compare Deterministic result +TEST(Analyzer_MM_DNN, compare_determine) { + AnalysisConfig cfg; + SetConfig(&cfg); + + std::vector> input_slots_all; + SetInput(&input_slots_all); + CompareDeterministic(reinterpret_cast(&cfg), + input_slots_all); +} + +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tests/api/analyzer_ner_tester.cc b/paddle/fluid/inference/tests/api/analyzer_ner_tester.cc index 577b97e271aacab5d6740de7c8bc00bc87ae54dd..36e07d5f55600dc7aa96227289f707fb19f92d56 100644 --- a/paddle/fluid/inference/tests/api/analyzer_ner_tester.cc +++ b/paddle/fluid/inference/tests/api/analyzer_ner_tester.cc @@ -16,15 +16,11 @@ namespace paddle { namespace inference { -using contrib::AnalysisConfig; struct DataRecord { - std::vector> word_data_all, mention_data_all; - std::vector> rnn_word_datas, rnn_mention_datas; + std::vector> word, mention; std::vector lod; // two inputs have the same lod info. - size_t batch_iter{0}; - size_t batch_size{1}; - size_t num_samples; // total number of samples + size_t batch_iter{0}, batch_size{1}, num_samples; // total number of samples DataRecord() = default; explicit DataRecord(const std::string &path, int batch_size = 1) : batch_size(batch_size) { @@ -34,22 +30,10 @@ struct DataRecord { DataRecord data; size_t batch_end = batch_iter + batch_size; // NOTE skip the final batch, if no enough data is provided. - if (batch_end <= word_data_all.size()) { - data.word_data_all.assign(word_data_all.begin() + batch_iter, - word_data_all.begin() + batch_end); - data.mention_data_all.assign(mention_data_all.begin() + batch_iter, - mention_data_all.begin() + batch_end); - // Prepare LoDs - data.lod.push_back(0); - CHECK(!data.word_data_all.empty()); - CHECK(!data.mention_data_all.empty()); - CHECK_EQ(data.word_data_all.size(), data.mention_data_all.size()); - for (size_t j = 0; j < data.word_data_all.size(); j++) { - data.rnn_word_datas.push_back(data.word_data_all[j]); - data.rnn_mention_datas.push_back(data.mention_data_all[j]); - // calculate lod - data.lod.push_back(data.lod.back() + data.word_data_all[j].size()); - } + if (batch_end <= word.size()) { + GetInputPerBatch(word, &data.word, &data.lod, batch_iter, batch_end); + GetInputPerBatch(mention, &data.mention, &data.lod, batch_iter, + batch_end); } batch_iter += batch_size; return data; @@ -68,27 +52,22 @@ struct DataRecord { // load mention data std::vector mention_data; split_to_int64(data[3], ' ', &mention_data); - word_data_all.push_back(std::move(word_data)); - mention_data_all.push_back(std::move(mention_data)); + word.push_back(std::move(word_data)); + mention.push_back(std::move(mention_data)); } num_samples = num_lines; } }; -void PrepareInputs(std::vector *input_slots, DataRecord *data, - int batch_size) { +void PrepareInputs(std::vector *input_slots, DataRecord *data) { PaddleTensor lod_word_tensor, lod_mention_tensor; lod_word_tensor.name = "word"; lod_mention_tensor.name = "mention"; auto one_batch = data->NextBatch(); - int size = one_batch.lod[one_batch.lod.size() - 1]; // token batch size - lod_word_tensor.shape.assign({size, 1}); - lod_word_tensor.lod.assign({one_batch.lod}); - lod_mention_tensor.shape.assign({size, 1}); - lod_mention_tensor.lod.assign({one_batch.lod}); // assign data - TensorAssignData(&lod_word_tensor, one_batch.rnn_word_datas); - TensorAssignData(&lod_mention_tensor, one_batch.rnn_mention_datas); + TensorAssignData(&lod_word_tensor, one_batch.word, one_batch.lod); + TensorAssignData(&lod_mention_tensor, one_batch.mention, + one_batch.lod); // Set inputs. input_slots->assign({lod_word_tensor, lod_mention_tensor}); for (auto &tensor : *input_slots) { @@ -96,13 +75,20 @@ void PrepareInputs(std::vector *input_slots, DataRecord *data, } } -void SetConfig(contrib::AnalysisConfig *cfg) { - cfg->prog_file = FLAGS_infer_model + "/__model__"; - cfg->param_file = FLAGS_infer_model + "/param"; - cfg->use_gpu = false; - cfg->device = 0; - cfg->specify_input_name = true; - cfg->enable_ir_optim = true; +void SetConfig(AnalysisConfig *cfg, bool memory_load = false) { + if (memory_load) { + std::string buffer_prog, buffer_param; + ReadBinaryFile(FLAGS_infer_model + "/__model__", &buffer_prog); + ReadBinaryFile(FLAGS_infer_model + "/param", &buffer_param); + cfg->SetModelBuffer(&buffer_prog[0], buffer_prog.size(), &buffer_param[0], + buffer_param.size()); + } else { + cfg->SetModel(FLAGS_infer_model + "/__model__", + FLAGS_infer_model + "/param"); + } + cfg->DisableGpu(); + cfg->SwitchSpecifyInputNames(); + cfg->SwitchIrOptim(); } void SetInput(std::vector> *inputs) { @@ -111,38 +97,47 @@ void SetInput(std::vector> *inputs) { int epoch = FLAGS_test_all_data ? data.num_samples / FLAGS_batch_size : 1; LOG(INFO) << "number of samples: " << epoch * FLAGS_batch_size; for (int bid = 0; bid < epoch; ++bid) { - PrepareInputs(&input_slots, &data, FLAGS_batch_size); + PrepareInputs(&input_slots, &data); (*inputs).emplace_back(input_slots); } } // Easy for profiling independently. -TEST(Analyzer_Chinese_ner, profile) { - contrib::AnalysisConfig cfg; - SetConfig(&cfg); - std::vector outputs; +void profile(bool memory_load = false) { + AnalysisConfig cfg; + SetConfig(&cfg, memory_load); + std::vector> outputs; std::vector> input_slots_all; SetInput(&input_slots_all); - TestPrediction(cfg, input_slots_all, &outputs, FLAGS_num_threads); + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, FLAGS_num_threads); if (FLAGS_num_threads == 1 && !FLAGS_test_all_data) { // the first inference result const int chinese_ner_result_data[] = {30, 45, 41, 48, 17, 26, 48, 39, 38, 16, 25}; - PADDLE_ENFORCE_EQ(outputs.size(), 1UL); - size_t size = GetSize(outputs[0]); + PADDLE_ENFORCE_GT(outputs.size(), 0); + auto output = outputs.back(); + PADDLE_ENFORCE_EQ(output.size(), 1UL); + size_t size = GetSize(output[0]); PADDLE_ENFORCE_GT(size, 0); - int64_t *result = static_cast(outputs[0].data.data()); + int64_t *result = static_cast(output[0].data.data()); for (size_t i = 0; i < std::min(11UL, size); i++) { EXPECT_EQ(result[i], chinese_ner_result_data[i]); } } } +TEST(Analyzer_Chinese_ner, profile) { profile(); } + +TEST(Analyzer_Chinese_ner, profile_memory_load) { + profile(true /* memory_load */); +} + // Check the fuse status TEST(Analyzer_Chinese_ner, fuse_statis) { - contrib::AnalysisConfig cfg; + AnalysisConfig cfg; SetConfig(&cfg); int num_ops; @@ -158,12 +153,24 @@ TEST(Analyzer_Chinese_ner, fuse_statis) { // Compare result of NativeConfig and AnalysisConfig TEST(Analyzer_Chinese_ner, compare) { - contrib::AnalysisConfig cfg; + AnalysisConfig cfg; + SetConfig(&cfg); + + std::vector> input_slots_all; + SetInput(&input_slots_all); + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), input_slots_all); +} + +// Compare Deterministic result +TEST(Analyzer_Chinese_ner, compare_determine) { + AnalysisConfig cfg; SetConfig(&cfg); std::vector> input_slots_all; SetInput(&input_slots_all); - CompareNativeAndAnalysis(cfg, input_slots_all); + CompareDeterministic(reinterpret_cast(&cfg), + input_slots_all); } } // namespace inference diff --git a/paddle/fluid/inference/tests/api/analyzer_pyramid_dnn_tester.cc b/paddle/fluid/inference/tests/api/analyzer_pyramid_dnn_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..9443b08063b8f61d3d6b291a7217d645d8825c54 --- /dev/null +++ b/paddle/fluid/inference/tests/api/analyzer_pyramid_dnn_tester.cc @@ -0,0 +1,199 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/tests/api/tester_helper.h" + +namespace paddle { +namespace inference { + +struct DataRecord { + std::vector> query_basic, query_phrase, title_basic, + title_phrase; + std::vector lod1, lod2, lod3, lod4; + size_t batch_iter{0}, batch_size{1}, num_samples; // total number of samples + DataRecord() = default; + explicit DataRecord(const std::string &path, int batch_size = 1) + : batch_size(batch_size) { + Load(path); + } + DataRecord NextBatch() { + DataRecord data; + size_t batch_end = batch_iter + batch_size; + // NOTE skip the final batch, if no enough data is provided. + if (batch_end <= query_basic.size()) { + GetInputPerBatch(query_basic, &data.query_basic, &data.lod1, batch_iter, + batch_end); + GetInputPerBatch(query_phrase, &data.query_phrase, &data.lod2, batch_iter, + batch_end); + GetInputPerBatch(title_basic, &data.title_basic, &data.lod3, batch_iter, + batch_end); + GetInputPerBatch(title_phrase, &data.title_phrase, &data.lod4, batch_iter, + batch_end); + } + batch_iter += batch_size; + return data; + } + void Load(const std::string &path) { + std::ifstream file(path); + std::string line; + int num_lines = 0; + while (std::getline(file, line)) { + std::vector data; + split(line, ';', &data); + // load query data + std::vector query_basic_data; + split_to_int64(data[1], ' ', &query_basic_data); + std::vector query_phrase_data; + split_to_int64(data[2], ' ', &query_phrase_data); + // load title data + std::vector title_basic_data; + split_to_int64(data[3], ' ', &title_basic_data); + std::vector title_phrase_data; + split_to_int64(data[4], ' ', &title_phrase_data); + // filter the empty data + bool flag = + data[1].size() && data[2].size() && data[3].size() && data[4].size(); + if (flag) { + query_basic.push_back(std::move(query_basic_data)); + query_phrase.push_back(std::move(query_phrase_data)); + title_basic.push_back(std::move(title_basic_data)); + title_phrase.push_back(std::move(title_phrase_data)); + num_lines++; + } + } + num_samples = num_lines; + } +}; + +void PrepareInputs(std::vector *input_slots, DataRecord *data, + int batch_size) { + PaddleTensor query_basic_tensor, query_phrase_tensor, title_basic_tensor, + title_phrase_tensor; + query_basic_tensor.name = "query_basic"; + query_phrase_tensor.name = "query_phrase"; + title_basic_tensor.name = "pos_title_basic"; + title_phrase_tensor.name = "pos_title_phrase"; + auto one_batch = data->NextBatch(); + // assign data + TensorAssignData(&query_basic_tensor, one_batch.query_basic, + one_batch.lod1); + TensorAssignData(&query_phrase_tensor, one_batch.query_phrase, + one_batch.lod2); + TensorAssignData(&title_basic_tensor, one_batch.title_basic, + one_batch.lod3); + TensorAssignData(&title_phrase_tensor, one_batch.title_phrase, + one_batch.lod4); + // Set inputs. + input_slots->assign({query_basic_tensor, query_phrase_tensor, + title_basic_tensor, title_phrase_tensor}); + for (auto &tensor : *input_slots) { + tensor.dtype = PaddleDType::INT64; + } +} + +void SetConfig(AnalysisConfig *cfg) { + cfg->SetModel(FLAGS_infer_model); + cfg->DisableGpu(); + cfg->SwitchSpecifyInputNames(); + cfg->SwitchIrOptim(); + if (FLAGS_zero_copy) { + cfg->SwitchUseFeedFetchOps(false); + } +} + +void SetInput(std::vector> *inputs) { + DataRecord data(FLAGS_infer_data, FLAGS_batch_size); + std::vector input_slots; + int epoch = FLAGS_test_all_data ? data.num_samples / FLAGS_batch_size : 1; + LOG(INFO) << "number of samples: " << epoch * FLAGS_batch_size; + for (int bid = 0; bid < epoch; ++bid) { + PrepareInputs(&input_slots, &data, FLAGS_batch_size); + (*inputs).emplace_back(input_slots); + } +} + +// Easy for profiling independently. +TEST(Analyzer_Pyramid_DNN, profile) { + AnalysisConfig cfg; + SetConfig(&cfg); + std::vector> outputs; + + std::vector> input_slots_all; + SetInput(&input_slots_all); + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, FLAGS_num_threads); + + if (FLAGS_num_threads == 1 && !FLAGS_test_all_data && !FLAGS_zero_copy) { + PADDLE_ENFORCE_GT(outputs.size(), 0); + auto output = outputs.back(); + PADDLE_ENFORCE_EQ(output.size(), 1UL); + size_t size = GetSize(output[0]); + PADDLE_ENFORCE_GT(size, 0); + float *result = static_cast(output[0].data.data()); + // output is probability, which is in (0, 1). + for (size_t i = 0; i < size; i++) { + EXPECT_GT(result[i], 0); + EXPECT_LT(result[i], 1); + } + } +} + +// Check the fuse status +TEST(Analyzer_Pyramid_DNN, fuse_statis) { + AnalysisConfig cfg; + SetConfig(&cfg); + + int num_ops; + auto predictor = CreatePaddlePredictor(cfg); + auto fuse_statis = GetFuseStatis( + static_cast(predictor.get()), &num_ops); +} + +// Compare result of NativeConfig and AnalysisConfig +TEST(Analyzer_Pyramid_DNN, compare) { + AnalysisConfig cfg; + SetConfig(&cfg); + + std::vector> input_slots_all; + SetInput(&input_slots_all); + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), input_slots_all); +} + +// Compare result of AnalysisConfig and AnalysisConfig + ZeroCopy +TEST(Analyzer_Pyramid_DNN, compare_zero_copy) { + AnalysisConfig cfg; + SetConfig(&cfg); + + std::vector> input_slots_all; + SetInput(&input_slots_all); + std::vector outputs_name; + outputs_name.emplace_back("cos_sim_2.tmp_0"); + CompareAnalysisAndZeroCopy(reinterpret_cast(&cfg), + input_slots_all, outputs_name); +} + +// Compare Deterministic result +TEST(Analyzer_Pyramid_DNN, compare_determine) { + AnalysisConfig cfg; + SetConfig(&cfg); + + std::vector> input_slots_all; + SetInput(&input_slots_all); + CompareDeterministic(reinterpret_cast(&cfg), + input_slots_all); +} + +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tests/api/analyzer_resnet50_tester.cc b/paddle/fluid/inference/tests/api/analyzer_resnet50_tester.cc index 67668298440e9af279e792f786a8123b71172a66..d4330e6cddf8818ace01be2f13a4c18a192c46e1 100644 --- a/paddle/fluid/inference/tests/api/analyzer_resnet50_tester.cc +++ b/paddle/fluid/inference/tests/api/analyzer_resnet50_tester.cc @@ -21,57 +21,35 @@ namespace inference { namespace analysis { void SetConfig(AnalysisConfig *cfg) { - cfg->param_file = FLAGS_infer_model + "/params"; - cfg->prog_file = FLAGS_infer_model + "/model"; - cfg->use_gpu = false; - cfg->device = 0; - cfg->enable_ir_optim = true; - cfg->specify_input_name = true; + cfg->SetModel(FLAGS_infer_model + "/model", FLAGS_infer_model + "/params"); + cfg->DisableGpu(); + cfg->SwitchIrOptim(); + cfg->SwitchSpecifyInputNames(); + cfg->SetCpuMathLibraryNumThreads(FLAGS_paddle_num_threads); } void SetInput(std::vector> *inputs) { - PADDLE_ENFORCE_EQ(FLAGS_test_all_data, 0, "Only have single batch of data."); - - PaddleTensor input; - // channel=3, height/width=318 - std::vector shape({FLAGS_batch_size, 3, 318, 318}); - input.shape = shape; - input.dtype = PaddleDType::FLOAT32; - - // fill input data, for profile easily, do not use random data here. - size_t size = FLAGS_batch_size * 3 * 318 * 318; - input.data.Resize(size * sizeof(float)); - float *input_data = static_cast(input.data.data()); - for (size_t i = 0; i < size; i++) { - *(input_data + i) = static_cast(i) / size; - } - - std::vector input_slots; - input_slots.assign({input}); - (*inputs).emplace_back(input_slots); + SetFakeImageInput(inputs, FLAGS_infer_model); } // Easy for profiling independently. void profile(bool use_mkldnn = false) { AnalysisConfig cfg; SetConfig(&cfg); - cfg._use_mkldnn = use_mkldnn; - std::vector outputs; + + if (use_mkldnn) { + cfg.EnableMKLDNN(); + } + std::vector> outputs; std::vector> input_slots_all; SetInput(&input_slots_all); - TestPrediction(cfg, input_slots_all, &outputs, FLAGS_num_threads); - - if (FLAGS_num_threads == 1 && !FLAGS_test_all_data) { - PADDLE_ENFORCE_EQ(outputs.size(), 1UL); - size_t size = GetSize(outputs[0]); - // output is a 512-dimension feature - EXPECT_EQ(size, 512 * FLAGS_batch_size); - } + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, FLAGS_num_threads); } TEST(Analyzer_resnet50, profile) { profile(); } -#ifndef PADDLE_WITH_MKLDNN +#ifdef PADDLE_WITH_MKLDNN TEST(Analyzer_resnet50, profile_mkldnn) { profile(true /* use_mkldnn */); } #endif @@ -83,19 +61,21 @@ TEST(Analyzer_resnet50, fuse_statis) { auto predictor = CreatePaddlePredictor(cfg); auto fuse_statis = GetFuseStatis( static_cast(predictor.get()), &num_ops); - ASSERT_TRUE(fuse_statis.count("fc_fuse")); - EXPECT_EQ(fuse_statis.at("fc_fuse"), 1); + LOG(INFO) << "num_ops: " << num_ops; } // Compare result of NativeConfig and AnalysisConfig void compare(bool use_mkldnn = false) { AnalysisConfig cfg; SetConfig(&cfg); - cfg._use_mkldnn = use_mkldnn; + if (use_mkldnn) { + cfg.EnableMKLDNN(); + } std::vector> input_slots_all; SetInput(&input_slots_all); - CompareNativeAndAnalysis(cfg, input_slots_all); + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), input_slots_all); } TEST(Analyzer_resnet50, compare) { compare(); } @@ -103,6 +83,17 @@ TEST(Analyzer_resnet50, compare) { compare(); } TEST(Analyzer_resnet50, compare_mkldnn) { compare(true /* use_mkldnn */); } #endif +// Compare Deterministic result +TEST(Analyzer_resnet50, compare_determine) { + AnalysisConfig cfg; + SetConfig(&cfg); + + std::vector> input_slots_all; + SetInput(&input_slots_all); + CompareDeterministic(reinterpret_cast(&cfg), + input_slots_all); +} + } // namespace analysis } // namespace inference } // namespace paddle diff --git a/paddle/fluid/inference/tests/api/analyzer_rnn1_tester.cc b/paddle/fluid/inference/tests/api/analyzer_rnn1_tester.cc index 6399476680c0af83a6d26aea952c58543bdce9ae..54fd3a4a4caba52110ab636e6d44ee2a473f0cb0 100644 --- a/paddle/fluid/inference/tests/api/analyzer_rnn1_tester.cc +++ b/paddle/fluid/inference/tests/api/analyzer_rnn1_tester.cc @@ -20,7 +20,6 @@ namespace paddle { namespace inference { using namespace framework; // NOLINT -using namespace contrib; // NOLINT struct DataRecord { std::vector>> link_step_data_all; @@ -204,13 +203,13 @@ void PrepareZeroCopyInputs(ZeroCopyTensor *lod_attention_tensor, } void SetConfig(AnalysisConfig *cfg) { - cfg->prog_file = FLAGS_infer_model + "/__model__"; - cfg->param_file = FLAGS_infer_model + "/param"; - cfg->use_gpu = false; - cfg->device = 0; - cfg->specify_input_name = true; - cfg->enable_ir_optim = true; - cfg->ir_passes.clear(); // Do not exclude any pass. + cfg->SetModel(FLAGS_infer_model + "/__model__", FLAGS_infer_model + "/param"); + cfg->DisableGpu(); + cfg->SwitchSpecifyInputNames(); + cfg->SwitchIrOptim(); + if (FLAGS_zero_copy) { + cfg->SwitchUseFeedFetchOps(false); + } } void SetInput(std::vector> *inputs) { @@ -226,18 +225,21 @@ void SetInput(std::vector> *inputs) { // Easy for profiling independently. TEST(Analyzer_rnn1, profile) { - contrib::AnalysisConfig cfg; + AnalysisConfig cfg; SetConfig(&cfg); - std::vector outputs; + cfg.DisableGpu(); + cfg.SwitchIrDebug(); + std::vector> outputs; std::vector> input_slots_all; SetInput(&input_slots_all); - TestPrediction(cfg, input_slots_all, &outputs, FLAGS_num_threads); + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, FLAGS_num_threads); } // Check the fuse status TEST(Analyzer_rnn1, fuse_statis) { - contrib::AnalysisConfig cfg; + AnalysisConfig cfg; SetConfig(&cfg); int num_ops; @@ -254,241 +256,49 @@ TEST(Analyzer_rnn1, fuse_statis) { // Compare result of NativeConfig and AnalysisConfig TEST(Analyzer_rnn1, compare) { - contrib::AnalysisConfig cfg; + AnalysisConfig cfg; SetConfig(&cfg); std::vector> input_slots_all; SetInput(&input_slots_all); - CompareNativeAndAnalysis(cfg, input_slots_all); + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), input_slots_all); } -// Test Multi-Thread. -TEST(Analyzer_rnn1, multi_thread) { - contrib::AnalysisConfig cfg; +// Compare Deterministic result +TEST(Analyzer_rnn1, compare_determine) { + AnalysisConfig cfg; SetConfig(&cfg); - std::vector outputs; std::vector> input_slots_all; SetInput(&input_slots_all); - TestPrediction(cfg, input_slots_all, &outputs, 4 /* multi_thread */); -} - -bool CompareTensors(const framework::Scope &a_scope, - const framework::Scope &b_scope, - const std::vector &tensors) { - for (auto &x : tensors) { - auto *a_var = a_scope.FindVar(x); - auto *b_var = b_scope.FindVar(x); - if (a_var && b_var) { - if (a_var->Type() == typeid(framework::LoDTensor) || - a_var->Type() == typeid(framework::Tensor)) { - LOG(INFO) << "comparing tensor " << x; - auto &a_t = a_var->Get(); - auto &b_t = b_var->Get(); - if (!inference::CompareTensor(a_t, b_t)) { - LOG(ERROR) << string::Sprintf("tensor %s not match in two scopes", x); - } - } else { - LOG(INFO) << "skip no tensor " << x; - } - } else { - LOG(INFO) << "skip tensor " << x; - } - } - return true; + CompareDeterministic(reinterpret_cast(&cfg), + input_slots_all); } -// Validate that the AnalysisPredictor + ZeroCopyTensor really works by testing -// on the complex RNN1 model. -TEST(Analyzer_rnn1, ZeroCopy) { - AnalysisConfig config; - SetConfig(&config); - config.use_feed_fetch_ops = false; - - PaddlePlace place; - int output_size{0}; - - auto predictor = CreatePaddlePredictor(config); - - config.use_feed_fetch_ops = true; - auto native_predictor = CreatePaddlePredictor(config); - - config.use_feed_fetch_ops = true; // the analysis predictor needs feed/fetch. - auto analysis_predictor = CreatePaddlePredictor(config); - -#define NEW_TENSOR(name__) \ - auto name__##_tensor = predictor->GetInputTensor(#name__); - NEW_TENSOR(data_lod_attention); - NEW_TENSOR(cell_init); - NEW_TENSOR(data); - NEW_TENSOR(week); - NEW_TENSOR(minute); - NEW_TENSOR(hidden_init); - - // Prepare data for AnalysisPredictor - DataRecord data(FLAGS_infer_data, FLAGS_batch_size); - PrepareZeroCopyInputs(data_lod_attention_tensor.get(), cell_init_tensor.get(), - data_tensor.get(), hidden_init_tensor.get(), - week_tensor.get(), minute_tensor.get(), &data, - FLAGS_batch_size); - - // Prepare data for NativePredictor - std::vector> native_inputs; - SetInput(&native_inputs); - std::vector native_outputs; - std::vector analysis_outputs; - - auto output_tensor = predictor->GetOutputTensor("final_output.tmp_1"); - // Run analysis predictor - - int num_ops; - auto fuse_statis = GetFuseStatis(predictor.get(), &num_ops); - ASSERT_TRUE(fuse_statis.count("fc_fuse")); - ASSERT_EQ(fuse_statis.at("fc_fuse"), 1); - ASSERT_EQ(fuse_statis.at("fc_nobias_lstm_fuse"), 2); // bi-directional LSTM - ASSERT_EQ(fuse_statis.at("seq_concat_fc_fuse"), 1); - ASSERT_EQ(num_ops, - 13); // After graph optimization, only 13 operators exists. - - Timer timer; - double total_time{0}; - double native_total_time{0}; - double analysis_total_time{0.}; - - for (int i = 0; i < FLAGS_repeat; i++) { - timer.tic(); - predictor->ZeroCopyRun(); - total_time += timer.toc(); - } - - auto *output_data = output_tensor->data(&place, &output_size); - ASSERT_GT(output_size, 0); // more than one output! - - for (int i = 0; i < FLAGS_repeat; i++) { - // Run native predictor. - timer.tic(); - ASSERT_TRUE(native_predictor->Run(native_inputs.front(), &native_outputs)); - native_total_time += timer.toc(); - } - - for (int i = 0; i < FLAGS_repeat; i++) { - timer.tic(); - ASSERT_TRUE( - analysis_predictor->Run(native_inputs.front(), &analysis_outputs)); - analysis_total_time += timer.toc(); - } - - if (!FLAGS_with_precision_check) { - return; - } - int native_output_size = VecReduceToInt(native_outputs.front().shape); - - EXPECT_EQ(native_output_size, output_size); - - // Compare tensors between analysis and zerocopy - auto *p0 = static_cast(predictor.get()); - auto *p1 = static_cast(analysis_predictor.get()); - auto *p2 = static_cast(native_predictor.get()); - - std::vector tensor_names; - for (auto &var_desc : p0->program().Block(0).AllVars()) { - tensor_names.push_back(var_desc->Name()); - } - - LOG(INFO) << "Comparing tensors"; - ASSERT_TRUE( - CompareTensors(*p0->scope(), *p1->scope(), {"final_output.tmp_1"})); - ASSERT_TRUE( - CompareTensors(*p0->scope(), *p2->scope(), {"final_output.tmp_1"})); - - LOG(INFO) << "output1 " << inference::LoDTensorSummary( - p0->scope() - ->FindVar("final_output.tmp_1") - ->Get()); - LOG(INFO) << "output2 " << inference::LoDTensorSummary( - p1->scope() - ->FindVar("final_output.tmp_1") - ->Get()); - LOG(INFO) << "output3 " << inference::LoDTensorSummary( - p2->scope() - ->FindVar("final_output.tmp_1") - ->Get()); - - for (int i = 0; i < output_size; i++) { - LOG(INFO) << output_data[i] << " " - << static_cast(native_outputs.front().data.data())[i] - << " " - << static_cast(analysis_outputs.front().data.data())[i]; - EXPECT_NEAR(output_data[i], - static_cast(native_outputs.front().data.data())[i], - 1e-3); - } - - LOG(INFO) << "batch_size: " << FLAGS_batch_size; +// Test Multi-Thread. +TEST(Analyzer_rnn1, multi_thread) { + AnalysisConfig cfg; + SetConfig(&cfg); + std::vector> outputs; - LOG(INFO) << "zero average time: " - << total_time / (FLAGS_repeat * FLAGS_batch_size); - LOG(INFO) << "analysis average time: " - << analysis_total_time / (FLAGS_repeat * FLAGS_batch_size); - LOG(INFO) << "native average time: " - << native_total_time / (FLAGS_repeat * FLAGS_batch_size); + std::vector> input_slots_all; + SetInput(&input_slots_all); + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, 2 /* multi_thread */); } -TEST(Analyzer_rnn1, ZeroCopyMultiThread) { - AnalysisConfig config; - SetConfig(&config); - config.use_feed_fetch_ops = false; - -#define NEW_TENSOR(name__) \ - auto name__##_tensor = predictor->GetInputTensor(#name__); - - auto base_predictor = CreatePaddlePredictor(config); - double total_time_of_threads{0}; - std::vector threads; - std::vector> predictors; - for (int tid = 0; tid < FLAGS_num_threads; tid++) { - predictors.emplace_back(CreatePaddlePredictor(config)); - } - - for (int tid = 0; tid < FLAGS_num_threads; tid++) { - threads.emplace_back([config, &total_time_of_threads, &predictors, tid] { - // auto predictor = base_predictor->Clone(); - auto &predictor = predictors[tid]; - NEW_TENSOR(data_lod_attention); - NEW_TENSOR(cell_init); - NEW_TENSOR(data); - NEW_TENSOR(week); - NEW_TENSOR(minute); - NEW_TENSOR(hidden_init); - - // Prepare data for AnalysisPredictor - DataRecord data(FLAGS_infer_data, FLAGS_batch_size); - Timer timer; - double total_time{0}; - - for (int i = 0; i < FLAGS_repeat; i++) { - PrepareZeroCopyInputs(data_lod_attention_tensor.get(), - cell_init_tensor.get(), data_tensor.get(), - hidden_init_tensor.get(), week_tensor.get(), - minute_tensor.get(), &data, FLAGS_batch_size); - - timer.tic(); - predictor->ZeroCopyRun(); - total_time += timer.toc(); - } - - total_time_of_threads += total_time; - - LOG(INFO) << "thread time: " << total_time / FLAGS_repeat; - }); - } - - for (auto &t : threads) { - t.join(); - } +// Compare result of AnalysisConfig and AnalysisConfig + ZeroCopy +TEST(Analyzer_rnn1, compare_zero_copy) { + AnalysisConfig cfg; + SetConfig(&cfg); - LOG(INFO) << "average time: " - << total_time_of_threads / FLAGS_num_threads / FLAGS_repeat; + std::vector> input_slots_all; + SetInput(&input_slots_all); + std::vector outputs_name; + outputs_name.emplace_back("final_output.tmp_1"); + CompareAnalysisAndZeroCopy(reinterpret_cast(&cfg), + input_slots_all, outputs_name); } } // namespace inference diff --git a/paddle/fluid/inference/tests/api/analyzer_rnn2_tester.cc b/paddle/fluid/inference/tests/api/analyzer_rnn2_tester.cc index ba04d030b94c0924311dcff5c6a34270a764f877..9ccbf58cbd2bbaab9b1a132c27e50356e1a5df37 100644 --- a/paddle/fluid/inference/tests/api/analyzer_rnn2_tester.cc +++ b/paddle/fluid/inference/tests/api/analyzer_rnn2_tester.cc @@ -18,12 +18,12 @@ namespace paddle { namespace inference { using namespace framework; // NOLINT +static std::vector result_data; struct DataRecord { std::vector>> link_step_data_all; std::vector lod; std::vector> rnn_link_data; - std::vector result_data; size_t num_samples; // total number of samples size_t batch_iter{0}; size_t batch_size{1}; @@ -57,6 +57,7 @@ struct DataRecord { std::ifstream file(path); std::string line; int num_lines = 0; + result_data.clear(); while (std::getline(file, line)) { num_lines++; std::vector data; @@ -104,12 +105,10 @@ void PrepareInputs(std::vector *input_slots, DataRecord *data, } void SetConfig(AnalysisConfig *cfg) { - cfg->prog_file = FLAGS_infer_model + "/__model__"; - cfg->param_file = FLAGS_infer_model + "/param"; - cfg->use_gpu = false; - cfg->device = 0; - cfg->specify_input_name = true; - cfg->enable_ir_optim = true; + cfg->SetModel(FLAGS_infer_model + "/__model__", FLAGS_infer_model + "/param"); + cfg->DisableGpu(); + cfg->SwitchSpecifyInputNames(); + cfg->SwitchIrOptim(); } void SetInput(std::vector> *inputs) { @@ -127,21 +126,23 @@ void SetInput(std::vector> *inputs) { TEST(Analyzer_rnn2, profile) { AnalysisConfig cfg; SetConfig(&cfg); - std::vector outputs; + std::vector> outputs; std::vector> input_slots_all; SetInput(&input_slots_all); - TestPrediction(cfg, input_slots_all, &outputs, FLAGS_num_threads); + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, FLAGS_num_threads); if (FLAGS_num_threads == 1 && !FLAGS_test_all_data) { // the first inference result - DataRecord data(FLAGS_infer_data, FLAGS_batch_size); PADDLE_ENFORCE_GT(outputs.size(), 0); - size_t size = GetSize(outputs[0]); + auto output = outputs.back(); + PADDLE_ENFORCE_GT(output.size(), 0); + size_t size = GetSize(output[0]); PADDLE_ENFORCE_GT(size, 0); - float *result = static_cast(outputs[0].data.data()); + float *result = static_cast(output[0].data.data()); for (size_t i = 0; i < size; i++) { - EXPECT_NEAR(result[i], data.result_data[i], 1e-3); + EXPECT_NEAR(result[i], result_data[i], 1e-3); } } } @@ -153,7 +154,19 @@ TEST(Analyzer_rnn2, compare) { std::vector> input_slots_all; SetInput(&input_slots_all); - CompareNativeAndAnalysis(cfg, input_slots_all); + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), input_slots_all); +} + +// Compare Deterministic result +TEST(Analyzer_rnn2, compare_determine) { + AnalysisConfig cfg; + SetConfig(&cfg); + + std::vector> input_slots_all; + SetInput(&input_slots_all); + CompareDeterministic(reinterpret_cast(&cfg), + input_slots_all); } } // namespace inference diff --git a/paddle/fluid/inference/tests/api/analyzer_seq_conv1_tester.cc b/paddle/fluid/inference/tests/api/analyzer_seq_conv1_tester.cc index cb4671c4379b5f6f144bfd5330866aa38163f4d4..9f23b9f037bcaeb758312d011067ae29c82e73cd 100644 --- a/paddle/fluid/inference/tests/api/analyzer_seq_conv1_tester.cc +++ b/paddle/fluid/inference/tests/api/analyzer_seq_conv1_tester.cc @@ -18,12 +18,9 @@ namespace paddle { namespace inference { struct DataRecord { - std::vector> title1_all, title2_all, title3_all, l1_all; std::vector> title1, title2, title3, l1; - std::vector title1_lod, title2_lod, title3_lod, l1_lod; - size_t batch_iter{0}; - size_t batch_size{1}; - size_t num_samples; // total number of samples + std::vector lod1, lod2, lod3, l1_lod; + size_t batch_iter{0}, batch_size{1}, num_samples; // total number of samples DataRecord() = default; explicit DataRecord(const std::string &path, int batch_size = 1) : batch_size(batch_size) { @@ -33,41 +30,11 @@ struct DataRecord { DataRecord data; size_t batch_end = batch_iter + batch_size; // NOTE skip the final batch, if no enough data is provided. - if (batch_end <= title1_all.size()) { - data.title1_all.assign(title1_all.begin() + batch_iter, - title1_all.begin() + batch_end); - data.title2_all.assign(title2_all.begin() + batch_iter, - title2_all.begin() + batch_end); - data.title3_all.assign(title3_all.begin() + batch_iter, - title3_all.begin() + batch_end); - data.l1_all.assign(l1_all.begin() + batch_iter, - l1_all.begin() + batch_end); - // Prepare LoDs - data.title1_lod.push_back(0); - data.title2_lod.push_back(0); - data.title3_lod.push_back(0); - data.l1_lod.push_back(0); - CHECK(!data.title1_all.empty()); - CHECK(!data.title2_all.empty()); - CHECK(!data.title3_all.empty()); - CHECK(!data.l1_all.empty()); - CHECK_EQ(data.title1_all.size(), data.title2_all.size()); - CHECK_EQ(data.title1_all.size(), data.title3_all.size()); - CHECK_EQ(data.title1_all.size(), data.l1_all.size()); - for (size_t j = 0; j < data.title1_all.size(); j++) { - data.title1.push_back(data.title1_all[j]); - data.title2.push_back(data.title2_all[j]); - data.title3.push_back(data.title3_all[j]); - data.l1.push_back(data.l1_all[j]); - // calculate lod - data.title1_lod.push_back(data.title1_lod.back() + - data.title1_all[j].size()); - data.title2_lod.push_back(data.title2_lod.back() + - data.title2_all[j].size()); - data.title3_lod.push_back(data.title3_lod.back() + - data.title3_all[j].size()); - data.l1_lod.push_back(data.l1_lod.back() + data.l1_all[j].size()); - } + if (batch_end <= title1.size()) { + GetInputPerBatch(title1, &data.title1, &data.lod1, batch_iter, batch_end); + GetInputPerBatch(title2, &data.title2, &data.lod2, batch_iter, batch_end); + GetInputPerBatch(title3, &data.title3, &data.lod3, batch_iter, batch_end); + GetInputPerBatch(l1, &data.l1, &data.l1_lod, batch_iter, batch_end); } batch_iter += batch_size; return data; @@ -92,10 +59,10 @@ struct DataRecord { // load l1 data std::vector l1_data; split_to_int64(data[3], ' ', &l1_data); - title1_all.push_back(std::move(title1_data)); - title2_all.push_back(std::move(title2_data)); - title3_all.push_back(std::move(title3_data)); - l1_all.push_back(std::move(l1_data)); + title1.push_back(std::move(title1_data)); + title2.push_back(std::move(title2_data)); + title3.push_back(std::move(title3_data)); + l1.push_back(std::move(l1_data)); } num_samples = num_lines; } @@ -109,24 +76,11 @@ void PrepareInputs(std::vector *input_slots, DataRecord *data, title3_tensor.name = "title3"; l1_tensor.name = "l1"; auto one_batch = data->NextBatch(); - int title1_size = one_batch.title1_lod[one_batch.title1_lod.size() - 1]; - title1_tensor.shape.assign({title1_size, 1}); - title1_tensor.lod.assign({one_batch.title1_lod}); - int title2_size = one_batch.title2_lod[one_batch.title2_lod.size() - 1]; - title2_tensor.shape.assign({title2_size, 1}); - title2_tensor.lod.assign({one_batch.title2_lod}); - int title3_size = one_batch.title3_lod[one_batch.title3_lod.size() - 1]; - title3_tensor.shape.assign({title3_size, 1}); - title3_tensor.lod.assign({one_batch.title3_lod}); - int l1_size = one_batch.l1_lod[one_batch.l1_lod.size() - 1]; - l1_tensor.shape.assign({l1_size, 1}); - l1_tensor.lod.assign({one_batch.l1_lod}); - // assign data - TensorAssignData(&title1_tensor, one_batch.title1); - TensorAssignData(&title2_tensor, one_batch.title2); - TensorAssignData(&title3_tensor, one_batch.title3); - TensorAssignData(&l1_tensor, one_batch.l1); + TensorAssignData(&title1_tensor, one_batch.title1, one_batch.lod1); + TensorAssignData(&title2_tensor, one_batch.title2, one_batch.lod2); + TensorAssignData(&title3_tensor, one_batch.title3, one_batch.lod3); + TensorAssignData(&l1_tensor, one_batch.l1, one_batch.l1_lod); // Set inputs. input_slots->assign({title1_tensor, title2_tensor, title3_tensor, l1_tensor}); for (auto &tensor : *input_slots) { @@ -135,11 +89,10 @@ void PrepareInputs(std::vector *input_slots, DataRecord *data, } void SetConfig(AnalysisConfig *cfg) { - cfg->model_dir = FLAGS_infer_model; - cfg->use_gpu = false; - cfg->device = 0; - cfg->specify_input_name = true; - cfg->enable_ir_optim = true; + cfg->SetModel(FLAGS_infer_model); + cfg->DisableGpu(); + cfg->SwitchSpecifyInputNames(); + cfg->SwitchIrOptim(); } void SetInput(std::vector> *inputs) { @@ -157,18 +110,21 @@ void SetInput(std::vector> *inputs) { TEST(Analyzer_seq_conv1, profile) { AnalysisConfig cfg; SetConfig(&cfg); - std::vector outputs; + std::vector> outputs; std::vector> input_slots_all; SetInput(&input_slots_all); - TestPrediction(cfg, input_slots_all, &outputs, FLAGS_num_threads); + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, FLAGS_num_threads); if (FLAGS_num_threads == 1 && !FLAGS_test_all_data) { // the first inference result - PADDLE_ENFORCE_EQ(outputs.size(), 1UL); - size_t size = GetSize(outputs[0]); + PADDLE_ENFORCE_GT(outputs.size(), 0); + auto output = outputs.back(); + PADDLE_ENFORCE_EQ(output.size(), 1UL); + size_t size = GetSize(output[0]); PADDLE_ENFORCE_GT(size, 0); - float *result = static_cast(outputs[0].data.data()); + float *result = static_cast(output[0].data.data()); // output is probability, which is in (0, 1). for (size_t i = 0; i < size; i++) { EXPECT_GT(result[i], 0); @@ -183,7 +139,13 @@ TEST(Analyzer_seq_conv1, fuse_statis) { SetConfig(&cfg); int num_ops; auto predictor = CreatePaddlePredictor(cfg); - GetFuseStatis(predictor.get(), &num_ops); + + auto fuse_statis = GetFuseStatis(predictor.get(), &num_ops); + ASSERT_TRUE(fuse_statis.count("fc_fuse")); + ASSERT_TRUE(fuse_statis.count("seqconv_eltadd_relu_fuse")); + EXPECT_EQ(fuse_statis.at("fc_fuse"), 2); + EXPECT_EQ(fuse_statis.at("seqconv_eltadd_relu_fuse"), 6); + EXPECT_EQ(num_ops, 32); } // Compare result of NativeConfig and AnalysisConfig @@ -193,7 +155,19 @@ TEST(Analyzer_seq_conv1, compare) { std::vector> input_slots_all; SetInput(&input_slots_all); - CompareNativeAndAnalysis(cfg, input_slots_all); + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), input_slots_all); +} + +// Compare Deterministic result +TEST(Analyzer_seq_conv1, compare_determine) { + AnalysisConfig cfg; + SetConfig(&cfg); + + std::vector> input_slots_all; + SetInput(&input_slots_all); + CompareDeterministic(reinterpret_cast(&cfg), + input_slots_all); } } // namespace inference diff --git a/paddle/fluid/inference/tests/api/analyzer_seq_pool1_tester.cc b/paddle/fluid/inference/tests/api/analyzer_seq_pool1_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..3cebf8e96984fad0de8d8c6775990f7c6a6cabe5 --- /dev/null +++ b/paddle/fluid/inference/tests/api/analyzer_seq_pool1_tester.cc @@ -0,0 +1,227 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include +#include "paddle/fluid/inference/tests/api/tester_helper.h" + +namespace paddle { +namespace inference { +namespace analysis { + +// diff: similarity_norm.tmp_0, for speed: fc_4.tmp_1 +static const char out_var_name[] = "reduce_sum_0.tmp_0"; + +// for diff: 154, for speed 111 +constexpr int num_slots = 154; + +struct OneSlotInBatch { + std::string name; + std::vector> data; + std::vector shape; + std::vector lod; +}; + +struct DataRecord { + std::vector> batched_data; + std::map>> datasets; + size_t batch_iter{0}, num_samples; // total number of samples + + DataRecord() = default; + explicit DataRecord(const std::string &path, int batch_size = 1) { + Load(path); + Prepare(batch_size); + } + + void Load(const std::string &path) { + std::ifstream file(path); + std::string line; + int num_lines = 0; + while (std::getline(file, line)) { + num_lines++; + std::vector data; + split(line, '\t', &data); + std::vector slot_data; + split_to_float(data[1], ' ', &slot_data); + std::string name = data[0]; + PADDLE_ENFORCE_EQ(slot_data.size() % 11, 0UL, + "line %d, %s should be divisible", num_lines, name); + datasets[name].emplace_back(std::move(slot_data)); + } + num_samples = num_lines / num_slots; + PADDLE_ENFORCE_EQ(num_samples * num_slots, static_cast(num_lines), + "num samples should be divisible"); + PADDLE_ENFORCE_GT(num_samples, 0UL); + } + + void Prepare(int bs) { + for (auto it = datasets.begin(); it != datasets.end(); ++it) { + PADDLE_ENFORCE_EQ(it->second.size(), num_samples, + "size of each slot should be equal"); + } + size_t num_batches = num_samples / bs; + EXPECT_GT(num_batches, 0); + batched_data.resize(num_batches); + for (auto &one_batch : batched_data) { + one_batch.resize(datasets.size()); + size_t i = 0; + for (auto it = datasets.begin(); it != datasets.end(); ++it) { + auto &slot = one_batch[i]; + slot.name = it->first; + slot.data.resize(bs); + slot.lod.resize(bs + 1); + slot.lod[0] = 0; + auto &lod = slot.lod; + auto &datas = it->second; + for (int k = 0; k < bs; ++k) { + size_t id = k + batch_iter * bs; + std::copy(datas[id].begin(), datas[id].end(), + std::back_inserter(slot.data[k])); + size_t len = datas[id].size() / 11; + PADDLE_ENFORCE_EQ(len * 11, datas[id].size(), + "%s %d size should be divisible", slot.name, id); + lod[k + 1] = lod[k] + len; + } + slot.shape.assign({static_cast(lod[bs]), 11}); + i++; + } + } + } + + const std::vector &NextBatch() { + if (batch_iter >= batched_data.size() - 1) { + batch_iter = -1; + } + return batched_data[++batch_iter]; + } +}; + +static void TensorAssignSlot(PaddleTensor *tensor, const OneSlotInBatch &slot) { + tensor->name = slot.name + "_embed"; + tensor->shape = slot.shape; + tensor->dtype = PaddleDType::FLOAT32; + tensor->lod.clear(); + tensor->lod.emplace_back(slot.lod); + TensorAssignData(tensor, slot.data); +} + +void PrepareInputs(std::vector *input_slots, DataRecord *data) { + const auto &one_batch = data->NextBatch(); + input_slots->resize(one_batch.size()); + for (size_t i = 0; i < one_batch.size(); ++i) { + auto &slot = one_batch[i]; + TensorAssignSlot(&((*input_slots)[i]), slot); + } +} + +void SetInput(std::vector> *inputs) { + DataRecord data(FLAGS_infer_data, FLAGS_batch_size); + std::vector input_slots; + int epoch = FLAGS_test_all_data ? data.batched_data.size() : 1; + LOG(INFO) << "number of samples: " + << data.batched_data.size() * FLAGS_batch_size; + for (int bid = 0; bid < epoch; ++bid) { + PrepareInputs(&input_slots, &data); + (*inputs).emplace_back(input_slots); + } +} + +void SetConfig(AnalysisConfig *cfg, bool use_mkldnn = false) { + cfg->SetModel(FLAGS_infer_model + "/model", FLAGS_infer_model + "/params"); + cfg->DisableGpu(); + cfg->SwitchSpecifyInputNames(); + cfg->SwitchIrDebug(); + cfg->SetCpuMathLibraryNumThreads(FLAGS_paddle_num_threads); + if (FLAGS_zero_copy) { + cfg->SwitchUseFeedFetchOps(false); + } + if (use_mkldnn) { + cfg->EnableMKLDNN(); + } + // Enable seqpool_concat_fuse_pass, disabled by default since it takes much + // time + cfg->pass_builder()->InsertPass(2, "seqpool_concat_fuse_pass"); +} + +void profile(bool use_mkldnn = false) { + AnalysisConfig cfg; + SetConfig(&cfg, use_mkldnn); + + std::vector> outputs; + std::vector> input_slots_all; + SetInput(&input_slots_all); + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, FLAGS_num_threads); +} + +TEST(Analyzer_seq_pool1, profile) { profile(); } + +// Compare result of NativeConfig and AnalysisConfig +TEST(Analyzer_seq_pool1, compare) { + AnalysisConfig cfg; + SetConfig(&cfg); + + std::vector> input_slots_all; + SetInput(&input_slots_all); + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), input_slots_all); +} + +// Compare Deterministic result +TEST(Analyzer_seq_pool1, compare_determine) { + AnalysisConfig cfg; + SetConfig(&cfg); + + std::vector> input_slots_all; + SetInput(&input_slots_all); + CompareDeterministic(reinterpret_cast(&cfg), + input_slots_all); +} + +// Check the fuse status +TEST(Analyzer_seq_pool1, fuse_statis) { + AnalysisConfig cfg; + SetConfig(&cfg); + int num_ops; + auto predictor = CreatePaddlePredictor(cfg); + auto fuse_statis = GetFuseStatis(predictor.get(), &num_ops); + ASSERT_TRUE(fuse_statis.count("fc_fuse")); + ASSERT_TRUE(fuse_statis.count("seqpool_concat_fuse")); + ASSERT_TRUE(fuse_statis.count("squared_mat_sub_fuse")); + ASSERT_TRUE(fuse_statis.count("repeated_fc_relu_fuse")); + ASSERT_EQ(fuse_statis.at("fc_fuse"), 10); + EXPECT_EQ(fuse_statis.at("seqpool_concat_fuse"), 2); + EXPECT_EQ(fuse_statis.at("squared_mat_sub_fuse"), 2); + EXPECT_EQ(fuse_statis.at("repeated_fc_relu_fuse"), 2); + LOG(INFO) << "num_ops: " << num_ops; + EXPECT_EQ(num_ops, 171); +} + +// Compare result of AnalysisConfig and AnalysisConfig + ZeroCopy +TEST(Analyzer_seq_pool1, compare_zero_copy) { + AnalysisConfig cfg; + SetConfig(&cfg); + + std::vector> input_slots_all; + SetInput(&input_slots_all); + std::vector outputs_name; + outputs_name.emplace_back(out_var_name); + CompareAnalysisAndZeroCopy(reinterpret_cast(&cfg), + input_slots_all, outputs_name); +} + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tests/api/analyzer_text_classification_tester.cc b/paddle/fluid/inference/tests/api/analyzer_text_classification_tester.cc index ca19475bda372398d425b0fa6f9a732cd79a8166..54492dbc238bbaf25f86b300fdd6585f74365088 100644 --- a/paddle/fluid/inference/tests/api/analyzer_text_classification_tester.cc +++ b/paddle/fluid/inference/tests/api/analyzer_text_classification_tester.cc @@ -47,11 +47,10 @@ struct DataReader { }; void SetConfig(AnalysisConfig *cfg) { - cfg->model_dir = FLAGS_infer_model; - cfg->use_gpu = false; - cfg->device = 0; - cfg->specify_input_name = true; - cfg->enable_ir_optim = true; + cfg->SetModel(FLAGS_infer_model); + cfg->DisableGpu(); + cfg->SwitchSpecifyInputNames(); + cfg->SwitchIrOptim(); } void SetInput(std::vector> *inputs) { @@ -70,16 +69,19 @@ void SetInput(std::vector> *inputs) { TEST(Analyzer_Text_Classification, profile) { AnalysisConfig cfg; SetConfig(&cfg); - std::vector outputs; + cfg.SwitchIrDebug(); + std::vector> outputs; std::vector> input_slots_all; SetInput(&input_slots_all); - TestPrediction(cfg, input_slots_all, &outputs, FLAGS_num_threads); + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, FLAGS_num_threads); if (FLAGS_num_threads == 1) { // Get output - LOG(INFO) << "get outputs " << outputs.size(); - for (auto &output : outputs) { + PADDLE_ENFORCE_GT(outputs.size(), 0); + LOG(INFO) << "get outputs " << outputs.back().size(); + for (auto &output : outputs.back()) { LOG(INFO) << "output.shape: " << to_string(output.shape); // no lod ? CHECK_EQ(output.lod.size(), 0UL); @@ -98,23 +100,35 @@ TEST(Analyzer_Text_Classification, profile) { TEST(Analyzer_Text_Classification, compare) { AnalysisConfig cfg; SetConfig(&cfg); + cfg.EnableMemoryOptim(); std::vector> input_slots_all; SetInput(&input_slots_all); - CompareNativeAndAnalysis(cfg, input_slots_all); + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), input_slots_all); +} + +// Compare Deterministic result +TEST(Analyzer_Text_Classification, compare_determine) { + AnalysisConfig cfg; + SetConfig(&cfg); + + std::vector> input_slots_all; + SetInput(&input_slots_all); + CompareDeterministic(reinterpret_cast(&cfg), + input_slots_all); } TEST(Analyzer_Text_Classification, compare_against_embedding_fc_lstm_fused) { AnalysisConfig cfg; SetConfig(&cfg); // Enable embedding_fc_lstm_fuse_pass (disabled by default) - auto it = std::find(cfg.ir_passes.begin(), cfg.ir_passes.end(), - "embedding_fc_lstm_fuse_pass"); - if (it != cfg.ir_passes.end()) cfg.ir_passes.erase(it); + cfg.pass_builder()->InsertPass(2, "embedding_fc_lstm_fuse_pass"); std::vector> input_slots_all; SetInput(&input_slots_all); - CompareNativeAndAnalysis(cfg, input_slots_all); + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), input_slots_all); } } // namespace inference diff --git a/paddle/fluid/inference/tests/api/analyzer_transformer_tester.cc b/paddle/fluid/inference/tests/api/analyzer_transformer_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..bd4f1b61973fb0de06dcc288e329c94756d5ed47 --- /dev/null +++ b/paddle/fluid/inference/tests/api/analyzer_transformer_tester.cc @@ -0,0 +1,241 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/tests/api/tester_helper.h" + +namespace paddle { +namespace inference { + +struct DataRecord { + std::vector> src_word, src_pos, trg_word, init_idx; + std::vector> src_slf_attn_bias, init_score, + trg_src_attn_bias; + std::vector> batch_data_shape; + std::vector> lod; + size_t batch_iter{0}, batch_size{1}, num_samples; // total number of samples + DataRecord() = default; + explicit DataRecord(const std::string &path, int batch_size = 1) + : batch_size(batch_size) { + Load(path); + } + DataRecord NextBatch() { + DataRecord data; + size_t batch_end = batch_iter + batch_size; + // NOTE skip the final batch, if no enough data is provided. + if (batch_end <= src_word.size()) { + data.src_word.assign(src_word.begin() + batch_iter, + src_word.begin() + batch_end); + data.src_pos.assign(src_pos.begin() + batch_iter, + src_pos.begin() + batch_end); + data.src_slf_attn_bias.assign(src_slf_attn_bias.begin() + batch_iter, + src_slf_attn_bias.begin() + batch_end); + data.trg_word.assign(trg_word.begin() + batch_iter, + trg_word.begin() + batch_end); + data.init_score.assign(init_score.begin() + batch_iter, + init_score.begin() + batch_end); + data.init_idx.assign(init_idx.begin() + batch_iter, + init_idx.begin() + batch_end); + data.trg_src_attn_bias.assign(trg_src_attn_bias.begin() + batch_iter, + trg_src_attn_bias.begin() + batch_end); + std::vector batch_shape = + *(batch_data_shape.begin() + batch_iter); + data.batch_data_shape.push_back(batch_shape); + data.lod.resize(2); + for (int i = 0; i < batch_shape[0] + 1; i++) { + data.lod[0].push_back(i); + data.lod[1].push_back(i); + } + } + batch_iter += batch_size; + return data; + } + void Load(const std::string &path) { + std::ifstream file(path); + std::string line; + size_t num_lines = 0; + while (std::getline(file, line)) { + num_lines++; + std::vector data; + split(line, ',', &data); + CHECK_EQ(data.size(), static_cast(8)); + // load src_word + std::vector src_word_data; + split_to_int64(data[0], ' ', &src_word_data); + src_word.push_back(std::move(src_word_data)); + // load src_pos + std::vector src_pos_data; + split_to_int64(data[1], ' ', &src_pos_data); + src_pos.push_back(std::move(src_pos_data)); + // load src_slf_attn_bias + std::vector src_slf_attn_bias_data; + split_to_float(data[2], ' ', &src_slf_attn_bias_data); + src_slf_attn_bias.push_back(std::move(src_slf_attn_bias_data)); + // load trg_word + std::vector trg_word_data; + split_to_int64(data[3], ' ', &trg_word_data); + trg_word.push_back(std::move(trg_word_data)); + // load init_score + std::vector init_score_data; + split_to_float(data[4], ' ', &init_score_data); + init_score.push_back(std::move(init_score_data)); + // load init_idx + std::vector init_idx_data; + split_to_int64(data[5], ' ', &init_idx_data); + init_idx.push_back(std::move(init_idx_data)); + // load trg_src_attn_bias + std::vector trg_src_attn_bias_data; + split_to_float(data[6], ' ', &trg_src_attn_bias_data); + trg_src_attn_bias.push_back(std::move(trg_src_attn_bias_data)); + // load shape for variant data shape + std::vector batch_data_shape_data; + split_to_int(data[7], ' ', &batch_data_shape_data); + batch_data_shape.push_back(std::move(batch_data_shape_data)); + } + num_samples = num_lines; + } +}; + +void PrepareInputs(std::vector *input_slots, DataRecord *data, + int batch_size) { + auto one_batch = data->NextBatch(); + batch_size = one_batch.batch_data_shape[0][0]; + auto n_head = one_batch.batch_data_shape[0][1]; + auto trg_seq_len = one_batch.batch_data_shape[0][2]; // 1 for inference + auto src_seq_len = one_batch.batch_data_shape[0][3]; + + PaddleTensor src_word, src_pos, src_slf_attn_bias, trg_word, init_score, + init_idx, trg_src_attn_bias; + + src_word.name = "src_word"; + src_word.shape.assign({batch_size, src_seq_len, 1}); + src_word.dtype = PaddleDType::INT64; + TensorAssignData(&src_word, one_batch.src_word); + + src_pos.name = "src_pos"; + src_pos.shape.assign({batch_size, src_seq_len, 1}); + src_pos.dtype = PaddleDType::INT64; + TensorAssignData(&src_pos, one_batch.src_pos); + + src_slf_attn_bias.name = "src_slf_attn_bias"; + src_slf_attn_bias.shape.assign( + {batch_size, n_head, src_seq_len, src_seq_len}); + src_slf_attn_bias.dtype = PaddleDType::FLOAT32; + TensorAssignData(&src_slf_attn_bias, one_batch.src_slf_attn_bias); + + trg_word.name = "trg_word"; + trg_word.shape.assign({batch_size, 1}); + trg_word.dtype = PaddleDType::INT64; + trg_word.lod.assign(one_batch.lod.begin(), one_batch.lod.end()); + TensorAssignData(&trg_word, one_batch.trg_word); + + init_score.name = "init_score"; + init_score.shape.assign({batch_size, 1}); + init_score.dtype = PaddleDType::FLOAT32; + init_score.lod.assign(one_batch.lod.begin(), one_batch.lod.end()); + TensorAssignData(&init_score, one_batch.init_score); + + init_idx.name = "init_idx"; + init_idx.shape.assign({batch_size}); + init_idx.dtype = PaddleDType::INT32; + TensorAssignData(&init_idx, one_batch.init_idx); + + trg_src_attn_bias.name = "trg_src_attn_bias"; + trg_src_attn_bias.shape.assign( + {batch_size, n_head, trg_seq_len, src_seq_len}); + trg_src_attn_bias.dtype = PaddleDType::FLOAT32; + TensorAssignData(&trg_src_attn_bias, one_batch.trg_src_attn_bias); + + input_slots->assign({src_word, src_pos, src_slf_attn_bias, trg_word, + init_score, init_idx, trg_src_attn_bias}); +} + +void SetConfig(AnalysisConfig *cfg) { + cfg->SetModel(FLAGS_infer_model + "/model", FLAGS_infer_model + "/params"); + cfg->DisableGpu(); + cfg->SwitchSpecifyInputNames(); + cfg->SwitchIrOptim(); + cfg->SetCpuMathLibraryNumThreads(FLAGS_paddle_num_threads); +} + +void SetInput(std::vector> *inputs) { + DataRecord data(FLAGS_infer_data, FLAGS_batch_size); + std::vector input_slots; + int test_batch_num = + FLAGS_test_all_data ? data.num_samples / FLAGS_batch_size : 1; + LOG(INFO) << "The number of samples to be test: " + << test_batch_num * FLAGS_batch_size; + for (int bid = 0; bid < test_batch_num; ++bid) { + input_slots.clear(); + PrepareInputs(&input_slots, &data, FLAGS_batch_size); + (*inputs).emplace_back(input_slots); + } +} + +// Easy for profiling independently. +void profile(bool use_mkldnn = false) { + AnalysisConfig cfg; + SetConfig(&cfg); + std::vector> outputs; + if (use_mkldnn) { + cfg.EnableMKLDNN(); + } + + std::vector> input_slots_all; + SetInput(&input_slots_all); + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, FLAGS_num_threads); +} + +TEST(Analyzer_Transformer, profile) { profile(); } +#ifdef PADDLE_WITH_MKLDNN +TEST(Analyzer_Transformer, profile_mkldnn) { profile(true); } +#endif + +// Check the fuse status +TEST(Analyzer_Transformer, fuse_statis) { + AnalysisConfig cfg; + SetConfig(&cfg); + + int num_ops; + auto predictor = CreatePaddlePredictor(cfg); + auto fuse_statis = GetFuseStatis( + static_cast(predictor.get()), &num_ops); +} + +// Compare result of NativeConfig and AnalysisConfig +// void compare(bool use_mkldnn = false) { +// AnalysisConfig cfg; +// SetConfig(&cfg); +// if (use_mkldnn) { +// cfg.EnableMKLDNN(); +// } +// +// std::vector> input_slots_all; +// SetInput(&input_slots_all); +// CompareNativeAndAnalysis( +// reinterpret_cast(&cfg), +// input_slots_all); +// } + +// TODO(yihuaxu): +// Disable compare and compare_mkldnn temporary, see +// https://github.com/paddlePaddle/Paddle/issues/16316 for details. +// TEST(Analyzer_Transformer, compare) { compare(); } +// #ifdef PADDLE_WITH_MKLDNN +// TEST(Analyzer_Transformer, compare_mkldnn) { compare(true /* use_mkldnn */); +// } +// #endif + +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tests/api/analyzer_vis_tester.cc b/paddle/fluid/inference/tests/api/analyzer_vis_tester.cc index 8933296490793a7693124eba23f8cf0801881e14..fb47048cd0ccc887927cb4b533d45df11ef633eb 100644 --- a/paddle/fluid/inference/tests/api/analyzer_vis_tester.cc +++ b/paddle/fluid/inference/tests/api/analyzer_vis_tester.cc @@ -12,6 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ +#include #include #include #include "paddle/fluid/inference/tests/api/tester_helper.h" @@ -19,7 +20,6 @@ limitations under the License. */ namespace paddle { namespace inference { namespace analysis { -using contrib::AnalysisConfig; struct Record { std::vector data; @@ -51,14 +51,13 @@ Record ProcessALine(const std::string &line) { } void SetConfig(AnalysisConfig *cfg) { - cfg->param_file = FLAGS_infer_model + "/__params__"; - cfg->prog_file = FLAGS_infer_model + "/__model__"; - cfg->use_gpu = false; - cfg->device = 0; - cfg->enable_ir_optim = true; - cfg->specify_input_name = true; + cfg->SetModel(FLAGS_infer_model + "/__model__", + FLAGS_infer_model + "/__params__"); + cfg->DisableGpu(); + cfg->SwitchIrDebug(); + cfg->SwitchSpecifyInputNames(false); // TODO(TJ): fix fusion gru - cfg->ir_passes.push_back("fc_gru_fuse_pass"); + cfg->pass_builder()->DeletePass("fc_gru_fuse_pass"); } void SetInput(std::vector> *inputs) { @@ -84,24 +83,30 @@ void SetInput(std::vector> *inputs) { void profile(bool use_mkldnn = false) { AnalysisConfig cfg; SetConfig(&cfg); - cfg._use_mkldnn = use_mkldnn; - std::vector outputs; + if (use_mkldnn) { + cfg.EnableMKLDNN(); + } + // cfg.pass_builder()->TurnOnDebug(); + std::vector> outputs; std::vector> input_slots_all; SetInput(&input_slots_all); - TestPrediction(cfg, input_slots_all, &outputs, FLAGS_num_threads); - + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, FLAGS_num_threads); if (FLAGS_num_threads == 1 && !FLAGS_test_all_data) { - const float ocr_result_data[] = { - 5.273636460856323538e-08, 3.296741795111302054e-07, - 1.873261190610264748e-08, 3.403730275408634043e-08, - 3.383312474625199684e-08}; - PADDLE_ENFORCE_EQ(outputs.size(), 1UL); - size_t size = GetSize(outputs[0]); - PADDLE_ENFORCE_GT(size, 0); - float *result = static_cast(outputs[0].data.data()); - for (size_t i = 0; i < std::min(5UL, size); i++) { - EXPECT_NEAR(result[i], ocr_result_data[i], 1e-3); + std::string line; + std::ifstream file(FLAGS_refer_result); + std::getline(file, line); + auto refer = ProcessALine(line); + file.close(); + + PADDLE_ENFORCE_GT(outputs.size(), 0); + auto &output = outputs.back().front(); + size_t numel = output.data.length() / PaddleDtypeSize(output.dtype); + CHECK_EQ(numel, refer.data.size()); + for (size_t i = 0; i < numel; ++i) { + EXPECT_NEAR(static_cast(output.data.data())[i], refer.data[i], + 1e-5); } } } @@ -125,11 +130,14 @@ TEST(Analyzer_vis, fuse_statis) { void compare(bool use_mkldnn = false) { AnalysisConfig cfg; SetConfig(&cfg); - cfg._use_mkldnn = use_mkldnn; + if (use_mkldnn) { + cfg.EnableMKLDNN(); + } std::vector> input_slots_all; SetInput(&input_slots_all); - CompareNativeAndAnalysis(cfg, input_slots_all); + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), input_slots_all); } TEST(Analyzer_vis, compare) { compare(); } @@ -137,6 +145,17 @@ TEST(Analyzer_vis, compare) { compare(); } TEST(Analyzer_vis, compare_mkldnn) { compare(true /* use_mkldnn */); } #endif +// Compare Deterministic result +TEST(Analyzer_vis, compare_determine) { + AnalysisConfig cfg; + SetConfig(&cfg); + + std::vector> input_slots_all; + SetInput(&input_slots_all); + CompareDeterministic(reinterpret_cast(&cfg), + input_slots_all); +} + } // namespace analysis } // namespace inference } // namespace paddle diff --git a/paddle/fluid/inference/tests/api/config_printer.h b/paddle/fluid/inference/tests/api/config_printer.h new file mode 100644 index 0000000000000000000000000000000000000000..b0c23fbd534847c8aad244749761e9c072148796 --- /dev/null +++ b/paddle/fluid/inference/tests/api/config_printer.h @@ -0,0 +1,89 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include "paddle/fluid/inference/api/paddle_inference_api.h" + +namespace paddle { +namespace inference { + +thread_local int num_spaces = 0; + +static std::string GenSpaces(int num_spaces) { + std::ostringstream os; + for (int i = 0; i < num_spaces; ++i) { + os << " "; + } + return os.str(); +} + +std::ostream &operator<<(std::ostream &os, + const PaddlePredictor::Config &config) { + os << GenSpaces(num_spaces) << "PaddlePredictor::Config {\n"; + num_spaces++; + os << GenSpaces(num_spaces) << "model_dir: " << config.model_dir << "\n"; + num_spaces--; + os << GenSpaces(num_spaces) << "}\n"; + return os; +} + +std::ostream &operator<<(std::ostream &os, const NativeConfig &config) { + os << GenSpaces(num_spaces) << "NativeConfig {\n"; + num_spaces++; + os << *reinterpret_cast(&config); + os << GenSpaces(num_spaces) << "use_gpu: " << config.use_gpu << "\n"; + os << GenSpaces(num_spaces) << "device: " << config.device << "\n"; + os << GenSpaces(num_spaces) + << "fraction_of_gpu_memory: " << config.fraction_of_gpu_memory << "\n"; + os << GenSpaces(num_spaces) + << "specify_input_name: " << config.specify_input_name << "\n"; + os << GenSpaces(num_spaces) + << "cpu_num_threads: " << config.cpu_math_library_num_threads() << "\n"; + num_spaces--; + os << GenSpaces(num_spaces) << "}\n"; + return os; +} + +std::ostream &operator<<(std::ostream &os, const AnalysisConfig &config) { + os << GenSpaces(num_spaces) << "AnalysisConfig {\n"; + num_spaces++; + os << config.ToNativeConfig(); + if (!config.model_from_memory()) { + os << GenSpaces(num_spaces) << "prog_file: " << config.prog_file() << "\n"; + os << GenSpaces(num_spaces) << "param_file: " << config.params_file() + << "\n"; + } else { + os << GenSpaces(num_spaces) + << "prog_file and param_file: load from memory \n"; + } + os << GenSpaces(num_spaces) << "enable_ir_optim: " << config.ir_optim() + << "\n"; + os << GenSpaces(num_spaces) << "enable_ir_optim: " << config.ir_optim() + << "\n"; + os << GenSpaces(num_spaces) + << "use_feed_fetch_ops: " << config.use_feed_fetch_ops_enabled() << "\n"; + os << GenSpaces(num_spaces) + << "use_tensorrt: " << config.tensorrt_engine_enabled() << "\n"; + os << GenSpaces(num_spaces) << "use_mkldnn: " << config.mkldnn_enabled() + << "\n"; + num_spaces--; + os << GenSpaces(num_spaces) << "}\n"; + return os; +} + +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py b/paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py new file mode 100644 index 0000000000000000000000000000000000000000..842865933f2b4741aea034b19952d4c59344ba06 --- /dev/null +++ b/paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py @@ -0,0 +1,222 @@ +# copyright (c) 2019 paddlepaddle authors. all rights reserved. +# licensed under the apache license, version 2.0 (the "license"); +# you may not use this file except in compliance with the license. +# you may obtain a copy of the license at +# +# http://www.apache.org/licenses/license-2.0 +# +# unless required by applicable law or agreed to in writing, software +# distributed under the license is distributed on an "as is" basis, +# without warranties or conditions of any kind, either express or implied. +# see the license for the specific language governing permissions and +# limitations under the license. +import hashlib +import unittest +import os +import numpy as np +import time +import sys +import random +import functools +import contextlib +from PIL import Image, ImageEnhance +import math +from paddle.dataset.common import download, md5file +import tarfile + +random.seed(0) +np.random.seed(0) + +DATA_DIM = 224 +SIZE_FLOAT32 = 4 +SIZE_INT64 = 8 +FULL_SIZE_BYTES = 30106000008 +FULL_IMAGES = 50000 +DATA_DIR_NAME = 'ILSVRC2012' +IMG_DIR_NAME = 'var' +TARGET_HASH = '8dc592db6dcc8d521e4d5ba9da5ca7d2' +img_mean = np.array([0.485, 0.456, 0.406]).reshape((3, 1, 1)) +img_std = np.array([0.229, 0.224, 0.225]).reshape((3, 1, 1)) + + +def resize_short(img, target_size): + percent = float(target_size) / min(img.size[0], img.size[1]) + resized_width = int(round(img.size[0] * percent)) + resized_height = int(round(img.size[1] * percent)) + img = img.resize((resized_width, resized_height), Image.LANCZOS) + return img + + +def crop_image(img, target_size, center): + width, height = img.size + size = target_size + if center == True: + w_start = (width - size) / 2 + h_start = (height - size) / 2 + else: + w_start = np.random.randint(0, width - size + 1) + h_start = np.random.randint(0, height - size + 1) + w_end = w_start + size + h_end = h_start + size + img = img.crop((w_start, h_start, w_end, h_end)) + return img + + +def process_image(img_path, mode, color_jitter, rotate): + img = Image.open(img_path) + img = resize_short(img, target_size=256) + img = crop_image(img, target_size=DATA_DIM, center=True) + if img.mode != 'RGB': + img = img.convert('RGB') + img = np.array(img).astype('float32').transpose((2, 0, 1)) / 255 + img -= img_mean + img /= img_std + return img + + +def download_concat(cache_folder, zip_path): + data_urls = [] + data_md5s = [] + data_urls.append( + 'https://paddle-inference-dist.bj.bcebos.com/int8/ILSVRC2012_img_val.tar.gz.partaa' + ) + data_md5s.append('60f6525b0e1d127f345641d75d41f0a8') + data_urls.append( + 'https://paddle-inference-dist.bj.bcebos.com/int8/ILSVRC2012_img_val.tar.gz.partab' + ) + data_md5s.append('1e9f15f64e015e58d6f9ec3210ed18b5') + file_names = [] + print("Downloading full ImageNet Validation dataset ...") + for i in range(0, len(data_urls)): + download(data_urls[i], cache_folder, data_md5s[i]) + file_name = os.path.join(cache_folder, data_urls[i].split('/')[-1]) + file_names.append(file_name) + print("Downloaded part {0}\n".format(file_name)) + if not os.path.exists(zip_path): + with open(zip_path, "w+") as outfile: + for fname in file_names: + with open(fname) as infile: + outfile.write(infile.read()) + + +def extract(zip_path, extract_folder): + data_dir = os.path.join(extract_folder, DATA_DIR_NAME) + img_dir = os.path.join(data_dir, IMG_DIR_NAME) + print("Extracting...\n") + + if not (os.path.exists(img_dir) and + len(os.listdir(img_dir)) == FULL_IMAGES): + tar = tarfile.open(zip_path) + tar.extractall(path=extract_folder) + tar.close() + print('Extracted. Full Imagenet Validation dataset is located at {0}\n'. + format(data_dir)) + + +def print_processbar(done, total): + done_filled = done * '=' + empty_filled = (total - done) * ' ' + percentage_done = done * 100 / total + sys.stdout.write("\r[%s%s]%d%%" % + (done_filled, empty_filled, percentage_done)) + sys.stdout.flush() + + +def check_integrity(filename, target_hash): + print('\nThe binary file exists. Checking file integrity...\n') + md = hashlib.md5() + count = 0 + total_parts = 50 + chunk_size = 8192 + onepart = FULL_SIZE_BYTES / chunk_size / total_parts + with open(filename) as ifs: + while True: + buf = ifs.read(8192) + if count % onepart == 0: + done = count / onepart + print_processbar(done, total_parts) + count = count + 1 + if not buf: + break + md.update(buf) + hash1 = md.hexdigest() + if hash1 == target_hash: + return True + else: + return False + + +def convert(file_list, data_dir, output_file): + print('Converting 50000 images to binary file ...\n') + with open(file_list) as flist: + lines = [line.strip() for line in flist] + num_images = len(lines) + with open(output_file, "w+b") as ofs: + #save num_images(int64_t) to file + ofs.seek(0) + num = np.array(int(num_images)).astype('int64') + ofs.write(num.tobytes()) + per_parts = 1000 + full_parts = FULL_IMAGES / per_parts + print_processbar(0, full_parts) + for idx, line in enumerate(lines): + img_path, label = line.split() + img_path = os.path.join(data_dir, img_path) + if not os.path.exists(img_path): + continue + + #save image(float32) to file + img = process_image( + img_path, 'val', color_jitter=False, rotate=False) + np_img = np.array(img) + ofs.seek(SIZE_INT64 + SIZE_FLOAT32 * DATA_DIM * DATA_DIM * 3 * + idx) + ofs.write(np_img.astype('float32').tobytes()) + ofs.flush() + + #save label(int64_t) to file + label_int = (int)(label) + np_label = np.array(label_int) + ofs.seek(SIZE_INT64 + SIZE_FLOAT32 * DATA_DIM * DATA_DIM * 3 * + num_images + idx * SIZE_INT64) + ofs.write(np_label.astype('int64').tobytes()) + ofs.flush() + if (idx + 1) % per_parts == 0: + done = (idx + 1) / per_parts + print_processbar(done, full_parts) + print("Conversion finished.") + + +def run_convert(): + print('Start to download and convert 50000 images to binary file...') + cache_folder = os.path.expanduser('~/.cache/paddle/dataset/int8/download') + extract_folder = os.path.join(cache_folder, 'full_data') + data_dir = os.path.join(extract_folder, DATA_DIR_NAME) + file_list = os.path.join(data_dir, 'val_list.txt') + zip_path = os.path.join(cache_folder, 'full_imagenet_val.tar.gz') + output_file = os.path.join(cache_folder, 'int8_full_val.bin') + retry = 0 + try_limit = 3 + + while not (os.path.exists(output_file) and + os.path.getsize(output_file) == FULL_SIZE_BYTES and + check_integrity(output_file, TARGET_HASH)): + if os.path.exists(output_file): + sys.stderr.write( + "\n\nThe existing binary file is broken. Start to generate new one...\n\n". + format(output_file)) + os.remove(output_file) + if retry < try_limit: + retry = retry + 1 + else: + raise RuntimeError( + "Can not convert the dataset to binary file with try limit {0}". + format(try_limit)) + download_concat(cache_folder, zip_path) + extract(zip_path, extract_folder) + convert(file_list, data_dir, output_file) + print("\nSuccess! The binary file can be found at {0}".format(output_file)) + + +if __name__ == '__main__': + run_convert() diff --git a/paddle/fluid/inference/tests/api/int8_mkldnn_quantization.md b/paddle/fluid/inference/tests/api/int8_mkldnn_quantization.md new file mode 100644 index 0000000000000000000000000000000000000000..cbeef5fb9da42388eade6fa90344abf77cb59bd6 --- /dev/null +++ b/paddle/fluid/inference/tests/api/int8_mkldnn_quantization.md @@ -0,0 +1,70 @@ +# INT8 MKL-DNN quantization + +This document describes how to use Paddle inference Engine to convert the FP32 model to INT8 model on ResNet-50 and MobileNet-V1. We provide the instructions on enabling INT8 MKL-DNN quantization in Paddle inference and show the ResNet-50 and MobileNet-V1 results in accuracy and performance. + +## 0. Install PaddlePaddle +Follow PaddlePaddle [installation instruction](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification#installation) to install PaddlePaddle. If you build PaddlePaddle yourself, please use the following cmake arguments. +``` +cmake .. -DWITH_TESTING=ON -WITH_FLUID_ONLY=ON -DWITH_GPU=OFF -DWITH_MKL=ON -WITH_SWIG_PY=OFF -DWITH_INFERENCE_API_TEST=ON -DON_INFER=ON + +``` +Note: MKL-DNN and MKL are required. + +## 1. Enable INT8 MKL-DNN quantization +For reference, please examine the code of unit test enclosed in [analyzer_int8_image_classification_tester.cc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/tests/api/analyzer_int8_image_classification_tester.cc). + +* ### Create Analysis config +INT8 quantization is one of the optimizations in analysis config. More information about analysis config can be found [here](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/advanced_usage/deploy/inference/native_infer_en.md#upgrade-performance-based-on-contribanalysisconfig-prerelease) + +* ### Create quantize config by analysis config +We enable the MKL-DNN quantization procedure by calling an appropriate method from analysis config. Afterwards, all the required quantization parameters (quantization op names, quantization strategies etc.) can be set through quantizer config which is present in the analysis config. It is also necessary to specify a pre-processed warmup dataset and desired batch size. + +```cpp +//Enable MKL-DNN quantization +cfg.EnableMkldnnQuantizer(); + +//use analysis config to call the MKL-DNN quantization config +cfg.mkldnn_quantizer_config()->SetWarmupData(warmup_data); +cfg.mkldnn_quantizer_config()->SetWarmupBatchSize(100); +``` + +## 2. Accuracy and Performance benchmark + +We provide the results of accuracy and performance measured on Intel(R) Xeon(R) Gold 6271 on single core. + + >**I. Top-1 Accuracy on Intel(R) Xeon(R) Gold 6271** + +| Model | Dataset | FP32 Accuracy | INT8 Accuracy | Accuracy Diff | +| :------------: | :------------: | :------------: | :------------: | :------------: | +| ResNet-50 | Full ImageNet Val | 76.63% | 76.48% | 0.15% | +| MobileNet-V1 | Full ImageNet Val | 70.78% | 70.36% | 0.42% | + + >**II. Throughput on Intel(R) Xeon(R) Gold 6271 (batch size 1 on single core)** + +| Model | Dataset | FP32 Throughput | INT8 Throughput | Ratio(INT8/FP32) | +| :------------: | :------------: | :------------: | :------------: | :------------: | +| ResNet-50 | Full ImageNet Val | 13.17 images/s | 49.84 images/s | 3.78 | +| MobileNet-V1 | Full ImageNet Val | 75.49 images/s | 232.38 images/s | 3.07 | + +Notes: +* Measurement of accuracy requires a model which accepts two inputs: data and labels. +* Different sampling batch size data may cause slight difference on INT8 top accuracy. +* CAPI performance data is better than python API performance data because of the python overhead. Especially for the small computational model, python overhead will be more obvious. + + +## 3. Commands to reproduce the above accuracy and performance benchmark +* #### Full dataset (Single core) + * ##### Download full ImageNet Validation Dataset +```bash +cd /PATH/TO/PADDLE/build +python ../paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py +``` +The converted data binary file is saved by default in ~/.cache/paddle/dataset/int8/download/int8_full_val.bin + * ##### ResNet50 Full dataset benchmark +```bash +./paddle/fluid/inference/tests/api/test_analyzer_int8_resnet50 --infer_model=third_party/inference_demo/int8v2/resnet50/model --infer_data=/path/to/converted/int8_full_val.bin --batch_size=1 --paddle_num_threads=1 +``` + * ##### Mobilenet-v1 Full dataset benchmark +```bash +./paddle/fluid/inference/tests/api/test_analyzer_int8_mobilenet --infer_model=third_party/inference_demo/int8v2/mobilenet/model --infer_data=/path/to/converted/int8_full_val.bin --batch_size=1 --paddle_num_threads=1 +``` diff --git a/paddle/fluid/inference/tests/api/tester_helper.h b/paddle/fluid/inference/tests/api/tester_helper.h index b1ee1080030b23e1ef7adefe3a0880f38e9099f5..d13469a8482304d04b99c96e70bac5c8b90e4043 100644 --- a/paddle/fluid/inference/tests/api/tester_helper.h +++ b/paddle/fluid/inference/tests/api/tester_helper.h @@ -15,32 +15,80 @@ #pragma once #include + #include +#include #include #include // NOLINT +#include #include +#ifdef WITH_GPERFTOOLS +#include +#endif #include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/scope.h" #include "paddle/fluid/inference/analysis/analyzer.h" #include "paddle/fluid/inference/analysis/ut_helper.h" #include "paddle/fluid/inference/api/analysis_predictor.h" #include "paddle/fluid/inference/api/helper.h" #include "paddle/fluid/inference/api/paddle_inference_pass.h" +#include "paddle/fluid/inference/tests/api/config_printer.h" +#include "paddle/fluid/inference/tests/test_helper.h" +#include "paddle/fluid/inference/utils/benchmark.h" #include "paddle/fluid/platform/profiler.h" +DEFINE_string(model_name, "", "model name"); DEFINE_string(infer_model, "", "model path"); DEFINE_string(infer_data, "", "data file"); -DEFINE_int32(batch_size, 1, "batch size."); +DEFINE_string(refer_result, "", "reference result for comparison"); +DEFINE_int32(batch_size, 1, "batch size"); +DEFINE_int32(warmup_batch_size, 100, "batch size for quantization warmup"); +// setting iterations to 0 means processing the whole dataset +DEFINE_int32(iterations, 0, "number of batches to process"); DEFINE_int32(repeat, 1, "Running the inference program repeat times."); DEFINE_bool(test_all_data, false, "Test the all dataset in data file."); DEFINE_int32(num_threads, 1, "Running the inference program in multi-threads."); DEFINE_bool(use_analysis, true, "Running the inference program in analysis mode."); +DEFINE_bool(record_benchmark, false, + "Record benchmark after profiling the model"); +DEFINE_double(accuracy, 1e-3, "Result Accuracy."); +DEFINE_double(quantized_accuracy, 1e-2, "Result Quantized Accuracy."); +DEFINE_bool(zero_copy, false, "Use ZeroCopy to speedup Feed/Fetch."); +DEFINE_bool(warmup, false, + "Use warmup to calculate elapsed_time more accurately. " + "To reduce CI time, it sets false in default."); + +DECLARE_bool(profile); +DECLARE_int32(paddle_num_threads); namespace paddle { namespace inference { -using contrib::AnalysisConfig; +template +constexpr paddle::PaddleDType GetPaddleDType(); + +template <> +constexpr paddle::PaddleDType GetPaddleDType() { + return paddle::PaddleDType::INT64; +} + +template <> +constexpr paddle::PaddleDType GetPaddleDType() { + return paddle::PaddleDType::FLOAT32; +} + +void PrintConfig(const PaddlePredictor::Config *config, bool use_analysis) { + const auto *analysis_config = + reinterpret_cast(config); + if (use_analysis) { + LOG(INFO) << *analysis_config; + return; + } + LOG(INFO) << analysis_config->ToNativeConfig(); +} +// Compare result between two PaddleTensor void CompareResult(const std::vector &outputs, const std::vector &ref_outputs) { EXPECT_GT(outputs.size(), 0UL); @@ -50,7 +98,7 @@ void CompareResult(const std::vector &outputs, auto &ref_out = ref_outputs[i]; size_t size = VecReduceToInt(out.shape); size_t ref_size = VecReduceToInt(ref_out.shape); - EXPECT_GT(size, 0); + EXPECT_GT(size, 0UL); EXPECT_EQ(size, ref_size); EXPECT_EQ(out.dtype, ref_out.dtype); switch (out.dtype) { @@ -66,7 +114,59 @@ void CompareResult(const std::vector &outputs, float *pdata = static_cast(out.data.data()); float *pdata_ref = static_cast(ref_out.data.data()); for (size_t j = 0; j < size; ++j) { - EXPECT_NEAR(pdata_ref[j], pdata[j], 1e-3); + CHECK_LE(std::abs(pdata_ref[j] - pdata[j]), FLAGS_accuracy); + } + break; + } + case PaddleDType::INT32: { + int32_t *pdata = static_cast(out.data.data()); + int32_t *pdata_ref = static_cast(ref_out.data.data()); + for (size_t j = 0; j < size; ++j) { + EXPECT_EQ(pdata_ref[j], pdata[j]); + } + break; + } + } + } +} + +// Compare result between a PaddleTensor and a ZeroCopyTensor +void CompareResult(const std::vector &outputs, + const std::vector &ref_outputs) { + EXPECT_GT(outputs.size(), 0UL); + EXPECT_EQ(outputs.size(), ref_outputs.size()); + for (size_t i = 0; i < outputs.size(); i++) { + auto &out = outputs[i]; + auto &ref_out = ref_outputs[i]; + size_t size = VecReduceToInt(out.shape); + EXPECT_GT(size, 0UL); + int ref_size = 0; // this is the number of elements not memory size + PaddlePlace place; + switch (out.dtype) { + case PaddleDType::INT64: { + int64_t *pdata = static_cast(out.data.data()); + int64_t *pdata_ref = ref_out.data(&place, &ref_size); + EXPECT_EQ(size, ref_size); + for (size_t j = 0; j < size; ++j) { + EXPECT_EQ(pdata_ref[j], pdata[j]); + } + break; + } + case PaddleDType::FLOAT32: { + float *pdata = static_cast(out.data.data()); + float *pdata_ref = ref_out.data(&place, &ref_size); + EXPECT_EQ(size, ref_size); + for (size_t j = 0; j < size; ++j) { + CHECK_LE(std::abs(pdata_ref[j] - pdata[j]), FLAGS_accuracy); + } + break; + } + case PaddleDType::INT32: { + int32_t *pdata = static_cast(out.data.data()); + int32_t *pdata_ref = ref_out.data(&place, &ref_size); + EXPECT_EQ(size, ref_size); + for (size_t j = 0; j < size; ++j) { + EXPECT_EQ(pdata_ref[j], pdata[j]); } break; } @@ -75,83 +175,236 @@ void CompareResult(const std::vector &outputs, } std::unique_ptr CreateTestPredictor( - const AnalysisConfig &config, bool use_analysis = true) { + const PaddlePredictor::Config *config, bool use_analysis = true) { + const auto *analysis_config = + reinterpret_cast(config); if (use_analysis) { - return CreatePaddlePredictor(config); - } else { - return CreatePaddlePredictor(config); + return CreatePaddlePredictor(*analysis_config); } + auto native_config = analysis_config->ToNativeConfig(); + return CreatePaddlePredictor(native_config); } size_t GetSize(const PaddleTensor &out) { return VecReduceToInt(out.shape); } std::unordered_map GetFuseStatis(PaddlePredictor *predictor, int *num_ops) { + std::unordered_map res; auto *analysis_predictor = static_cast(predictor); - auto &fuse_statis = analysis_predictor->analysis_argument() - .Get>( - framework::ir::kFuseStatisAttr); - for (auto &item : fuse_statis) { + auto *fusion_status = + analysis_predictor->analysis_argument().fusion_statis_ptr(); + if (!fusion_status) { + return res; + } + for (auto &item : *fusion_status) { LOG(INFO) << "fused " << item.first << " " << item.second; } int num = 0; for (auto &node : - analysis_predictor->analysis_argument().main_dfg->nodes.nodes()) { - if (node->IsFunction()) { + analysis_predictor->analysis_argument().main_graph().Nodes()) { + if (node->IsOp()) { ++num; } } *num_ops = num; - return fuse_statis; + return *fusion_status; } -void TestOneThreadPrediction( - const AnalysisConfig &config, - const std::vector> &inputs, - std::vector *outputs, bool use_analysis = true) { +void SetFakeImageInput(std::vector> *inputs, + const std::string &dirname, bool is_combined = true, + std::string model_filename = "model", + std::string params_filename = "params", + const std::vector *feed_names = nullptr, + const int continuous_inuput_index = 0) { + // Set fake_image_data + PADDLE_ENFORCE_EQ(FLAGS_test_all_data, 0, "Only have single batch of data."); + std::vector> feed_target_shapes = GetFeedTargetShapes( + dirname, is_combined, model_filename, params_filename); + std::ostringstream os; + for (size_t i = 0; i < feed_target_shapes.size(); ++i) { + os << "feed target " << i << ": {" << feed_target_shapes[i][0]; + for (size_t j = 1; j < feed_target_shapes[i].size(); ++j) { + os << ", " << feed_target_shapes[i][j]; + } + os << "}\n"; + } + LOG(INFO) << os.str(); + if (feed_names) { + PADDLE_ENFORCE_EQ(feed_names->size(), feed_target_shapes.size()); + } + std::vector input_slots(feed_target_shapes.size()); + for (size_t i = 0; i < feed_target_shapes.size(); ++i) { + const auto &feed_shape = feed_target_shapes[i]; + auto &input = input_slots[i]; + std::vector shape({FLAGS_batch_size}); + for (size_t s = 1; s < feed_shape.size(); ++s) { + shape.push_back(static_cast(feed_shape[s])); + } + if (feed_names) { + input.name = (*feed_names)[i]; + } + input.shape = shape; + input.dtype = PaddleDType::FLOAT32; + size_t len = std::accumulate(shape.begin(), shape.end(), size_t{1}, + [](int a, int b) { return a * b; }); + input.data.Resize(len * sizeof(float)); + input.lod.assign({{0, static_cast(FLAGS_batch_size)}}); + float *input_data = static_cast(input.data.data()); + // fill input data, for profile easily, do not use random data here. + for (size_t j = 0; j < len; ++j) { + *(input_data + j) = + static_cast((j + continuous_inuput_index) % len) / len; + } + } + (*inputs).emplace_back(input_slots); +} + +void GetInputPerBatch(const std::vector> &in, + std::vector> *out, + std::vector *lod, size_t batch_iter, + size_t batch_end) { + lod->clear(); + lod->push_back(0); + for (auto it = in.begin() + batch_iter; it < in.begin() + batch_end; it++) { + out->push_back(*it); + lod->push_back(lod->back() + (*it).size()); // calculate lod + } +} + +void ConvertPaddleTensorToZeroCopyTensor( + PaddlePredictor *predictor, const std::vector &inputs) { + for (size_t i = 0; i < inputs.size(); i++) { + auto input = inputs[i]; + auto tensor = predictor->GetInputTensor(input.name); + tensor->Reshape(input.shape); + tensor->SetLoD({input.lod}); + if (input.dtype == PaddleDType::INT64) { + ZeroCopyTensorAssignData(tensor.get(), input.data); + } else if (input.dtype == PaddleDType::FLOAT32) { + ZeroCopyTensorAssignData(tensor.get(), input.data); + } else if (input.dtype == PaddleDType::INT32) { + ZeroCopyTensorAssignData(tensor.get(), input.data); + } else { + LOG(ERROR) << "unsupported feed type " << input.dtype; + } + } +} + +void PredictionWarmUp(PaddlePredictor *predictor, + const std::vector> &inputs, + std::vector> *outputs, + int num_threads, int tid) { int batch_size = FLAGS_batch_size; + LOG(INFO) << "Running thread " << tid << ", warm up run..."; + if (FLAGS_zero_copy) { + ConvertPaddleTensorToZeroCopyTensor(predictor, inputs[0]); + } + outputs->resize(1); + Timer warmup_timer; + warmup_timer.tic(); + if (!FLAGS_zero_copy) { + predictor->Run(inputs[0], &(*outputs)[0], batch_size); + } else { + predictor->ZeroCopyRun(); + } + PrintTime(batch_size, 1, num_threads, tid, warmup_timer.toc(), 1); + if (FLAGS_profile) { + paddle::platform::ResetProfiler(); + } +} + +void PredictionRun(PaddlePredictor *predictor, + const std::vector> &inputs, + std::vector> *outputs, + int num_threads, int tid) { int num_times = FLAGS_repeat; - auto predictor = CreateTestPredictor(config, use_analysis); - Timer timer; - timer.tic(); - for (int i = 0; i < num_times; i++) { - for (size_t j = 0; j < inputs.size(); j++) { - predictor->Run(inputs[j], outputs); + int iterations = inputs.size(); // process the whole dataset ... + if (FLAGS_iterations > 0 && + FLAGS_iterations < static_cast(inputs.size())) + iterations = + FLAGS_iterations; // ... unless the number of iterations is set + outputs->resize(iterations); + LOG(INFO) << "Thread " << tid << ", number of threads " << num_threads + << ", run " << num_times << " times..."; + Timer run_timer; + double elapsed_time = 0; +#ifdef WITH_GPERFTOOLS + ProfilerStart("paddle_inference.prof"); +#endif + if (!FLAGS_zero_copy) { + run_timer.tic(); + for (int i = 0; i < iterations; i++) { + for (int j = 0; j < num_times; j++) { + predictor->Run(inputs[i], &(*outputs)[i], FLAGS_batch_size); + } + } + elapsed_time = run_timer.toc(); + } else { + for (int i = 0; i < iterations; i++) { + ConvertPaddleTensorToZeroCopyTensor(predictor, inputs[i]); + run_timer.tic(); + for (int j = 0; j < num_times; j++) { + predictor->ZeroCopyRun(); + } + elapsed_time += run_timer.toc(); } } - PrintTime(batch_size, num_times, 1, 0, timer.toc() / num_times, - inputs.size()); +#ifdef WITH_GPERFTOOLS + ProfilerStop(); +#endif + + auto batch_latency = elapsed_time / (iterations * num_times); + PrintTime(FLAGS_batch_size, num_times, num_threads, tid, batch_latency, + iterations); + if (FLAGS_record_benchmark) { + Benchmark benchmark; + benchmark.SetName(FLAGS_model_name); + benchmark.SetBatchSize(FLAGS_batch_size); + benchmark.SetLatency(batch_latency); + benchmark.PersistToFile("benchmark_record.txt"); + } +} + +void TestOneThreadPrediction( + const PaddlePredictor::Config *config, + const std::vector> &inputs, + std::vector> *outputs, bool use_analysis = true) { + auto predictor = CreateTestPredictor(config, use_analysis); + if (FLAGS_warmup) { + PredictionWarmUp(predictor.get(), inputs, outputs, 1, 0); + } + PredictionRun(predictor.get(), inputs, outputs, 1, 0); } void TestMultiThreadPrediction( - const AnalysisConfig &config, + const PaddlePredictor::Config *config, const std::vector> &inputs, - std::vector *outputs, int num_threads, + std::vector> *outputs, int num_threads, bool use_analysis = true) { - int batch_size = FLAGS_batch_size; - int num_times = FLAGS_repeat; std::vector threads; std::vector> predictors; - // TODO(yanchunwei): Bug here, the analyzer phase can't be parallelled - // because AttentionLSTM's hard code nodeid will be damanged. - for (int tid = 0; tid < num_threads; ++tid) { - predictors.emplace_back(CreateTestPredictor(config, use_analysis)); + predictors.emplace_back(CreateTestPredictor(config, use_analysis)); + for (int tid = 1; tid < num_threads; tid++) { + predictors.emplace_back(predictors.front()->Clone()); } + for (int tid = 0; tid < num_threads; ++tid) { threads.emplace_back([&, tid]() { // Each thread should have local inputs and outputs. // The inputs of each thread are all the same. - std::vector> inputs_tid = inputs; - std::vector outputs_tid; - Timer timer; - timer.tic(); - for (int i = 0; i < num_times; i++) { - for (size_t j = 0; j < inputs_tid.size(); j++) { - predictors[tid]->Run(inputs_tid[j], &outputs_tid); - } + std::vector> outputs_tid; + auto &predictor = predictors[tid]; +#ifdef PADDLE_WITH_MKLDNN + if (use_analysis) { + static_cast(predictor.get()) + ->SetMkldnnThreadID(static_cast(tid) + 1); + } +#endif + if (FLAGS_warmup) { + PredictionWarmUp(predictor.get(), inputs, &outputs_tid, num_threads, + tid); } - PrintTime(batch_size, num_times, num_threads, tid, - timer.toc() / num_times, inputs_tid.size()); + PredictionRun(predictor.get(), inputs, &outputs_tid, num_threads, tid); }); } for (int i = 0; i < num_threads; ++i) { @@ -159,12 +412,11 @@ void TestMultiThreadPrediction( } } -void TestPrediction(const AnalysisConfig &config, +void TestPrediction(const PaddlePredictor::Config *config, const std::vector> &inputs, - std::vector *outputs, int num_threads, - bool use_analysis = FLAGS_use_analysis) { - LOG(INFO) << "use_analysis: " << use_analysis - << ", use_mkldnn: " << config._use_mkldnn; + std::vector> *outputs, + int num_threads, bool use_analysis = FLAGS_use_analysis) { + PrintConfig(config, use_analysis); if (num_threads == 1) { TestOneThreadPrediction(config, inputs, outputs, use_analysis); } else { @@ -173,16 +425,133 @@ void TestPrediction(const AnalysisConfig &config, } } +void CompareTopAccuracy( + const std::vector> &output_slots_quant, + const std::vector> &output_slots_ref) { + if (output_slots_quant.size() == 0 || output_slots_ref.size() == 0) + throw std::invalid_argument( + "CompareTopAccuracy: output_slots vector is empty."); + + float total_accs1_quant{0}; + float total_accs1_ref{0}; + for (size_t i = 0; i < output_slots_quant.size(); ++i) { + PADDLE_ENFORCE(output_slots_quant[i].size() >= 2UL); + PADDLE_ENFORCE(output_slots_ref[i].size() >= 2UL); + // second output: acc_top1 + if (output_slots_quant[i][1].lod.size() > 0 || + output_slots_ref[i][1].lod.size() > 0) + throw std::invalid_argument( + "CompareTopAccuracy: top1 accuracy output has nonempty LoD."); + if (output_slots_quant[i][1].dtype != paddle::PaddleDType::FLOAT32 || + output_slots_ref[i][1].dtype != paddle::PaddleDType::FLOAT32) + throw std::invalid_argument( + "CompareTopAccuracy: top1 accuracy output is of a wrong type."); + total_accs1_quant += + *static_cast(output_slots_quant[i][1].data.data()); + total_accs1_ref += + *static_cast(output_slots_ref[i][1].data.data()); + } + float avg_acc1_quant = total_accs1_quant / output_slots_quant.size(); + float avg_acc1_ref = total_accs1_ref / output_slots_ref.size(); + + LOG(INFO) << "Avg top1 INT8 accuracy: " << std::fixed << std::setw(6) + << std::setprecision(4) << avg_acc1_quant; + LOG(INFO) << "Avg top1 FP32 accuracy: " << std::fixed << std::setw(6) + << std::setprecision(4) << avg_acc1_ref; + LOG(INFO) << "Accepted accuracy drop threshold: " << FLAGS_quantized_accuracy; + CHECK_LE(std::abs(avg_acc1_quant - avg_acc1_ref), FLAGS_quantized_accuracy); +} + +void CompareDeterministic( + const PaddlePredictor::Config *config, + const std::vector> &inputs) { + int batch_size = FLAGS_batch_size; + int num_times = FLAGS_repeat; + auto predictor = CreateTestPredictor(config, FLAGS_use_analysis); + + std::vector warmup_outputs, outputs; + // run num_times to Compare Deterministic Result. + for (size_t j = 0; j < inputs.size(); j++) { + // warmup run + predictor->Run(inputs[j], &warmup_outputs, batch_size); + for (int i = 0; i < num_times; i++) { + predictor->Run(inputs[j], &outputs, batch_size); + CompareResult(outputs, warmup_outputs); + } + } +} + void CompareNativeAndAnalysis( - const AnalysisConfig &config, + const PaddlePredictor::Config *config, const std::vector> &inputs) { - LOG(INFO) << "use_mkldnn: " << config._use_mkldnn; - std::vector native_outputs, analysis_outputs; + PrintConfig(config, true); + std::vector> native_outputs, analysis_outputs; TestOneThreadPrediction(config, inputs, &native_outputs, false); TestOneThreadPrediction(config, inputs, &analysis_outputs, true); + PADDLE_ENFORCE(native_outputs.size() > 0, "Native output is empty."); + PADDLE_ENFORCE(analysis_outputs.size() > 0, "Analysis output is empty."); + CompareResult(analysis_outputs.back(), native_outputs.back()); +} + +void CompareQuantizedAndAnalysis( + const AnalysisConfig *config, const AnalysisConfig *qconfig, + const std::vector> &inputs) { + PADDLE_ENFORCE_EQ(inputs[0][0].shape[0], FLAGS_batch_size, + "Input data has to be packed batch by batch."); + LOG(INFO) << "FP32 & INT8 prediction run: batch_size " << FLAGS_batch_size + << ", warmup batch size " << FLAGS_warmup_batch_size << "."; + + LOG(INFO) << "--- FP32 prediction start ---"; + auto *cfg = reinterpret_cast(config); + PrintConfig(cfg, true); + std::vector> analysis_outputs; + TestOneThreadPrediction(cfg, inputs, &analysis_outputs, true); + + LOG(INFO) << "--- INT8 prediction start ---"; + auto *qcfg = reinterpret_cast(qconfig); + PrintConfig(qcfg, true); + std::vector> quantized_outputs; + TestOneThreadPrediction(qcfg, inputs, &quantized_outputs, true); + + LOG(INFO) << "--- comparing outputs --- "; + CompareTopAccuracy(quantized_outputs, analysis_outputs); +} + +void CompareNativeAndAnalysis( + PaddlePredictor *native_pred, PaddlePredictor *analysis_pred, + const std::vector> &inputs) { + int batch_size = FLAGS_batch_size; + std::vector native_outputs, analysis_outputs; + native_pred->Run(inputs[0], &native_outputs, batch_size); + analysis_pred->Run(inputs[0], &analysis_outputs, batch_size); CompareResult(analysis_outputs, native_outputs); } +void CompareAnalysisAndZeroCopy( + PaddlePredictor::Config *config, + const std::vector> &inputs, + const std::vector &outputs_name) { + int batch_size = FLAGS_batch_size; + // analysis + std::vector analysis_outputs; + auto predictor = CreateTestPredictor(config, true); + predictor->Run(inputs[0], &analysis_outputs, batch_size); + // analysis + zero_copy + std::vector zerocopy_outputs; + reinterpret_cast(config)->SwitchUseFeedFetchOps(false); + predictor = CreateTestPredictor(config, true); + ConvertPaddleTensorToZeroCopyTensor(predictor.get(), inputs[0]); + predictor->ZeroCopyRun(); + for (size_t i = 0; i < outputs_name.size(); i++) { + ZeroCopyTensor zerocopy_output = + *predictor->GetOutputTensor(outputs_name[i]).get(); + zerocopy_outputs.emplace_back(zerocopy_output); + LOG(INFO) << "ZeroCopy output: " << DescribeZeroCopyTensor(zerocopy_output); + } + // compare + CompareResult(analysis_outputs, zerocopy_outputs); +} + template std::string LoDTensorSummary(const framework::LoDTensor &tensor) { std::stringstream ss; @@ -254,9 +623,9 @@ static bool CompareTensorData(const framework::LoDTensor &a, const framework::LoDTensor &b) { auto a_shape = framework::vectorize(a.dims()); auto b_shape = framework::vectorize(b.dims()); - size_t a_size = std::accumulate(a_shape.begin(), a_shape.end(), 1, + size_t a_size = std::accumulate(a_shape.begin(), a_shape.end(), size_t{1}, [](int a, int b) { return a * b; }); - size_t b_size = std::accumulate(b_shape.begin(), b_shape.end(), 1, + size_t b_size = std::accumulate(b_shape.begin(), b_shape.end(), size_t{1}, [](int a, int b) { return a * b; }); if (a_size != b_size) { LOG(ERROR) << string::Sprintf("tensor data size not match, %d != %d", @@ -264,7 +633,7 @@ static bool CompareTensorData(const framework::LoDTensor &a, } for (size_t i = 0; i < a_size; i++) { - if (a.type() == typeid(float)) { + if (a.type() == framework::proto::VarType::FP32) { const auto *a_data = a.data(); const auto *b_data = b.data(); if (std::abs(a_data[i] - b_data[i]) > 1e-3) { @@ -273,7 +642,7 @@ static bool CompareTensorData(const framework::LoDTensor &a, b_data[i]); return false; } - } else if (a.type() == typeid(int64_t)) { + } else if (a.type() == framework::proto::VarType::INT64) { const auto *a_data = a.data(); const auto *b_data = b.data(); if (std::abs(a_data[i] - b_data[i]) > 1e-3) { diff --git a/paddle/fluid/inference/tests/api/trt_models_tester.cc b/paddle/fluid/inference/tests/api/trt_models_tester.cc index 91111f2af56065bbf57ba3a41bddd55ecced1060..98ce225a0476b38c021b0b81489f69d7953ae456 100644 --- a/paddle/fluid/inference/tests/api/trt_models_tester.cc +++ b/paddle/fluid/inference/tests/api/trt_models_tester.cc @@ -1,103 +1,199 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ #include #include #include -#include "paddle/fluid/inference/analysis/analyzer.h" -#include "paddle/fluid/inference/api/paddle_inference_api.h" + +#include "paddle/fluid/inference/tests/api/tester_helper.h" namespace paddle { -using paddle::contrib::MixedRTConfig; - -DEFINE_string(dirname, "", "Directory of the inference model."); - -NativeConfig GetConfigNative() { - NativeConfig config; - config.model_dir = FLAGS_dirname; - // LOG(INFO) << "dirname " << config.model_dir; - config.fraction_of_gpu_memory = 0.45; - config.use_gpu = true; - config.device = 0; - return config; +namespace inference { + +DEFINE_bool(use_tensorrt, true, "Test the performance of TensorRT engine."); +DEFINE_string(prog_filename, "", "Name of model file."); +DEFINE_string(param_filename, "", "Name of parameters file."); + +template +void SetConfig(ConfigType* config, std::string model_dir, bool use_gpu, + bool use_tensorrt = false, int batch_size = -1) { + if (!FLAGS_prog_filename.empty() && !FLAGS_param_filename.empty()) { + config->prog_file = model_dir + "/" + FLAGS_prog_filename; + config->param_file = model_dir + "/" + FLAGS_param_filename; + } else { + config->model_dir = model_dir; + } + if (use_gpu) { + config->use_gpu = true; + config->device = 0; + config->fraction_of_gpu_memory = 0.15; + } } -MixedRTConfig GetConfigTRT() { - MixedRTConfig config; - config.model_dir = FLAGS_dirname; - config.use_gpu = true; - config.fraction_of_gpu_memory = 0.2; - config.device = 0; - config.max_batch_size = 3; - return config; +template <> +void SetConfig(AnalysisConfig* config, std::string model_dir, + bool use_gpu, bool use_tensorrt, + int batch_size) { + if (!FLAGS_prog_filename.empty() && !FLAGS_param_filename.empty()) { + config->SetModel(model_dir + "/" + FLAGS_prog_filename, + model_dir + "/" + FLAGS_param_filename); + } else { + config->SetModel(model_dir); + } + if (use_gpu) { + config->EnableUseGpu(100, 0); + if (use_tensorrt) { + config->EnableTensorRtEngine(1 << 10, batch_size, 3, + AnalysisConfig::Precision::kFloat32, false); + config->pass_builder()->DeletePass("conv_bn_fuse_pass"); + config->pass_builder()->DeletePass("fc_fuse_pass"); + config->pass_builder()->TurnOnDebug(); + } else { + config->SwitchIrOptim(); + } + } } -void CompareTensorRTWithFluid(int batch_size, std::string model_dirname) { - NativeConfig config0 = GetConfigNative(); - config0.model_dir = model_dirname; - - MixedRTConfig config1 = GetConfigTRT(); - config1.model_dir = model_dirname; - config1.max_batch_size = batch_size; - - auto predictor0 = CreatePaddlePredictor(config0); - auto predictor1 = CreatePaddlePredictor(config1); - // Prepare inputs - int height = 224; - int width = 224; - float *data = new float[batch_size * 3 * height * width]; - memset(data, 0, sizeof(float) * (batch_size * 3 * height * width)); - data[0] = 1.0f; - - // Prepare inputs - PaddleTensor tensor; - tensor.name = "input_0"; - tensor.shape = std::vector({batch_size, 3, height, width}); - tensor.data = PaddleBuf(static_cast(data), - sizeof(float) * (batch_size * 3 * height * width)); - tensor.dtype = PaddleDType::FLOAT32; - std::vector paddle_tensor_feeds(1, tensor); - - // Prepare outputs - std::vector outputs0; - std::vector outputs1; - CHECK(predictor0->Run(paddle_tensor_feeds, &outputs0)); - - CHECK(predictor1->Run(paddle_tensor_feeds, &outputs1, batch_size)); - - // Get output. - ASSERT_EQ(outputs0.size(), 1UL); - ASSERT_EQ(outputs1.size(), 1UL); - - const size_t num_elements = outputs0.front().data.length() / sizeof(float); - const size_t num_elements1 = outputs1.front().data.length() / sizeof(float); - EXPECT_EQ(num_elements, num_elements1); - - auto *data0 = static_cast(outputs0.front().data.data()); - auto *data1 = static_cast(outputs1.front().data.data()); - - ASSERT_GT(num_elements, 0UL); - for (size_t i = 0; i < std::min(num_elements, num_elements1); i++) { - EXPECT_NEAR(data0[i], data1[i], 1e-3); +void profile(std::string model_dir, bool use_analysis, bool use_tensorrt) { + std::vector> inputs_all; + if (!FLAGS_prog_filename.empty() && !FLAGS_param_filename.empty()) { + SetFakeImageInput(&inputs_all, model_dir, true, FLAGS_prog_filename, + FLAGS_param_filename); + } else { + SetFakeImageInput(&inputs_all, model_dir, false, "__model__", ""); + } + + std::vector> outputs; + if (use_analysis || use_tensorrt) { + AnalysisConfig config; + config.EnableUseGpu(100, 0); + config.pass_builder()->TurnOnDebug(); + SetConfig(&config, model_dir, true, use_tensorrt, + FLAGS_batch_size); + TestPrediction(reinterpret_cast(&config), + inputs_all, &outputs, FLAGS_num_threads, true); + } else { + NativeConfig config; + SetConfig(&config, model_dir, true, false); + TestPrediction(reinterpret_cast(&config), + inputs_all, &outputs, FLAGS_num_threads, false); } } -TEST(trt_models_test, main) { - std::vector infer_models = {"mobilenet", "resnet50", - "resnext50"}; - for (auto &model_dir : infer_models) { - CompareTensorRTWithFluid(1, FLAGS_dirname + "/" + model_dir); +void compare(std::string model_dir, bool use_tensorrt) { + std::vector> inputs_all; + if (!FLAGS_prog_filename.empty() && !FLAGS_param_filename.empty()) { + SetFakeImageInput(&inputs_all, model_dir, true, FLAGS_prog_filename, + FLAGS_param_filename); + } else { + SetFakeImageInput(&inputs_all, model_dir, false, "__model__", ""); + } + + AnalysisConfig analysis_config; + SetConfig(&analysis_config, model_dir, true, use_tensorrt, + FLAGS_batch_size); + CompareNativeAndAnalysis( + reinterpret_cast(&analysis_config), + inputs_all); +} + +void compare_continuous_input(std::string model_dir, bool use_tensorrt) { + AnalysisConfig analysis_config; + SetConfig(&analysis_config, model_dir, true, use_tensorrt, + FLAGS_batch_size); + auto config = + reinterpret_cast(&analysis_config); + auto native_pred = CreateTestPredictor(config, false); + auto analysis_pred = CreateTestPredictor(config, true); + for (int i = 0; i < 100; i++) { + std::vector> inputs_all; + if (!FLAGS_prog_filename.empty() && !FLAGS_param_filename.empty()) { + SetFakeImageInput(&inputs_all, model_dir, true, FLAGS_prog_filename, + FLAGS_param_filename, nullptr, i); + } else { + SetFakeImageInput(&inputs_all, model_dir, false, "__model__", "", nullptr, + i); + } + CompareNativeAndAnalysis(native_pred.get(), analysis_pred.get(), + inputs_all); + } +} + +TEST(TensorRT_mobilenet, compare) { + std::string model_dir = FLAGS_infer_model + "/mobilenet"; + compare(model_dir, /* use_tensorrt */ true); +} + +TEST(TensorRT_resnet50, compare) { + std::string model_dir = FLAGS_infer_model + "/resnet50"; + compare(model_dir, /* use_tensorrt */ true); +} + +TEST(TensorRT_resnext50, compare) { + std::string model_dir = FLAGS_infer_model + "/resnext50"; + compare(model_dir, /* use_tensorrt */ true); +} + +TEST(TensorRT_resnext50, profile) { + std::string model_dir = FLAGS_infer_model + "/resnext50"; + // Set FLAGS_record_benchmark to true to record benchmark to file. + // FLAGS_record_benchmark=true; + FLAGS_model_name = "resnext50"; + profile(model_dir, /* use_analysis */ true, FLAGS_use_tensorrt); +} + +TEST(resnext50, compare_analysis_native) { + std::string model_dir = FLAGS_infer_model + "/resnext50"; + compare(model_dir, false /*use tensorrt*/); +} + +TEST(TensorRT_mobilenet, analysis) { + std::string model_dir = FLAGS_infer_model + "/" + "mobilenet"; + compare(model_dir, false /* use_tensorrt */); +} + +TEST(AnalysisPredictor, use_gpu) { + std::string model_dir = FLAGS_infer_model + "/" + "mobilenet"; + AnalysisConfig config; + config.EnableUseGpu(100, 0); + config.SetModel(model_dir); + config.pass_builder()->TurnOnDebug(); + + std::vector> inputs_all; + auto predictor = CreatePaddlePredictor(config); + SetFakeImageInput(&inputs_all, model_dir, false, "__model__", ""); + + std::vector outputs; + for (auto& input : inputs_all) { + ASSERT_TRUE(predictor->Run(input, &outputs)); } } + +TEST(TensorRT_mobilenet, profile) { + std::string model_dir = FLAGS_infer_model + "/" + "mobilenet"; + profile(model_dir, true, false); +} + +TEST(resnet50, compare_continuous_input) { + std::string model_dir = FLAGS_infer_model + "/resnet50"; + compare_continuous_input(model_dir, true); +} + +TEST(resnet50, compare_continuous_input_native) { + std::string model_dir = FLAGS_infer_model + "/resnet50"; + compare_continuous_input(model_dir, false); +} + +} // namespace inference } // namespace paddle diff --git a/paddle/fluid/inference/tests/book/test_inference_nlp.cc b/paddle/fluid/inference/tests/book/test_inference_nlp.cc index cbcfc964c91c33ab41a72ad7fec759086ad887cc..5c1204b9e6b78e42d999b12a2b7be6f822ecf818 100644 --- a/paddle/fluid/inference/tests/book/test_inference_nlp.cc +++ b/paddle/fluid/inference/tests/book/test_inference_nlp.cc @@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include #include #include #include // NOLINT diff --git a/paddle/fluid/inference/tests/test.cmake b/paddle/fluid/inference/tests/test.cmake new file mode 100644 index 0000000000000000000000000000000000000000..fc6de70f5a89331cb8940b34c1c9ff5a164c2894 --- /dev/null +++ b/paddle/fluid/inference/tests/test.cmake @@ -0,0 +1,59 @@ +include(ExternalProject) +set(INFERENCE_URL "http://paddle-inference-dist.bj.bcebos.com" CACHE STRING "inference download url") +set(INFERENCE_DEMO_INSTALL_DIR "${THIRD_PARTY_PATH}/inference_demo" CACHE STRING + "A path setting inference demo download directories.") + +function(inference_download INSTALL_DIR URL FILENAME) + message(STATUS "Download inference test stuff from ${URL}/${FILENAME}") + string(REGEX REPLACE "[-%.]" "_" FILENAME_EX ${FILENAME}) + ExternalProject_Add( + extern_inference_download_${FILENAME_EX} + ${EXTERNAL_PROJECT_LOG_ARGS} + PREFIX ${INSTALL_DIR} + URL ${URL}/${FILENAME} + DOWNLOAD_COMMAND wget --no-check-certificate -q -O ${INSTALL_DIR}/${FILENAME} ${URL}/${FILENAME} + DOWNLOAD_DIR ${INSTALL_DIR} + DOWNLOAD_NO_PROGRESS 1 + CONFIGURE_COMMAND "" + BUILD_COMMAND "" + UPDATE_COMMAND "" + INSTALL_COMMAND "" + ) +endfunction() + +function(inference_download_and_uncompress INSTALL_DIR URL FILENAME) + message(STATUS "Download inference test stuff from ${URL}/${FILENAME}") + string(REGEX REPLACE "[-%.]" "_" FILENAME_EX ${FILENAME}) + set(EXTERNAL_PROJECT_NAME "extern_inference_download_${FILENAME_EX}") + set(UNPACK_DIR "${INSTALL_DIR}/src/${EXTERNAL_PROJECT_NAME}") + ExternalProject_Add( + ${EXTERNAL_PROJECT_NAME} + ${EXTERNAL_PROJECT_LOG_ARGS} + PREFIX ${INSTALL_DIR} + DOWNLOAD_COMMAND wget --no-check-certificate -q -O ${INSTALL_DIR}/${FILENAME} ${URL}/${FILENAME} && + ${CMAKE_COMMAND} -E tar xzf ${INSTALL_DIR}/${FILENAME} + DOWNLOAD_DIR ${INSTALL_DIR} + DOWNLOAD_NO_PROGRESS 1 + CONFIGURE_COMMAND "" + BUILD_COMMAND "" + UPDATE_COMMAND "" + INSTALL_COMMAND "" + ) +endfunction() + +set(WORD2VEC_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/word2vec") +if(NOT EXISTS ${WORD2VEC_INSTALL_DIR} AND NOT WIN32) + inference_download_and_uncompress(${WORD2VEC_INSTALL_DIR} ${INFERENCE_URL} "word2vec.inference.model.tar.gz") +endif() +set(WORD2VEC_MODEL_DIR "${WORD2VEC_INSTALL_DIR}/word2vec.inference.model") + +function (inference_base_test TARGET) + set(options "") + set(oneValueArgs "") + set(multiValueArgs SRCS ARGS DEPS) + cmake_parse_arguments(base_test "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) + if(WITH_GPU) + set(mem_opt "--fraction_of_gpu_memory_to_use=0.5") + endif() + cc_test(${TARGET} SRCS ${base_test_SRCS} DEPS ${base_test_DEPS} ARGS ${mem_opt} ${base_test_ARGS}) +endfunction() diff --git a/paddle/fluid/inference/tests/test_helper.h b/paddle/fluid/inference/tests/test_helper.h index 94f0550df57e79fa68c135f5c9c4b7effe6ac156..861f69f4d2143b16bdec546d92ce7bd13ca53ed3 100644 --- a/paddle/fluid/inference/tests/test_helper.h +++ b/paddle/fluid/inference/tests/test_helper.h @@ -18,9 +18,9 @@ limitations under the License. */ #include #include -#include "paddle/fluid/framework/ir/graph_to_program_pass.h" #include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/inference/io.h" +#include "paddle/fluid/platform/port.h" #include "paddle/fluid/platform/profiler.h" DECLARE_bool(use_mkldnn); @@ -94,15 +94,15 @@ void CheckError(const paddle::framework::LoDTensor& output1, std::unique_ptr InitProgram( paddle::framework::Executor* executor, paddle::framework::Scope* scope, - const std::string& dirname, const bool is_combined = false) { + const std::string& dirname, const bool is_combined = false, + const std::string& prog_filename = "__model_combined__", + const std::string& param_filename = "__params_combined__") { std::unique_ptr inference_program; if (is_combined) { // All parameters are saved in a single file. // Hard-coding the file names of program and parameters in unittest. // The file names should be consistent with that used in Python API // `fluid.io.save_inference_model`. - std::string prog_filename = "__model_combined__"; - std::string param_filename = "__params_combined__"; inference_program = paddle::inference::Load(executor, scope, dirname + "/" + prog_filename, dirname + "/" + param_filename); @@ -115,12 +115,15 @@ std::unique_ptr InitProgram( } std::vector> GetFeedTargetShapes( - const std::string& dirname, const bool is_combined = false) { + const std::string& dirname, const bool is_combined = false, + const std::string& prog_filename = "__model_combined__", + const std::string& param_filename = "__params_combined__") { auto place = paddle::platform::CPUPlace(); auto executor = paddle::framework::Executor(place); auto* scope = new paddle::framework::Scope(); - auto inference_program = InitProgram(&executor, scope, dirname, is_combined); + auto inference_program = InitProgram(&executor, scope, dirname, is_combined, + prog_filename, param_filename); auto& global_block = inference_program->Block(0); const std::vector& feed_target_names = @@ -136,15 +139,6 @@ std::vector> GetFeedTargetShapes( return feed_target_shapes; } -void Compile(paddle::framework::ProgramDesc* program) { - std::unique_ptr g( - new paddle::framework::ir::Graph(*program)); - auto pass = paddle::framework::ir::PassRegistry::Instance().Get( - "graph_to_program_pass"); - pass->SetNotOwned("program", program); - pass->Apply(std::move(g)); -} - template void TestInference(const std::string& dirname, const std::vector& cpu_feeds, @@ -177,12 +171,9 @@ void TestInference(const std::string& dirname, // Enable the profiler paddle::platform::EnableProfiler(state); { - paddle::platform::RecordEvent record_event( - "init_program", - paddle::platform::DeviceContextPool::Instance().Get(place)); + paddle::platform::RecordEvent record_event("init_program"); inference_program = InitProgram(&executor, scope, dirname, is_combined); } - Compile(inference_program.get()); // Disable the profiler and print the timing information paddle::platform::DisableProfiler(paddle::platform::EventSortingKey::kDefault, @@ -237,9 +228,7 @@ void TestInference(const std::string& dirname, // Run repeat times to profile the performance for (int i = 0; i < repeat; ++i) { - paddle::platform::RecordEvent record_event( - "run_inference", - paddle::platform::DeviceContextPool::Instance().Get(place)); + paddle::platform::RecordEvent record_event("run_inference"); if (PrepareContext) { // Note: if you change the inference_program, you need to call @@ -261,5 +250,3 @@ void TestInference(const std::string& dirname, delete scope; } - -USE_PASS(graph_to_program_pass); diff --git a/paddle/fluid/inference/utils/CMakeLists.txt b/paddle/fluid/inference/utils/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..2104e4ac7222258ee025bd5acd60b1db251df654 --- /dev/null +++ b/paddle/fluid/inference/utils/CMakeLists.txt @@ -0,0 +1,2 @@ +cc_library(benchmark SRCS benchmark.cc DEPS enforce) +cc_test(test_benchmark SRCS benchmark_tester.cc DEPS benchmark) diff --git a/paddle/fluid/inference/utils/benchmark.cc b/paddle/fluid/inference/utils/benchmark.cc new file mode 100644 index 0000000000000000000000000000000000000000..0bd526bcac2d9ceda95730dc3c5210aed8ccfb5c --- /dev/null +++ b/paddle/fluid/inference/utils/benchmark.cc @@ -0,0 +1,49 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/utils/benchmark.h" +#include +#include "paddle/fluid/platform/enforce.h" + +namespace paddle { +namespace inference { + +std::string Benchmark::SerializeToString() const { + std::stringstream ss; + ss << "-----------------------------------------------------\n"; + ss << "name\t"; + ss << "batch_size\t"; + ss << "num_threads\t"; + ss << "latency\t"; + ss << "qps"; + ss << '\n'; + + ss << name_ << "\t"; + ss << batch_size_ << "\t\t"; + ss << num_threads_ << "\t"; + ss << latency_ << "\t"; + ss << 1000.0 / latency_; + ss << '\n'; + return ss.str(); +} +void Benchmark::PersistToFile(const std::string &path) const { + std::ofstream file(path, std::ios::app); + PADDLE_ENFORCE(file.is_open(), "Can not open %s to add benchmark", path); + file << SerializeToString(); + file.flush(); + file.close(); +} + +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/utils/benchmark.h b/paddle/fluid/inference/utils/benchmark.h new file mode 100644 index 0000000000000000000000000000000000000000..a1304cf4e7778f74e15e4fe5e2f405fd3c185eb4 --- /dev/null +++ b/paddle/fluid/inference/utils/benchmark.h @@ -0,0 +1,54 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include + +namespace paddle { +namespace inference { + +/* + * Helper class to calculate the performance. + */ +struct Benchmark { + int batch_size() const { return batch_size_; } + void SetBatchSize(int x) { batch_size_ = x; } + + int num_threads() const { return num_threads_; } + void SetNumThreads(int x) { num_threads_ = x; } + + bool use_gpu() const { return use_gpu_; } + void SetUseGpu() { use_gpu_ = true; } + + float latency() const { return latency_; } + void SetLatency(float x) { latency_ = x; } + + const std::string& name() const { return name_; } + void SetName(const std::string& name) { name_ = name; } + + std::string SerializeToString() const; + void PersistToFile(const std::string& path) const; + + private: + bool use_gpu_{false}; + int batch_size_{0}; + float latency_; + int num_threads_{1}; + std::string name_; +}; + +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/utils/benchmark_tester.cc b/paddle/fluid/inference/utils/benchmark_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..0c48c2db9b691ae8cf587f2729c2789d4ce2dbe1 --- /dev/null +++ b/paddle/fluid/inference/utils/benchmark_tester.cc @@ -0,0 +1,39 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/utils/benchmark.h" +#include +#include + +using namespace paddle::inference; // NOLINT +TEST(Benchmark, basic) { + Benchmark benchmark; + benchmark.SetName("key0"); + benchmark.SetBatchSize(10); + benchmark.SetUseGpu(); + benchmark.SetLatency(220); + LOG(INFO) << "benchmark:\n" << benchmark.SerializeToString(); +} + +TEST(Benchmark, PersistToFile) { + Benchmark benchmark; + benchmark.SetName("key0"); + benchmark.SetBatchSize(10); + benchmark.SetUseGpu(); + benchmark.SetLatency(220); + + benchmark.PersistToFile("1.log"); + benchmark.PersistToFile("2.log"); + benchmark.PersistToFile("3.log"); +} diff --git a/paddle/fluid/inference/utils/singleton.h b/paddle/fluid/inference/utils/singleton.h index cfb89e704457a11a3cd6e89dba5efad5acae0bce..990bef359499834c3a7cb025c3fb1d94ceea958e 100644 --- a/paddle/fluid/inference/utils/singleton.h +++ b/paddle/fluid/inference/utils/singleton.h @@ -45,13 +45,13 @@ struct Registry { } template - static void Register(const std::string& name) { + void Register(const std::string& name) { PADDLE_ENFORCE_EQ(items_.count(name), 0); items_[name] = new ItemChild; } - static ItemParent* Lookup(const std::string& name, - const std::string& default_name = "") { + ItemParent* Lookup(const std::string& name, + const std::string& default_name = "") { auto it = items_.find(name); if (it == items_.end()) { if (default_name == "") @@ -70,11 +70,8 @@ struct Registry { private: Registry() = default; - static std::unordered_map items_; + std::unordered_map items_; }; -template -std::unordered_map Registry::items_; - } // namespace inference } // namespace paddle diff --git a/paddle/fluid/memory/CMakeLists.txt b/paddle/fluid/memory/CMakeLists.txt index 709fc7e12e1db537ceece30c405c0e8a2582e8ca..7eb663ea280e65f3c10304aa47c9970df099b901 100644 --- a/paddle/fluid/memory/CMakeLists.txt +++ b/paddle/fluid/memory/CMakeLists.txt @@ -1,15 +1,12 @@ add_subdirectory(detail) - -cc_library(malloc SRCS malloc.cc DEPS buddy_allocator place enforce) +add_subdirectory(allocation) +cc_library(malloc SRCS malloc.cc DEPS place enforce allocator_facade profiler) cc_library(memcpy SRCS memcpy.cc DEPS place) cc_library(memory DEPS malloc memcpy) - -cc_test(malloc_test SRCS malloc_test.cc DEPS malloc) - #if (WITH_GPU) # nv_test(pinned_memory_test SRCS pinned_memory_test.cu DEPS place memory) #endif() diff --git a/paddle/fluid/memory/allocation/CMakeLists.txt b/paddle/fluid/memory/allocation/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..ac77c3d2a500816a4eb41ed13f23ee628290f287 --- /dev/null +++ b/paddle/fluid/memory/allocation/CMakeLists.txt @@ -0,0 +1,66 @@ +cc_library(allocator SRCS allocator.cc DEPS place) +cc_library(cpu_allocator SRCS cpu_allocator.cc DEPS allocator) +cc_library(best_fit_allocator SRCS best_fit_allocator.cc DEPS allocator) +cc_library(locked_allocator SRCS locked_allocator.cc DEPS allocator) +cc_library(buffered_allocator SRCS buffered_allocator.cc DEPS allocator) +cc_library(legacy_allocator SRCS legacy_allocator.cc DEPS allocator buddy_allocator profiler) +cc_test(buffered_allocator_test SRCS buffered_allocator_test.cc DEPS best_fit_allocator locked_allocator buffered_allocator cpu_allocator) + +if (WITH_GPU) + nv_library(cuda_allocator SRCS cuda_allocator.cc DEPS allocator cuda_device_guard) +endif() + +cc_library(retry_allocator SRCS retry_allocator.cc DEPS allocator) + +if (WITH_GPU) + nv_test(best_fit_allocator_test + SRCS best_fit_allocator_test.cc + best_fit_allocator_test.cu + DEPS best_fit_allocator + locked_allocator + cpu_allocator + cuda_allocator + device_context + memcpy) +else() + cc_test(best_fit_allocator_test + SRCS best_fit_allocator_test.cc + DEPS best_fit_allocator + locked_allocator + cpu_allocator) +endif() + +nv_library(pinned_allocator SRCS pinned_allocator.cc DEPS allocator) +if (WITH_GPU) + set(AllocatorFacadeDeps gpu_info cuda_allocator pinned_allocator cuda_device_guard) +else () + set(AllocatorFacadeDeps) +endif() + +cc_library(aligned_allocator SRCS aligned_allocator.cc DEPS allocator) +cc_library(auto_increment_allocator SRCS auto_increment_allocator.cc DEPS allocator) +cc_library(zero_size_allocator SRCS zero_size_allocator.cc DEPS allocator) +cc_library(conditional_allocator SRCS conditional_allocator.cc DEPS allocator) +cc_library(allocator_strategy SRCS allocator_strategy.cc DEPS gflags) +cc_library(allocator_facade SRCS allocator_facade.cc DEPS + ${AllocatorFacadeDeps} + cpu_allocator + locked_allocator + best_fit_allocator + aligned_allocator + auto_increment_allocator + zero_size_allocator + conditional_allocator + retry_allocator + buffered_allocator + allocator_strategy + legacy_allocator + ) + +nv_test(allocation_and_eigen_test SRCS allocation_and_eigen_test.cu DEPS allocator_facade) + +cc_test(retry_allocator_test SRCS retry_allocator_test.cc DEPS retry_allocator best_fit_allocator locked_allocator cpu_allocator) + +cc_test(allocator_facade_abs_flags_test SRCS allocator_facade_abs_flags_test.cc DEPS allocator_facade) + +cc_test(allocator_facade_frac_flags_test SRCS allocator_facade_frac_flags_test.cc DEPS allocator_facade) diff --git a/paddle/fluid/memory/allocation/aligned_allocator.cc b/paddle/fluid/memory/allocation/aligned_allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..efae280dbd47a1db476f9c371ba73eac96c30df6 --- /dev/null +++ b/paddle/fluid/memory/allocation/aligned_allocator.cc @@ -0,0 +1,31 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/aligned_allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +ThinAlignedAllocator::ThinAlignedAllocator( + std::shared_ptr underlyning_allocator) + : underlying_allocator_(std::move(underlyning_allocator)) {} + +bool ThinAlignedAllocator::IsAllocThreadSafe() const { + return underlying_allocator_->IsAllocThreadSafe(); +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/aligned_allocator.h b/paddle/fluid/memory/allocation/aligned_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..064acd06e71da98802126913e0af843cfbf717e7 --- /dev/null +++ b/paddle/fluid/memory/allocation/aligned_allocator.h @@ -0,0 +1,101 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include "paddle/fluid/memory/allocation/allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +// The aligned allocation and allocator will wrap a managed allocator, +// and returns the aligned pointer. +// +// NOTE(yy): For speed reason, I just use a template parameter to get +// alignment, however, it can be an private member if necessary. +// +// NOTE(yy): kAlignment must be 2^N. a `static_assert` should be added. +template +class AlignedAllocation : public Allocation { + static_assert(kAlignment > 0 && (kAlignment & (kAlignment - 1)) == 0, + "kAlignment must be 2^N"); + + public: + AlignedAllocation(AllocationPtr&& underlying_allocation, size_t size) + : Allocation(AlignedPtr(underlying_allocation->ptr()), + size + kAlignment - Offset(underlying_allocation->ptr()), + underlying_allocation->place()), + underlying_allocation_(std::move(underlying_allocation)) {} + + private: + static void* AlignedPtr(void* ptr) { + return reinterpret_cast(reinterpret_cast(ptr) + + Offset(ptr)); + } + + // Offset to aligned pointer. + // if ptr is already aligned, returns 0. + static size_t Offset(void* ptr) { + auto ptr_addr = reinterpret_cast(ptr); + intptr_t aligned_addr = (ptr_addr & ~(kAlignment - 1)); + intptr_t diff = aligned_addr - ptr_addr; + if (diff == 0) { + return 0; + } else { + return kAlignment + diff; + } + } + + AllocationPtr underlying_allocation_; +}; + +// Thin aligned allocator is trivial and used to generate a small size binary. +// +// NOTE(yy): This is a trick to make a template class. This class extract the +// common code into a `thin` class. So if there are multiple specification of +// the template class, the binary size will not extended too much. +// +// NOTE(yy): This could be an over design. If it harms readability of code, it +// could be removed later. +class ThinAlignedAllocator : public Allocator { + public: + explicit ThinAlignedAllocator( + std::shared_ptr underlyning_allocator); + + bool IsAllocThreadSafe() const; + + protected: + std::shared_ptr underlying_allocator_; +}; + +// An aligned allocator will allocate `size+kAlignment` allocation and adjust +// the pointer offset. +template +class AlignedAllocator : public ThinAlignedAllocator { + public: + using ThinAlignedAllocator::ThinAlignedAllocator; + + protected: + Allocation* AllocateImpl(size_t size, Allocator::Attr attr) override { + auto raw_allocation = + underlying_allocator_->Allocate(size + kAlignment, attr); + return new AlignedAllocation(std::move(raw_allocation), size); + } +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/allocation_and_eigen_test.cu b/paddle/fluid/memory/allocation/allocation_and_eigen_test.cu new file mode 100644 index 0000000000000000000000000000000000000000..b61649e59d326a64aa806460feffc3a910b1cab8 --- /dev/null +++ b/paddle/fluid/memory/allocation/allocation_and_eigen_test.cu @@ -0,0 +1,48 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "gtest/gtest.h" +#include "paddle/fluid/framework/eigen.h" +#include "paddle/fluid/framework/tensor.h" +#include "paddle/fluid/platform/device_context.h" +#include "paddle/fluid/platform/for_range.h" +#include "unsupported/Eigen/CXX11/Tensor" + +// NOTE(yy): this unittest is not important. It just used for debugging. +// It can be removed later. +struct FillZero { + public: + float* ptr_; + + __device__ void operator()(size_t i) { ptr_[i] = 0.0f; } +}; + +namespace paddle { +TEST(Eigen, main) { + framework::Tensor tensor; + platform::CUDAPlace gpu(0); + float* ptr = tensor.mutable_data({10, 10}, gpu); + auto& dev_ctx = *reinterpret_cast( + platform::DeviceContextPool::Instance().Get(gpu)); + PADDLE_ENFORCE(cudaMemset(ptr, 0, sizeof(float) * 100)); + + platform::ForRange for_range(dev_ctx, 100); + for_range(FillZero{ptr}); + dev_ctx.Wait(); + + auto eigen_vec = framework::EigenVector::Flatten(tensor); + auto& eigen_dev = *dev_ctx.eigen_device(); + eigen_vec.device(eigen_dev) = eigen_vec.constant(0.0f); +} +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/allocation_with_underlying.h b/paddle/fluid/memory/allocation/allocation_with_underlying.h new file mode 100644 index 0000000000000000000000000000000000000000..69f78667d7d33c59245a9890b9a2ce469f629450 --- /dev/null +++ b/paddle/fluid/memory/allocation/allocation_with_underlying.h @@ -0,0 +1,33 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/memory/allocation/allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +class AllocationWithUnderlying : public Allocation { + public: + explicit AllocationWithUnderlying(AllocationPtr allocation) + : Allocation(allocation->ptr(), allocation->size(), allocation->place()), + allocation_(std::move(allocation)) {} + AllocationPtr allocation_; +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/allocator.cc b/paddle/fluid/memory/allocation/allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..8fb8a5fb897a736d7515951ba08c633da9a7706c --- /dev/null +++ b/paddle/fluid/memory/allocation/allocator.cc @@ -0,0 +1,45 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/allocator.h" + +#include + +namespace paddle { +namespace memory { +namespace allocation { +Allocation::~Allocation() {} + +Allocator::~Allocator() {} + +bool Allocator::IsAllocThreadSafe() const { return false; } + +AllocationPtr Allocator::Allocate(size_t size, Allocator::Attr attr) { + auto ptr = AllocateImpl(size, attr); + ptr->set_allocator(this); + return AllocationPtr(ptr); +} + +void Allocator::Free(Allocation* allocation) { delete allocation; } + +const char* BadAlloc::what() const noexcept { return msg_.c_str(); } + +void AllocationDeleter::operator()(Allocation* allocation) const { + auto* allocator = allocation->allocator(); + allocator->Free(allocation); +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/allocator.h b/paddle/fluid/memory/allocation/allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..3465278935f7ce05456e94bb3a7d1ae9f114ff96 --- /dev/null +++ b/paddle/fluid/memory/allocation/allocator.h @@ -0,0 +1,147 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include +#include +#include "paddle/fluid/platform/place.h" + +namespace paddle { +namespace memory { +namespace allocation { + +// Exception when `Alloc`/`AllocShared` failed +class BadAlloc : public std::exception { + public: + explicit BadAlloc(std::string msg) : msg_(std::move(msg)) {} + const char* what() const noexcept override; + + private: + std::string msg_; +}; + +class Allocation; +class AllocationDeleter { + public: + void operator()(Allocation* allocation) const; +}; + +class Allocator; +// Allocation is the object holding the actually pointer. Use +// `Allocation::ptr()` will returns the pointer that allocated. +// +// NOTE: this is the base class of Allocation. Each allocator can use its own +// allocation object. +// NOTE: the `Allocation::ptr()` could be nullptr, if the allocation size is 0 +class Allocation { + public: + Allocation(void* ptr, size_t size, platform::Place place) + : allocator_(nullptr), ptr_(ptr), size_(size), place_(place) {} + + Allocation(const Allocation& o) = delete; + Allocation& operator=(const Allocation& o) = delete; + + // Returns the holding pointer. + // NOTE: For performance consideration, it is better not to make this method + // as a virtual method. If we want to implement a `defragmentation` later, + // we might need to make `ptr_` field as a protected field, and add a virtual + // method like `defragmentation` to change `ptr_`. + void* ptr() const { return ptr_; } + + // Returns the size of this memory buffer, i.e., ptr() + size() - 1 is the + // last valid element. + // + // NOTE: Some allocator might alloc more memory than request. The size + // could larger than its request. For example, + // the AlignedAllocator will always allocate memory as size + kAlignment. + // The raw pointer might not aligned, so an offset might be added to raw + // the pointer. The size of this allocation will be + // `size + kAlignemnt - offset`. + size_t size() const { return size_; } + + const platform::Place& place() const { return place_; } + + Allocator* allocator() { return allocator_; } + + void set_allocator(Allocator* allocator) { allocator_ = allocator; } + + virtual ~Allocation(); + + private: + Allocator* allocator_; + void* ptr_; + size_t size_; + platform::Place place_; +}; + +using AllocationPtr = std::unique_ptr; + +// Base interface class of memory Allocator. +// To allocate a memory, allocator needs two parameters: +// 1. size of bytes. +// 2. Attribute of memory. +// NOTE: the attribute of memory might be ignored if the allocator does not +// care it. +class Allocator { + public: + enum Attr { + kDefault = 0, // Default attribute. Uses the fast or stablest allocation + // algorithm. + + kFixedHuge = 1, // The allocation may not be freed until the program + // ends. e.g., `Parameters` and `Momentum`. + + kFluxHuge = 2, // The allocation may create and freed frequently and the + // allocation is considerable huge. Like `activations` + // and gradients. + + kScratchpad = + 3, // The `Scratchpad` memory is allocated and freed very soon, + // usually within an operator or aux memory. + // Like CUDNN workspace, AUX memory in batch norm, etc. + // + // https://en.wikipedia.org/wiki/Scratchpad_memory + + kCrossDevice = + 4, // The memory used cross-device memory copy/communication. + // For example: + // 1. it can use an `pinned` memory for CPU-GPU + // communication. + // 2. it can use an `registered` memory for RDMA + // communication. + + NumOfAttrs = 5 // The number of all attributes. It is used internally. + }; + + virtual ~Allocator(); + + // Allocate an allocation. + AllocationPtr Allocate(size_t size, Allocator::Attr attr = kDefault); + + // True if the `Allocate` is thread safe. + virtual bool IsAllocThreadSafe() const; + + protected: + virtual void Free(Allocation* allocation); + virtual Allocation* AllocateImpl(size_t size, Allocator::Attr attr) = 0; + + private: + friend class AllocationDeleter; +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/allocator_facade.cc b/paddle/fluid/memory/allocation/allocator_facade.cc new file mode 100644 index 0000000000000000000000000000000000000000..a3b73e3ba31c89c2a94955b0fea64df4ab0ffc26 --- /dev/null +++ b/paddle/fluid/memory/allocation/allocator_facade.cc @@ -0,0 +1,273 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/allocator.h" +#include +#include +#include +#include +#include +#include +#include "paddle/fluid/memory/allocation/aligned_allocator.h" +#include "paddle/fluid/memory/allocation/allocator_facade.h" +#include "paddle/fluid/memory/allocation/allocator_strategy.h" +#include "paddle/fluid/memory/allocation/auto_increment_allocator.h" +#include "paddle/fluid/memory/allocation/best_fit_allocator.h" +#include "paddle/fluid/memory/allocation/conditional_allocator.h" +#include "paddle/fluid/memory/allocation/cpu_allocator.h" +#include "paddle/fluid/memory/allocation/legacy_allocator.h" +#include "paddle/fluid/memory/allocation/locked_allocator.h" +#include "paddle/fluid/memory/allocation/retry_allocator.h" +#include "paddle/fluid/memory/allocation/zero_size_allocator.h" +#include "paddle/fluid/platform/cpu_info.h" +#include "paddle/fluid/platform/enforce.h" +#include "paddle/fluid/platform/place.h" +#ifdef PADDLE_WITH_CUDA +#include "paddle/fluid/memory/allocation/cuda_allocator.h" +#include "paddle/fluid/memory/allocation/pinned_allocator.h" +#include "paddle/fluid/platform/cuda_device_guard.h" +#include "paddle/fluid/platform/gpu_info.h" +#endif + +DEFINE_int64( + gpu_allocator_retry_time, 0, + "The retry time (milliseconds) when allocator fails " + "to allocate memory. No retry if this value is not greater than 0"); + +namespace paddle { +namespace memory { +namespace allocation { + +// TODO(yy): Dirty code here. This class should be configurable in runtime. +class CPUManagedAllocator : public Allocator { + public: + CPUManagedAllocator() : normal_allocator_(new CPUAllocator()) {} + + bool IsAllocThreadSafe() const override { return true; } + + protected: + Allocation* AllocateImpl(size_t size, Allocator::Attr attr) override { + return normal_allocator_->Allocate(size, attr).release(); + } + + private: + std::shared_ptr normal_allocator_; +}; + +// TODO(yy): Dirty code here. This class should be configurable in runtime. +class ChunkedAllocator : public Allocator { + public: + explicit ChunkedAllocator(std::unique_ptr system_allocator, + size_t max_chunk_size, size_t capacity = 1, + int64_t retry_time = -1) + : max_chunk_size_(max_chunk_size), retry_time_(retry_time) { + raw_allocator_ = std::move(system_allocator); + + if (max_chunk_size_ == 0) { + default_allocator_ = raw_allocator_; + } else { + if (capacity == 1) { + VLOG(1) << "Create BestFitAllocator with chunk_size " + << max_chunk_size_; + default_allocator_ = CreateAllocatorWithChunk(); + } else { + VLOG(1) << "Create AutoIncrementAllocator with chunk_size " + << max_chunk_size_ << " and capacity " << capacity; + default_allocator_ = std::make_shared( + [this] { return CreateAllocatorWithChunk(); }, capacity); + } + } + + auto* cond_allocator = new ConditionalAllocator(); + cond_allocator + ->AddAllocator( + [this](size_t size, Attr attr) { return size < max_chunk_size_; }, + default_allocator_) + .AddAllocator( + [](size_t size, Attr attr) { + return true; // default case + }, + raw_allocator_); + default_allocator_.reset(cond_allocator); + } + + ~ChunkedAllocator() override { + // Specify destruct order. + default_allocator_.reset(); + chunks_.clear(); + raw_allocator_.reset(); + } + + std::shared_ptr CreateAllocatorWithChunk() { + chunks_.emplace_back(raw_allocator_->Allocate(max_chunk_size_)); + auto* allocation = chunks_.back().get(); + std::unique_ptr allocator(new LockedAllocator( + std::unique_ptr(new BestFitAllocator(allocation)))); + + if (retry_time_ > 0) { + auto* retry_allocator = + new RetryAllocator(std::move(allocator), retry_time_); + allocator.reset(retry_allocator); + } + + return std::make_shared>(std::move(allocator)); + } + + bool IsAllocThreadSafe() const override { return true; } + + protected: + Allocation* AllocateImpl(size_t size, Allocator::Attr attr) override { + return default_allocator_->Allocate(size, attr).release(); + } + + protected: + size_t max_chunk_size_; + int64_t retry_time_; + std::vector chunks_; + std::shared_ptr raw_allocator_; + std::shared_ptr default_allocator_; +}; + +#ifdef PADDLE_WITH_CUDA + +class CUDAChunkedAllocator : public ChunkedAllocator { + public: + explicit CUDAChunkedAllocator(int dev_id) + : ChunkedAllocator(std::unique_ptr( + new CUDAAllocator(platform::CUDAPlace(dev_id))), + GetMaxChunkSize(dev_id), GetCapcity(dev_id), + GetRetryTime()) {} + + private: + static size_t GetMaxChunkSize(int dev_id) { + platform::CUDADeviceGuard guard(dev_id); + return platform::GpuMaxChunkSize(); + } + + static size_t GetCapcity(int dev_id) { + platform::CUDADeviceGuard guard(dev_id); + size_t available, total; + platform::GpuMemoryUsage(&available, &total); + size_t max_chunk_size = platform::GpuMaxChunkSize(); + return max_chunk_size == 0 ? 0 : available / max_chunk_size; + } + + static int64_t GetRetryTime() { return FLAGS_gpu_allocator_retry_time; } +}; + +class CUDAPinnedChunkedAllocator : public ChunkedAllocator { + public: + CUDAPinnedChunkedAllocator() + : ChunkedAllocator(std::unique_ptr(new CPUPinnedAllocator()), + platform::CUDAPinnedMaxChunkSize(), GetCapacity(), + -1) {} // never retry + + private: + static size_t GetCapacity() { + size_t total = platform::CpuTotalPhysicalMemory(); + size_t max_chunk_size = platform::CUDAPinnedMaxChunkSize(); + return max_chunk_size == 0 ? 0 : total / max_chunk_size; + } +}; + +#endif + +class AllocatorFacadePrivate { + public: + std::map> allocators_; + + ~AllocatorFacadePrivate() = default; + + AllocatorFacadePrivate() { + if (GetAllocatorStrategy() == AllocatorStrategy::kLegacy) { + InitLegacyAllocator(); + } else { + InitCPUAllocator(); + InitCUDAAllocator(); + InitCUDAPinnedAllocator(); + WrapZeroSizeAllocator(); + } + } + + private: + void InitLegacyAllocator() { + std::vector places{platform::CPUPlace()}; +#ifdef PADDLE_WITH_CUDA + for (int dev_id = 0; dev_id < platform::GetCUDADeviceCount(); ++dev_id) { + places.emplace_back(platform::CUDAPlace(dev_id)); + } + places.emplace_back(platform::CUDAPinnedPlace()); +#endif + for (auto& p : places) { + allocators_[p] = std::make_shared(p); + } + } + + void InitCPUAllocator() { + allocators_[platform::CPUPlace()] = std::make_shared(); + } + + void InitCUDAAllocator() { +#ifdef PADDLE_WITH_CUDA + int device_count = platform::GetCUDADeviceCount(); + for (int dev_id = 0; dev_id < device_count; ++dev_id) { + allocators_[platform::CUDAPlace(dev_id)] = + std::make_shared(dev_id); + } +#endif + } + + void InitCUDAPinnedAllocator() { +#ifdef PADDLE_WITH_CUDA + allocators_[platform::CUDAPinnedPlace()] = + std::make_shared(); +#endif + } + + void WrapZeroSizeAllocator() { + for (auto& pair : allocators_) { + pair.second = + std::make_shared(pair.second, pair.first); + } + } +}; + +// Pimpl. Make interface clean. +AllocatorFacade::AllocatorFacade() : m_(new AllocatorFacadePrivate()) {} +AllocatorFacade::~AllocatorFacade() { delete m_; } + +AllocatorFacade& AllocatorFacade::Instance() { + static AllocatorFacade instance; + return instance; +} + +std::shared_ptr AllocatorFacade::AllocShared( + const platform::Place& place, size_t size, Allocator::Attr attr) { + return std::shared_ptr(Alloc(place, size, attr).release(), + AllocationDeleter()); +} + +AllocationPtr AllocatorFacade::Alloc(const platform::Place& place, size_t size, + Allocator::Attr attr) { + auto it = m_->allocators_.find(place); + if (it == m_->allocators_.end()) { + throw BadAlloc( + string::Sprintf("No such allocator for the place, %s", place)); + } + return m_->allocators_.at(place)->Allocate(size, attr); +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/allocator_facade.h b/paddle/fluid/memory/allocation/allocator_facade.h new file mode 100644 index 0000000000000000000000000000000000000000..16da30bec0d9f524bd076fe76d15c2fcfa7edd3a --- /dev/null +++ b/paddle/fluid/memory/allocation/allocator_facade.h @@ -0,0 +1,57 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include "paddle/fluid/memory/allocation/allocator.h" +#include "paddle/fluid/platform/place.h" + +namespace paddle { +namespace memory { +namespace allocation { + +// Allocator Facade is the interface exposed to other modules. +// All the configuration or dirty code under development should +// be hidden behind this facade. +// +// NOTE(yy): This class is a singleton class. +// NOTE(yy): To create a stable ABI and make compilation faster. Here we use +// a Pimpl trick; +class AllocatorFacadePrivate; +class AllocatorFacade { + public: + ~AllocatorFacade(); + AllocatorFacade(const AllocatorFacade& o) = delete; + const AllocatorFacade& operator=(const AllocatorFacade& o) = delete; + + static AllocatorFacade& Instance(); + + // Allocate a shared allocation. + std::shared_ptr AllocShared( + const platform::Place& place, size_t size, + Allocator::Attr attr = Allocator::kDefault); + + // Allocate a unique allocation. + AllocationPtr Alloc(const platform::Place& place, size_t size, + Allocator::Attr attr = Allocator::kDefault); + + // TODO(yy): Allocate a Copy-On-Write allocation? + private: + AllocatorFacade(); + AllocatorFacadePrivate* m_; +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/allocator_facade_abs_flags_test.cc b/paddle/fluid/memory/allocation/allocator_facade_abs_flags_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..67905973ff620a7e0fb863fef80778aceba7aeb2 --- /dev/null +++ b/paddle/fluid/memory/allocation/allocator_facade_abs_flags_test.cc @@ -0,0 +1,100 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/allocator_facade.h" +#include +#include + +#ifdef PADDLE_WITH_CUDA +DECLARE_double(fraction_of_gpu_memory_to_use); +DECLARE_double(fraction_of_cuda_pinned_memory_to_use); +DECLARE_uint64(initial_gpu_memory_in_mb); +DECLARE_uint64(reallocate_gpu_memory_in_mb); +DECLARE_int64(gpu_allocator_retry_time); +#endif + +namespace paddle { +namespace memory { +namespace allocation { + +//! Run allocate test cases for different places +void AllocateTestCases() { + auto &instance = AllocatorFacade::Instance(); + platform::Place place; + size_t size = 1024; + + { + place = platform::CPUPlace(); + size = 1024; + auto cpu_allocation = instance.Alloc(place, size); + ASSERT_NE(cpu_allocation, nullptr); + ASSERT_NE(cpu_allocation->ptr(), nullptr); + ASSERT_EQ(cpu_allocation->place(), place); + ASSERT_EQ(cpu_allocation->size(), size); + } + +#ifdef PADDLE_WITH_CUDA + { + place = platform::CUDAPlace(0); + size = 1024; + auto gpu_allocation = instance.Alloc(place, size); + ASSERT_NE(gpu_allocation, nullptr); + ASSERT_NE(gpu_allocation->ptr(), nullptr); + ASSERT_EQ(gpu_allocation->place(), place); + ASSERT_GE(gpu_allocation->size(), size); + } + + { + // Allocate 2GB gpu memory + place = platform::CUDAPlace(0); + size = 2 * static_cast(1 << 30); + auto gpu_allocation = instance.Alloc(place, size); + ASSERT_NE(gpu_allocation, nullptr); + ASSERT_NE(gpu_allocation->ptr(), nullptr); + ASSERT_EQ(gpu_allocation->place(), place); + ASSERT_GE(gpu_allocation->size(), size); + } + + { + place = platform::CUDAPinnedPlace(); + size = (1 << 20); + auto cuda_pinned_allocation = + instance.Alloc(platform::CUDAPinnedPlace(), 1 << 20); + ASSERT_NE(cuda_pinned_allocation, nullptr); + ASSERT_NE(cuda_pinned_allocation->ptr(), nullptr); + ASSERT_EQ(cuda_pinned_allocation->place(), place); + ASSERT_GE(cuda_pinned_allocation->size(), size); + } +#endif +} + +TEST(Allocator, SpecifyGpuMemory) { +#ifdef PADDLE_WITH_CUDA + // Set to 0.0 to test FLAGS_initial_gpu_memory_in_mb and + // FLAGS_reallocate_gpu_memory_in_mb + FLAGS_fraction_of_gpu_memory_to_use = 0.0; + // 512 MB + FLAGS_initial_gpu_memory_in_mb = 512; + // 4 MB + FLAGS_reallocate_gpu_memory_in_mb = 4; + FLAGS_gpu_allocator_retry_time = 500; + FLAGS_fraction_of_cuda_pinned_memory_to_use = 0.5; +#endif + + AllocateTestCases(); +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/allocator_facade_frac_flags_test.cc b/paddle/fluid/memory/allocation/allocator_facade_frac_flags_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..decdc62f1361a9c159b8ccb09910e0f164b35210 --- /dev/null +++ b/paddle/fluid/memory/allocation/allocator_facade_frac_flags_test.cc @@ -0,0 +1,94 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/allocator_facade.h" +#include +#include + +#ifdef PADDLE_WITH_CUDA +DECLARE_double(fraction_of_gpu_memory_to_use); +DECLARE_double(fraction_of_cuda_pinned_memory_to_use); +DECLARE_uint64(initial_gpu_memory_in_mb); +DECLARE_uint64(reallocate_gpu_memory_in_mb); +DECLARE_int64(gpu_allocator_retry_time); +#endif + +namespace paddle { +namespace memory { +namespace allocation { + +//! Run allocate test cases for different places +void AllocateTestCases() { + auto &instance = AllocatorFacade::Instance(); + platform::Place place; + size_t size = 1024; + + { + place = platform::CPUPlace(); + size = 1024; + auto cpu_allocation = instance.Alloc(place, size); + ASSERT_NE(cpu_allocation, nullptr); + ASSERT_NE(cpu_allocation->ptr(), nullptr); + ASSERT_EQ(cpu_allocation->place(), place); + ASSERT_EQ(cpu_allocation->size(), size); + } + +#ifdef PADDLE_WITH_CUDA + { + place = platform::CUDAPlace(0); + size = 1024; + auto gpu_allocation = instance.Alloc(place, size); + ASSERT_NE(gpu_allocation, nullptr); + ASSERT_NE(gpu_allocation->ptr(), nullptr); + ASSERT_EQ(gpu_allocation->place(), place); + ASSERT_GE(gpu_allocation->size(), size); + } + + { + // Allocate 2GB gpu memory + place = platform::CUDAPlace(0); + size = 2 * static_cast(1 << 30); + auto gpu_allocation = instance.Alloc(place, size); + ASSERT_NE(gpu_allocation, nullptr); + ASSERT_NE(gpu_allocation->ptr(), nullptr); + ASSERT_EQ(gpu_allocation->place(), place); + ASSERT_GE(gpu_allocation->size(), size); + } + + { + place = platform::CUDAPinnedPlace(); + size = (1 << 20); + auto cuda_pinned_allocation = + instance.Alloc(platform::CUDAPinnedPlace(), 1 << 20); + ASSERT_NE(cuda_pinned_allocation, nullptr); + ASSERT_NE(cuda_pinned_allocation->ptr(), nullptr); + ASSERT_EQ(cuda_pinned_allocation->place(), place); + ASSERT_GE(cuda_pinned_allocation->size(), size); + } +#endif +} + +TEST(Allocator, Allocator) { +#ifdef PADDLE_WITH_CUDA + FLAGS_fraction_of_gpu_memory_to_use = 0.01; + FLAGS_gpu_allocator_retry_time = 500; + FLAGS_fraction_of_cuda_pinned_memory_to_use = 0.5; +#endif + + AllocateTestCases(); +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/allocator_strategy.cc b/paddle/fluid/memory/allocation/allocator_strategy.cc new file mode 100644 index 0000000000000000000000000000000000000000..8cebda9005b29b5b3259de0830c42eb10ef90e66 --- /dev/null +++ b/paddle/fluid/memory/allocation/allocator_strategy.cc @@ -0,0 +1,42 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/allocator_strategy.h" +#include "gflags/gflags.h" +#include "paddle/fluid/platform/enforce.h" + +DEFINE_string( + allocator_strategy, "legacy", + "The allocation strategy. Legacy means the original allocator of Fluid." + "New means the experimental allocators of Fluid. in [legacy, new]"); + +namespace paddle { +namespace memory { +namespace allocation { + +static AllocatorStrategy GetStrategyFromFlag() { + return FLAGS_allocator_strategy == "legacy" + ? AllocatorStrategy::kLegacy + : AllocatorStrategy::kNaiveBestFit; +} + +AllocatorStrategy GetAllocatorStrategy() { + static AllocatorStrategy strategy = GetStrategyFromFlag(); + return strategy; +} + +void UseAllocatorStrategyGFlag() {} +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/allocator_strategy.h b/paddle/fluid/memory/allocation/allocator_strategy.h new file mode 100644 index 0000000000000000000000000000000000000000..9adbd879939c562cf84579a92f21d3b82e69a7e5 --- /dev/null +++ b/paddle/fluid/memory/allocation/allocator_strategy.h @@ -0,0 +1,30 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +namespace paddle { +namespace memory { +namespace allocation { + +enum class AllocatorStrategy { kLegacy, kNaiveBestFit }; + +extern AllocatorStrategy GetAllocatorStrategy(); + +// Do nothing, just make sure linker do not prune this file. +extern void UseAllocatorStrategyGFlag(); + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/auto_increment_allocator.cc b/paddle/fluid/memory/allocation/auto_increment_allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..c4785d2078601d7f9c5eeb7b902c7d1020340214 --- /dev/null +++ b/paddle/fluid/memory/allocation/auto_increment_allocator.cc @@ -0,0 +1,78 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/auto_increment_allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { +bool AutoIncrementAllocator::IsAllocThreadSafe() const { return true; } + +std::shared_ptr AutoIncrementAllocator::CreateNewAllocator() { + std::lock_guard guard(mtx_); + auto old_size = allocator_num_.load(); + PADDLE_ENFORCE_LT(old_size, underlying_allocators_.size(), + "Allocator number exceeds capacity %d", + underlying_allocators_.size()); + underlying_allocators_[old_size] = creator_(); + prev_success_allocator_ = old_size; + ++allocator_num_; + PADDLE_ENFORCE( + underlying_allocators_[old_size]->IsAllocThreadSafe(), + "the underlying allocator must be thread safe. This is a program " + "bug."); + return underlying_allocators_[old_size]; +} +Allocation *AutoIncrementAllocator::AllocateImpl(size_t size, + Allocator::Attr attr) { + auto cur = prev_success_allocator_.load(); + size_t retry_count = allocator_num_.load(); + size_t allocator_num = retry_count; + while (retry_count-- > 0) { // until there retry count is zero + try { + auto res = underlying_allocators_[cur]->Allocate(size, attr); + prev_success_allocator_ = cur; + return res.release(); + } catch (BadAlloc &) { + if (++cur >= allocator_num) { + cur = 0; + } + } catch (...) { + // if there is another type of allocation, just rethrow it. + throw; + } + } + + // This happens when the first allocator is exhausted and + // there are more than 1 allocation requests + // In this situation, the first allocation request would success + // and the second allocation request would fail if we do not use + // the newly created allocator by the first allocation request. + for (cur = allocator_num; cur < allocator_num_; ++cur) { + try { + auto ret = underlying_allocators_[cur]->Allocate(size, attr); + prev_success_allocator_ = cur; + return ret.release(); + } catch (BadAlloc &) { + } catch (...) { + throw; + } + } + // No suitable allocator + return CreateNewAllocator()->Allocate(size, attr).release(); +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/auto_increment_allocator.h b/paddle/fluid/memory/allocation/auto_increment_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..382588f17a9748b1b0a356c0469c683f6c904778 --- /dev/null +++ b/paddle/fluid/memory/allocation/auto_increment_allocator.h @@ -0,0 +1,79 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include // NOLINT +#include +#include +#include // NOLINT +#include // NOLINT +#include +#include "paddle/fluid/memory/allocation/allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +// The AutoIncrementAllocator manages many underlying allocators. If none of +// them can allocate the request memory, a new allocator will be created and +// invoke its `allocate` method. +// +// NOTE(yy): The AutoIncrementAllocator will prefer to allocate memory from +// the latest successful allocator. +// +// NOTE(yy): We may need to release an underlying allocator if it allocate +// nothing. However, it is generally not useful, since it will make performance +// undetermined. +// +// NOTE(yy): This allocator is only locked when creating new underlying +// allocator. The allocation requests from many threads may be dispatched +// to the same underlying allocator. So the underlying allocator must be +// thread safe. +// +// NOTE(zjl): Add capacity parameters to constructor. A high-performance +// thread-safe std::vector with varying size is hard to implement. +// Fortunately, we can get the total GPU memory and each chunk size. +// Therefore, we can get the suitable capacity of AutoIncrementAllocator. +class AutoIncrementAllocator : public Allocator { + public: + // Creator is the method to create ManagedAllocator + using AllocatorCreator = std::function()>; + + explicit AutoIncrementAllocator(AllocatorCreator&& creator, size_t capacity) + : creator_(std::move(creator)), underlying_allocators_(capacity) {} + + bool IsAllocThreadSafe() const override; + + private: + std::shared_ptr CreateNewAllocator(); + + protected: + Allocation* AllocateImpl(size_t size, Allocator::Attr attr) override; + + private: + AllocatorCreator creator_; + + std::vector underlying_allocators_; + std::atomic allocator_num_{0}; + + // Use std::atomic rather than std::mutex, since std::atomic is usually + // lock-free + std::atomic prev_success_allocator_{0}; + + std::mutex mtx_; +}; +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/best_fit_allocator.cc b/paddle/fluid/memory/allocation/best_fit_allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..e3d6c2f511ef083ef9ecc1fe8df96051b2b85cc2 --- /dev/null +++ b/paddle/fluid/memory/allocation/best_fit_allocator.cc @@ -0,0 +1,170 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/best_fit_allocator.h" +#include +#include +#include +#include + +namespace paddle { +namespace memory { +namespace allocation { + +static int HighestBitPos(size_t N) { + if (UNLIKELY(N == 0)) { + return 0; + } else { +#ifdef __GNUCC__ + return sizeof(unsigned int) * 8 - __builtin_clz(N); +#else + return static_cast(std::log2(N) + 1); +#endif + } +} + +BestFitAllocator::BestFitAllocator(Allocation* allocation) + : allocation_(allocation) { + details::Chunk chunk; + chunk.size_ = allocation_->size(); + chunk.offset_ = 0; + chunk.is_free = true; + chunks_.emplace_back(chunk); + free_chunks_[HighestBitPos(chunk.size_)].insert( + {chunk.size_, chunks_.begin()}); +} + +size_t BestFitAllocator::FreeSize() const { + size_t acc = 0; + for (auto& array_item : free_chunks_) { + for (auto& pair : array_item) { + acc += pair.second->size_; + } + } + return acc; +} + +BestFitAllocator::ListIt BestFitAllocator::SplitChunk(size_t request_size, + size_t free_chunk_offset, + MapIt bin_iterator) { + auto to_split_it = bin_iterator->second; + free_chunks_[free_chunk_offset].erase(bin_iterator); + + PADDLE_ENFORCE(to_split_it->is_free); + PADDLE_ENFORCE_GE(to_split_it->size_, request_size); + + auto remaining_size = to_split_it->size_ - request_size; + details::Chunk to_use; + details::Chunk remaining; + to_use.size_ = request_size; + to_use.is_free = false; + remaining.size_ = remaining_size; + remaining.is_free = true; + + // calc offsets + to_use.offset_ = to_split_it->offset_; + remaining.offset_ = to_use.offset_ + to_use.size_; + + // insert to chunk list + auto to_use_it = chunks_.insert(to_split_it, to_use); + if (remaining.size_ != 0) { + auto bit_size = static_cast(HighestBitPos(remaining.size_)); + free_chunks_[bit_size].insert( + {remaining.size_, chunks_.insert(to_split_it, remaining)}); + } + chunks_.erase(to_split_it); + return to_use_it; +} + +void BestFitAllocator::InsertFreeNode(const ListIt& it) { + auto pos = static_cast(HighestBitPos(it->size_)); + auto& free_map = free_chunks_[pos]; + free_map.insert({it->size_, it}); +} +void BestFitAllocator::EraseFreeNode(const ListIt& it) { + size_t pos = static_cast(HighestBitPos(it->size_)); + auto& free_map = free_chunks_[pos]; + auto map_it = free_map.find(it->size_); + while (map_it->second != it && map_it != free_map.end()) { + ++map_it; + } + PADDLE_ENFORCE(map_it != free_map.end()); + free_map.erase(map_it); +} +size_t BestFitAllocator::NumFreeChunks() const { + size_t num = 0; + for (auto& array_item : free_chunks_) { + num += array_item.size(); + } + return num; +} +void BestFitAllocator::Free(Allocation* allocation) { + auto* bf_allocation = dynamic_cast(allocation); + PADDLE_ENFORCE_NOT_NULL(bf_allocation, + "The input allocation is not BestFitAllocation."); + auto chunk_it = bf_allocation->ChunkIterator(); + PADDLE_ENFORCE(!chunk_it->is_free); + chunk_it->is_free = true; + if (chunk_it != chunks_.begin()) { + auto prev_it = chunk_it; + --prev_it; + + if (prev_it->is_free) { + // Merge Left. + EraseFreeNode(prev_it); + prev_it->size_ += chunk_it->size_; + chunks_.erase(chunk_it); + chunk_it = prev_it; + } + } + + auto next_it = chunk_it; + ++next_it; + if (next_it != chunks_.end() && next_it->is_free) { + EraseFreeNode(next_it); + chunk_it->size_ += next_it->size_; + chunks_.erase(next_it); + } + + InsertFreeNode(chunk_it); + delete allocation; +} +Allocation* BestFitAllocator::AllocateImpl(size_t size, Allocator::Attr attr) { + auto highest_set_bit = static_cast(HighestBitPos(size)); + MapIt map_it; + for (; highest_set_bit < free_chunks_.size(); ++highest_set_bit) { + map_it = free_chunks_[highest_set_bit].lower_bound(size); + if (map_it != free_chunks_[highest_set_bit].end()) { + break; + } + } + if (UNLIKELY(highest_set_bit == free_chunks_.size())) { + throw BadAlloc(string::Sprintf( + "Cannot allocate %d, All fragments size is %d", size, FreeSize())); + } + auto chunk_it = SplitChunk(size, highest_set_bit, map_it); + return new BestFitAllocation(this, chunk_it); +} + +BestFitAllocation::BestFitAllocation( + paddle::memory::allocation::BestFitAllocator* allocator, + typename details::ChunkList::iterator chunk_it) + : Allocation(reinterpret_cast( + reinterpret_cast(allocator->BasePtr()) + + chunk_it->offset_), + chunk_it->size_, allocator->Place()), + chunk_it_(chunk_it) {} +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/best_fit_allocator.h b/paddle/fluid/memory/allocation/best_fit_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..4f10f2b53e8543d4197097f1cae8de765bceeb0f --- /dev/null +++ b/paddle/fluid/memory/allocation/best_fit_allocator.h @@ -0,0 +1,132 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include +#include "paddle/fluid/memory/allocation/allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { +namespace details { +struct Chunk { + bool is_free{true}; + // Offset to the base allocation. + uintptr_t offset_; + size_t size_; +}; + +// Here we use std::list to maintain chunk list. +// NOTE(yy): The traditional implementation of ChunkList is add `prev`/`next` +// pointers in `Chunk`, and split the allocation as `ChunkHeader` and +// `Payload`. Such as +// *-------*---------------*---------------*--------------* +// | Chunk | prev_ pointer | next_ pointer | payload .... | +// *-------*---------------*---------------*--------------* +// This implementation can just return a raw pointer, and we can get the list +// structure by the raw pointer. However, we cannot use the same code on GPU +// since CPU cannot access GPU memory directly. +// +// So we choose to use `std::list` and return an allocation instance, which +// contains the list node iterator, then we can unify CPU/GPU code. +// +// To return an allocation is not a bad idea, since Tensor/Vector should holds +// an allocation instead of raw pointer directly. +using ChunkList = std::list; + +// Here we use a multi-level map of free chunks. +// the map is +// MSB offset --> size --> [ChunkList::iterator] +// +// The time complexities: +// find a free chunk: +// O(logN), +// where N is the number of free nodes with the same MSB offset. +// find the position of a chunk iterator: +// O(logN + K), +// where N is the number of free nodes with the same MSB offset. +// where K is the number of free nodes with the same size. +// insert a free chunk: +// O(logN), +// where N is the number of free nodes with the same MSB offset. +// erase a free chunk: +// O(1) +using FreeChunkBin = + std::array, sizeof(size_t) * 8>; +} // namespace details + +class BestFitAllocator; + +// The BestFitAllocation maintain the List Node iterator. +class BestFitAllocation : public Allocation { + private: + using ListIt = typename details::ChunkList::iterator; + + public: + BestFitAllocation(BestFitAllocator* allocator, ListIt chunk_it); + + const ListIt& ChunkIterator() const { return chunk_it_; } + + private: + typename details::ChunkList::iterator chunk_it_; +}; + +// TODO(yy): Current BestFitAllocator is not thread-safe. To make it thread +// safe, we must wrap a locked_allocator. However, we can implement a thread +// safe allocator by locking each bin and chunks list independently. It will +// make BestFitAllocator faster in multi-thread situation. +// +// This allocator implements a best-fit allocator with merging the free nodes. +// +// To allocate a buffer, it will find the best-fit chunk. If the best-fit chunk +// is larger than request size, the original block will be split into two +// chunks. The first block will be used and the second block will be put into +// free chunks. +// +// To free an allocation, it will set the chunk of allocation to free and merge +// the prev-chunk and the next-chunk when possible. +class BestFitAllocator : public Allocator { + public: + explicit BestFitAllocator(Allocation* allocation); + + void* BasePtr() const { return allocation_->ptr(); } + + const platform::Place& Place() const { return allocation_->place(); } + + size_t NumFreeChunks() const; + + private: + size_t FreeSize() const; + using MapIt = typename details::FreeChunkBin::value_type::iterator; + using ListIt = typename details::ChunkList::iterator; + + ListIt SplitChunk(size_t request_size, size_t free_chunk_offset, + MapIt bin_iterator); + void EraseFreeNode(const ListIt& it); + void InsertFreeNode(const ListIt& it); + + protected: + void Free(Allocation* allocation) override; + Allocation* AllocateImpl(size_t size, Allocator::Attr attr) override; + + private: + Allocation* allocation_; // not owned + details::ChunkList chunks_; + details::FreeChunkBin free_chunks_; +}; +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/best_fit_allocator_test.cc b/paddle/fluid/memory/allocation/best_fit_allocator_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..b274b05562b15856276b1c88d3504fda1ecafacc --- /dev/null +++ b/paddle/fluid/memory/allocation/best_fit_allocator_test.cc @@ -0,0 +1,138 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/best_fit_allocator.h" +#include +#include // NOLINT +#include +#include "gtest/gtest.h" +#include "paddle/fluid/memory/allocation/cpu_allocator.h" +#include "paddle/fluid/memory/allocation/locked_allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +class StubAllocation : public Allocation { + public: + explicit StubAllocation(size_t size) + : Allocation(0, size, platform::CPUPlace()) {} +}; + +TEST(BestFitAllocator, test_allocation) { + StubAllocation stub(4UL * 1024 * 1024 * 1024); + BestFitAllocator allocator(&stub); + { auto allocation = allocator.Allocate(64, allocator.kDefault); } + + { + auto allocation = allocator.Allocate(80, allocator.kDefault); + + { + auto best_fit_allocation = + dynamic_cast(allocation.get()); + ASSERT_NE(best_fit_allocation, nullptr); + ASSERT_FALSE(best_fit_allocation->ChunkIterator()->is_free); + ASSERT_EQ(best_fit_allocation->ChunkIterator()->offset_, 0); + ASSERT_EQ(allocation->size(), 80); + ASSERT_EQ(allocation->ptr(), nullptr); + } + + auto allocation2 = allocator.Allocate(60, allocator.kDefault); + auto allocation3 = allocator.Allocate(90, allocator.kDefault); + allocation2.reset(); + allocation2 = allocator.Allocate(30, allocator.kDefault); + + { + auto best_fit_allocation = + dynamic_cast(allocation2.get()); + ASSERT_EQ(best_fit_allocation->ChunkIterator()->offset_, 80); + } + allocation2.reset(); + allocation2 = allocator.Allocate(60, allocator.kDefault); + + { + auto best_fit_allocation = + dynamic_cast(allocation2.get()); + ASSERT_EQ(best_fit_allocation->ChunkIterator()->offset_, 80); + } + + allocation.reset(); + allocation2.reset(); + + allocation = allocator.Allocate(80 + 60, allocator.kDefault); + { + auto best_fit_allocation = + dynamic_cast(allocation.get()); + ASSERT_EQ(best_fit_allocation->ChunkIterator()->offset_, 0); + } + + allocation.reset(); + + allocation = allocator.Allocate(80, allocator.kDefault); + allocation2 = allocator.Allocate(60, allocator.kDefault); + allocation = nullptr; + allocation2 = nullptr; + allocation3 = nullptr; + + ASSERT_EQ(allocator.NumFreeChunks(), 1U); + } +} + +TEST(BestFitAllocator, test_concurrent_cpu_allocation) { + CPUAllocator allocator; + auto global_allocation = + allocator.Allocate(256UL * 1024 * 1024, allocator.kDefault); + + std::unique_ptr best_fit_allocator( + new BestFitAllocator(global_allocation.get())); + + LockedAllocator locked_allocator(std::move(best_fit_allocator)); + + auto th_main = [&](std::random_device::result_type seed) { + std::default_random_engine engine(seed); + std::uniform_int_distribution dist(1U, 1024U); + + for (size_t i = 0; i < 128; ++i) { + size_t allocate_size = dist(engine); + + auto allocation = locked_allocator.Allocate( + sizeof(size_t) * allocate_size, locked_allocator.kDefault); + + size_t* data = reinterpret_cast(allocation->ptr()); + + for (size_t j = 0; j < allocate_size; ++j) { + data[j] = j; + } + std::this_thread::yield(); + + for (size_t j = 0; j < allocate_size; ++j) { + ASSERT_EQ(data[j], j); + } + } + }; + { + std::vector threads; + for (size_t i = 0; i < 1024; ++i) { + std::random_device dev; + threads.emplace_back(th_main, dev()); + } + for (auto& th : threads) { + th.join(); + } + } +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/best_fit_allocator_test.cu b/paddle/fluid/memory/allocation/best_fit_allocator_test.cu new file mode 100644 index 0000000000000000000000000000000000000000..fdd5b43ad4aa8024efee314ca949445fefbef067 --- /dev/null +++ b/paddle/fluid/memory/allocation/best_fit_allocator_test.cu @@ -0,0 +1,88 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include // NOLINT +#include +#include "gtest/gtest.h" +#include "paddle/fluid/memory/allocation/best_fit_allocator.h" +#include "paddle/fluid/memory/allocation/cuda_allocator.h" +#include "paddle/fluid/memory/allocation/locked_allocator.h" +#include "paddle/fluid/memory/memcpy.h" +#include "paddle/fluid/platform/for_range.h" +namespace paddle { +namespace memory { +namespace allocation { + +struct ForEachFill { + size_t* ptr_; + + explicit ForEachFill(size_t* ptr) : ptr_(ptr) {} + + __device__ void operator()(size_t i) { ptr_[i] = i; } +}; + +TEST(BestFitAllocator, concurrent_cuda) { + CUDAAllocator allocator(platform::CUDAPlace(0)); + // 256 MB + auto cuda_allocation = + allocator.Allocate(256U * 1024 * 1024, allocator.kDefault); + LockedAllocator concurrent_allocator( + std::unique_ptr(new BestFitAllocator(cuda_allocation.get()))); + + auto th_main = [&](std::random_device::result_type seed) { + std::default_random_engine engine(seed); + std::uniform_int_distribution dist(1U, 1024U); + platform::CUDAPlace gpu(0); + platform::CUDADeviceContext dev_ctx(gpu); + std::array buf; + for (size_t i = 0; i < 128; ++i) { + size_t allocate_size = dist(engine); + + auto allocation = concurrent_allocator.Allocate( + sizeof(size_t) * allocate_size, concurrent_allocator.kDefault); + + size_t* data = reinterpret_cast(allocation->ptr()); + + ForEachFill fill(data); + platform::ForRange for_range(dev_ctx, + allocate_size); + for_range(fill); + + memory::Copy(platform::CPUPlace(), buf.data(), gpu, data, + sizeof(size_t) * allocate_size, dev_ctx.stream()); + + dev_ctx.Wait(); + for (size_t j = 0; j < allocate_size; ++j) { + ASSERT_EQ(buf[j], j); + } + allocation = nullptr; + } + }; + + { + std::vector threads; + for (size_t i = 0; i < 1024; ++i) { + std::random_device dev; + threads.emplace_back(th_main, dev()); + } + for (auto& th : threads) { + th.join(); + } + } +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/buffered_allocator.cc b/paddle/fluid/memory/allocation/buffered_allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..fc75abc9dfee6c9df5bc87faa493002cc1fe6298 --- /dev/null +++ b/paddle/fluid/memory/allocation/buffered_allocator.cc @@ -0,0 +1,80 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/buffered_allocator.h" +#include +#include +#include +#include "paddle/fluid/memory/allocation/allocation_with_underlying.h" + +namespace paddle { +namespace memory { +namespace allocation { + +BufferedAllocator::BufferedAllocator(std::unique_ptr &&allocator) + : underlying_allocator_(std::move(allocator)) { + PADDLE_ENFORCE_NOT_NULL( + underlying_allocator_, + "Underlying allocator of BufferedAllocator must be unmanaged"); + if (underlying_allocator_->IsAllocThreadSafe()) { + mtx_.reset(new std::mutex()); + } +} + +BufferedAllocator::~BufferedAllocator() { FreeCache(-1UL); } + +void BufferedAllocator::FreeCache(size_t size) { + platform::LockGuardPtr guard(mtx_); + if (UNLIKELY(size == 0)) return; + size_t cur = 0; + while (!allocations_.empty()) { // free the largest + auto it = --allocations_.end(); + cur += it->second->size(); + delete it->second.release(); + allocations_.erase(it); + if (cur >= size) return; + } +} + +bool BufferedAllocator::IsAllocThreadSafe() const { + return this->underlying_allocator_->IsAllocThreadSafe(); +} +void BufferedAllocator::Free(Allocation *allocation) { + platform::LockGuardPtr guard(mtx_); + allocations_.emplace(allocation->size(), AllocationPtr(allocation)); +} +Allocation *BufferedAllocator::AllocateImpl(size_t size, Allocator::Attr attr) { + { + platform::LockGuardPtr guard(mtx_); + auto it = allocations_.lower_bound(size); + if (it != allocations_.end() && it->first < size * 2) { + AllocationPtr result(std::move(it->second)); + allocations_.erase(it); + return new AllocationWithUnderlying(std::move(result)); + } + } + + try { + return new AllocationWithUnderlying( + underlying_allocator_->Allocate(size, attr)); + } catch (BadAlloc &) { + FreeCache(size); + return new AllocationWithUnderlying( + underlying_allocator_->Allocate(size, attr)); + } +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/buffered_allocator.h b/paddle/fluid/memory/allocation/buffered_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..d44a3f85beba712b1e735ba14008689bce7d0d64 --- /dev/null +++ b/paddle/fluid/memory/allocation/buffered_allocator.h @@ -0,0 +1,58 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include "paddle/fluid/memory/allocation/allocator.h" +#include "paddle/fluid/platform/lock_guard_ptr.h" + +namespace paddle { +namespace memory { +namespace allocation { + +// NOTE(zjl): BufferedAllocator maintains a memory pool to accelerate +// memory allocation and reuse memory. +// BufferedAllocator provides the same thread-safety level as +// underlying_allocator_ +class BufferedAllocator : public Allocator { + public: + explicit BufferedAllocator(std::unique_ptr &&allocator); + + ~BufferedAllocator(); + + bool IsAllocThreadSafe() const override; + + // only used in unittest + inline void ClearCache() { FreeCache(-1UL); } + + private: + void FreeCache(size_t size); + + protected: + void Free(Allocation *allocation) override; + Allocation *AllocateImpl(size_t size, Allocator::Attr attr) override; + + private: + std::unique_ptr underlying_allocator_; + std::multimap allocations_; + std::unique_ptr mtx_; +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/buffered_allocator_test.cc b/paddle/fluid/memory/allocation/buffered_allocator_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..c8bd5292ca0f6c3e7ebdc7f5908523b0b7c8ba3a --- /dev/null +++ b/paddle/fluid/memory/allocation/buffered_allocator_test.cc @@ -0,0 +1,146 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/buffered_allocator.h" +#include +#include +#include +#include "paddle/fluid/memory/allocation/best_fit_allocator.h" +#include "paddle/fluid/memory/allocation/cpu_allocator.h" +#include "paddle/fluid/memory/allocation/locked_allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +inline std::unique_ptr GetBufferedAllocator( + Allocation *allocation, bool thread_safe) { + std::unique_ptr allocator(new BestFitAllocator(allocation)); + if (thread_safe) { + allocator.reset(new LockedAllocator(std::move(allocator))); + } + + return std::unique_ptr( + new BufferedAllocator(std::move(allocator))); +} + +TEST(buffered_allocator, thread_safety) { + std::unique_ptr allocator(new CPUAllocator()); + auto chunk = allocator->Allocate(1 << 20, allocator->kDefault); + { + auto buf_allocator = GetBufferedAllocator(chunk.get(), true); + ASSERT_EQ(buf_allocator->IsAllocThreadSafe(), true); + } + + { + auto buf_allocator = GetBufferedAllocator(chunk.get(), false); + ASSERT_EQ(buf_allocator->IsAllocThreadSafe(), false); + } +} + +class StubAllocation : public Allocation { + public: + using Allocation::Allocation; +}; + +class StubAllocator : public Allocator { + public: + void ResetCounter() { + construct_count_ = 0; + destruct_count_ = 0; + } + + size_t GetAllocCount() const { return construct_count_; } + + size_t GetFreeCount() const { return destruct_count_; } + + protected: + void Free(Allocation *allocation) override { + auto *alloc = dynamic_cast(allocation); + PADDLE_ENFORCE_NOT_NULL(alloc); + if (alloc->ptr()) delete[] static_cast(alloc->ptr()); + ++destruct_count_; + delete allocation; + } + Allocation *AllocateImpl(size_t size, Allocator::Attr attr) override { + ++construct_count_; + if (size == 0) { + return new StubAllocation(nullptr, 0, platform::CPUPlace()); + } else { + return new StubAllocation(new uint8_t[size], size, platform::CPUPlace()); + } + } + + private: + size_t construct_count_ = 0; + size_t destruct_count_ = 0; +}; + +constexpr size_t kZero = 0; +constexpr size_t kOne = 1; +constexpr size_t kTwo = 2; + +TEST(buffered_allocator, lazy_free) { + std::unique_ptr stub_allocator(new StubAllocator()); + auto *underlying_allocator = stub_allocator.get(); + std::unique_ptr allocator( + new BufferedAllocator(std::move(stub_allocator))); + + { + underlying_allocator->ResetCounter(); + auto x = allocator->Allocate(1025, allocator->kDefault); + ASSERT_EQ(underlying_allocator->GetAllocCount(), kOne); + ASSERT_EQ(underlying_allocator->GetFreeCount(), kZero); + x = nullptr; + ASSERT_EQ(underlying_allocator->GetFreeCount(), kZero); + } + + { + underlying_allocator->ResetCounter(); + auto x = allocator->Allocate(900, allocator->kDefault); + ASSERT_EQ(underlying_allocator->GetAllocCount(), kZero); + ASSERT_EQ(underlying_allocator->GetFreeCount(), kZero); + auto y = allocator->Allocate(2048, allocator->kDefault); + ASSERT_EQ(underlying_allocator->GetAllocCount(), kOne); + ASSERT_EQ(underlying_allocator->GetFreeCount(), kZero); + x = nullptr; + ASSERT_EQ(underlying_allocator->GetFreeCount(), kZero); + y = nullptr; + ASSERT_EQ(underlying_allocator->GetFreeCount(), kZero); + } + + { + underlying_allocator->ResetCounter(); + allocator->ClearCache(); + ASSERT_EQ(underlying_allocator->GetAllocCount(), kZero); + ASSERT_EQ(underlying_allocator->GetFreeCount(), kTwo); + } +} + +TEST(buffered_allocator, garbage_collection) { + std::unique_ptr cpu_allocator(new CPUAllocator()); + auto chunk = cpu_allocator->Allocate(2048, cpu_allocator->kDefault); + auto allocator = GetBufferedAllocator(chunk.get(), false); + auto x1 = allocator->Allocate(1600, allocator->kDefault); + auto x2 = allocator->Allocate(400, allocator->kDefault); + x1 = nullptr; + x2 = nullptr; + auto x3 = allocator->Allocate(1600, allocator->kDefault); + ASSERT_NE(x3, nullptr); + ASSERT_NE(x3->ptr(), nullptr); +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/conditional_allocator.cc b/paddle/fluid/memory/allocation/conditional_allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..96a818e03e507c6de720344288312dc2af2ae647 --- /dev/null +++ b/paddle/fluid/memory/allocation/conditional_allocator.cc @@ -0,0 +1,48 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/conditional_allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +ConditionalAllocator& ConditionalAllocator::AddAllocator( + std::function func, + std::shared_ptr allocator) { + underlying_allocators_.emplace_back(std::move(func), std::move(allocator)); + return *this; +} + +bool ConditionalAllocator::IsAllocThreadSafe() const { + return std::all_of(underlying_allocators_.begin(), + underlying_allocators_.end(), + [](const AllocatorWithCond& allocatorWithCond) { + return allocatorWithCond.second->IsAllocThreadSafe(); + }); +} + +Allocation* ConditionalAllocator::AllocateImpl(size_t size, + Allocator::Attr attr) { + for (auto& pair : underlying_allocators_) { + if (pair.first(size, attr)) { + return pair.second->Allocate(size, attr).release(); + } + } + throw BadAlloc("No suitable allocator"); +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/conditional_allocator.h b/paddle/fluid/memory/allocation/conditional_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..94cba4432ed4f72c0a75da9b31d48611a8404ad3 --- /dev/null +++ b/paddle/fluid/memory/allocation/conditional_allocator.h @@ -0,0 +1,61 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include +#include "paddle/fluid/memory/allocation/allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +// A composite allocator who will dispatch the allocation request by registered +// condition. +// +// For example: +// +// auto* cond_allocator = new ConditionalAllocator(); +// cond_allocator->AddAllocator([](size_t size, Attr attr){ +// // if size > 10 +// return size > 10; +// }, allocator_a).AddAllocator([](size_t size, Attr attr){ +// // elif attr is kDefault +// return attr == kDefault; +// }, allocator_b).AddAllocator([](size_t size, Attr attr){ +// // else +// return true; +// }, allocator_c); +class ConditionalAllocator : public Allocator { + public: + ConditionalAllocator() = default; + + ConditionalAllocator& AddAllocator(std::function func, + std::shared_ptr allocator); + + bool IsAllocThreadSafe() const override; + + protected: + Allocation* AllocateImpl(size_t size, Allocator::Attr attr) override; + + private: + using AllocatorWithCond = + std::pair, std::shared_ptr>; + std::vector underlying_allocators_; +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/cpu_allocator.cc b/paddle/fluid/memory/allocation/cpu_allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..cc81a6f7b8b1950b07b6fb1571b53d9b5ddb1b9f --- /dev/null +++ b/paddle/fluid/memory/allocation/cpu_allocator.cc @@ -0,0 +1,45 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/cpu_allocator.h" +#include +#include + +namespace paddle { +namespace memory { +namespace allocation { + +CPUAllocation::CPUAllocation(void *ptr, size_t size) + : Allocation(ptr, size, platform::CPUPlace()) {} + +bool CPUAllocator::IsAllocThreadSafe() const { return true; } + +void CPUAllocator::Free(Allocation *allocation) { + PADDLE_ENFORCE_NOT_NULL(dynamic_cast(allocation)); + free(allocation->ptr()); + delete allocation; +} + +Allocation *CPUAllocator::AllocateImpl(size_t size, Allocator::Attr attr) { + void *ptr; + auto status = posix_memalign(&ptr, kAlignment, size); + if (UNLIKELY(status) != 0) { + throw BadAlloc(string::Sprintf("Cannot allocate cpu memory %d. Errno is %d", + size, status)); + } + return new CPUAllocation(ptr, size); +} +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/cpu_allocator.h b/paddle/fluid/memory/allocation/cpu_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..26d3643f4edff1f2d71b1c761e915a6dacb485ad --- /dev/null +++ b/paddle/fluid/memory/allocation/cpu_allocator.h @@ -0,0 +1,51 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include "paddle/fluid/memory/allocation/allocator.h" + +#ifdef _WIN32 +#define posix_memalign_free _aligned_free +#define posix_memalign(p, a, s) \ + (((*(p)) = _aligned_malloc((s), (a))), *(p) ? 0 : errno) +#endif + +namespace paddle { +namespace memory { +namespace allocation { +// CPU system allocator and allocation. +// +// NOTE(yy): Should we just use `malloc` here since there is an +// aligned_allocator. +// +// NOTE(yy): It is no need to use `BestFitAllocator` in CPU. We can import +// an open-sourced allocator into Paddle. +class CPUAllocator; +class CPUAllocation : public Allocation { + public: + CPUAllocation(void* ptr, size_t size); +}; + +class CPUAllocator : public Allocator { + public: + constexpr static size_t kAlignment = 64u; + bool IsAllocThreadSafe() const override; + + protected: + void Free(Allocation* allocation) override; + Allocation* AllocateImpl(size_t size, Allocator::Attr attr) override; +}; +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/cuda_allocator.cc b/paddle/fluid/memory/allocation/cuda_allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..430bf0be98e08787ac4412a8b6e0fcc310ffe2b4 --- /dev/null +++ b/paddle/fluid/memory/allocation/cuda_allocator.cc @@ -0,0 +1,48 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/cuda_allocator.h" +#include +#include +#include +#include "paddle/fluid/platform/cuda_device_guard.h" +#include "paddle/fluid/platform/gpu_info.h" + +namespace paddle { +namespace memory { +namespace allocation { +bool CUDAAllocator::IsAllocThreadSafe() const { return true; } +void CUDAAllocator::Free(Allocation* allocation) { + platform::CUDADeviceGuard guard(place_.device); + auto* cuda_allocation = dynamic_cast(allocation); + PADDLE_ENFORCE_NOT_NULL(cuda_allocation); + PADDLE_ENFORCE_EQ(boost::get(cuda_allocation->place()), + place_); + PADDLE_ENFORCE(cudaFree(allocation->ptr())); + delete allocation; +} +Allocation* CUDAAllocator::AllocateImpl(size_t size, Allocator::Attr attr) { + platform::CUDADeviceGuard guard(place_.device); + void* ptr; + auto status = cudaMalloc(&ptr, size); + if (UNLIKELY(status != cudaSuccess)) { + throw BadAlloc(string::Sprintf( + "Cannot allocate %d on GPU %d, cuda status %d, %s", size, place_.device, + status, cudaGetErrorString(status))); + } + return new CUDAAllocation(ptr, size, platform::Place(place_)); +} +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/cuda_allocator.h b/paddle/fluid/memory/allocation/cuda_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..63726f5820b1c81565117c7a9bf798c17c9681f6 --- /dev/null +++ b/paddle/fluid/memory/allocation/cuda_allocator.h @@ -0,0 +1,47 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include "paddle/fluid/memory/allocation/allocator.h" +#include "paddle/fluid/platform/place.h" + +namespace paddle { +namespace memory { +namespace allocation { + +// CUDA System allocator and allocation. +// Just a flag type. +class CUDAAllocation : public Allocation { + public: + using Allocation::Allocation; +}; + +class CUDAAllocator : public Allocator { + public: + explicit CUDAAllocator(const platform::CUDAPlace& place) : place_(place) {} + explicit CUDAAllocator(const platform::Place& place) + : place_(boost::get(place)) {} + bool IsAllocThreadSafe() const override; + + protected: + void Free(Allocation* allocation) override; + Allocation* AllocateImpl(size_t size, Allocator::Attr attr) override; + + private: + platform::CUDAPlace place_; +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/legacy_allocator.cc b/paddle/fluid/memory/allocation/legacy_allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..514ac7883ad2effdf3518be8afe3f448a5ac10b2 --- /dev/null +++ b/paddle/fluid/memory/allocation/legacy_allocator.cc @@ -0,0 +1,417 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include +#include + +#ifdef PADDLE_WITH_JEMALLOC +#include +#endif + +#include "glog/logging.h" +#include "paddle/fluid/memory/allocation/legacy_allocator.h" +#include "paddle/fluid/memory/detail/buddy_allocator.h" +#include "paddle/fluid/memory/detail/system_allocator.h" +#include "paddle/fluid/platform/gpu_info.h" +#include "paddle/fluid/platform/profiler.h" +#include "paddle/fluid/string/printf.h" +#include "paddle/fluid/string/split.h" + +DEFINE_bool(init_allocated_mem, false, + "It is a mistake that the values of the memory allocated by " + "BuddyAllocator are always zeroed in some op's implementation. " + "To find this error in time, we use init_allocated_mem to indicate " + "that initializing the allocated memory with a small value " + "during unit testing."); +DECLARE_double(fraction_of_gpu_memory_to_use); +DECLARE_uint64(initial_gpu_memory_in_mb); +DECLARE_uint64(reallocate_gpu_memory_in_mb); +DECLARE_bool(benchmark); + +namespace paddle { +namespace memory { +namespace legacy { +template +void *Alloc(const Place &place, size_t size); + +template +void Free(const Place &place, void *p, size_t size); + +template +size_t Used(const Place &place); + +struct Usage : public boost::static_visitor { + size_t operator()(const platform::CPUPlace &cpu) const; + size_t operator()(const platform::CUDAPlace &gpu) const; + size_t operator()(const platform::CUDAPinnedPlace &cuda_pinned) const; +}; + +size_t memory_usage(const platform::Place &p); + +using BuddyAllocator = detail::BuddyAllocator; + +BuddyAllocator *GetCPUBuddyAllocator() { + // We tried thread_local for inference::RNN1 model, but that not works much + // for multi-thread test. + static std::once_flag init_flag; + static detail::BuddyAllocator *a = nullptr; + + std::call_once(init_flag, []() { + a = new detail::BuddyAllocator( + std::unique_ptr(new detail::CPUAllocator), + platform::CpuMinChunkSize(), platform::CpuMaxChunkSize()); + }); + + return a; +} + +// We compared the NaiveAllocator with BuddyAllocator in CPU memory allocation, +// seems they are almost the same overhead. +struct NaiveAllocator { + void *Alloc(size_t size) { return malloc(size); } + + void Free(void *p) { + PADDLE_ENFORCE(p); + free(p); + } + + static NaiveAllocator *Instance() { + static NaiveAllocator x; + return &x; + } + + private: + std::mutex lock_; +}; + +template <> +void *Alloc(const platform::CPUPlace &place, size_t size) { + VLOG(10) << "Allocate " << size << " bytes on " << platform::Place(place); +#ifdef PADDLE_WITH_JEMALLOC + void *p = malloc(size); +#else + void *p = GetCPUBuddyAllocator()->Alloc(size); +#endif + if (FLAGS_init_allocated_mem) { + memset(p, 0xEF, size); + } + VLOG(10) << " pointer=" << p; + return p; +} + +template <> +void Free(const platform::CPUPlace &place, void *p, + size_t size) { + VLOG(10) << "Free pointer=" << p << " on " << platform::Place(place); +#ifdef PADDLE_WITH_JEMALLOC + free(p); +#else + GetCPUBuddyAllocator()->Free(p); +#endif +} + +template <> +size_t Used(const platform::CPUPlace &place) { +#ifdef PADDLE_WITH_JEMALLOC + // fake the result of used memory when PADDLE_WITH_JEMALLOC is ON + return 0U; +#else + return GetCPUBuddyAllocator()->Used(); +#endif +} + +#ifdef PADDLE_WITH_CUDA +BuddyAllocator *GetGPUBuddyAllocator(int gpu_id) { + static std::once_flag init_flag; + static detail::BuddyAllocator **a_arr = nullptr; + static std::vector devices; + + std::call_once(init_flag, [gpu_id]() { + devices = platform::GetSelectedDevices(); + int gpu_num = devices.size(); + + allocation::GPUMemMonitor.Initialize(devices.size()); + + a_arr = new BuddyAllocator *[gpu_num]; + for (size_t i = 0; i < devices.size(); ++i) { + int dev_id = devices[i]; + a_arr[i] = nullptr; + platform::SetDeviceId(dev_id); + a_arr[i] = new BuddyAllocator(std::unique_ptr( + new detail::GPUAllocator(dev_id)), + platform::GpuMinChunkSize(), + platform::GpuMaxChunkSize()); + + VLOG(10) << "\n\nNOTE:\n" + << "You can set GFlags environment variable " + << "'FLAGS_fraction_of_gpu_memory_to_use' " + << "or 'FLAGS_initial_gpu_memory_in_mb' " + << "or 'FLAGS_reallocate_gpu_memory_in_mb' " + << "to change the memory size for GPU usage.\n" + << "Current 'FLAGS_fraction_of_gpu_memory_to_use' value is " + << FLAGS_fraction_of_gpu_memory_to_use + << ". Current 'FLAGS_initial_gpu_memory_in_mb' value is " + << FLAGS_initial_gpu_memory_in_mb + << ". Current 'FLAGS_reallocate_gpu_memory_in_mb' value is " + << FLAGS_reallocate_gpu_memory_in_mb << "\n\n"; + } + }); + + platform::SetDeviceId(gpu_id); + auto pos = std::distance(devices.begin(), + std::find(devices.begin(), devices.end(), gpu_id)); + return a_arr[pos]; +} +#endif + +template <> +size_t Used(const platform::CUDAPlace &place) { +#ifdef PADDLE_WITH_CUDA + return GetGPUBuddyAllocator(place.device)->Used(); +#else + PADDLE_THROW("'CUDAPlace' is not supported in CPU only device."); +#endif +} + +template <> +void *Alloc(const platform::CUDAPlace &place, + size_t size) { +#ifdef PADDLE_WITH_CUDA + auto *buddy_allocator = GetGPUBuddyAllocator(place.device); + auto *ptr = buddy_allocator->Alloc(size); + if (ptr == nullptr) { + int cur_dev = platform::GetCurrentDeviceId(); + platform::SetDeviceId(place.device); + size_t avail, total; + platform::GpuMemoryUsage(&avail, &total); + LOG(FATAL) << "Cannot allocate " << string::HumanReadableSize(size) + << " in GPU " << place.device << ", available " + << string::HumanReadableSize(avail) << "total " << total + << "GpuMinChunkSize " + << string::HumanReadableSize(buddy_allocator->GetMinChunkSize()) + << "GpuMaxChunkSize " + << string::HumanReadableSize(buddy_allocator->GetMaxChunkSize()) + << "GPU memory used: " + << string::HumanReadableSize(Used(place)); + platform::SetDeviceId(cur_dev); + } else { + if (FLAGS_benchmark) { + allocation::GPUMemMonitor.Add(place.device, size); + } + if (FLAGS_init_allocated_mem) { + cudaMemset(ptr, 0xEF, size); + } + } + return ptr; +#else + PADDLE_THROW("'CUDAPlace' is not supported in CPU only device."); +#endif +} + +template <> +void Free(const platform::CUDAPlace &place, void *p, + size_t size) { +#ifdef PADDLE_WITH_CUDA + GetGPUBuddyAllocator(place.device)->Free(p); + if (FLAGS_benchmark) { + allocation::GPUMemMonitor.Minus(place.device, size); + } +#else + PADDLE_THROW("'CUDAPlace' is not supported in CPU only device."); +#endif +} + +#ifdef PADDLE_WITH_CUDA +BuddyAllocator *GetCUDAPinnedBuddyAllocator() { + static std::once_flag init_flag; + static BuddyAllocator *ba = nullptr; + + std::call_once(init_flag, []() { + ba = new BuddyAllocator(std::unique_ptr( + new detail::CUDAPinnedAllocator), + platform::CUDAPinnedMinChunkSize(), + platform::CUDAPinnedMaxChunkSize()); + }); + + return ba; +} +#endif + +template <> +size_t Used(const platform::CUDAPinnedPlace &place) { +#ifdef PADDLE_WITH_CUDA + return GetCUDAPinnedBuddyAllocator()->Used(); +#else + PADDLE_THROW("'CUDAPinnedPlace' is not supported in CPU only device."); +#endif +} + +template <> +void *Alloc(const platform::CUDAPinnedPlace &place, + size_t size) { +#ifdef PADDLE_WITH_CUDA + auto *buddy_allocator = GetCUDAPinnedBuddyAllocator(); + void *ptr = buddy_allocator->Alloc(size); + + if (ptr == nullptr) { + LOG(WARNING) << "cudaHostAlloc Cannot allocate " << size + << " bytes in CUDAPinnedPlace"; + } + if (FLAGS_init_allocated_mem) { + memset(ptr, 0xEF, size); + } + return ptr; +#else + PADDLE_THROW("'CUDAPinnedPlace' is not supported in CPU only device."); +#endif +} + +template <> +void Free(const platform::CUDAPinnedPlace &place, + void *p, size_t size) { +#ifdef PADDLE_WITH_CUDA + GetCUDAPinnedBuddyAllocator()->Free(p); +#else + PADDLE_THROW("'CUDAPinnedPlace' is not supported in CPU only device."); +#endif +} + +struct AllocVisitor : public boost::static_visitor { + inline explicit AllocVisitor(size_t size) : size_(size) {} + + template + inline void *operator()(const Place &place) const { + return Alloc(place, size_); + } + + private: + size_t size_; +}; + +struct FreeVisitor : public boost::static_visitor { + inline explicit FreeVisitor(void *ptr, size_t size) + : ptr_(ptr), size_(size) {} + + template + inline void operator()(const Place &place) const { + Free(place, ptr_, size_); + } + + private: + void *ptr_; + size_t size_; +}; + +size_t Usage::operator()(const platform::CPUPlace &cpu) const { + return Used(cpu); +} + +size_t Usage::operator()(const platform::CUDAPlace &gpu) const { +#ifdef PADDLE_WITH_CUDA + return Used(gpu); +#else + PADDLE_THROW("'CUDAPlace' is not supported in CPU only device."); +#endif +} + +size_t Usage::operator()(const platform::CUDAPinnedPlace &cuda_pinned) const { +#ifdef PADDLE_WITH_CUDA + return Used(cuda_pinned); +#else + PADDLE_THROW("'CUDAPinnedPlace' is not supported in CPU only device."); +#endif +} +} // namespace legacy + +namespace allocation { +LegacyMemMonitor GPUMemMonitor; + +Allocation *LegacyAllocator::AllocateImpl(size_t size, Allocator::Attr attr) { + void *ptr = boost::apply_visitor(legacy::AllocVisitor(size), place_); + auto *tmp_alloc = new Allocation(ptr, size, place_); + platform::MemEvenRecorder::Instance().PushMemRecord( + static_cast(tmp_alloc), place_, size); + return tmp_alloc; +} + +void LegacyAllocator::Free(Allocation *allocation) { + boost::apply_visitor( + legacy::FreeVisitor(allocation->ptr(), allocation->size()), + allocation->place()); + platform::MemEvenRecorder::Instance().PopMemRecord( + static_cast(allocation), place_); + delete allocation; +} + +bool MemInfo::Add(const size_t &size) { + std::lock_guard lock(mutex_); + usage_ += size; + bool peak_point = usage_ > peak_usage_; + if (peak_point) peak_usage_ = usage_; + return peak_point; +} + +void MemInfo::Minus(const size_t &size) { + std::lock_guard lock(mutex_); + usage_ -= size; +} + +uint64_t MemInfo::GetPeakUsage() const { return peak_usage_; } + +LegacyMemMonitor::~LegacyMemMonitor() { + for (auto &item : gpu_mem_info_) delete item.second; +} + +void LegacyMemMonitor::Initialize(const int &device_num) { + for (auto i = 0; i < device_num; ++i) { + gpu_mem_info_[i] = new MemInfo(); + } +} + +void LegacyMemMonitor::Add(const int &device, const size_t &size) { + if (gpu_mem_info_[device]->Add(size)) { + VLOG(3) << "#LegacyMemMonitor# device: " << device + << " peak memory usage : " + << (gpu_mem_info_[device]->GetPeakUsage() >> 20) << " MiB"; + } +} + +void LegacyMemMonitor::Minus(const int &device, const size_t &size) { + gpu_mem_info_[device]->Minus(size); +} + +uint64_t LegacyMemMonitor::GetMemUsage(const int &device) const { + return gpu_mem_info_.find(device) == gpu_mem_info_.end() + ? 0 + : gpu_mem_info_.at(device)->GetPeakUsage(); +} + +void LegacyMemMonitor::PrintMemUsage() { + std::vector devices; + for (const auto &item : gpu_mem_info_) { + devices.emplace_back(item.first); + } + std::sort(devices.begin(), devices.end()); + for (const auto &device : devices) { + std::cout << "Device : " << device << " Peak Memory Usage : " + << (gpu_mem_info_[device]->GetPeakUsage() >> 20) << " MiB" + << std::endl; + } +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/legacy_allocator.h b/paddle/fluid/memory/allocation/legacy_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..d9bdae153da6439598f76f5cac226897e6e0c596 --- /dev/null +++ b/paddle/fluid/memory/allocation/legacy_allocator.h @@ -0,0 +1,84 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include // NOLINT +#include +#include +#include +#include "paddle/fluid/memory/allocation/allocator.h" +#include "paddle/fluid/platform/place.h" +namespace paddle { +namespace memory { +namespace allocation { + +class MemInfo { + public: + MemInfo() : usage_(0), peak_usage_(0) {} + + // return a flag to indicate current operation will create a peak point or not + bool Add(const size_t &); + void Minus(const size_t &); + + uint64_t GetPeakUsage() const; + + private: + /* current memory usage*/ + uint64_t usage_; + uint64_t peak_usage_; + std::mutex mutex_; + + DISABLE_COPY_AND_ASSIGN(MemInfo); +}; + +class LegacyMemMonitor { + public: + // used to store the GPU memory usage of each devices + using MemUsage = std::unordered_map; + + MemUsage GetMemUsageInfo() { return gpu_mem_info_; } + ~LegacyMemMonitor(); + + void Initialize(const int &); + void Add(const int &, const size_t &); + void Minus(const int &, const size_t &); + + uint64_t GetMemUsage(const int &) const; + + void PrintMemUsage(); + + private: + MemUsage gpu_mem_info_; +}; + +extern LegacyMemMonitor GPUMemMonitor; + +class LegacyAllocatorPrivate; +class LegacyAllocator : public Allocator { + public: + explicit LegacyAllocator(const platform::Place &p) : place_(p) {} + + protected: + Allocation *AllocateImpl(size_t size, Allocator::Attr attr) override; + void Free(Allocation *allocation) override; + + private: + platform::Place place_; +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/locked_allocator.cc b/paddle/fluid/memory/allocation/locked_allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..62d768c580607f32db8c49eb3d62f0f32c9dbeeb --- /dev/null +++ b/paddle/fluid/memory/allocation/locked_allocator.cc @@ -0,0 +1,49 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/locked_allocator.h" +#include // NOLINT +#include +#include "paddle/fluid/memory/allocation/allocation_with_underlying.h" +#include "paddle/fluid/platform/lock_guard_ptr.h" +namespace paddle { +namespace memory { +namespace allocation { + +bool LockedAllocator::IsAllocThreadSafe() const { return true; } + +LockedAllocator::LockedAllocator( + std::unique_ptr &&underlying_allocator) + : underlying_allocator_(std::move(underlying_allocator)) { + PADDLE_ENFORCE_NOT_NULL(underlying_allocator_); + if (!underlying_allocator_->IsAllocThreadSafe()) { + mtx_.reset(new std::mutex()); + } +} +void LockedAllocator::Free(Allocation *allocation) { + { + platform::LockGuardPtr guard(mtx_); + reinterpret_cast(allocation) + ->allocation_.reset(); // Destroy inner allocation + } + delete allocation; +} +Allocation *LockedAllocator::AllocateImpl(size_t size, Allocator::Attr attr) { + platform::LockGuardPtr guard(mtx_); + return new AllocationWithUnderlying( + underlying_allocator_->Allocate(size, attr)); +} +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/locked_allocator.h b/paddle/fluid/memory/allocation/locked_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..4967b9bb8d3ad101cff4657b0a45b49b76e2deb2 --- /dev/null +++ b/paddle/fluid/memory/allocation/locked_allocator.h @@ -0,0 +1,41 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +#pragma once +#include +#include // NOLINT +#include // NOLINT +#include "paddle/fluid/memory/allocation/allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +// A allocator to make underlying allocator thread safe. +class LockedAllocator : public Allocator { + public: + explicit LockedAllocator(std::unique_ptr &&underlying_allocator); + bool IsAllocThreadSafe() const override; + + protected: + void Free(Allocation *allocation) override; + Allocation *AllocateImpl(size_t size, Allocator::Attr attr) override; + + private: + std::unique_ptr underlying_allocator_; + std::unique_ptr mtx_; +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/pinned_allocator.cc b/paddle/fluid/memory/allocation/pinned_allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..de81d12cca6ca280289371abdec225c9e2b8f4d0 --- /dev/null +++ b/paddle/fluid/memory/allocation/pinned_allocator.cc @@ -0,0 +1,40 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/pinned_allocator.h" +#include +#include + +namespace paddle { +namespace memory { +namespace allocation { +bool CPUPinnedAllocator::IsAllocThreadSafe() const { return true; } +void CPUPinnedAllocator::Free(Allocation *allocation) { + PADDLE_ENFORCE_NOT_NULL(dynamic_cast(allocation)); + PADDLE_ENFORCE(cudaFreeHost(allocation->ptr())); + delete allocation; +} +Allocation *CPUPinnedAllocator::AllocateImpl(size_t size, + Allocator::Attr attr) { + // PADDLE_ENFORCE_EQ( + // attr, kCrossDevice, + // "CPUPinnedAllocator should be used for Cross-Device Communication"); + + void *ptr; + PADDLE_ENFORCE(cudaHostAlloc(&ptr, size, cudaHostAllocPortable)); + return new CPUPinnedAllocation(ptr, size); +} +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/pinned_allocator.h b/paddle/fluid/memory/allocation/pinned_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..42d0938f2afbb1efca8bfdd7035bc0eada30f06b --- /dev/null +++ b/paddle/fluid/memory/allocation/pinned_allocator.h @@ -0,0 +1,40 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include "paddle/fluid/memory/allocation/allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +// Allocator uses `cudaHostAlloc` +class CPUPinnedAllocation : public Allocation { + public: + CPUPinnedAllocation(void *ptr, size_t size) + : Allocation(ptr, size, platform::CUDAPinnedPlace()) {} +}; + +class CPUPinnedAllocator : public Allocator { + public: + bool IsAllocThreadSafe() const override; + + protected: + void Free(Allocation *allocation) override; + Allocation *AllocateImpl(size_t size, Allocator::Attr attr) override; +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/retry_allocator.cc b/paddle/fluid/memory/allocation/retry_allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..981705051b449e6a35c2dcce9138dc2efae52920 --- /dev/null +++ b/paddle/fluid/memory/allocation/retry_allocator.cc @@ -0,0 +1,75 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/retry_allocator.h" +#include "paddle/fluid/memory/allocation/allocation_with_underlying.h" +namespace paddle { +namespace memory { +namespace allocation { + +bool RetryAllocator::IsAllocThreadSafe() const { + return underlying_allocator_->IsAllocThreadSafe(); +} + +void RetryAllocator::Free(Allocation* allocation) { + // Delete underlying allocation first. + reinterpret_cast(allocation)->allocation_.reset(); + { + // notify all waited allocators, they can try to allocate memory after free. + std::lock_guard lock(mutex_); + cv_.notify_all(); + } + delete allocation; +} + +Allocation* RetryAllocator::AllocateImpl(size_t size, Allocator::Attr attr) { + auto alloc_func = [&, this]() { + return new AllocationWithUnderlying( + underlying_allocator_->Allocate(size, attr)); + }; + // In fact, we can unify the code of allocation success and failure + // But it would add lock even when allocation success at the first time + try { + return alloc_func(); + } catch (BadAlloc& bad_alloc) { + { + // We can just write allocation retry inside the predicate function of + // wait_until + // But it needs to acquire the lock when executing predicate function + // For better performance, we use loop here + auto end_time = std::chrono::high_resolution_clock::now() + retry_time_; + auto wait_until = [&, this] { + std::unique_lock lock(mutex_); + return cv_.wait_until(lock, end_time); + }; + while (wait_until() != std::cv_status::timeout) { + try { + return alloc_func(); + } catch (BadAlloc& ex) { + bad_alloc = ex; + } catch (...) { + throw; + } + } + + throw; // rethrow the original exception or throw the internal bad_alloc + } + } catch (...) { + throw; + } +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/retry_allocator.h b/paddle/fluid/memory/allocation/retry_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..6ab8ca8fbec0077b2c95cf727731ca0095716197 --- /dev/null +++ b/paddle/fluid/memory/allocation/retry_allocator.h @@ -0,0 +1,67 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include // NOLINT +#include // NOLINT +#include +#include // NOLINT +#include +#include "paddle/fluid/memory/allocation/allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +class RetryAllocator; + +class RetryAllocator : public Allocator { + public: + RetryAllocator(std::unique_ptr&& allocator, size_t retry_ms) + : underlying_allocator_(std::move(allocator)), retry_time_(retry_ms) { + EnforceCheck(); + } + + bool IsAllocThreadSafe() const override; + + private: + void EnforceCheck() { + PADDLE_ENFORCE_NOT_NULL( + underlying_allocator_.get(), + "UnderlyingAllocator of RetryAllocator must be UnmanagedAllocator"); + PADDLE_ENFORCE(underlying_allocator_->IsAllocThreadSafe(), + "UnderlyingAllocator of RetryAllocator must be thread-safe"); + } + + protected: + void Free(Allocation* allocation) override; + Allocation* AllocateImpl(size_t size, Allocator::Attr attr) override; + + private: + std::unique_ptr underlying_allocator_; + std::chrono::milliseconds retry_time_; + std::mutex mutex_; + std::condition_variable cv_; + + // For debug, We can add an atomic integer to record how many memory sizes are + // waited to allocate + // std::atomic waited_allocate_size_{0}; + + friend class RetryAllocation; +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/retry_allocator_test.cc b/paddle/fluid/memory/allocation/retry_allocator_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..f0b215dac252475217a403e680a23559280b0e8d --- /dev/null +++ b/paddle/fluid/memory/allocation/retry_allocator_test.cc @@ -0,0 +1,98 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/retry_allocator.h" +#include +#include // NOLINT +#include // NOLINT +#include // NOLINT +#include // NOLINT +#include +#include "gtest/gtest.h" +#include "paddle/fluid/memory/allocation/best_fit_allocator.h" +#include "paddle/fluid/memory/allocation/cpu_allocator.h" +#include "paddle/fluid/memory/allocation/locked_allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +TEST(RetryAllocator, RetryAllocator) { + CPUAllocator cpu_allocator; + + size_t size = (1 << 20); + auto cpu_allocation = cpu_allocator.Allocate(size, cpu_allocator.kDefault); + + std::unique_ptr best_fit_allocator( + new BestFitAllocator(cpu_allocation.get())); + std::unique_ptr locked_allocator( + new LockedAllocator(std::move(best_fit_allocator))); + + size_t thread_num = 32; + size_t sleep_time = 40; + size_t extra_time = 10; + + // Reserve to perform more tests in the future + std::vector> allocators; + { + std::unique_ptr best_fit_allocator( + new BestFitAllocator(cpu_allocation.get())); + std::unique_ptr locked_allocator( + new LockedAllocator(std::move(best_fit_allocator))); + allocators.push_back(std::make_shared( + std::move(locked_allocator), + (thread_num - 1) * (sleep_time + extra_time))); + } + + for (auto &allocator : allocators) { + std::vector threads(thread_num); + std::vector addresses(threads.size(), nullptr); + + std::mutex mutex; + std::condition_variable cv; + bool flag = false; + + for (size_t i = 0; i < threads.size(); ++i) { + threads[i] = std::thread([&, i]() { + { + std::unique_lock lock(mutex); + cv.wait(lock, [&] { return flag; }); + } + + auto ret = allocator->Allocate(size - 1); + addresses[i] = ret->ptr(); + std::this_thread::sleep_for(std::chrono::milliseconds(sleep_time)); + }); + } + + { + std::lock_guard lock(mutex); + flag = true; + cv.notify_all(); + } + + for (auto &th : threads) { + th.join(); + } + + void *val = cpu_allocation->ptr(); + bool is_all_equal = std::all_of(addresses.begin(), addresses.end(), + [val](void *p) { return p == val; }); + ASSERT_TRUE(is_all_equal); + } +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/zero_size_allocator.cc b/paddle/fluid/memory/allocation/zero_size_allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..cb2df1a029815478bbc9d3b09425f3ef145c5fb3 --- /dev/null +++ b/paddle/fluid/memory/allocation/zero_size_allocator.cc @@ -0,0 +1,34 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/zero_size_allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +bool ZeroSizeAllocator::IsAllocThreadSafe() const { + return underlying_allocator_->IsAllocThreadSafe(); +} + +Allocation *ZeroSizeAllocator::AllocateImpl(size_t size, Allocator::Attr attr) { + if (size == 0) { + return new ZeroSizeAllocation(place_); + } else { + return underlying_allocator_->Allocate(size, attr).release(); + } +} +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/zero_size_allocator.h b/paddle/fluid/memory/allocation/zero_size_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..0f01dfcdf5b1179c52d8c0204b655cab10770d95 --- /dev/null +++ b/paddle/fluid/memory/allocation/zero_size_allocator.h @@ -0,0 +1,51 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include "paddle/fluid/memory/allocation/allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +// The allocator handles the request's size is zero. Allocator will always +// return an allocation even the request size is zero. However, the +// allocation.ptr() is nullptr +class ZeroSizeAllocation : public Allocation { + public: + explicit ZeroSizeAllocation(const platform::Place& p) + : Allocation(nullptr, 0, p) {} +}; + +class ZeroSizeAllocator : public Allocator { + public: + ZeroSizeAllocator(std::shared_ptr underlying_allocator, + const platform::Place& p) + : underlying_allocator_(std::move(underlying_allocator)), place_(p) {} + + bool IsAllocThreadSafe() const override; + + protected: + Allocation* AllocateImpl(size_t size, Allocator::Attr attr) override; + + private: + std::shared_ptr underlying_allocator_; + const platform::Place& place_; +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/detail/CMakeLists.txt b/paddle/fluid/memory/detail/CMakeLists.txt index c725dba5e98c200c2542d97cb8f53a938f6b614a..a555b6b299228720c7559e610f4d6f31167e1555 100644 --- a/paddle/fluid/memory/detail/CMakeLists.txt +++ b/paddle/fluid/memory/detail/CMakeLists.txt @@ -9,3 +9,5 @@ endif(${WITH_GPU}) cc_test(system_allocator_test SRCS system_allocator_test.cc DEPS system_allocator) cc_library(buddy_allocator SRCS buddy_allocator.cc DEPS memory_block system_allocator glog) + +cc_test(buddy_allocator_test SRCS buddy_allocator_test.cc DEPS buddy_allocator) diff --git a/paddle/fluid/memory/detail/buddy_allocator.cc b/paddle/fluid/memory/detail/buddy_allocator.cc index 26ef27c3caafadb4801b0ae52133f6175655ce0a..edd6ea4adec2e080d294fdb207d8dd4880fdcf79 100644 --- a/paddle/fluid/memory/detail/buddy_allocator.cc +++ b/paddle/fluid/memory/detail/buddy_allocator.cc @@ -13,6 +13,10 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/memory/detail/buddy_allocator.h" + +#include +#include + #include "glog/logging.h" DEFINE_bool(free_idle_memory, false, @@ -36,9 +40,10 @@ BuddyAllocator::~BuddyAllocator() { "have actually been freed"; while (!pool_.empty()) { auto block = static_cast(std::get<2>(*pool_.begin())); - VLOG(10) << "Free from block (" << block << ", " << max_chunk_size_ << ")"; + VLOG(10) << "Free from block (" << block << ", " << block->size(cache_) + << ")"; - system_allocator_->Free(block, max_chunk_size_, block->index(cache_)); + system_allocator_->Free(block, block->size(cache_), block->index(cache_)); cache_.invalidate(block); pool_.erase(pool_.begin()); } @@ -71,7 +76,7 @@ void* BuddyAllocator::Alloc(size_t unaligned_size) { // refill the pool if failure if (it == pool_.end()) { - it = RefillPool(); + it = RefillPool(size); // if still failure, fail fatally if (it == pool_.end()) { return nullptr; @@ -184,19 +189,28 @@ void* BuddyAllocator::SystemAlloc(size_t size) { return static_cast(p)->data(); } -BuddyAllocator::PoolSet::iterator BuddyAllocator::RefillPool() { +BuddyAllocator::PoolSet::iterator BuddyAllocator::RefillPool( + size_t request_bytes) { + size_t allocate_bytes = max_chunk_size_; + size_t index = 0; + #ifdef PADDLE_WITH_CUDA if (system_allocator_->UseGpu()) { if ((total_used_ + total_free_) == 0) { - // Compute the maximum allocation size for the first allocation. - max_chunk_size_ = platform::GpuMaxChunkSize(); + // Compute the allocation size for gpu for the first allocation. + allocate_bytes = std::max(platform::GpuInitAllocSize(), request_bytes); + } else { + // Reallocation size + if (realloc_size_ == 0) { + realloc_size_ = platform::GpuReallocSize(); + } + allocate_bytes = std::max(realloc_size_, request_bytes); } } #endif - // Allocate a new maximum sized block - size_t index = 0; - void* p = system_allocator_->Alloc(&index, max_chunk_size_); + // Allocate a new block + void* p = system_allocator_->Alloc(&index, allocate_bytes); if (p == nullptr) return pool_.end(); @@ -204,7 +218,7 @@ BuddyAllocator::PoolSet::iterator BuddyAllocator::RefillPool() { << " from system allocator"; static_cast(p)->init(&cache_, MemoryBlock::FREE_CHUNK, index, - max_chunk_size_, nullptr, nullptr); + allocate_bytes, nullptr, nullptr); // gpu fallback allocation if (system_allocator_->UseGpu() && @@ -212,10 +226,10 @@ BuddyAllocator::PoolSet::iterator BuddyAllocator::RefillPool() { fallback_alloc_count_++; } - total_free_ += max_chunk_size_; + total_free_ += allocate_bytes; // dump the block into pool - return pool_.insert(IndexSizeAddress(index, max_chunk_size_, p)).first; + return pool_.insert(IndexSizeAddress(index, allocate_bytes, p)).first; } BuddyAllocator::PoolSet::iterator BuddyAllocator::FindExistChunk(size_t size) { @@ -286,12 +300,12 @@ void BuddyAllocator::CleanIdleFallBackAlloc() { VLOG(10) << "Return block " << block << " to fallback allocator."; - system_allocator_->Free(block, max_chunk_size_, block->index(cache_)); + system_allocator_->Free(block, block->size(cache_), block->index(cache_)); cache_.invalidate(block); pool = PoolSet::reverse_iterator(pool_.erase(std::next(pool).base())); - total_free_ -= max_chunk_size_; + total_free_ -= block->size(cache_); fallback_alloc_count_--; // If no fall allocation exists, return directly @@ -322,12 +336,12 @@ void BuddyAllocator::CleanIdleNormalAlloc() { VLOG(10) << "Return block " << block << " to base allocator."; - system_allocator_->Free(block, max_chunk_size_, block->index(cache_)); + system_allocator_->Free(block, block->size(cache_), block->index(cache_)); cache_.invalidate(block); pool = PoolSet::reverse_iterator(pool_.erase(std::next(pool).base())); - total_free_ -= max_chunk_size_; + total_free_ -= block->size(cache_); if (!shall_free_alloc()) return; } diff --git a/paddle/fluid/memory/detail/buddy_allocator.h b/paddle/fluid/memory/detail/buddy_allocator.h index 3f86a51f0d0b8504bbc4b0477f123093b343e9cf..bdc8cca4b55e6fe67618fb13cd8bf40c2c24858b 100644 --- a/paddle/fluid/memory/detail/buddy_allocator.h +++ b/paddle/fluid/memory/detail/buddy_allocator.h @@ -60,7 +60,7 @@ class BuddyAllocator { void* SystemAlloc(size_t size); /*! \brief If existing chunks are not suitable, refill pool */ - PoolSet::iterator RefillPool(); + PoolSet::iterator RefillPool(size_t request_bytes); /** * \brief Find the suitable chunk from existing pool and split @@ -89,6 +89,8 @@ class BuddyAllocator { size_t min_chunk_size_; // the minimum size of each chunk size_t max_chunk_size_; // the maximum size of each chunk + size_t realloc_size_ = 0; // the size of re-allocated chunk + private: /** * \brief A list of free allocation diff --git a/paddle/fluid/memory/detail/buddy_allocator_test.cc b/paddle/fluid/memory/detail/buddy_allocator_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..1edc9f2034c87d4dbd655135c557bdb86ec4354d --- /dev/null +++ b/paddle/fluid/memory/detail/buddy_allocator_test.cc @@ -0,0 +1,133 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/memory/detail/buddy_allocator.h" + +#include + +#include "gflags/gflags.h" +#include "gtest/gtest.h" +#include "paddle/fluid/memory/detail/system_allocator.h" +#include "paddle/fluid/platform/gpu_info.h" + +#ifdef PADDLE_WITH_CUDA +DECLARE_double(fraction_of_gpu_memory_to_use); +DECLARE_uint64(initial_gpu_memory_in_mb); +DECLARE_uint64(reallocate_gpu_memory_in_mb); +#endif + +namespace paddle { +namespace memory { +namespace detail { + +constexpr static int test_gpu_id = 0; + +void TestBuddyAllocator(BuddyAllocator* allocator, size_t size_bytes) { + bool freed = false; + size_t used_bytes = allocator->Used(); + + if (size_bytes > 0) { + void* p = allocator->Alloc(size_bytes); + + EXPECT_NE(p, nullptr); +#ifdef PADDLE_WITH_CUDA + if (size_bytes < platform::GpuMaxChunkSize()) { +#else + if (size_bytes < platform::CpuMaxChunkSize()) { +#endif + // Not allocate from SystemAllocator + EXPECT_GE(allocator->Used(), used_bytes + size_bytes); + } else { + // Allocate from SystemAllocator doesn't count in Used() + EXPECT_EQ(allocator->Used(), used_bytes); + } + + int* intp = static_cast(p); + std::shared_ptr ptr(intp, [&](void* p) { + allocator->Free(intp); + freed = true; + }); + } else { + freed = true; + } + + EXPECT_EQ(used_bytes, allocator->Used()); + EXPECT_TRUE(freed); +} + +#ifdef PADDLE_WITH_CUDA +TEST(BuddyAllocator, GpuFraction) { + FLAGS_fraction_of_gpu_memory_to_use = 0.01; + + BuddyAllocator buddy_allocator( + std::unique_ptr(new GPUAllocator(test_gpu_id)), + platform::GpuMinChunkSize(), platform::GpuMaxChunkSize()); + + TestBuddyAllocator(&buddy_allocator, 10); + TestBuddyAllocator(&buddy_allocator, 10 << 10); + TestBuddyAllocator(&buddy_allocator, 10 << 20); + TestBuddyAllocator(&buddy_allocator, 2 * static_cast(1 << 30)); +} + +TEST(BuddyAllocator, InitRealloc) { + FLAGS_initial_gpu_memory_in_mb = 100; + FLAGS_reallocate_gpu_memory_in_mb = 50; + + EXPECT_EQ(platform::GpuMaxChunkSize(), static_cast(100 << 20)); + + BuddyAllocator buddy_allocator( + std::unique_ptr(new GPUAllocator(test_gpu_id)), + platform::GpuMinChunkSize(), platform::GpuMaxChunkSize()); + + // Less then initial size and reallocate size + TestBuddyAllocator(&buddy_allocator, 10 << 20); + // Between initial size and reallocate size and not exceed pool + TestBuddyAllocator(&buddy_allocator, 80 << 20); + // Less then reallocate size and exceed pool + TestBuddyAllocator(&buddy_allocator, 40 << 20); + // Greater then reallocate size and exceed pool + TestBuddyAllocator(&buddy_allocator, 80 << 20); + // Greater then initial size and reallocate size + TestBuddyAllocator(&buddy_allocator, 2 * static_cast(1 << 30)); +} + +TEST(BuddyAllocator, ReallocSizeGreaterThanInit) { + FLAGS_initial_gpu_memory_in_mb = 5; + FLAGS_reallocate_gpu_memory_in_mb = 10; + + EXPECT_EQ(platform::GpuMaxChunkSize(), static_cast(10 << 20)); + + BuddyAllocator buddy_allocator( + std::unique_ptr(new GPUAllocator(test_gpu_id)), + platform::GpuMinChunkSize(), platform::GpuMaxChunkSize()); + + // Less then initial size and reallocate size + TestBuddyAllocator(&buddy_allocator, 1 << 20); + // Between initial size and reallocate size and not exceed pool + TestBuddyAllocator(&buddy_allocator, 3 << 20); + // Less then initial size and exceed pool + TestBuddyAllocator(&buddy_allocator, 3 << 20); + // Less then reallocate size and not exceed pool (now pool is 15 MB, used 7 + // MB) + TestBuddyAllocator(&buddy_allocator, 7 << 20); + // Less then reallocate size and exceed pool + TestBuddyAllocator(&buddy_allocator, 8 << 20); + // Greater then initial size and reallocate size + TestBuddyAllocator(&buddy_allocator, 2 * static_cast(1 << 30)); +} +#endif + +} // namespace detail +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/detail/system_allocator.cc b/paddle/fluid/memory/detail/system_allocator.cc index 1b96798d23cec34a1863f56c1e4027ce32b2eec5..41d79c5beb1367907a401b572d3d0eaf3a8ac67b 100644 --- a/paddle/fluid/memory/detail/system_allocator.cc +++ b/paddle/fluid/memory/detail/system_allocator.cc @@ -30,13 +30,11 @@ limitations under the License. */ #include "paddle/fluid/platform/enforce.h" #include "paddle/fluid/platform/gpu_info.h" -// If use_pinned_memory is true, CPUAllocator calls mlock, which -// returns pinned and locked memory as staging areas for data exchange -// between host and device. Allocates too much would reduce the amount -// of memory available to the system for paging. So, by default, we -// should set false to use_pinned_memory. -DEFINE_bool(use_pinned_memory, true, "If set, allocate cpu pinned memory."); +DECLARE_bool(use_pinned_memory); DECLARE_double(fraction_of_gpu_memory_to_use); +DECLARE_uint64(initial_gpu_memory_in_mb); +DECLARE_uint64(reallocate_gpu_memory_in_mb); + namespace paddle { namespace memory { namespace detail { @@ -91,7 +89,11 @@ void CPUAllocator::Free(void* p, size_t size, size_t index) { munlock(p, size); #endif } +#ifdef _WIN32 + _aligned_free(p); +#else free(p); +#endif } bool CPUAllocator::UseGpu() const { return false; } @@ -120,11 +122,18 @@ void* GPUAllocator::Alloc(size_t* index, size_t size) { gpu_alloc_size_ += size; return p; } else { - LOG(WARNING) - << "Cannot malloc " << size / 1024.0 / 1024.0 - << " MB GPU memory. Please shrink FLAGS_fraction_of_gpu_memory_to_use " - "environment variable to a lower value. Current value is " - << FLAGS_fraction_of_gpu_memory_to_use; + LOG(WARNING) << "Cannot malloc " << size / 1024.0 / 1024.0 + << " MB GPU memory. Please shrink " + "FLAGS_fraction_of_gpu_memory_to_use or " + "FLAGS_initial_gpu_memory_in_mb or " + "FLAGS_reallocate_gpu_memory_in_mb" + "environment variable to a lower value. " + << "Current FLAGS_fraction_of_gpu_memory_to_use value is " + << FLAGS_fraction_of_gpu_memory_to_use + << ". Current FLAGS_initial_gpu_memory_in_mb value is " + << FLAGS_initial_gpu_memory_in_mb + << ". Current FLAGS_reallocate_gpu_memory_in_mb value is " + << FLAGS_reallocate_gpu_memory_in_mb; return nullptr; } } @@ -174,14 +183,14 @@ void* CUDAPinnedAllocator::Alloc(size_t* index, size_t size) { void* p; // PINNED memory is visible to all CUDA contexts. - cudaError_t result = cudaMallocHost(&p, size); + cudaError_t result = cudaHostAlloc(&p, size, cudaHostAllocPortable); if (result == cudaSuccess) { *index = 1; // PINNED memory cuda_pinnd_alloc_size_ += size; return p; } else { - LOG(WARNING) << "cudaMallocHost failed."; + LOG(WARNING) << "cudaHostAlloc failed."; return nullptr; } diff --git a/paddle/fluid/memory/malloc.cc b/paddle/fluid/memory/malloc.cc index 0f13a4ea9c1af175771f5cc201ea5c0a8a0f7555..e414ad657a9447142d6e3a42fc7efc86f01e9c9f 100644 --- a/paddle/fluid/memory/malloc.cc +++ b/paddle/fluid/memory/malloc.cc @@ -12,213 +12,22 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include - #include "paddle/fluid/memory/malloc.h" - -#include "glog/logging.h" - -#include "paddle/fluid/memory/detail/buddy_allocator.h" -#include "paddle/fluid/memory/detail/system_allocator.h" -#include "paddle/fluid/platform/gpu_info.h" - -DEFINE_bool(init_allocated_mem, false, - "It is a mistake that the values of the memory allocated by " - "BuddyAllocator are always zeroed in some op's implementation. " - "To find this error in time, we use init_allocated_mem to indicate " - "that initializing the allocated memory with a small value " - "during unit testing."); -DECLARE_double(fraction_of_gpu_memory_to_use); - +#include +#include +#include "paddle/fluid/memory/allocation/allocator_facade.h" +#include "paddle/fluid/memory/allocation/allocator_strategy.h" +#include "paddle/fluid/platform/place.h" namespace paddle { namespace memory { - -using BuddyAllocator = detail::BuddyAllocator; - -BuddyAllocator* GetCPUBuddyAllocator() { - // We tried thread_local for inference::RNN1 model, but that not works much - // for multi-thread test. - static std::once_flag init_flag; - static detail::BuddyAllocator* a = nullptr; - - std::call_once(init_flag, []() { - a = new detail::BuddyAllocator( - std::unique_ptr(new detail::CPUAllocator), - platform::CpuMinChunkSize(), platform::CpuMaxChunkSize()); - }); - - return a; -} - -// We compared the NaiveAllocator with BuddyAllocator in CPU memory allocation, -// seems they are almost the same overhead. -struct NaiveAllocator { - void* Alloc(size_t size) { return malloc(size); } - - void Free(void* p) { - PADDLE_ENFORCE(p); - free(p); - } - - static NaiveAllocator* Instance() { - static NaiveAllocator x; - return &x; - } - - private: - std::mutex lock_; -}; - -template <> -void* Alloc(platform::CPUPlace place, size_t size) { - VLOG(10) << "Allocate " << size << " bytes on " << platform::Place(place); - void* p = GetCPUBuddyAllocator()->Alloc(size); - if (FLAGS_init_allocated_mem) { - memset(p, 0xEF, size); - } - VLOG(10) << " pointer=" << p; - return p; -} - -template <> -void Free(platform::CPUPlace place, void* p) { - VLOG(10) << "Free pointer=" << p << " on " << platform::Place(place); - GetCPUBuddyAllocator()->Free(p); -} - -template <> -size_t Used(platform::CPUPlace place) { - return GetCPUBuddyAllocator()->Used(); -} - -#ifdef PADDLE_WITH_CUDA - -BuddyAllocator* GetGPUBuddyAllocator(int gpu_id) { - static std::once_flag init_flag; - static detail::BuddyAllocator** a_arr = nullptr; - - std::call_once(init_flag, [gpu_id]() { - int gpu_num = platform::GetCUDADeviceCount(); - PADDLE_ENFORCE(gpu_id < gpu_num, "gpu_id:%d should < gpu_num:%d", gpu_id, - gpu_num); - - a_arr = new BuddyAllocator*[gpu_num]; - for (int i = 0; i < gpu_num; i++) { - a_arr[i] = nullptr; - platform::SetDeviceId(i); - a_arr[i] = new BuddyAllocator( - std::unique_ptr(new detail::GPUAllocator(i)), - platform::GpuMinChunkSize(), platform::GpuMaxChunkSize()); - - VLOG(10) << "\n\nNOTE: each GPU device use " - << FLAGS_fraction_of_gpu_memory_to_use * 100 - << "% of GPU memory.\n" - << "You can set GFlags environment variable '" - << "FLAGS_fraction_of_gpu_memory_to_use" - << "' to change the fraction of GPU usage.\n\n"; - } - }); - - platform::SetDeviceId(gpu_id); - return a_arr[gpu_id]; -} - -template <> -size_t Used(platform::CUDAPlace place) { - return GetGPUBuddyAllocator(place.device)->Used(); -} - -template <> -void* Alloc(platform::CUDAPlace place, size_t size) { - auto* buddy_allocator = GetGPUBuddyAllocator(place.device); - auto* ptr = buddy_allocator->Alloc(size); - if (ptr == nullptr) { - int cur_dev = platform::GetCurrentDeviceId(); - platform::SetDeviceId(place.device); - size_t avail, total; - platform::GpuMemoryUsage(&avail, &total); - LOG(WARNING) << "Cannot allocate " << size << " bytes in GPU " - << place.device << ", available " << avail << " bytes"; - LOG(WARNING) << "total " << total; - LOG(WARNING) << "GpuMinChunkSize " << buddy_allocator->GetMinChunkSize(); - LOG(WARNING) << "GpuMaxChunkSize " << buddy_allocator->GetMaxChunkSize(); - LOG(WARNING) << "GPU memory used: " << Used(place); - platform::SetDeviceId(cur_dev); - } - if (FLAGS_init_allocated_mem) { - cudaMemset(ptr, 0xEF, size); - } - return ptr; -} - -template <> -void Free(platform::CUDAPlace place, void* p) { - GetGPUBuddyAllocator(place.device)->Free(p); -} - -BuddyAllocator* GetCUDAPinnedBuddyAllocator() { - static std::once_flag init_flag; - static BuddyAllocator* ba = nullptr; - - std::call_once(init_flag, []() { - ba = new BuddyAllocator(std::unique_ptr( - new detail::CUDAPinnedAllocator), - platform::CUDAPinnedMinChunkSize(), - platform::CUDAPinnedMaxChunkSize()); - }); - - return ba; -} - -template <> -size_t Used(platform::CUDAPinnedPlace place) { - return GetCUDAPinnedBuddyAllocator()->Used(); -} - -template <> -void* Alloc(platform::CUDAPinnedPlace place, - size_t size) { - auto* buddy_allocator = GetCUDAPinnedBuddyAllocator(); - void* ptr = buddy_allocator->Alloc(size); - - if (ptr == nullptr) { - LOG(WARNING) << "cudaMallocHost Cannot allocate " << size - << " bytes in CUDAPinnedPlace"; - } - if (FLAGS_init_allocated_mem) { - memset(ptr, 0xEF, size); - } - return ptr; -} - -template <> -void Free(platform::CUDAPinnedPlace place, void* p) { - GetCUDAPinnedBuddyAllocator()->Free(p); -} -#endif - -size_t Usage::operator()(const platform::CPUPlace& cpu) const { - return Used(cpu); -} - -size_t Usage::operator()(const platform::CUDAPlace& gpu) const { -#ifdef PADDLE_WITH_CUDA - return Used(gpu); -#else - PADDLE_THROW("'CUDAPlace' is not supported in CPU only device."); -#endif -} - -size_t Usage::operator()(const platform::CUDAPinnedPlace& cuda_pinned) const { -#ifdef PADDLE_WITH_CUDA - return Used(cuda_pinned); -#else - PADDLE_THROW("'CUDAPinnedPlace' is not supported in CPU only device."); -#endif +std::shared_ptr AllocShared(const platform::Place& place, + size_t size, Allocator::Attr attr) { + return allocation::AllocatorFacade::Instance().AllocShared(place, size, attr); } -size_t memory_usage(const platform::Place& p) { - return boost::apply_visitor(Usage(), p); +AllocationPtr Alloc(const platform::Place& place, size_t size, + Allocator::Attr attr) { + return allocation::AllocatorFacade::Instance().Alloc(place, size, attr); } } // namespace memory diff --git a/paddle/fluid/memory/malloc.h b/paddle/fluid/memory/malloc.h index 3e6bfddd69cb16edf323d040ea5369cd551f299e..916538b2a659d7d9503fdc337a4ba84fa21f77f9 100644 --- a/paddle/fluid/memory/malloc.h +++ b/paddle/fluid/memory/malloc.h @@ -14,91 +14,21 @@ limitations under the License. */ #pragma once +#include +#include "paddle/fluid/memory/allocation/allocator.h" #include "paddle/fluid/platform/place.h" - namespace paddle { namespace memory { +using allocation::Allocation; +using allocation::Allocator; +using allocation::AllocationPtr; -/** - * \brief Allocate memory block in one place. - * - * \param[in] place Allocation place (CPU or GPU). - * \param[in] size Allocation size. - * - * \return Allocated memory block address. - * - * \note If return nullptr, it indicates memory allocation failed - * because insufficient memory in current system. When Alloc - * function is invoked, you must check the returned memory - * address is valid or not. - */ -template -void* Alloc(Place place, size_t size); - -/** - * \brief Free memory block in one place. - * - * \param[in] place Allocation place (CPU or GPU). - * \param[in] ptr Memory block address to free. - * - */ -template -void Free(Place place, void* ptr); - -/** - * \brief Total size of used memory in one place. - * - * \param[in] place Allocation place (CPU or GPU). - * - */ -template -size_t Used(Place place); - -struct Usage : public boost::static_visitor { - size_t operator()(const platform::CPUPlace& cpu) const; - size_t operator()(const platform::CUDAPlace& gpu) const; - size_t operator()(const platform::CUDAPinnedPlace& cuda_pinned) const; -}; - -size_t memory_usage(const platform::Place& p); - -/** - * \brief Free memory block in one place. - * - * \note In some cases, custom deleter is used to - * deallocate the memory automatically for - * std::unique_ptr in tensor.h. - * - */ -template -class PODDeleter { - static_assert(std::is_pod::value, "T must be POD"); - - public: - explicit PODDeleter(Place place) : place_(place) {} - void operator()(T* ptr) { Free(place_, static_cast(ptr)); } - - private: - Place place_; -}; - -/** - * \brief Free memory block in one place does not meet POD - * - * \note In some cases, custom deleter is used to - * deallocate the memory automatically for - * std::unique_ptr in tensor.h. - * - */ -template -class PlainDeleter { - public: - explicit PlainDeleter(Place place) : place_(place) {} - void operator()(T* ptr) { Free(place_, reinterpret_cast(ptr)); } +extern std::shared_ptr AllocShared( + const platform::Place& place, size_t size, + Allocator::Attr attr = Allocator::kDefault); - private: - Place place_; -}; +extern AllocationPtr Alloc(const platform::Place& place, size_t size, + Allocator::Attr attr = Allocator::kDefault); } // namespace memory } // namespace paddle diff --git a/paddle/fluid/memory/malloc_test.cc b/paddle/fluid/memory/malloc_test.cc deleted file mode 100644 index d39466ef60c3750600dea726a6570397423d42f6..0000000000000000000000000000000000000000 --- a/paddle/fluid/memory/malloc_test.cc +++ /dev/null @@ -1,198 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/memory/malloc.h" - -#include - -#include "gtest/gtest.h" -#include "paddle/fluid/memory/detail/memory_block.h" -#include "paddle/fluid/platform/cpu_info.h" -#include "paddle/fluid/platform/gpu_info.h" -#include "paddle/fluid/platform/place.h" - -inline bool is_aligned(void const *p) { - return 0 == (reinterpret_cast(p) & 0x3); -} - -size_t align(size_t size, paddle::platform::CPUPlace place) { - size += sizeof(paddle::memory::detail::MemoryBlock::Desc); - size_t alignment = paddle::platform::CpuMinChunkSize(); - size_t remaining = size % alignment; - return remaining == 0 ? size : size + (alignment - remaining); -} - -TEST(BuddyAllocator, CPUAllocation) { - void *p = nullptr; - - EXPECT_EQ(p, nullptr); - - paddle::platform::CPUPlace cpu; - p = paddle::memory::Alloc(cpu, 4096); - - EXPECT_NE(p, nullptr); - - paddle::platform::Place place = cpu; - EXPECT_EQ(paddle::memory::Used(cpu), paddle::memory::memory_usage(place)); - - paddle::memory::Free(cpu, p); -} - -TEST(BuddyAllocator, CPUMultAlloc) { - paddle::platform::CPUPlace cpu; - - std::unordered_map ps; - - size_t total_size = paddle::memory::Used(cpu); - EXPECT_EQ(total_size, 0UL); - - for (auto size : - {0, 128, 256, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304}) { - ps[paddle::memory::Alloc(cpu, size)] = size; - - // Buddy Allocator doesn't manage too large memory chunk - if (paddle::memory::Used(cpu) == total_size) continue; - - size_t aligned_size = align(size, cpu); - total_size += aligned_size; - EXPECT_EQ(total_size, paddle::memory::Used(cpu)); - } - - for (auto p : ps) { - EXPECT_EQ(is_aligned(p.first), true); - paddle::memory::Free(cpu, p.first); - - // Buddy Allocator doesn't manage too large memory chunk - if (paddle::memory::Used(cpu) == total_size) continue; - - size_t aligned_size = align(p.second, cpu); - total_size -= aligned_size; - EXPECT_EQ(total_size, paddle::memory::Used(cpu)); - } -} - -#ifdef PADDLE_WITH_CUDA - -size_t align(size_t size, paddle::platform::CUDAPlace place) { - size += sizeof(paddle::memory::detail::MemoryBlock::Desc); - size_t alignment = paddle::platform::GpuMinChunkSize(); - size_t remaining = size % alignment; - return remaining == 0 ? size : size + (alignment - remaining); -} - -TEST(BuddyAllocator, GPUAllocation) { - void *p = nullptr; - - EXPECT_EQ(p, nullptr); - - paddle::platform::CUDAPlace gpu(0); - p = paddle::memory::Alloc(gpu, 4096); - - EXPECT_NE(p, nullptr); - - paddle::platform::Place place = gpu; - EXPECT_EQ(paddle::memory::Used(gpu), paddle::memory::memory_usage(place)); - - paddle::memory::Free(gpu, p); -} - -TEST(BuddyAllocator, GPUMultAlloc) { - paddle::platform::CUDAPlace gpu; - - std::unordered_map ps; - - size_t total_size = paddle::memory::Used(gpu); - EXPECT_EQ(total_size, 0UL); - - for (auto size : - {0, 128, 256, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304}) { - ps[paddle::memory::Alloc(gpu, size)] = size; - - // Buddy Allocator doesn't manage too large memory chunk - if (paddle::memory::Used(gpu) == total_size) continue; - - size_t aligned_size = align(size, gpu); - total_size += aligned_size; - EXPECT_EQ(total_size, paddle::memory::Used(gpu)); - } - - for (auto p : ps) { - EXPECT_EQ(is_aligned(p.first), true); - paddle::memory::Free(gpu, p.first); - - // Buddy Allocator doesn't manage too large memory chunk - if (paddle::memory::Used(gpu) == total_size) continue; - - size_t aligned_size = align(p.second, gpu); - total_size -= aligned_size; - EXPECT_EQ(total_size, paddle::memory::Used(gpu)); - } -} - -size_t align(size_t size, paddle::platform::CUDAPinnedPlace place) { - size += sizeof(paddle::memory::detail::MemoryBlock::Desc); - size_t alignment = paddle::platform::CUDAPinnedMinChunkSize(); - size_t remaining = size % alignment; - return remaining == 0 ? size : size + (alignment - remaining); -} - -TEST(BuddyAllocator, CUDAPinnedAllocator) { - void *p = nullptr; - - EXPECT_EQ(p, nullptr); - - paddle::platform::CUDAPinnedPlace cpu; - p = paddle::memory::Alloc(cpu, 4096); - - EXPECT_NE(p, nullptr); - - paddle::platform::Place place = cpu; - EXPECT_EQ(paddle::memory::Used(cpu), paddle::memory::memory_usage(place)); - - paddle::memory::Free(cpu, p); -} - -TEST(BuddyAllocator, CUDAPinnedMultAllocator) { - paddle::platform::CUDAPinnedPlace cpu; - - std::unordered_map ps; - - size_t total_size = paddle::memory::Used(cpu); - EXPECT_EQ(total_size, 0UL); - - for (auto size : - {0, 128, 256, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304}) { - ps[paddle::memory::Alloc(cpu, size)] = size; - - // Buddy Allocator doesn't manage too large memory chunk - if (paddle::memory::Used(cpu) == total_size) continue; - - size_t aligned_size = align(size, cpu); - total_size += aligned_size; - EXPECT_EQ(total_size, paddle::memory::Used(cpu)); - } - - for (auto p : ps) { - EXPECT_EQ(is_aligned(p.first), true); - paddle::memory::Free(cpu, p.first); - - // Buddy Allocator doesn't manage too large memory chunk - if (paddle::memory::Used(cpu) == total_size) continue; - - size_t aligned_size = align(p.second, cpu); - total_size -= aligned_size; - EXPECT_EQ(total_size, paddle::memory::Used(cpu)); - } -} -#endif diff --git a/paddle/fluid/memory/memcpy.cc b/paddle/fluid/memory/memcpy.cc index a177d4985fd0e2cca983b6873af89c60f526b811..1408163e4b5278ddcd65eb4f2900109d772a589a 100644 --- a/paddle/fluid/memory/memcpy.cc +++ b/paddle/fluid/memory/memcpy.cc @@ -15,6 +15,7 @@ limitations under the License. */ #include "paddle/fluid/memory/memcpy.h" #include // for memcpy +#include "paddle/fluid/platform/profiler.h" namespace paddle { namespace memory { @@ -27,15 +28,30 @@ void Copy(platform::CPUPlace, void* dst, } #ifdef PADDLE_WITH_CUDA +static constexpr size_t kMaxGpuAsyncCopyBytes = 64 * 1024; // 64K + +// NOTE(zcd): Do not use GpuMemcpySync as much as possible. +// because GpuMemcpySync issues the copying command to the default stream, +// which will make two commands from different streams cannot run concurrently. +// Reference: +// https://devblogs.nvidia.com/gpu-pro-tip-cuda-7-streams-simplify-concurrency/ + template <> void Copy( platform::CPUPlace dst_place, void* dst, platform::CUDAPlace src_place, const void* src, size_t num, cudaStream_t stream) { platform::SetDeviceId(src_place.device); + if (stream) { + platform::RecordEvent record_event("GpuMemcpyAsync:GPU->CPU"); platform::GpuMemcpyAsync(dst, src, num, cudaMemcpyDeviceToHost, stream); } else { + platform::RecordEvent record_event("GpuMemcpySync:GPU->CPU"); platform::GpuMemcpySync(dst, src, num, cudaMemcpyDeviceToHost); + // FIXME(zjl): do we really need it? + if (num <= kMaxGpuAsyncCopyBytes) { + cudaStreamSynchronize(0); + } } } @@ -45,9 +61,15 @@ void Copy( const void* src, size_t num, cudaStream_t stream) { platform::SetDeviceId(dst_place.device); if (stream) { + platform::RecordEvent record_event("GpuMemcpyAsync:CPU->GPU"); platform::GpuMemcpyAsync(dst, src, num, cudaMemcpyHostToDevice, stream); } else { + platform::RecordEvent record_event("GpuMemcpySync:CPU->GPU"); platform::GpuMemcpySync(dst, src, num, cudaMemcpyHostToDevice); + // FIXME(zjl): do we really need it? + if (num <= kMaxGpuAsyncCopyBytes) { + cudaStreamSynchronize(0); + } } } @@ -58,15 +80,19 @@ void Copy( if (dst_place == src_place) { platform::SetDeviceId(src_place.device); if (stream) { + platform::RecordEvent record_event("GpuMemcpyAsync(same_gpu):GPU->GPU"); platform::GpuMemcpyAsync(dst, src, num, cudaMemcpyDeviceToDevice, stream); } else { + platform::RecordEvent record_event("GpuMemcpySync(same_gpu):GPU->GPU"); platform::GpuMemcpySync(dst, src, num, cudaMemcpyDeviceToDevice); } } else { if (stream) { + platform::RecordEvent record_event("GpuMemcpyPeerAsync:GPU->GPU"); platform::GpuMemcpyPeerAsync(dst, dst_place.device, src, src_place.device, num, stream); } else { + platform::RecordEvent record_event("GpuMemcpyPeerSync:GPU->GPU"); platform::GpuMemcpyPeerSync(dst, dst_place.device, src, src_place.device, num); } @@ -101,8 +127,10 @@ void Copy( cudaStream_t stream) { platform::SetDeviceId(src_place.device); if (stream) { + platform::RecordEvent record_event("GpuMemcpyAsync:GPU->CUDAPinned"); platform::GpuMemcpyAsync(dst, src, num, cudaMemcpyDeviceToHost, stream); } else { + platform::RecordEvent record_event("GpuMemcpySync:GPU->CUDAPinned"); platform::GpuMemcpySync(dst, src, num, cudaMemcpyDeviceToHost); } } @@ -114,8 +142,10 @@ void Copy( cudaStream_t stream) { platform::SetDeviceId(dst_place.device); if (stream) { + platform::RecordEvent record_event("GpuMemcpyAsync:CUDAPinned->GPU"); platform::GpuMemcpyAsync(dst, src, num, cudaMemcpyHostToDevice, stream); } else { + platform::RecordEvent record_event("GpuMemcpySync:CUDAPinned->GPU"); platform::GpuMemcpySync(dst, src, num, cudaMemcpyHostToDevice); } } diff --git a/paddle/fluid/op_use_default_grad_op_maker.spec b/paddle/fluid/op_use_default_grad_op_maker.spec new file mode 100644 index 0000000000000000000000000000000000000000..21a25ce7d5e2bad172cf50cee6138ef4b44b07c1 --- /dev/null +++ b/paddle/fluid/op_use_default_grad_op_maker.spec @@ -0,0 +1,46 @@ +attention_lstm +conv_shift +cos_sim +dequantize +fc +flatten +fsp +fused_embedding_fc_lstm +fused_embedding_seq_pool +fusion_gru +fusion_lstm +fusion_repeated_fc_relu +fusion_seqconv_eltadd_relu +fusion_seqexpand_concat_fc +fusion_seqpool_concat +fusion_squared_mat_sub +gru +hierarchical_sigmoid +lrn +lstm_unit +lstmp +max_pool2d_with_index +max_pool3d_with_index +maxout +modified_huber_loss +nce +pool2d +pool3d +prelu +quantize +rank_loss +reduce_max +reduce_mean +reduce_min +reduce_prod +reduce_sum +requantize +reshape +rnn_memory_helper +sequence_softmax +spp +squeeze +tensor_array_to_tensor +transpose +unpool +unsqueeze diff --git a/paddle/fluid/operators/CMakeLists.txt b/paddle/fluid/operators/CMakeLists.txt index c97225669a572cd62250729a9e4e9f7b674816e4..e52e83673fe1c9ad2426e45f233c5e62f5c2f06e 100644 --- a/paddle/fluid/operators/CMakeLists.txt +++ b/paddle/fluid/operators/CMakeLists.txt @@ -1,353 +1,114 @@ -file(GLOB GENERAL_OPS RELATIVE "${CMAKE_CURRENT_SOURCE_DIR}" "*_op.cc") -string(REPLACE "_mkldnn" "" GENERAL_OPS "${GENERAL_OPS}") -string(REPLACE ".cc" "" GENERAL_OPS "${GENERAL_OPS}") -list(REMOVE_DUPLICATES GENERAL_OPS) -set(DEPS_OPS "") -set(pybind_file ${PADDLE_BINARY_DIR}/paddle/fluid/pybind/pybind.h) -file(WRITE ${pybind_file} "// Generated by the paddle/fluid/operator/CMakeLists.txt. DO NOT EDIT!\n\n") -function(op_library TARGET) - # op_library is a function to create op library. The interface is same as - # cc_library. But it handle split GPU/CPU code and link some common library - # for ops. - set(cc_srcs) - set(cu_srcs) - set(hip_cu_srcs) - set(miopen_hip_cc_srcs) - set(cu_cc_srcs) - set(cudnn_cu_cc_srcs) - set(CUDNN_FILE) - set(mkldnn_cc_srcs) - set(MKLDNN_FILE) - set(op_common_deps operator op_registry math_function) - set(options "") - set(oneValueArgs "") - set(multiValueArgs SRCS DEPS) - set(pybind_flag 0) - cmake_parse_arguments(op_library "${options}" "${oneValueArgs}" - "${multiValueArgs}" ${ARGN}) - - list(LENGTH op_library_SRCS op_library_SRCS_len) - if (${op_library_SRCS_len} EQUAL 0) - if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.cc) - list(APPEND cc_srcs ${TARGET}.cc) - endif() - if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.cu.cc) - list(APPEND cu_cc_srcs ${TARGET}.cu.cc) - endif() - if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.cu) - list(APPEND cu_srcs ${TARGET}.cu) - endif() - if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.hip.cu) - list(APPEND hip_cu_srcs ${TARGET}.hip.cu) - endif() - string(REPLACE "_op" "_cudnn_op" CUDNN_FILE "${TARGET}") - if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${CUDNN_FILE}.cu.cc) - list(APPEND cudnn_cu_cc_srcs ${CUDNN_FILE}.cu.cc) - endif() - if(WITH_AMD_GPU) - string(REPLACE "_op" "_miopen_op" MIOPEN_FILE "${TARGET}") - if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${MIOPEN_FILE}.hip.cc) - list(APPEND miopen_hip_cc_srcs ${MIOPEN_FILE}.hip.cc) - endif() - endif() - if(WITH_MKLDNN) - string(REPLACE "_op" "_mkldnn_op" MKLDNN_FILE "${TARGET}") - if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${MKLDNN_FILE}.cc) - list(APPEND mkldnn_cc_srcs ${MKLDNN_FILE}.cc) - endif() - endif() - else() - foreach(src ${op_library_SRCS}) - if (${src} MATCHES ".*\\.hip.cu$") - list(APPEND hip_cu_srcs ${src}) - elseif (${src} MATCHES ".*\\.cu$") - list(APPEND cu_srcs ${src}) - elseif(${src} MATCHES ".*_cudnn_op.cu.cc$") - list(APPEND cudnn_cu_cc_srcs ${src}) - elseif(WITH_AMD_GPU AND ${src} MATCHES ".*_miopen_op.hip.cc$") - list(APPEND miopen_hip_cc_srcs ${src}) - elseif(WITH_MKLDNN AND ${src} MATCHES ".*_mkldnn_op.cc$") - list(APPEND mkldnn_cc_srcs ${src}) - elseif(${src} MATCHES ".*\\.cu.cc$") - list(APPEND cu_cc_srcs ${src}) - elseif(${src} MATCHES ".*\\.cc$") - list(APPEND cc_srcs ${src}) - else() - message(FATAL_ERROR "${TARGET} Source file ${src} should only be .cc or .cu") - endif() - endforeach() - endif() - - list(LENGTH cc_srcs cc_srcs_len) - if (${cc_srcs_len} EQUAL 0) - message(FATAL_ERROR "The op library ${TARGET} should contains at least one .cc file") - endif() - if (WIN32) - # remove windows unsupported op, because windows has no nccl, no warpctc such ops. - foreach(windows_unsupport_op "nccl_op" "gen_nccl_id_op" "warpctc_op" "hierarchical_sigmoid_op" - "crf_decoding_op" "select_op" "lstmp_op" "gru_op" "fusion_gru_op" "lstm_op" "fusion_lstm_op" "cumsum_op" - "channel_send_op" "channel_create_op" "channel_close_op" "channel_recv_op") - if ("${TARGET}" STREQUAL "${windows_unsupport_op}") - return() - endif() - endforeach() - endif(WIN32) - set(OP_LIBRARY ${TARGET} ${OP_LIBRARY} PARENT_SCOPE) - - list(LENGTH op_library_DEPS op_library_DEPS_len) - if (${op_library_DEPS_len} GREATER 0) - set(DEPS_OPS ${TARGET} ${DEPS_OPS} PARENT_SCOPE) - endif() - if (WITH_GPU) - nv_library(${TARGET} SRCS ${cc_srcs} ${cu_cc_srcs} ${cudnn_cu_cc_srcs} ${mkldnn_cc_srcs} ${cu_srcs} DEPS ${op_library_DEPS} - ${op_common_deps}) - elseif (WITH_AMD_GPU) - hip_library(${TARGET} SRCS ${cc_srcs} ${hip_cu_srcs} ${miopen_hip_cc_srcs} ${mkldnn_cc_srcs} DEPS ${op_library_DEPS} - ${op_common_deps}) - else() - cc_library(${TARGET} SRCS ${cc_srcs} ${mkldnn_cc_srcs} DEPS ${op_library_DEPS} - ${op_common_deps}) - endif() - - # Define operators that don't need pybind here. - foreach(manual_pybind_op "compare_op" "logical_op" "nccl_op" -"tensor_array_read_write_op" "tensorrt_engine_op") - if ("${TARGET}" STREQUAL "${manual_pybind_op}") - set(pybind_flag 1) - endif() - endforeach() - - # The registration of USE_OP, please refer to paddle/fluid/framework/op_registry.h. - # Note that it's enough to just adding one operator to pybind in a *_op.cc file. - # And for detail pybind information, please see generated paddle/pybind/pybind.h. - file(READ ${TARGET}.cc TARGET_CONTENT) - string(REGEX MATCH "REGISTER_OPERATOR\\(.*REGISTER_OPERATOR\\(" multi_register "${TARGET_CONTENT}") - string(REGEX MATCH "REGISTER_OPERATOR\\([a-z0-9_]*," one_register "${multi_register}") - if (one_register STREQUAL "") - string(REPLACE "_op" "" TARGET "${TARGET}") - else () - string(REPLACE "REGISTER_OPERATOR(" "" TARGET "${one_register}") - string(REPLACE "," "" TARGET "${TARGET}") - endif() - - # pybind USE_NO_KERNEL_OP - # HACK: if REGISTER_OP_CPU_KERNEL presents the operator must have kernel - string(REGEX MATCH "REGISTER_OP_CPU_KERNEL" regex_result "${TARGET_CONTENT}") - string(REPLACE "_op" "" TARGET "${TARGET}") - if (${pybind_flag} EQUAL 0 AND regex_result STREQUAL "") - file(APPEND ${pybind_file} "USE_NO_KERNEL_OP(${TARGET});\n") - set(pybind_flag 1) - endif() - - # pybind USE_CPU_ONLY_OP - list(LENGTH cu_srcs cu_srcs_len) - list(LENGTH cu_cc_srcs cu_cc_srcs_len) - list(LENGTH mkldnn_cc_srcs mkldnn_cc_srcs_len) - list(LENGTH hip_cu_srcs hip_cu_srcs_len) - list(LENGTH miopen_hip_cc_srcs miopen_hip_cc_srcs_len) - if (${pybind_flag} EQUAL 0 AND ${mkldnn_cc_srcs_len} EQUAL 0 AND ${cu_srcs_len} EQUAL 0 AND ${cu_cc_srcs_len} EQUAL 0 AND - ${hip_cu_srcs_len} EQUAL 0 AND ${miopen_hip_cc_srcs_len} EQUAL 0) - file(APPEND ${pybind_file} "USE_CPU_ONLY_OP(${TARGET});\n") - set(pybind_flag 1) - endif() - - # pybind USE_OP_DEVICE_KERNEL for CUDNN - list(LENGTH cudnn_cu_cc_srcs cudnn_cu_cc_srcs_len) - if (WITH_GPU AND ${cudnn_cu_cc_srcs_len} GREATER 0) - file(APPEND ${pybind_file} "USE_OP_DEVICE_KERNEL(${TARGET}, CUDNN);\n") - endif() - - # pybind USE_OP_DEVICE_KERNEL for MIOPEN - if (WITH_AMD_GPU AND ${miopen_hip_cc_srcs_len} GREATER 0) - file(APPEND ${pybind_file} "USE_OP_DEVICE_KERNEL(${TARGET}, MIOPEN);\n") - endif() - - # pybind USE_OP_DEVICE_KERNEL for MKLDNN - if (WITH_MKLDNN AND ${mkldnn_cc_srcs_len} GREATER 0) - # Append first implemented MKLDNN activation operator - if (${MKLDNN_FILE} STREQUAL "activation_mkldnn_op") - file(APPEND ${pybind_file} "USE_OP_DEVICE_KERNEL(relu, MKLDNN);\n") - else() - file(APPEND ${pybind_file} "USE_OP_DEVICE_KERNEL(${TARGET}, MKLDNN);\n") - endif() - endif() +include(operators) - # pybind USE_OP - if (${pybind_flag} EQUAL 0) - # NOTE(*): activation use macro to regist the kernels, set use_op manually. - if(${TARGET} STREQUAL "activation") - file(APPEND ${pybind_file} "USE_OP(relu);\n") - elseif(${TARGET} STREQUAL "fake_dequantize") - file(APPEND ${pybind_file} "USE_OP(fake_dequantize_max_abs);\n") - elseif(${TARGET} STREQUAL "fake_quantize") - file(APPEND ${pybind_file} "USE_OP(fake_quantize_abs_max);\n") - elseif(${TARGET} STREQUAL "tensorrt_engine_op") - message(STATUS "Pybind skips [tensorrt_engine_op], for this OP is only used in inference") - elseif(${TARGET} STREQUAL "fc") - # HACK: fc only have mkldnn and cpu, which would mismatch the cpu only condition - file(APPEND ${pybind_file} "USE_CPU_ONLY_OP(${TARGET});\n") - else() - file(APPEND ${pybind_file} "USE_OP(${TARGET});\n") - endif() - endif() -endfunction() +# clean cache and pybind_file content first when rebuild +unset(GLOB_OP_LIB CACHE) +unset(OP_LIBRARY CACHE) +set(pybind_file ${PADDLE_BINARY_DIR}/paddle/fluid/pybind/pybind.h CACHE INTERNAL "pybind.h file") +file(WRITE ${pybind_file} "// Generated by the paddle/fluid/operator/CMakeLists.txt. DO NOT EDIT!\n\n") add_subdirectory(math) -if (NOT WIN32) -add_subdirectory(nccl) -if(WITH_GPU) - op_library(nccl_op DEPS nccl_common) - file(APPEND ${pybind_file} "USE_CUDA_ONLY_OP(ncclAllReduce);\n") -else() - set(DEPS_OPS ${DEPS_OPS} nccl_op) -endif() -endif() # NOT WIN32 +add_subdirectory(controlflow) +add_subdirectory(csp) +add_subdirectory(detection) +add_subdirectory(elementwise) +add_subdirectory(fused) +add_subdirectory(metrics) +add_subdirectory(ngraph) +add_subdirectory(optimizers) +add_subdirectory(reduce_ops) +add_subdirectory(sequence_ops) +add_subdirectory(jit) -set(DISTRIBUTE_DEPS "") if(WITH_DISTRIBUTE) add_subdirectory(distributed) - set(DISTRIBUTE_DEPS "") - if(WITH_GRPC) - set(DISTRIBUTE_DEPS sendrecvop_grpc grpc++_unsecure grpc_unsecure gpr cares zlib protobuf node) - else() - set(DISTRIBUTE_DEPS sendrecvop_brpc brpc leveldb snappystream snappy protobuf ssl crypto zlib node) - if(WITH_BRPC_RDMA) - find_library(IBVERBS_LIBRARY NAMES ibverbs) - ADD_LIBRARY(ibverbs SHARED IMPORTED GLOBAL) - SET_PROPERTY(TARGET ibverbs PROPERTY IMPORTED_LOCATION ${IBVERBS_LIBRARY}) - + add_subdirectory(distributed_ops) +endif() - find_library(RDMACM_LIBRARY NAMES rdmacm) - ADD_LIBRARY(rdmacm SHARED IMPORTED GLOBAL) - SET_PROPERTY(TARGET rdmacm PROPERTY IMPORTED_LOCATION ${RDMACM_LIBRARY}) +add_subdirectory(reader) - set(DISTRIBUTE_DEPS ${DISTRIBUTE_DEPS} ibverbs rdmacm) - endif() - endif() +if (NOT WIN32) + add_subdirectory(nccl) +endif() - set(DISTRIBUTE_COMPILE_FLAGS "-Wno-non-virtual-dtor -Wno-error=non-virtual-dtor -Wno-error=delete-non-virtual-dtor") - foreach(dist_op "prefetch_op" "checkpoint_notify_op" "listen_and_serv_op" "send_op" "recv_op" "send_barrier_op" "fetch_barrier_op") - op_library(${dist_op} DEPS ${DISTRIBUTE_DEPS}) - set_source_files_properties(${dist_op}.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS}) - endforeach() +if (WITH_GPU AND TENSORRT_FOUND) + add_subdirectory(tensorrt) +endif() - #set_source_files_properties(send_recv_op_test.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS}) - #cc_test(test_send_recv SRCS send_recv_op_test.cc DEPS prefetch_op send_op - # listen_and_serv_op sum_op executor SERIAL) - if(WITH_GPU AND NOT WIN32) - set_source_files_properties(test_send_nccl_id.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS}) - cc_test(test_send_nccl_id SRCS test_send_nccl_id.cc DEPS listen_and_serv_op ${DISTRIBUTE_DEPS} executor SERIAL) - if(WITH_GRPC) - op_library(gen_nccl_id_op DEPS nccl_common sendrecvop_grpc) - else() - op_library(gen_nccl_id_op DEPS nccl_common sendrecvop_brpc) - endif() - set_source_files_properties(gen_nccl_id_op.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS}) - else() - set(DEPS_OPS ${DEPS_OPS} gen_nccl_id_op) - endif() # WITH_GPU AND NOT WIN32 -else() - set(DEPS_OPS ${DEPS_OPS} checkpoint_notify_op prefetch_op recv_op listen_and_serv_op send_op send_barrier_op fetch_barrier_op gen_nccl_id_op) +if (ANAKIN_FOUND) + add_subdirectory(anakin) endif() -op_library(cross_entropy_op DEPS cross_entropy) -if(WITH_GPU) - op_library(softmax_with_cross_entropy_op DEPS cross_entropy softmax cub) - op_library(sequence_softmax_op DEPS cub) -else() - op_library(softmax_with_cross_entropy_op DEPS cross_entropy softmax) +SET(OP_HEADER_DEPS xxhash) +if (WITH_GPU) + SET(OP_HEADER_DEPS ${OP_HEADER_DEPS} cub) endif() -op_library(softmax_op DEPS softmax) -if (WITH_GPU AND TENSORRT_FOUND) - op_library(tensorrt_engine_op DEPS tensorrt_engine tensorrt_converter) - file(APPEND ${pybind_file} "USE_CUDA_ONLY_OP(tensorrt_engine);\n") - nv_test(test_tensorrt_engine_op SRCS tensorrt_engine_op_test.cc - DEPS tensorrt_engine_op - analysis) -else() - set(DEPS_OPS ${DEPS_OPS} tensorrt_engine_op) +SET(OP_PREFETCH_DEPS "") +if (WITH_DISTRIBUTE) + SET(OP_PREFETCH_DEPS ${OP_PREFETCH_DEPS} parameter_prefetch) endif() -op_library(clip_by_norm_op DEPS selected_rows_functor selected_rows) -op_library(sum_op DEPS selected_rows_functor) -op_library(sgd_op DEPS selected_rows_functor) -op_library(print_op DEPS lod_tensor) -op_library(adagrad_op DEPS selected_rows_functor) -op_library(maxout_op DEPS maxouting) -op_library(unpool_op DEPS unpooling) -op_library(pool_op DEPS pooling) -op_library(pool_with_index_op DEPS pooling) -op_library(lod_rank_table_op DEPS lod_rank_table) -op_library(lod_tensor_to_array_op DEPS lod_rank_table_op) -op_library(array_to_lod_tensor_op DEPS lod_rank_table_op) -op_library(max_sequence_len_op DEPS lod_rank_table) -op_library(sequence_conv_op DEPS context_project) -op_library(sequence_pool_op DEPS sequence_pooling) -if (NOT WIN32) -op_library(lstm_op DEPS sequence2batch lstm_compute) -op_library(hierarchical_sigmoid_op DEPS matrix_bit_code) -op_library(lstmp_op DEPS sequence2batch lstm_compute) -op_library(gru_op DEPS sequence2batch gru_compute) -endif(NOT WIN32) -op_library(recurrent_op DEPS executor) -op_library(warpctc_op DEPS dynload_warpctc sequence_padding sequence_scale) -op_library(cos_sim_op DEPS cos_sim_functor) -op_library(parallel_do_op DEPS executor) -op_library(unsqueeze_op DEPS reshape_op) -op_library(squeeze_op DEPS reshape_op) -op_library(extract_rows_op DEPS memory) -op_library(flatten_op DEPS reshape_op) -op_library(sequence_pad_op DEPS sequence_padding) -op_library(unstack_op DEPS stack_op) -op_library(fake_quantize_op DEPS memory) -op_library(fusion_lstm_op DEPS jit_kernel) + +register_operators(EXCLUDES py_func_op warpctc_op dgc_op conv_fusion_op sync_batch_norm_op DEPS ${OP_HEADER_DEPS} ${OP_PREFETCH_DEPS}) + if (WITH_GPU) - op_library(conv_op DEPS vol2col depthwise_conv im2col) - op_library(layer_norm_op DEPS cub) - op_library(reduce_mean_op DEPS cub) - op_library(affine_channel_op DEPS cub) + # warpctc_op needs cudnn 7 above + if (${CUDNN_MAJOR_VERSION} VERSION_LESS 7) + op_library(warpctc_op DEPS dynload_warpctc sequence_padding sequence_scale SRCS warpctc_op.cc warpctc_op.cu.cc) + else() + op_library(warpctc_op DEPS dynload_warpctc sequence_padding sequence_scale) + endif() + # conv_fusion_op needs cudnn 7 above + if (NOT ${CUDNN_VERSION} VERSION_LESS 7100) + op_library(conv_fusion_op) + file(APPEND ${pybind_file} "USE_CUDA_ONLY_OP(conv2d_fusion);\n") + endif() + if (NOT WIN32) + op_library(sync_batch_norm_op) + file(APPEND ${pybind_file} "USE_CUDA_ONLY_OP(sync_batch_norm);\n") + endif() else() - op_library(conv_op DEPS vol2col im2col) + op_library(warpctc_op DEPS dynload_warpctc sequence_padding sequence_scale) endif() -op_library(conv_transpose_op DEPS vol2col im2col) - -# FIXME(typhoonzero): save/load depends lodtensor serialization functions -op_library(save_op DEPS lod_tensor) -op_library(load_op DEPS lod_tensor) -op_library(save_combine_op DEPS lod_tensor) -op_library(load_combine_op DEPS lod_tensor) -op_library(concat_op DEPS concat) -list(REMOVE_ITEM GENERAL_OPS ${DEPS_OPS}) +set(COMMON_OP_DEPS ${OP_HEADER_DEPS}) -foreach(src ${GENERAL_OPS}) - op_library(${src}) -endforeach() +if (WITH_GPU AND NOT WIN32) + op_library(dgc_op DEPS dgc) + file(APPEND ${pybind_file} "USE_CUDA_ONLY_OP(dgc);\n") + set(COMMON_OP_DEPS ${COMMON_OP_DEPS} dgc) +endif() -file(APPEND ${pybind_file} "USE_OP(less_than);\nUSE_OP(logical_and);\nUSE_NO_KERNEL_OP(read_from_array);\n") -if (NOT WIN32) -add_subdirectory(reader) -endif(NOT WIN32) -foreach(src ${READER_LIBRARY}) - set(OP_LIBRARY ${src} ${OP_LIBRARY}) -endforeach() +set(COMMON_OP_DEPS ${COMMON_OP_DEPS} selected_rows_functor selected_rows lod_tensor maxouting unpooling pooling lod_rank_table context_project sequence_pooling executor) +set(COMMON_OP_DEPS ${COMMON_OP_DEPS} dynload_warpctc) +set(COMMON_OP_DEPS ${COMMON_OP_DEPS} sequence_padding sequence_scale cos_sim_functor memory jit_kernel_helper concat_and_split cross_entropy softmax vol2col im2col sampler sample_prob tree2col) +set(COMMON_OP_DEPS ${COMMON_OP_DEPS} sequence2batch lstm_compute matrix_bit_code gru_compute activation_functions beam_search) +if (WITH_GPU) + set(COMMON_OP_DEPS ${COMMON_OP_DEPS} depthwise_conv prelu) +endif() -add_subdirectory(detection) -foreach(src ${DETECTION_LIBRARY}) - set(OP_LIBRARY ${src} ${OP_LIBRARY}) -endforeach() +# FIXME(typhoonzero): operator deps may not needed. +# op_library(lod_tensor_to_array_op DEPS lod_rank_table_op) +# op_library(array_to_lod_tensor_op DEPS lod_rank_table_op) +# op_library(unsqueeze_op DEPS reshape_op) +# op_library(squeeze_op DEPS reshape_op) +# op_library(flatten_op DEPS reshape_op) +# op_library(unstack_op DEPS stack_op) +# op_library(tensor_array_to_tensor_op DEPS concat_op) -set(GLOB_OP_LIB ${OP_LIBRARY} CACHE INTERNAL "Global OP library") -set(GLOB_DISTRIBUTE_DEPS ${DISTRIBUTE_DEPS} CACHE INTERNAL "distributed dependency") +set(OPERATOR_DEPS ${OPERATOR_DEPS} ${COMMON_OP_DEPS}) +set(GLOB_OPERATOR_DEPS ${OPERATOR_DEPS} CACHE INTERNAL "Global Op dependencies") cc_test(gather_test SRCS gather_test.cc DEPS tensor) -cc_test(scatter_test SRCS scatter_test.cc DEPS tensor) +cc_test(scatter_test SRCS scatter_test.cc DEPS tensor math_function) cc_test(beam_search_decode_op_test SRCS beam_search_decode_op_test.cc DEPS lod_tensor) -cc_test(beam_search_op_test SRCS beam_search_op_test.cc DEPS lod_tensor beam_search_op) cc_test(strided_memcpy_test SRCS strided_memcpy_test.cc DEPS tensor memory) cc_test(save_load_op_test SRCS save_load_op_test.cc DEPS save_op load_op) cc_test(save_load_combine_op_test SRCS save_load_combine_op_test.cc DEPS save_combine_op load_combine_op) -if(NOT WIN32) -nv_test(nccl_op_test SRCS nccl_op_test.cu.cc DEPS nccl_op gpu_info device_context) -endif() nv_test(dropout_op_test SRCS dropout_op_test.cc DEPS dropout_op tensor) + +if (WITH_PYTHON) + cc_library(py_func_op SRCS py_func_op.cc DEPS op_registry python pybind) +endif() + +set(GLOB_OP_LIB ${OP_LIBRARY} CACHE INTERNAL "Global OP library") +add_subdirectory(benchmark) diff --git a/paddle/fluid/operators/accuracy_op.cc b/paddle/fluid/operators/accuracy_op.cc deleted file mode 100644 index 42fcace17926641b5caf677eb3c8ba5222e37190..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/accuracy_op.cc +++ /dev/null @@ -1,102 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/operators/accuracy_op.h" - -namespace paddle { -namespace operators { - -class AccuracyOp : public framework::OperatorWithKernel { - public: - using framework::OperatorWithKernel::OperatorWithKernel; - - void InferShape(framework::InferShapeContext *ctx) const override { - PADDLE_ENFORCE(ctx->HasInput("Out"), - "Input (Out) of accuracy op should not be null."); - PADDLE_ENFORCE(ctx->HasInput("Indices"), - "Input (Indices) of accuracy op should not be null."); - PADDLE_ENFORCE(ctx->HasInput("Label"), - "Input (Label) of accuracy op should not be null."); - PADDLE_ENFORCE(ctx->HasOutput("Accuracy"), - "Output (Accuracy) of AccuracyOp should not be null."); - PADDLE_ENFORCE(ctx->HasOutput("Correct"), - "Output (Correct) of AccuracyOp should not be null."); - PADDLE_ENFORCE(ctx->HasOutput("Total"), - "Output (Total) of AccuracyOp should not be null."); - - auto inference_dim = ctx->GetInputDim("Out"); - auto label_dim = ctx->GetInputDim("Label"); - // Assume indices has same shape as inference, because - // it's the output of topk. - - PADDLE_ENFORCE_EQ(label_dim.size(), 2, "label's rank must be 2."); - PADDLE_ENFORCE_EQ(label_dim[1], 1, "label's second dimension must be 1"); - PADDLE_ENFORCE_EQ(inference_dim[0], label_dim[0], - "the inference tensor's num_rows must be" - " the same as label."); - - ctx->SetOutputDim("Accuracy", {1}); - ctx->SetOutputDim("Correct", {1}); - ctx->SetOutputDim("Total", {1}); - ctx->ShareLoD("Out", /*->*/ "Accuracy"); - } - - protected: - framework::OpKernelType GetExpectedKernelType( - const framework::ExecutionContext &ctx) const override { - return framework::OpKernelType( - framework::ToDataType(ctx.Input("Out")->type()), - ctx.GetPlace()); - } -}; - -class AccuracyOpMaker : public framework::OpProtoAndCheckerMaker { - public: - void Make() override { - // TODO(typhoonzero): support both inference value and indices. - AddInput("Out", "The network output of topk (inferences)"); - AddInput("Indices", "The the network output of topk (indices)"); - AddInput("Label", "Label of the training data"); - // TODO(typhoonzero): AddInput("Weight", ... - AddOutput("Accuracy", "The accuracy of current batch"); - AddOutput("Correct", "The correct samples count of current batch"); - AddOutput("Total", "The samples count of current batch"); - - AddComment(R"DOC( -Accuracy Operator. - -It will print accuracy rate for classification. -The accuracy is calculated as follows: - -$$accuracy = \frac{NumOfCorrectPredicts}{NumOfAllSamples}$$ - -Both the input Out and Label can carry the LoD (Level of Details) -information, or not. But the output only shares the LoD information -with the input Out(Inference). - -)DOC"); - } -}; - -} // namespace operators -} // namespace paddle - -namespace ops = paddle::operators; -REGISTER_OPERATOR(accuracy, ops::AccuracyOp, ops::AccuracyOpMaker, - paddle::framework::EmptyGradOpMaker); -// FIXME(typhoonzero): types of T is for infernece data. -// label data is always int. -REGISTER_OP_CPU_KERNEL(accuracy, - ops::AccuracyKernel, - ops::AccuracyKernel); diff --git a/paddle/fluid/operators/accuracy_op.cu b/paddle/fluid/operators/accuracy_op.cu deleted file mode 100644 index 23b48c6fdf427348879de07c671c65327d6436d7..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/accuracy_op.cu +++ /dev/null @@ -1,99 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include -#include -#include "paddle/fluid/operators/accuracy_op.h" -#include "paddle/fluid/platform/cuda_primitives.h" -#include "paddle/fluid/platform/gpu_info.h" - -namespace paddle { -namespace operators { -using platform::PADDLE_CUDA_NUM_THREADS; - -template -__global__ void AccuracyCudaKernel(const int N, const int D, - const int64_t* Xdata, - const int64_t* labeldata, int* correct_data, - float* accuracy, int* total_data) { - int count = 0; - __shared__ int total[BlockSize]; - - // support only 1 block - for (int i = threadIdx.x; i < (N); i += BlockSize) { - for (int j = 0; j < D; ++j) { - if (Xdata[i * D + j] == labeldata[i]) { - ++count; - break; - } - } - } - total[threadIdx.x] = count; - __syncthreads(); - - // reduce the count with init value 0, and output accuracy. - int result = thrust::reduce(thrust::device, total, total + BlockSize, 0); - if (threadIdx.x == 0) { - *correct_data = result; - *accuracy = static_cast(result) / static_cast(N); - *total_data = N; - } -} - -template -class AccuracyOpCUDAKernel : public framework::OpKernel { - public: - void Compute(const framework::ExecutionContext& ctx) const override { - PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()), - "It must use CUDAPlace."); - auto* inference = ctx.Input("Out"); - auto* indices = ctx.Input("Indices"); - auto* label = ctx.Input("Label"); - - auto* accuracy = ctx.Output("Accuracy"); - auto* correct = ctx.Output("Correct"); - auto* total = ctx.Output("Total"); - // FIXME(typhoonzero): only support indices currently - // if add support for output values, how to detect the data type? - const int64_t* indices_data = indices->data(); - const int64_t* label_data = label->data(); - - int* correct_data = correct->mutable_data(ctx.GetPlace()); - int* total_data = total->mutable_data(ctx.GetPlace()); - float* accuracy_data = accuracy->mutable_data(ctx.GetPlace()); - - int num_samples = static_cast(inference->dims()[0]); - size_t infer_width = inference->dims()[1]; - auto stream = ctx.cuda_device_context().stream(); - platform::GpuMemsetAsync(accuracy_data, 0, sizeof(float), stream); - - if (num_samples == 0) { - return; - } - - AccuracyCudaKernel< - PADDLE_CUDA_NUM_THREADS><<<1, PADDLE_CUDA_NUM_THREADS, 0, stream>>>( - num_samples, infer_width, indices_data, label_data, correct_data, - accuracy_data, total_data); - } -}; - -} // namespace operators -} // namespace paddle - -// FIXME(typhoonzero): types of T is for inference data. -// label data is always int64 -REGISTER_OP_CUDA_KERNEL(accuracy, - paddle::operators::AccuracyOpCUDAKernel, - paddle::operators::AccuracyOpCUDAKernel); diff --git a/paddle/fluid/operators/activation_cudnn.cu.cc b/paddle/fluid/operators/activation_cudnn.cu.cc new file mode 100644 index 0000000000000000000000000000000000000000..494c02374a9faa22486644c9b9c7d586c86d41b0 --- /dev/null +++ b/paddle/fluid/operators/activation_cudnn.cu.cc @@ -0,0 +1,40 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/activation_op.h" +#include "paddle/fluid/platform/cudnn_desc.h" + +namespace paddle { +namespace operators { +using framework::Tensor; +using platform::ActivationDescriptor; +using platform::TensorDescriptor; + +template +class CudnnActivationKernel + : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& context) const override { + framework::Tensor *X, *Out; + ExtractActivationTensor(context, X, Out); + ActivationDescriptor act_desc; + TensorDescriptor x_desc, out_desc; + x_desc.set(detail::Ref(X)); + out_desc.set(detail::Ref(Out)); + } +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/activation_cudnn_op.cu.cc b/paddle/fluid/operators/activation_cudnn_op.cu.cc new file mode 100644 index 0000000000000000000000000000000000000000..f03355eb441f99b54d78fe90bcb3bea116db58f1 --- /dev/null +++ b/paddle/fluid/operators/activation_cudnn_op.cu.cc @@ -0,0 +1,189 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/activation_op.h" +#include "paddle/fluid/platform/cudnn_desc.h" + +namespace paddle { +namespace operators { +using framework::Tensor; +using platform::ActivationDescriptor; +using platform::TensorDescriptor; +using platform::CUDADeviceContext; + +template +struct CudnnActivationFunctor { + using ELEMENT_TYPE = T; + CudnnActivationFunctor(const CUDADeviceContext& ctx, const T& c, + const cudnnActivationMode_t& m) + : ctx_(ctx), coef_(c), mode_(m) {} + void operator()(const Tensor& x, Tensor* out) { + ActivationDescriptor act_desc; + act_desc.set(mode_, coef_); + TensorDescriptor x_desc, out_desc; + x_desc.set(x); + out_desc.set(detail::Ref(out)); + PADDLE_ENFORCE(platform::dynload::cudnnActivationForward( + ctx_.cudnn_handle(), act_desc.desc(), + platform::CudnnDataType::kOne(), x_desc.desc(), x.data(), + platform::CudnnDataType::kZero(), out_desc.desc(), + out->mutable_data(ctx_.GetPlace()))); + } + const CUDADeviceContext& ctx_; + const T coef_; + const cudnnActivationMode_t mode_; +}; + +template +struct CudnnActivationGradFunctor { + using ELEMENT_TYPE = T; + CudnnActivationGradFunctor(const CUDADeviceContext& ctx, const T& c, + const cudnnActivationMode_t& m) + : ctx_(ctx), coef_(c), mode_(m) {} + void operator()(const Tensor& x, const Tensor& out, const Tensor dout, + Tensor* dx) { + ActivationDescriptor act_desc; + act_desc.set(mode_, coef_); + TensorDescriptor x_desc, out_desc, dout_desc, dx_desc; + x_desc.set(x); + out_desc.set(out); + dout_desc.set(dout); + dx_desc.set(detail::Ref(dx)); + PADDLE_ENFORCE(platform::dynload::cudnnActivationBackward( + ctx_.cudnn_handle(), act_desc.desc(), + platform::CudnnDataType::kOne(), out_desc.desc(), out.data(), + dout_desc.desc(), dout.data(), x_desc.desc(), x.data(), + platform::CudnnDataType::kZero(), dx_desc.desc(), + dx->mutable_data(ctx_.GetPlace()))); + } + const CUDADeviceContext& ctx_; + const T coef_; + const cudnnActivationMode_t mode_; +}; + +template +struct CudnnReluFunctor : public CudnnActivationFunctor { + explicit CudnnReluFunctor(const CUDADeviceContext& ctx) + : CudnnActivationFunctor(ctx, 0.0, CUDNN_ACTIVATION_RELU) {} +}; +template +struct CudnnReluGradFunctor : public CudnnActivationGradFunctor { + explicit CudnnReluGradFunctor(const CUDADeviceContext& ctx) + : CudnnActivationGradFunctor(ctx, 0.0, CUDNN_ACTIVATION_RELU) {} + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepOut; } +}; + +template +struct CudnnRelu6Functor : public CudnnActivationFunctor { + explicit CudnnRelu6Functor(const CUDADeviceContext& ctx) + : CudnnActivationFunctor(ctx, 6.0, CUDNN_ACTIVATION_CLIPPED_RELU) {} +}; +template +struct CudnnRelu6GradFunctor : public CudnnActivationGradFunctor { + explicit CudnnRelu6GradFunctor(const CUDADeviceContext& ctx) + : CudnnActivationGradFunctor(ctx, 6.0, CUDNN_ACTIVATION_CLIPPED_RELU) { + } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepOut; } +}; + +template +struct CudnnSigmoidFunctor : public CudnnActivationFunctor { + explicit CudnnSigmoidFunctor(const CUDADeviceContext& ctx) + : CudnnActivationFunctor(ctx, 0.0, CUDNN_ACTIVATION_SIGMOID) {} +}; +template +struct CudnnSigmoidGradFunctor : public CudnnActivationGradFunctor { + explicit CudnnSigmoidGradFunctor(const CUDADeviceContext& ctx) + : CudnnActivationGradFunctor(ctx, 0.0, CUDNN_ACTIVATION_SIGMOID) {} + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepOut; } +}; + +template +struct CudnnTanhFunctor : public CudnnActivationFunctor { + explicit CudnnTanhFunctor(const CUDADeviceContext& ctx) + : CudnnActivationFunctor(ctx, 0.0, CUDNN_ACTIVATION_TANH) {} +}; +template +struct CudnnTanhGradFunctor : public CudnnActivationGradFunctor { + explicit CudnnTanhGradFunctor(const CUDADeviceContext& ctx) + : CudnnActivationGradFunctor(ctx, 0.0, CUDNN_ACTIVATION_TANH) {} + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepOut; } +}; + +template +class CudnnActivationKernel + : public framework::OpKernel { + public: + using T = typename Functor::ELEMENT_TYPE; + void Compute(const framework::ExecutionContext& context) const override { + const framework::Tensor* X = nullptr; + framework::Tensor* Out = nullptr; + ExtractActivationTensor(context, &X, &Out); + Out->mutable_data(context.GetPlace()); + auto& dev_ctx = context.template device_context(); + Functor functor(dev_ctx); + functor(detail::Ref(X), Out); + } +}; + +template +class CudnnActivationGradKernel + : public framework::OpKernel { + public: + using T = typename Functor::ELEMENT_TYPE; + void Compute(const framework::ExecutionContext& context) const override { + static_assert(Functor::FwdDeps() == kDepOut, "Forward deps must be Out."); + + const framework::Tensor *X, *Out, *dOut; + X = Out = dOut = nullptr; + framework::Tensor* dX = nullptr; + ExtractActivationGradTensor(context, &X, &Out, &dOut, + &dX); + dX->mutable_data(context.GetPlace()); + auto& dev_ctx = context.template device_context(); + Functor functor(dev_ctx); + functor(detail::Ref(X), detail::Ref(Out), detail::Ref(dOut), dX); + } +}; + +} // namespace operators +} // namespace paddle + +namespace plat = paddle::platform; +namespace ops = paddle::operators; + +#define FOR_EACH_CUDNN_OP_FUNCTOR(__macro) \ + __macro(relu, CudnnReluFunctor, CudnnReluGradFunctor); \ + __macro(relu6, CudnnRelu6Functor, CudnnRelu6GradFunctor); \ + __macro(sigmoid, CudnnTanhFunctor, CudnnTanhGradFunctor); \ + __macro(tanh, CudnnTanhFunctor, CudnnTanhGradFunctor) + +#define REGISTER_ACTIVATION_CUDNN_KERNEL(act_type, functor, grad_functor) \ + REGISTER_OP_KERNEL(act_type, CUDNN, plat::CUDAPlace, \ + ops::CudnnActivationKernel>, \ + ops::CudnnActivationKernel>); \ + REGISTER_OP_KERNEL( \ + act_type##_grad, CUDNN, plat::CUDAPlace, \ + ops::CudnnActivationGradKernel>, \ + ops::CudnnActivationGradKernel>); + +FOR_EACH_CUDNN_OP_FUNCTOR(REGISTER_ACTIVATION_CUDNN_KERNEL); diff --git a/paddle/fluid/operators/activation_mkldnn_op.cc b/paddle/fluid/operators/activation_mkldnn_op.cc deleted file mode 100644 index 137bca5e2b8e2754aed274970e08b03ee816a7f2..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/activation_mkldnn_op.cc +++ /dev/null @@ -1,349 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve. - - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. */ - -#include "paddle/fluid/operators/activation_op.h" -#include "paddle/fluid/platform/mkldnn_helper.h" - -namespace paddle { -namespace operators { - -using framework::DataLayout; -using framework::Tensor; -using mkldnn::memory; -using mkldnn::primitive; -using mkldnn::stream; -using platform::GetMKLDNNFormat; -using platform::MKLDNNDeviceContext; -using platform::to_void_cast; - -namespace { -std::string gethash(const mkldnn::memory::dims &operand_dims, - const mkldnn::algorithm algorithm) { - auto dim2str = [](const mkldnn::memory::dims &operand_dims) { - std::string dstr = ""; - for (size_t i = 0; i < operand_dims.size(); ++i) { - dstr += std::to_string(operand_dims[i]) + "-"; - } - return dstr; - }; - return dim2str(operand_dims) + std::to_string(algorithm); -} -} // namespace - -template -class MKLDNNActivationKernel - : public framework::OpKernel { - public: - void Compute(const framework::ExecutionContext &ctx) const override { - const auto *x = ctx.Input("X"); - PADDLE_ENFORCE(x->layout() == DataLayout::kMKLDNN && - x->format() != memory::format::format_undef, - "Wrong layout/format set for Input x tensor"); - - Functor functor; - - auto attrs = functor.GetAttrs(); - for (auto &attr : attrs) { - *attr.second = ctx.Attr(attr.first); - } - functor(ctx); - } -}; - -template -class MKLDNNActivationGradKernel - : public framework::OpKernel { - public: - void Compute(const framework::ExecutionContext &ctx) const override { - const auto *diff_y = ctx.Input(framework::GradVarName("Out")); - PADDLE_ENFORCE(diff_y->layout() == DataLayout::kMKLDNN && - diff_y->format() != memory::format::format_undef, - "Wrong layout/format set for Input OutGrad tensor"); - - Functor functor; - - auto attrs = functor.GetAttrs(); - for (auto &attr : attrs) { - *attr.second = ctx.Attr(attr.first); - } - functor(ctx); - } -}; - -template -void eltwise_forward(const framework::ExecutionContext &ctx, - mkldnn::algorithm algorithm, const T alpha = 0, - const T beta = 0) { - PADDLE_ENFORCE(paddle::platform::is_cpu_place(ctx.GetPlace()), - "It must use CPUPlace."); - auto &dev_ctx = ctx.template device_context(); - const auto &mkldnn_engine = dev_ctx.GetEngine(); - - const auto *x = ctx.Input("X"); - auto *y = ctx.Output("Out"); - - const T *x_data = x->data(); - T *y_data = y->mutable_data(ctx.GetPlace()); - - PADDLE_ENFORCE(x->dims().size() == 2 || x->dims().size() == 4, - "Input dim must be with 2 or 4"); - - std::vector src_tz = framework::vectorize2int(x->dims()); - - auto src_format = - src_tz.size() == 2 ? mkldnn::memory::format::nc : x->format(); - - const std::string key = gethash(src_tz, algorithm); - const std::string key_src_data = - key + ctx.op().Output("Out") + "@eltwise_fwd_src_data"; - const std::string key_src_layout = - key + ctx.op().Output("Out") + "@eltwise_fwd_src_layout"; - const std::string key_with_layout = key + std::to_string(src_format); - const std::string key_src_mem = key_with_layout + "@eltwise_fwd_src_mem"; - const std::string key_dst_mem = key_with_layout + "@eltwise_fwd_dst_mem"; - const std::string key_fwd = key_with_layout + "@eltwise_fwd"; - const std::string key_fwd_pd = key_with_layout + "@eltwise_fwd_pd"; - - // save input data and layout to be referred in backward path - auto p_src_data = std::make_shared(x_data); - dev_ctx.SetBlob(key_src_data, p_src_data); - auto p_src_layout = std::make_shared(src_format); - dev_ctx.SetBlob(key_src_layout, p_src_layout); - - auto p_fwd = std::static_pointer_cast( - dev_ctx.GetBlob(key_fwd)); - - std::shared_ptr dst_memory; - - if (p_fwd == nullptr) { - // create mkldnn memory for input X - auto src_md = platform::MKLDNNMemDesc( - src_tz, platform::MKLDNNGetDataType(), src_format); - auto src_memory = std::shared_ptr( - new memory({src_md, mkldnn_engine}, to_void_cast(x_data))); - // save src_memory to be referred in backward path - dev_ctx.SetBlob(key_src_mem, src_memory); - - // create primitive descriptor for activation forward and save it - auto forward_desc = mkldnn::eltwise_forward::desc( - mkldnn::prop_kind::forward_training, algorithm, - src_memory->get_primitive_desc().desc(), alpha, beta); - auto forward_pd = std::make_shared( - forward_desc, mkldnn_engine); - - // save prim desc into global device context to be referred in backward path - dev_ctx.SetBlob(key_fwd_pd, forward_pd); - - // create mkldnn memory for output y - dst_memory = - std::make_shared(forward_pd->dst_primitive_desc(), y_data); - - dev_ctx.SetBlob(key_dst_mem, dst_memory); - - // create activation primitive - p_fwd = std::make_shared(*forward_pd, *src_memory, - *dst_memory); - dev_ctx.SetBlob(key_fwd, p_fwd); - } else { - // primitives already exist - auto src_memory = - std::static_pointer_cast(dev_ctx.GetBlob(key_src_mem)); - PADDLE_ENFORCE(src_memory != nullptr, - "Fail to find eltwise src_memory in device context."); - dst_memory = - std::static_pointer_cast(dev_ctx.GetBlob(key_dst_mem)); - PADDLE_ENFORCE(dst_memory != nullptr, - "Fail to find eltwise dst_memory in device context."); - - src_memory->set_data_handle(platform::to_void_cast(x_data)); - dst_memory->set_data_handle(y_data); - } - - // push primitive to stream and wait until it's executed - std::vector pipeline; - pipeline.push_back(*p_fwd); - stream(stream::kind::eager).submit(pipeline).wait(); - - y->set_layout(DataLayout::kMKLDNN); - y->set_format(GetMKLDNNFormat(*dst_memory)); -} - -template -void eltwise_grad(const framework::ExecutionContext &ctx, - mkldnn::algorithm algorithm, const T alpha = 0, - const T beta = 0) { - auto &dev_ctx = ctx.template device_context(); - const auto &mkldnn_engine = dev_ctx.GetEngine(); - - const auto *diff_y = ctx.Input(framework::GradVarName("Out")); - auto *diff_x = ctx.Output(framework::GradVarName("X")); - - const T *diff_y_data = diff_y->data(); - T *diff_x_data = diff_x->mutable_data(ctx.GetPlace()); - - std::vector diff_dst_tz = framework::vectorize2int(diff_y->dims()); - - auto diff_y_format = - diff_dst_tz.size() == 2 ? mkldnn::memory::format::nc : diff_y->format(); - - const std::string key = gethash(diff_dst_tz, algorithm); - const std::string key_src_data = - key + ctx.op().Input("Out") + "@eltwise_fwd_src_data"; - const std::string key_src_layout = - key + ctx.op().Input("Out") + "@eltwise_fwd_src_layout"; - const auto p_src_layout = - std::static_pointer_cast(dev_ctx.GetBlob(key_src_layout)); - const std::string key_src_mem = - key + std::to_string(*p_src_layout) + "@eltwise_fwd_src_mem"; - const std::string key_fwd_pd = - key + std::to_string(*p_src_layout) + "@eltwise_fwd_pd"; - const std::string key_with_layouts = - key + std::to_string(*p_src_layout) + "-" + std::to_string(diff_y_format); - const std::string key_diff_src_mem = - key_with_layouts + "@eltwise_diff_src_mem"; - const std::string key_diff_dst_mem = - key_with_layouts + "@eltwise_diff_dst_mem"; - const std::string key_grad = key_with_layouts + "@eltwise_grad"; - - const auto p_src_data = - std::static_pointer_cast(dev_ctx.GetBlob(key_src_data)); - - auto src_memory = - std::static_pointer_cast(dev_ctx.GetBlob(key_src_mem)); - PADDLE_ENFORCE(src_memory != nullptr, - "Fail to find src_memory in device context"); - src_memory->set_data_handle(*p_src_data.get()); - - std::shared_ptr diff_src_memory; - - auto p_grad = std::static_pointer_cast( - dev_ctx.GetBlob(key_grad)); - - if (p_grad == nullptr) { - // create mkldnn memory for input diff_y - auto diff_dst_md = platform::MKLDNNMemDesc( - diff_dst_tz, platform::MKLDNNGetDataType(), diff_y_format); - auto diff_dst_memory = std::shared_ptr( - new memory({diff_dst_md, mkldnn_engine}, to_void_cast(diff_y_data))); - dev_ctx.SetBlob(key_diff_dst_mem, diff_dst_memory); - - // retrieve eltwise primitive desc from device context - auto forward_pd = - std::static_pointer_cast( - dev_ctx.GetBlob(key_fwd_pd)); - PADDLE_ENFORCE(forward_pd != nullptr, - "Fail to find eltwise_fwd_pd in device context"); - - // ceate primitive descriptor for activation backward - auto backward_desc = mkldnn::eltwise_backward::desc( - algorithm, diff_dst_memory->get_primitive_desc().desc(), - src_memory->get_primitive_desc().desc(), alpha, beta); - auto backward_pd = mkldnn::eltwise_backward::primitive_desc( - backward_desc, mkldnn_engine, *forward_pd); - - // create mkldnn memory for output diff_src - diff_src_memory = std::make_shared( - backward_pd.diff_src_primitive_desc(), diff_x_data); - dev_ctx.SetBlob(key_diff_src_mem, diff_src_memory); - - // create activation backward primitive - p_grad = std::make_shared( - backward_pd, *src_memory, *diff_dst_memory, *diff_src_memory); - dev_ctx.SetBlob(key_grad, p_grad); - } else { - // primitives already exist - diff_src_memory = std::static_pointer_cast( - dev_ctx.GetBlob(key_diff_src_mem)); - auto diff_dst_memory = std::static_pointer_cast( - dev_ctx.GetBlob(key_diff_dst_mem)); - - diff_src_memory->set_data_handle( - platform::to_void_reinterpret_cast(diff_x_data)); - diff_dst_memory->set_data_handle( - platform::to_void_reinterpret_cast(diff_y_data)); - } - - // push primitive to stream and wait until it's executed - std::vector pipeline; - pipeline.push_back(*p_grad); - stream(stream::kind::eager).submit(pipeline).wait(); - - diff_x->set_layout(DataLayout::kMKLDNN); - diff_x->set_format(GetMKLDNNFormat(*diff_src_memory)); -} - -template -struct MKLDNNActivationFunc : public BaseActivationFunctor { - void operator()(const framework::ExecutionContext &ctx) const { - eltwise_forward(ctx, algorithm); - } -}; - -template -struct MKLDNNActivationGradFunc : public BaseActivationFunctor { - void operator()(const framework::ExecutionContext &ctx) const { - eltwise_grad(ctx, algorithm); - } -}; - -template -using ReluMKLDNNFunctor = - MKLDNNActivationFunc; - -template -using TanhMKLDNNFunctor = - MKLDNNActivationFunc; - -template -using SqrtMKLDNNFunctor = - MKLDNNActivationFunc; - -template -using AbsMKLDNNFunctor = - MKLDNNActivationFunc; - -template -using ReluMKLDNNGradFunctor = - MKLDNNActivationGradFunc; - -template -using TanhMKLDNNGradFunctor = - MKLDNNActivationGradFunc; - -template -using SqrtMKLDNNGradFunctor = - MKLDNNActivationGradFunc; - -template -using AbsMKLDNNGradFunctor = - MKLDNNActivationGradFunc; -} // namespace operators -} // namespace paddle - -namespace ops = paddle::operators; - -#define REGISTER_ACTIVATION_MKLDNN_KERNEL(act_type, functor, grad_functor) \ - REGISTER_OP_KERNEL(act_type, MKLDNN, ::paddle::platform::CPUPlace, \ - ops::MKLDNNActivationKernel>); \ - REGISTER_OP_KERNEL( \ - act_type##_grad, MKLDNN, ::paddle::platform::CPUPlace, \ - ops::MKLDNNActivationGradKernel>); - -#define FOR_EACH_MKLDNN_KERNEL_FUNCTOR(__macro) \ - __macro(relu, ReluMKLDNNFunctor, ReluMKLDNNGradFunctor); \ - __macro(tanh, TanhMKLDNNFunctor, TanhMKLDNNGradFunctor); \ - __macro(sqrt, SqrtMKLDNNFunctor, SqrtMKLDNNGradFunctor); \ - __macro(abs, AbsMKLDNNFunctor, AbsMKLDNNGradFunctor); - -FOR_EACH_MKLDNN_KERNEL_FUNCTOR(REGISTER_ACTIVATION_MKLDNN_KERNEL); diff --git a/paddle/fluid/operators/activation_op.cc b/paddle/fluid/operators/activation_op.cc index bbf52bea1358c32596ab6f14eeaa419735d19fc6..1e5d63fc11d1d81350525e2b3390a3ae44f00f8d 100644 --- a/paddle/fluid/operators/activation_op.cc +++ b/paddle/fluid/operators/activation_op.cc @@ -13,55 +13,106 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/operators/activation_op.h" +#include #include -#include "paddle/fluid/operators/mkldnn_activation_op.h" +#include +#include +#include +#include "paddle/fluid/operators/mkldnn/mkldnn_activation_op.h" #include "paddle/fluid/platform/port.h" +#ifdef PADDLE_WITH_CUDA +#include "paddle/fluid/platform/cudnn_helper.h" +#endif namespace paddle { namespace operators { using paddle::framework::Tensor; -#define REGISTER_ACTIVATION_OP_MAKER(OP_NAME, OP_COMMENT) \ - class OP_NAME##OpMaker \ - : public ::paddle::framework::OpProtoAndCheckerMaker { \ - public: \ - void Make() override { \ - AddInput("X", "Input of " #OP_NAME " operator"); \ - AddOutput("Out", "Output of " #OP_NAME " operator").Reuse("X"); \ - AddAttr("use_mkldnn", \ - "(bool, default false) Only used in mkldnn kernel") \ - .SetDefault(false); \ - AddComment(#OP_COMMENT); \ - } \ +template +static constexpr bool CanInplaceAct() { + return GradFunctor::FwdDeps() == kDepOut || GradFunctor::FwdDeps() == kNoDeps; +} + +std::unique_ptr> GetInplaceOpSet() { + std::unique_ptr> ret( + new std::unordered_set()); +#define INSERT_INTO_INPLACE_OP_SET(op_type, __omitted, fwd_functor, \ + bwd_functor) \ + if (CanInplaceAct>()) { \ + ret->insert(#op_type); \ } -#define REGISTER_ACTIVATION_OP_GRAD_MAKER(OP_NAME, KERNEL_TYPE) \ - class OP_NAME##GradMaker \ - : public ::paddle::framework::SingleGradOpDescMaker { \ + FOR_EACH_ACTIVATION_OP(INSERT_INTO_INPLACE_OP_SET); +#undef INSERT_INTO_INPLACE_OP_SET + return ret; +} + +#define REGISTER_ACTIVATION_OP_MAKER(OP_NAME, OP_COMMENT) \ + class OP_NAME##OpMaker \ + : public ::paddle::framework::OpProtoAndCheckerMaker { \ public: \ - using ::paddle::framework::SingleGradOpDescMaker::SingleGradOpDescMaker; \ - \ - protected: \ - std::unique_ptr<::paddle::framework::OpDesc> Apply() const override { \ - auto* op = new ::paddle::framework::OpDesc(); \ - op->SetType(#KERNEL_TYPE "_grad"); \ - op->SetInput("Out", Output("Out")); \ - op->SetInput(::paddle::framework::GradVarName("Out"), \ - OutputGrad("Out")); \ - \ - op->SetAttrMap(Attrs()); \ - \ - op->SetOutput(::paddle::framework::GradVarName("X"), InputGrad("X")); \ - return std::unique_ptr<::paddle::framework::OpDesc>(op); \ + void Make() override { \ + AddInput("X", "Input of " #OP_NAME " operator"); \ + AddOutput("Out", "Output of " #OP_NAME " operator"); \ + AddAttr("use_mkldnn", \ + "(bool, default false) Only used in mkldnn kernel") \ + .SetDefault(false); \ + AddAttr("use_cudnn", \ + "(bool, default false) Only used in cudnn kernel, need " \ + "install cudnn") \ + .SetDefault(false); \ + AddAttr( \ + "is_test", \ + "(bool, default false) Set to true for inference only, false " \ + "for training. Some layers may run faster when this is true.") \ + .SetDefault(false); \ + AddComment(OP_COMMENT); \ } \ } +template +class ActivationGradOpDescMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + std::unique_ptr op(new framework::OpDesc()); + op->SetType(ForwardOpType() + "_grad"); + op->SetInput(framework::GradVarName("Out"), OutputGrad("Out")); + op->SetOutput(framework::GradVarName("X"), InputGrad("X")); + op->SetAttrMap(Attrs()); + + if (static_cast(kDepValue) & + static_cast(ActBwdOpFwdDeps::kDepX)) { + op->SetInput("X", Input("X")); + } + + if (static_cast(kDepValue) & + static_cast(ActBwdOpFwdDeps::kDepOut)) { + op->SetInput("Out", Output("Out")); + } + + return op; + } +}; + framework::OpKernelType GetKernelType(const framework::ExecutionContext& ctx, const framework::OperatorWithKernel& oper, const std::string& name) { framework::LibraryType library{framework::LibraryType::kPlain}; framework::DataLayout layout = framework::DataLayout::kAnyLayout; +// FIXME(liuwei1031) temporarily disable the code to unblock users +// TODO(liuwei1031) figure out the reason behind +// https://github.com/PaddlePaddle/Paddle/issues/16096 +// and re-enable this in the future +// #ifdef PADDLE_WITH_CUDA +// auto it1 = oper.Attrs().find("use_cudnn"); +// if (it1 != oper.Attrs().end() && platform::CanCUDNNBeUsed(ctx)) { +// library = framework::LibraryType::kCUDNN; +// } +// #endif #ifdef PADDLE_WITH_MKLDNN auto it = oper.Attrs().find("use_mkldnn"); if (library == framework::LibraryType::kPlain && it != oper.Attrs().end() && @@ -71,8 +122,8 @@ framework::OpKernelType GetKernelType(const framework::ExecutionContext& ctx, } #endif return framework::OpKernelType( - framework::ToDataType(ctx.Input(name)->type()), - ctx.GetPlace(), layout, library); + framework::GetDataTypeOfVar(ctx.InputVar(name)), ctx.GetPlace(), layout, + library); } class ActivationOp : public framework::OperatorWithKernel { @@ -91,16 +142,12 @@ class ActivationOp : public framework::OperatorWithKernel { } }; -class ActivationOpInferVarType : public framework::VarTypeInference { - public: - void operator()(const framework::OpDesc& op_desc, - framework::BlockDesc* block) const override { - auto x_name = op_desc.Input("X")[0]; - auto out_name = op_desc.Output("Out")[0]; - auto& x = block->FindRecursiveOrCreateVar(x_name); - auto& out = block->FindRecursiveOrCreateVar(out_name); - out.SetType(x.GetType()); - out.SetDataType(x.GetDataType()); +class ActivationOpInferVarType + : public framework::PassInDtypeAndVarTypeToOutput { + protected: + std::unordered_map GetInputOutputWithSameType() + const override { + return std::unordered_map{{"X", /*->*/ "Out"}}; } }; @@ -109,21 +156,22 @@ class ActivationOpGrad : public framework::OperatorWithKernel { using framework::OperatorWithKernel::OperatorWithKernel; void InferShape(framework::InferShapeContext* ctx) const override { - ctx->ShareDim("Out", framework::GradVarName("X")); - ctx->ShareLoD("Out", framework::GradVarName("X")); + auto out_grad_name = framework::GradVarName("Out"); + ctx->ShareDim(out_grad_name, framework::GradVarName("X")); + ctx->ShareLoD(out_grad_name, framework::GradVarName("X")); } protected: framework::OpKernelType GetExpectedKernelType( const framework::ExecutionContext& ctx) const override { - return GetKernelType(ctx, *this, "Out"); + return GetKernelType(ctx, *this, framework::GradVarName("Out")); } }; UNUSED constexpr char SigmoidDoc[] = R"DOC( Sigmoid Activation Operator -$$out = \frac{1}{1 + e^{-x}}$$ +$$out = \\frac{1}{1 + e^{-x}}$$ )DOC"; @@ -148,6 +196,13 @@ $out = \max(x, 0)$ )DOC"; +UNUSED constexpr char GeluDoc[] = R"DOC( +Gelu Activation Operator. + +$out = \\frac{1 + erf(\\frac{x}{\\sqrt{2}})}{2} x$ + +)DOC"; + UNUSED constexpr char TanhDoc[] = R"DOC( Tanh Activation Operator. @@ -165,6 +220,9 @@ $$out = x - \\frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$$ UNUSED constexpr char SqrtDoc[] = R"DOC( Sqrt Activation Operator. +Please make sure legal input, when input a negative value closed to zero, +you should add a small epsilon(1e-12) to avoid negative number caused by numerical errors. + $out = \sqrt{x}$ )DOC"; @@ -179,14 +237,14 @@ $out = |x|$ UNUSED constexpr char CeilDoc[] = R"DOC( Ceil Activation Operator. -$out = ceil(x)$ +$out = \left \lceil x \right \rceil$ )DOC"; UNUSED constexpr char FloorDoc[] = R"DOC( Floor Activation Operator. -$out = floor(x)$ +$out = \left \lfloor x \right \rfloor$ )DOC"; @@ -244,10 +302,52 @@ $out = \ln(1 + e^{x})$ UNUSED constexpr char SoftsignDoc[] = R"DOC( Softsign Activation Operator. -$$out = \frac{x}{1 + |x|}$$ +$$out = \\frac{x}{1 + \|x\|}$$ )DOC"; +class AcosOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", "Input of acos operator"); + AddOutput("Out", "Output of acos operator"); + AddComment(R"DOC( +Arccosine Activation Operator. + +$$out = \cos^{-1}(x)$$ + +)DOC"); + } +}; + +class AsinOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", "Input of asin operator"); + AddOutput("Out", "Output of asin operator"); + AddComment(R"DOC( +Arcsine Activation Operator. + +$$out = \sin^{-1}(x)$$ + +)DOC"); + } +}; + +class AtanOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", "Input of atan operator"); + AddOutput("Out", "Output of atan operator"); + AddComment(R"DOC( +Arctanh Activation Operator. + +$$out = \tanh^{-1}(x)$$ + +)DOC"); + } +}; + class LeakyReluOpMaker : public framework::OpProtoAndCheckerMaker { public: void Make() override { @@ -273,7 +373,7 @@ class SoftShrinkOpMaker : public framework::OpProtoAndCheckerMaker { :strong:`Softshrink Activation Operator` .. math:: - out = \begin{cases} + out = \begin{cases} x - \lambda, \text{if } x > \lambda \\ x + \lambda, \text{if } x < -\lambda \\ 0, \text{otherwise} @@ -439,7 +539,7 @@ class HardSigmoidOpMaker : public framework::OpProtoAndCheckerMaker { AddComment(R"DOC( HardSigmoid Activation Operator. -Segment-wise linear approximation of sigmoid(https://arxiv.org/abs/1603.00391), +Segment-wise linear approximation of sigmoid(https://arxiv.org/abs/1603.00391), which is much faster than sigmoid. $out = \max(0, \min(1, slope * x + shift))$ @@ -471,6 +571,7 @@ REGISTER_ACTIVATION_OP_MAKER(Sigmoid, SigmoidDoc); REGISTER_ACTIVATION_OP_MAKER(LogSigmoid, LogSigmoidDoc); REGISTER_ACTIVATION_OP_MAKER(Exp, ExpDoc); REGISTER_ACTIVATION_OP_MAKER(Relu, ReluDoc); +REGISTER_ACTIVATION_OP_MAKER(Gelu, GeluDoc); REGISTER_ACTIVATION_OP_MAKER(Tanh, TanhDoc); REGISTER_ACTIVATION_OP_MAKER(TanhShrink, TanhShrinkDoc); REGISTER_ACTIVATION_OP_MAKER(Sqrt, SqrtDoc); @@ -485,72 +586,27 @@ REGISTER_ACTIVATION_OP_MAKER(Log, LogDoc); REGISTER_ACTIVATION_OP_MAKER(Square, SquareDoc); REGISTER_ACTIVATION_OP_MAKER(Softplus, SoftplusDoc); REGISTER_ACTIVATION_OP_MAKER(Softsign, SoftsignDoc); - -REGISTER_ACTIVATION_OP_GRAD_MAKER(Sigmoid, sigmoid); -REGISTER_ACTIVATION_OP_GRAD_MAKER(Relu, relu); -REGISTER_ACTIVATION_OP_GRAD_MAKER(Exp, exp); -REGISTER_ACTIVATION_OP_GRAD_MAKER(Tanh, tanh); -REGISTER_ACTIVATION_OP_GRAD_MAKER(Ceil, ceil); -REGISTER_ACTIVATION_OP_GRAD_MAKER(Floor, floor); -REGISTER_ACTIVATION_OP_GRAD_MAKER(Sqrt, sqrt); -REGISTER_ACTIVATION_OP_GRAD_MAKER(SoftRelu, soft_relu); -REGISTER_ACTIVATION_OP_GRAD_MAKER(Relu6, relu6); -REGISTER_ACTIVATION_OP_GRAD_MAKER(Reciprocal, reciprocal); -REGISTER_ACTIVATION_OP_GRAD_MAKER(HardSigmoid, hard_sigmoid); } // namespace operators } // namespace paddle namespace ops = paddle::operators; -#define FOR_EACH_INPLACE_OP_FUNCTOR(__macro) \ - __macro(Sigmoid, sigmoid); \ - __macro(Relu, relu); \ - __macro(Exp, exp); \ - __macro(Tanh, tanh); \ - __macro(Ceil, ceil); \ - __macro(Floor, floor); \ - __macro(Sqrt, sqrt); \ - __macro(SoftRelu, soft_relu); \ - __macro(Relu6, relu6); \ - __macro(Reciprocal, reciprocal); \ - __macro(HardSigmoid, hard_sigmoid); - -#define FOR_EACH_OP_FUNCTOR(__macro) \ - __macro(LogSigmoid, logsigmoid); \ - __macro(SoftShrink, softshrink); \ - __macro(Abs, abs); \ - __macro(Cos, cos); \ - __macro(Sin, sin); \ - __macro(Round, round); \ - __macro(Log, log); \ - __macro(Square, square); \ - __macro(BRelu, brelu); \ - __macro(Pow, pow); \ - __macro(STanh, stanh); \ - __macro(Softplus, softplus); \ - __macro(Softsign, softsign); \ - __macro(LeakyRelu, leaky_relu); \ - __macro(TanhShrink, tanh_shrink); \ - __macro(ELU, elu); \ - __macro(HardShrink, hard_shrink); \ - __macro(Swish, swish); \ - __macro(ThresholdedRelu, thresholded_relu); - -#define REGISTER_INPLACE_ACTIVATION_OP(OP_NAME, KERNEL_TYPE) \ - REGISTER_OPERATOR(KERNEL_TYPE, ::paddle::operators::ActivationOp, \ - ::paddle::operators::OP_NAME##OpMaker, \ - ::paddle::operators::ActivationOpInferVarType, \ - ::paddle::operators::OP_NAME##GradMaker); \ - REGISTER_OPERATOR(KERNEL_TYPE##_grad, ::paddle::operators::ActivationOpGrad) - -#define REGISTER_ACTIVATION_OP(OP_NAME, KERNEL_TYPE) \ - REGISTER_OPERATOR(KERNEL_TYPE, ::paddle::operators::ActivationOp, \ - ::paddle::operators::OP_NAME##OpMaker, \ - ::paddle::operators::ActivationOpInferVarType, \ - ::paddle::framework::DefaultGradOpDescMaker); \ - REGISTER_OPERATOR(KERNEL_TYPE##_grad, ::paddle::operators::ActivationOpGrad) - -#define REGISTER_ACTIVATION_CPU_KERNEL(act_type, functor, grad_functor) \ +#define REGISTER_ACTIVATION_OP(KERNEL_TYPE, OP_NAME, functor, grad_functor) \ + REGISTER_OPERATOR( \ + KERNEL_TYPE, ops::ActivationOp, ops::OP_NAME##OpMaker, \ + ops::ActivationOpInferVarType, \ + ops::ActivationGradOpDescMaker::FwdDeps()>, \ + std::conditional>(), \ + ::paddle::framework::SingleOpInplaceInToOut, \ + void>::type); \ + REGISTER_OPERATOR( \ + KERNEL_TYPE##_grad, ops::ActivationOpGrad, \ + std::conditional>(), \ + ::paddle::framework::SingleOpInplaceInToOut, \ + void>::type) + +#define REGISTER_ACTIVATION_CPU_KERNEL(act_type, op_name, functor, \ + grad_functor) \ REGISTER_OP_CPU_KERNEL( \ act_type, ops::ActivationKernel>, \ @@ -563,6 +619,5 @@ namespace ops = paddle::operators; ops::ActivationGradKernel>); -FOR_EACH_OP_FUNCTOR(REGISTER_ACTIVATION_OP); -FOR_EACH_INPLACE_OP_FUNCTOR(REGISTER_INPLACE_ACTIVATION_OP); -FOR_EACH_KERNEL_FUNCTOR(REGISTER_ACTIVATION_CPU_KERNEL); +FOR_EACH_ACTIVATION_OP(REGISTER_ACTIVATION_OP); +FOR_EACH_ACTIVATION_OP(REGISTER_ACTIVATION_CPU_KERNEL); diff --git a/paddle/fluid/operators/activation_op.cu b/paddle/fluid/operators/activation_op.cu index 27487b396ccf63d962defa6b270063ccb409164e..9c7a8d8971cba4090db1bbc32c7eabf2285e7eff 100644 --- a/paddle/fluid/operators/activation_op.cu +++ b/paddle/fluid/operators/activation_op.cu @@ -15,7 +15,8 @@ limitations under the License. */ namespace ops = paddle::operators; namespace plat = paddle::platform; -#define REGISTER_ACTIVATION_CUDA_KERNEL(act_type, functor, grad_functor) \ +#define REGISTER_ACTIVATION_CUDA_KERNEL(act_type, op_name, functor, \ + grad_functor) \ REGISTER_OP_CUDA_KERNEL( \ act_type, \ ops::ActivationKernel>, \ @@ -26,6 +27,8 @@ namespace plat = paddle::platform; act_type##_grad, ops::ActivationGradKernel>, \ ops::ActivationGradKernel>); + ops::grad_functor>, \ + ops::ActivationGradKernel>); -FOR_EACH_KERNEL_FUNCTOR(REGISTER_ACTIVATION_CUDA_KERNEL); +FOR_EACH_ACTIVATION_OP(REGISTER_ACTIVATION_CUDA_KERNEL); diff --git a/paddle/fluid/operators/activation_op.h b/paddle/fluid/operators/activation_op.h index 2e31d1c9c708225135e27c93ba94722794c4b282..915632a328feb99c021ec062a9b22a04623eff4a 100644 --- a/paddle/fluid/operators/activation_op.h +++ b/paddle/fluid/operators/activation_op.h @@ -11,14 +11,22 @@ limitations under the License. */ #pragma once #include +#include +#include #include #include #include #include +#include +#ifndef _USE_MATH_DEFINES +#define _USE_MATH_DEFINES +#endif + #include "paddle/fluid/framework/eigen.h" #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/operators/detail/safe_ref.h" +#include "paddle/fluid/operators/math/blas.h" #include "paddle/fluid/platform/float16.h" #ifdef PADDLE_WITH_MKLDNN @@ -28,15 +36,133 @@ limitations under the License. */ namespace paddle { namespace operators { -/* Use ugly global variable, for the using in python layer side - Please refer to the layer_helper.py and get the details. - */ -static std::unordered_set InplaceOpSet = { - "sigmoid", "exp", "relu", "tanh", "sqrt", "ceil", - "floor", "reciprocal", "relu6", "soft_relu", "hard_sigmoid", +enum ActBwdOpFwdDeps { + kNoDeps = 0x00, // Do not need any forward input/output + kDepX = 0x01, // Only need forward input X + kDepOut = 0x02, // Only need forward output Out + + // Never add kDepXOut, because Out can be always calculated + // by forward input X in backward part. + // FIXME(zjl): but in MKLDNN abs, X and Out are all needed... + // Developers should not rely on this enum value! + kDepXOut = 0x03 }; -static bool IsInplace(std::string op) { return InplaceOpSet.count(op); } +std::unique_ptr> GetInplaceOpSet(); + +static bool IsInplace(const std::string& op) { + static auto InplaceOpSet = GetInplaceOpSet(); + bool inplace = InplaceOpSet->count(op); + // for op_grad + const int kGradSuffixLen = 4; + if (op.size() > kGradSuffixLen && + op.compare(op.size() - kGradSuffixLen - 1, kGradSuffixLen, "grad")) { + inplace = + InplaceOpSet->count(op.substr(0, op.size() - (kGradSuffixLen + 1))); + } + return inplace; +} + +/* The following operator can be used to process SelectedRows, because the + * output of those operator for zero is zero too. + */ +static std::unordered_set CanBeUsedBySelectedRows = { + "abs", "abs_grad", "square", "square_grad", "sqrt", "sqrt_grad"}; + +inline void ExtractActivationTensor(const framework::ExecutionContext& context, + const framework::Tensor** X, + framework::Tensor** Out) { + auto x_var = context.InputVar("X"); + auto out_var = context.OutputVar("Out"); + PADDLE_ENFORCE(x_var != nullptr, + "Cannot get input Variable X, variable name = %s", + context.op().Input("X")); + PADDLE_ENFORCE(out_var != nullptr, + "Cannot get output Variable Out, variable name = %s", + context.op().Output("Out")); + if (CanBeUsedBySelectedRows.count(context.op().Type())) { + *X = paddle::framework::GetLoDTensorOrSelectedRowsValueFromVar(*x_var); + *Out = paddle::framework::GetMutableLoDTensorOrSelectedRowsValueFromVar( + out_var); + } else { + *X = context.Input("X"); + *Out = context.Output("Out"); + } + + PADDLE_ENFORCE(*Out != nullptr, + "Cannot get output tensor Out, variable name = %s", + context.op().Output("Out")); +} + +template +inline void ExtractActivationGradTensor( + const framework::ExecutionContext& context, const framework::Tensor** X, + const framework::Tensor** Out, const framework::Tensor** dOut, + framework::Tensor** dX) { + auto out_grad_var = context.InputVar(framework::GradVarName("Out")); + auto x_grad_var = context.OutputVar(framework::GradVarName("X")); + const framework::Variable* out_var = nullptr; + + if (static_cast(kDepValue) & static_cast(kDepOut)) { + out_var = context.InputVar("Out"); + PADDLE_ENFORCE(out_var != nullptr, + "Cannot get input Variable Out, variable name = %s", + context.op().Input("Out")); + } + PADDLE_ENFORCE(out_grad_var != nullptr, + "Cannot get input Variable %s, variable name = %s", + framework::GradVarName("Out"), + context.op().Input(framework::GradVarName("Out"))); + PADDLE_ENFORCE(x_grad_var != nullptr, + "Cannot get output Variable %s, variable name = %s", + framework::GradVarName("X"), + context.op().Output(framework::GradVarName("X"))); + + if (CanBeUsedBySelectedRows.count(context.op().Type())) { + *dOut = paddle::framework::GetLoDTensorOrSelectedRowsValueFromVar( + *out_grad_var); + *dX = paddle::framework::GetMutableLoDTensorOrSelectedRowsValueFromVar( + x_grad_var); + + if (out_var) { + *Out = + paddle::framework::GetLoDTensorOrSelectedRowsValueFromVar(*out_var); + } else { + *Out = *dOut; // fake out + } + + } else { + *Out = context.Input("Out"); + *dOut = context.Input(framework::GradVarName("Out")); + *dX = context.Output(framework::GradVarName("X")); + + if (out_var) { + *Out = &(out_var->Get()); + } else { + *Out = *dOut; // fake out + } + } + + PADDLE_ENFORCE(*dX != nullptr, + "Cannot get output tensor %s, variable name = %s", + framework::GradVarName("X"), + context.op().Output(framework::GradVarName("X"))); + + if (static_cast(kDepValue) & static_cast(kDepX)) { + auto x_var = context.InputVar("X"); + PADDLE_ENFORCE(x_var != nullptr, + "Cannot get input tensor X, variable name = %s", + context.op().Input("X")); + if (CanBeUsedBySelectedRows.count(context.op().Type())) { + *X = paddle::framework::GetLoDTensorOrSelectedRowsValueFromVar(*x_var); + } else { + *X = context.Input("X"); + } + } else { + VLOG(10) << " Inplace activation of Op : " << context.op().Type(); + *X = *dX; + } +} template class ActivationKernel @@ -45,16 +171,13 @@ class ActivationKernel using T = typename Functor::ELEMENT_TYPE; void Compute(const framework::ExecutionContext& context) const override { - auto& X = detail::Ref(context.Input("X"), - "Cannot get input tensor X, variable name = %s", - context.op().Input("X")); - - auto& Out = detail::Ref(context.Output("Out"), - "Cannot get output tensor Out, variable name = %s", - context.op().Output("Out")); - Out.mutable_data(context.GetPlace()); - auto x = framework::EigenVector::Flatten(X); - auto out = framework::EigenVector::Flatten(Out); + const framework::Tensor* X = nullptr; + framework::Tensor* Out = nullptr; + ExtractActivationTensor(context, &X, &Out); + Out->mutable_data(context.GetPlace()); + + auto x = framework::EigenVector::Flatten(detail::Ref(X)); + auto out = framework::EigenVector::Flatten(detail::Ref(Out)); auto* place = context.template device_context().eigen_device(); Functor functor; @@ -73,15 +196,16 @@ class ActivationGradKernel public: using T = typename Functor::ELEMENT_TYPE; void Compute(const framework::ExecutionContext& context) const override { - auto* Out = context.Input("Out"); - auto* dOut = - context.Input(framework::GradVarName("Out")); - auto* dX = context.Output(framework::GradVarName("X")); + const framework::Tensor *X, *Out, *dOut; + framework::Tensor* dX = nullptr; + X = Out = dOut = nullptr; + ExtractActivationGradTensor(context, &X, &Out, &dOut, + &dX); dX->mutable_data(context.GetPlace()); - - auto dout = framework::EigenVector::Flatten(*dOut); - auto out = framework::EigenVector::Flatten(*Out); - auto dx = framework::EigenVector::Flatten(*dX); + auto dout = framework::EigenVector::Flatten(detail::Ref(dOut)); + auto out = framework::EigenVector::Flatten(detail::Ref(Out)); + auto dx = framework::EigenVector::Flatten(detail::Ref(dX)); + auto x = framework::EigenVector::Flatten(detail::Ref(X)); auto* place = context.template device_context().eigen_device(); Functor functor; @@ -89,16 +213,7 @@ class ActivationGradKernel for (auto& attr : attrs) { *attr.second = context.Attr(attr.first); } - bool inplace = functor.Inplace(); - if (!inplace) { - auto* X = context.Input("X"); - auto x = framework::EigenVector::Flatten(*X); - functor(*place, x, out, dout, dx); - } else { - VLOG(10) << " Inplace activation "; - auto x = framework::EigenVector::Flatten(*dX); - functor(*place, x, out, dout, dx); - } + functor(*place, x, out, dout, dx); } }; @@ -130,12 +245,13 @@ struct SigmoidFunctor : public BaseActivationFunctor { template struct SigmoidGradFunctor : public BaseActivationFunctor { - bool Inplace() const { return IsInplace("sigmoid"); } template void operator()(Device d, X x, Out out, dOut dout, dX dx) const { dx.device(d) = dout * out * (static_cast(1) - out); } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepOut; } }; // Originally: logsigmoid(x) = -log (1 + exp(-x)) @@ -172,6 +288,8 @@ struct LogSigmoidGradFunctor : public BaseActivationFunctor { dx.device(d) = dout * ((-x - temp).exp() / ((-temp).exp() + (-x - temp).exp())); } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepX; } }; // exp(x) = e^x @@ -185,12 +303,13 @@ struct ExpFunctor : public BaseActivationFunctor { template struct ExpGradFunctor : public BaseActivationFunctor { - bool Inplace() const { return IsInplace("exp"); } template void operator()(Device d, X x, Out out, dOut dout, dX dx) const { dx.device(d) = dout * out; } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepOut; } }; // relu(x) = max(x, 0) @@ -204,12 +323,59 @@ struct ReluFunctor : public BaseActivationFunctor { template struct ReluGradFunctor : public BaseActivationFunctor { - bool Inplace() const { return IsInplace("relu"); } template void operator()(Device d, X x, Out out, dOut dout, dX dx) const { dx.device(d) = dout * (out > static_cast(0)).template cast(); } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepOut; } +}; + +// gelu(x) = 0.5 * x * (1 + erf(x / sqrt(2))) +template +struct GeluFunctor : public BaseActivationFunctor { + template + void operator()(Device d, X x, Out out) const { +// Because the execute or device context can not be deliver here, it keep the +// marco for NVCC. +#if defined(PADDLE_WITH_MKLML) && !defined(_WIN32) && !defined(__APPLE__) && \ + !defined(__OSX__) && !defined(PADDLE_WITH_CUDA) + auto x_data = x.data(); + auto out_data = out.data(); + int n = std::min(x.size(), out.size()); + + std::memset(out_data, 0, n * sizeof(T)); + math::CBlas::AXPY(n, static_cast(M_SQRT1_2), x_data, 1, out_data, 1); + math::CBlas::VMERF(n, out_data, out_data, VML_LA); + for (int i = 0; i < n; i++) { + out_data[i] += static_cast(1); + } + math::CBlas::VMUL(n, x_data, out_data, out_data); + for (int i = 0; i < n; i++) { + out_data[i] *= static_cast(0.5); + } +#else + auto temp = (x * static_cast(M_SQRT1_2)).erf(); + out.device(d) = x * static_cast(0.5) * (static_cast(1) + temp); +#endif + } +}; + +template +struct GeluGradFunctor : BaseActivationFunctor { + template + void operator()(Device d, X x, Out out, dOut dout, dX dx) const { + auto first = static_cast(0.5) * + (static_cast(1) + ((x * static_cast(M_SQRT1_2)).erf())); + + auto second = static_cast(0.5 * M_2_SQRTPI * M_SQRT1_2) * x * + (-static_cast(0.5) * x.square()).exp(); + dx.device(d) = dout * (first + second); + } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepX; } }; // tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x)) @@ -223,12 +389,13 @@ struct TanhFunctor : public BaseActivationFunctor { template struct TanhGradFunctor : public BaseActivationFunctor { - bool Inplace() const { return IsInplace("tanh"); } template void operator()(Device d, X x, Out out, dOut dout, dX dx) const { dx.device(d) = dout * (static_cast(1) - out * out); } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepOut; } }; // tanhshrink(x) = x - tanh(x) @@ -248,6 +415,8 @@ struct TanhShrinkGradFunctor : public BaseActivationFunctor { void operator()(Device d, X x, Out out, dOut dout, dX dx) const { dx.device(d) = dout * (x.tanh() * x.tanh()); } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepX; } }; // tanhshrink(x) = x - tanh(x) @@ -282,6 +451,8 @@ struct HardShrinkGradFunctor : public BaseActivationFunctor { auto temp2 = (x > static_cast(threshold)).template cast().eval(); dx.device(d) = dout * (temp1 + temp2).template cast(); } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepX; } }; // softshrink(x) = x - lambda, if x > lambda; x + lambda, if x < -lambda; 0 @@ -316,6 +487,8 @@ struct SoftShrinkGradFunctor : public BaseActivationFunctor { auto temp2 = (x < -lambdaT).template cast().eval(); dx.device(d) = dout * (temp1 + temp2).template cast(); } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepX; } }; // sqrt(x) = x^(1/2) @@ -329,13 +502,13 @@ struct SqrtFunctor : public BaseActivationFunctor { template struct SqrtGradFunctor : public BaseActivationFunctor { - bool Inplace() const { return IsInplace("sqrt"); } template void operator()(Device d, X x, Out out, dOut dout, dX dx) const { - const Out out_conj = Eigen::numext::conj(out); - dx.device(d) = static_cast(0.5) * dout / out_conj; + dx.device(d) = static_cast(0.5) * dout / out; } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepOut; } }; // ceil(x) = ceiling(x) @@ -349,12 +522,13 @@ struct CeilFunctor : public BaseActivationFunctor { template struct ZeroGradFunctor : public BaseActivationFunctor { - bool Inplace() const { return IsInplace("ceil"); } template void operator()(Device d, X x, Out out, dOut dout, dX dx) const { dx.device(d) = static_cast(0) / out; } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kNoDeps; } }; // floor(x) = flooring(x) @@ -398,6 +572,8 @@ struct CosGradFunctor : public BaseActivationFunctor { void operator()(Device d, X x, Out out, dOut dout, dX dx) const { dx.device(d) = -dout * x.unaryExpr(Sine()); } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepX; } }; // cosine(x) = cos(x) @@ -417,6 +593,8 @@ struct SinGradFunctor : public BaseActivationFunctor { void operator()(Device d, X x, Out out, dOut dout, dX dx) const { dx.device(d) = dout * x.unaryExpr(Cosine()); } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepX; } }; // sine(x) = sin(x) @@ -428,6 +606,107 @@ struct SinFunctor : public BaseActivationFunctor { } }; +template +struct Acos { + HOSTDEVICE T operator()(const T& val) const { return acos(val); } +}; + +template <> +struct Acos { + HOSTDEVICE platform::float16 operator()(const platform::float16& val) const { + return platform::float16(acos(static_cast(val))); + } +}; + +// Acos(x) = acos(x) +template +struct AcosFunctor : public BaseActivationFunctor { + template + void operator()(Device d, X x, Out out) const { + out.device(d) = x.unaryExpr(Acos()); + } +}; + +// acos'(x) = -1/sqrt(1-x^2) +template +struct AcosGradFunctor : public BaseActivationFunctor { + template + void operator()(Device d, X x, Out out, dOut dout, dX dx) const { + dx.device(d) = + -dout * static_cast(1) / (static_cast(1) - x.square()).sqrt(); + } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepX; } +}; + +template +struct Asin { + HOSTDEVICE T operator()(const T& val) const { return asin(val); } +}; + +template <> +struct Asin { + HOSTDEVICE platform::float16 operator()(const platform::float16& val) const { + return platform::float16(asin(static_cast(val))); + } +}; + +// Asin(x) = asin(x) +template +struct AsinFunctor : public BaseActivationFunctor { + template + void operator()(Device d, X x, Out out) const { + out.device(d) = x.unaryExpr(Asin()); + } +}; + +// asin'(x) = 1/sqrt(1-x^2) +template +struct AsinGradFunctor : public BaseActivationFunctor { + template + void operator()(Device d, X x, Out out, dOut dout, dX dx) const { + dx.device(d) = + dout * static_cast(1) / (static_cast(1) - x.square()).sqrt(); + } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepX; } +}; + +template +struct Atan { + HOSTDEVICE T operator()(const T& val) const { return atan(val); } +}; + +template <> +struct Atan { + HOSTDEVICE platform::float16 operator()(const platform::float16& val) const { + return platform::float16(atan(static_cast(val))); + } +}; + +// Atan(x) = atan(x) +template +struct AtanFunctor : public BaseActivationFunctor { + template + void operator()(Device d, X x, Out out) const { + out.device(d) = x.unaryExpr(Atan()); + } +}; + +// atan'(x) = 1 / (1 + x^2) +template +struct AtanGradFunctor : public BaseActivationFunctor { + template + void operator()(Device d, X x, Out out, dOut dout, dX dx) const { + dx.device(d) = dout * static_cast(1) / (static_cast(1) + x.square()); + } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepX; } +}; + // round(x) = [x] template struct RoundFunctor : public BaseActivationFunctor { @@ -453,6 +732,8 @@ struct AbsGradFunctor : public BaseActivationFunctor { void operator()(Device d, X x, Out out, dOut dout, dX dx) const { dx.device(d) = dout * x.sign(); } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepXOut; } }; // reciprocal(x) = 1 / x @@ -466,12 +747,13 @@ struct ReciprocalFunctor : public BaseActivationFunctor { template struct ReciprocalGradFunctor : public BaseActivationFunctor { - bool Inplace() const { return IsInplace("reciprocal"); } template void operator()(Device d, X x, Out out, dOut dout, dX dx) const { dx.device(d) = dout * static_cast(-1) * out * out; } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepOut; } }; // log(x) = natural logarithm of x @@ -490,6 +772,8 @@ struct LogGradFunctor : public BaseActivationFunctor { void operator()(Device d, X x, Out out, dOut dout, dX dx) const { dx.device(d) = dout * (static_cast(1) / x); } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepX; } }; // square(x) = x^2 @@ -508,6 +792,8 @@ struct SquareGradFunctor : public BaseActivationFunctor { void operator()(Device d, X x, Out out, dOut dout, dX dx) const { dx.device(d) = dout * static_cast(2) * x; } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepX; } }; template @@ -542,6 +828,8 @@ struct BReluGradFunctor : public BaseActivationFunctor { ((x > static_cast(t_min)) * (x < static_cast(t_max))) .template cast(); } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepX; } }; // relu6(x) = min(max(0, x), 6) @@ -566,7 +854,6 @@ struct Relu6GradFunctor : public BaseActivationFunctor { typename BaseActivationFunctor::AttrPair GetAttrs() { return {{"threshold", &threshold}}; } - bool Inplace() const { return IsInplace("relu6"); } template void operator()(Device d, X x, Out out, dOut dout, dX dx) const { @@ -575,6 +862,8 @@ struct Relu6GradFunctor : public BaseActivationFunctor { ((out > static_cast(0)) * (out < static_cast(threshold))) .template cast(); } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepOut; } }; // softplus(x) = log(1 + exp(x)) @@ -604,6 +893,8 @@ struct SoftplusGradFunctor : public BaseActivationFunctor { dx.device(d) = dout * ((x - temp).exp() / ((-temp).exp() + (x - temp).exp())); } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepX; } }; // softsign(x) = x / (1 + |x|) @@ -625,6 +916,8 @@ struct SoftsignGradFunctor : public BaseActivationFunctor { dx.device(d) = dout * (static_cast(1) / (static_cast(1) + x.abs()).square()); } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepX; } }; template @@ -648,7 +941,6 @@ struct SoftReluGradFunctor : public BaseActivationFunctor { typename BaseActivationFunctor::AttrPair GetAttrs() { return {{"threshold", &threshold}}; } - bool Inplace() const { return IsInplace("soft_relu"); } template void operator()(Device d, X x, Out out, dOut dout, dX dx) const { @@ -656,6 +948,8 @@ struct SoftReluGradFunctor : public BaseActivationFunctor { auto temp = ((out > -tmp) * (out < tmp)).template cast().eval(); dx.device(d) = dout * (static_cast(1) - (-out).exp()) * temp; } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepOut; } }; template @@ -685,6 +979,8 @@ struct LeakyReluGradFunctor : public BaseActivationFunctor { auto temp2 = (x >= static_cast(0)).template cast().eval(); dx.device(d) = dout * (temp1 + temp2).template cast(); } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepX; } }; template @@ -712,9 +1008,11 @@ struct ELUGradFunctor : public BaseActivationFunctor { typename dX> void operator()(Device d, X x, Out out, dOut dout, dX dx) const { dx.device(d) = dout * (x > static_cast(0)).template cast() + - dout * (out + static_cast(alpha)) * + dout * static_cast(alpha) * x.exp() * (x < static_cast(0)).template cast(); } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepX; } }; // FIXME(qijun) https://github.com/PaddlePaddle/Paddle/issues/5198 @@ -740,8 +1038,10 @@ struct PowGradFunctor : public BaseActivationFunctor { typename dX> void operator()(Device d, X x, Out out, dOut dout, dX dx) const { dx.device(d) = dout * static_cast(factor) * - x.pow(static_cast(factor - static_cast(1))); + x.pow(static_cast(factor) - static_cast(1)); } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepX; } }; template @@ -775,6 +1075,8 @@ struct STanhGradFunctor : public BaseActivationFunctor { auto temp = (a * x).tanh() * (a * x).tanh(); dx.device(d) = dout * a * b * (static_cast(1) - temp); } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepX; } }; template @@ -804,6 +1106,8 @@ struct ThresholdedReluGradFunctor : public BaseActivationFunctor { auto th = static_cast(threshold); dx.device(d) = dout * (x > th).template cast(); } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepX; } }; template @@ -829,7 +1133,6 @@ struct HardSigmoidGradFunctor : public BaseActivationFunctor { typename BaseActivationFunctor::AttrPair GetAttrs() { return {{"slope", &slope}, {"offset", &offset}}; } - bool Inplace() { return IsInplace("hard_sigmoid"); } template void operator()(Device d, X x, Out out, dOut dout, dX dx) const { @@ -838,6 +1141,8 @@ struct HardSigmoidGradFunctor : public BaseActivationFunctor { .template cast() * static_cast(slope); } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepOut; } }; template @@ -862,45 +1167,54 @@ struct SwishGradFunctor : public BaseActivationFunctor { template - void operator()(Device d, X x, Out out, dOut dout, dX dx) const { + void operator()(Device d, X x, Out fake_out, dOut dout, dX dx) const { auto temp1 = static_cast(1) / (static_cast(1) + (static_cast(-beta) * x).exp()); + auto out = x * temp1; auto temp2 = temp1 * (static_cast(1) - (static_cast(beta) * out)); dx.device(d) = dout * ((static_cast(beta) * out) + temp2); } + + static constexpr ActBwdOpFwdDeps FwdDeps() { return kDepX; } }; } // namespace operators } // namespace paddle -#define FOR_EACH_KERNEL_FUNCTOR(__macro) \ - __macro(sigmoid, SigmoidFunctor, SigmoidGradFunctor); \ - __macro(logsigmoid, LogSigmoidFunctor, LogSigmoidGradFunctor); \ - __macro(exp, ExpFunctor, ExpGradFunctor); \ - __macro(relu, ReluFunctor, ReluGradFunctor); \ - __macro(tanh, TanhFunctor, TanhGradFunctor); \ - __macro(softshrink, SoftShrinkFunctor, SoftShrinkGradFunctor); \ - __macro(sqrt, SqrtFunctor, SqrtGradFunctor); \ - __macro(abs, AbsFunctor, AbsGradFunctor); \ - __macro(ceil, CeilFunctor, ZeroGradFunctor); \ - __macro(floor, FloorFunctor, ZeroGradFunctor); \ - __macro(cos, CosFunctor, CosGradFunctor); \ - __macro(sin, SinFunctor, SinGradFunctor); \ - __macro(round, RoundFunctor, ZeroGradFunctor); \ - __macro(reciprocal, ReciprocalFunctor, ReciprocalGradFunctor); \ - __macro(log, LogFunctor, LogGradFunctor); \ - __macro(square, SquareFunctor, SquareGradFunctor); \ - __macro(brelu, BReluFunctor, BReluGradFunctor); \ - __macro(soft_relu, SoftReluFunctor, SoftReluGradFunctor); \ - __macro(pow, PowFunctor, PowGradFunctor); \ - __macro(stanh, STanhFunctor, STanhGradFunctor); \ - __macro(softplus, SoftplusFunctor, SoftplusGradFunctor); \ - __macro(softsign, SoftsignFunctor, SoftsignGradFunctor); \ - __macro(relu6, Relu6Functor, Relu6GradFunctor); \ - __macro(leaky_relu, LeakyReluFunctor, LeakyReluGradFunctor); \ - __macro(tanh_shrink, TanhShrinkFunctor, TanhShrinkGradFunctor); \ - __macro(elu, ELUFunctor, ELUGradFunctor); \ - __macro(hard_shrink, HardShrinkFunctor, HardShrinkGradFunctor); \ - __macro(hard_sigmoid, HardSigmoidFunctor, HardSigmoidGradFunctor); \ - __macro(swish, SwishFunctor, SwishGradFunctor); \ - __macro(thresholded_relu, ThresholdedReluFunctor, ThresholdedReluGradFunctor); +#define FOR_EACH_ACTIVATION_OP(__macro) \ + __macro(sigmoid, Sigmoid, SigmoidFunctor, SigmoidGradFunctor); \ + __macro(logsigmoid, LogSigmoid, LogSigmoidFunctor, LogSigmoidGradFunctor); \ + __macro(exp, Exp, ExpFunctor, ExpGradFunctor); \ + __macro(relu, Relu, ReluFunctor, ReluGradFunctor); \ + __macro(gelu, Gelu, GeluFunctor, GeluGradFunctor); \ + __macro(tanh, Tanh, TanhFunctor, TanhGradFunctor); \ + __macro(atan, Atan, AtanFunctor, AtanGradFunctor); \ + __macro(softshrink, SoftShrink, SoftShrinkFunctor, SoftShrinkGradFunctor); \ + __macro(sqrt, Sqrt, SqrtFunctor, SqrtGradFunctor); \ + __macro(abs, Abs, AbsFunctor, AbsGradFunctor); \ + __macro(ceil, Ceil, CeilFunctor, ZeroGradFunctor); \ + __macro(floor, Floor, FloorFunctor, ZeroGradFunctor); \ + __macro(cos, Cos, CosFunctor, CosGradFunctor); \ + __macro(acos, Acos, AcosFunctor, AcosGradFunctor); \ + __macro(sin, Sin, SinFunctor, SinGradFunctor); \ + __macro(asin, Asin, AsinFunctor, AsinGradFunctor); \ + __macro(round, Round, RoundFunctor, ZeroGradFunctor); \ + __macro(reciprocal, Reciprocal, ReciprocalFunctor, ReciprocalGradFunctor); \ + __macro(log, Log, LogFunctor, LogGradFunctor); \ + __macro(square, Square, SquareFunctor, SquareGradFunctor); \ + __macro(brelu, BRelu, BReluFunctor, BReluGradFunctor); \ + __macro(soft_relu, SoftRelu, SoftReluFunctor, SoftReluGradFunctor); \ + __macro(pow, Pow, PowFunctor, PowGradFunctor); \ + __macro(stanh, STanh, STanhFunctor, STanhGradFunctor); \ + __macro(softplus, Softplus, SoftplusFunctor, SoftplusGradFunctor); \ + __macro(softsign, Softsign, SoftsignFunctor, SoftsignGradFunctor); \ + __macro(relu6, Relu6, Relu6Functor, Relu6GradFunctor); \ + __macro(leaky_relu, LeakyRelu, LeakyReluFunctor, LeakyReluGradFunctor); \ + __macro(tanh_shrink, TanhShrink, TanhShrinkFunctor, TanhShrinkGradFunctor); \ + __macro(elu, ELU, ELUFunctor, ELUGradFunctor); \ + __macro(hard_shrink, HardShrink, HardShrinkFunctor, HardShrinkGradFunctor); \ + __macro(hard_sigmoid, HardSigmoid, HardSigmoidFunctor, \ + HardSigmoidGradFunctor); \ + __macro(swish, Swish, SwishFunctor, SwishGradFunctor); \ + __macro(thresholded_relu, ThresholdedRelu, ThresholdedReluFunctor, \ + ThresholdedReluGradFunctor); diff --git a/paddle/fluid/operators/adadelta_op.cc b/paddle/fluid/operators/adadelta_op.cc deleted file mode 100644 index 89a7a49e0fa8427826f5d91274912a68f2316b61..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/adadelta_op.cc +++ /dev/null @@ -1,130 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/operators/adadelta_op.h" - -namespace paddle { -namespace operators { - -using Tensor = framework::Tensor; - -class AdadeltaOp : public framework::OperatorWithKernel { - public: - using framework::OperatorWithKernel::OperatorWithKernel; - - void InferShape(framework::InferShapeContext *ctx) const override { - PADDLE_ENFORCE(ctx->HasInput("Param"), - "Input(Param) of AdadeltaOp should not be null."); - PADDLE_ENFORCE(ctx->HasInput("Grad"), - "Input(Grad) of AdadeltaOp should not be null."); - PADDLE_ENFORCE(ctx->HasInput("AvgSquaredGrad"), - "Input(AvgSquaredGrad) of AdadeltaOp should not be null."); - PADDLE_ENFORCE(ctx->HasInput("AvgSquaredUpdate"), - "Input(AvgSquaredUpdate) of AdadeltaOp should not be null."); - PADDLE_ENFORCE( - ctx->GetInputsVarType("Param").front() == - framework::proto::VarType::LOD_TENSOR, - "The input var's type should be LoDTensor, but the received is %s", - ctx->Inputs("Param").front(), ctx->GetInputsVarType("Param").front()); - PADDLE_ENFORCE( - ctx->GetInputsVarType("Grad").front() == - framework::proto::VarType::LOD_TENSOR, - "The input var's type should be LoDTensor, but the received is %s", - ctx->Inputs("Grad").front(), ctx->GetInputsVarType("Grad").front()); - - PADDLE_ENFORCE(ctx->HasOutput("ParamOut"), - "Output(ParamOut) of AdadeltaOp should not be null."); - PADDLE_ENFORCE( - ctx->HasOutput("AvgSquaredGradOut"), - "Output(AvgSquaredGradOut) of AdadeltaOp should not be null."); - PADDLE_ENFORCE( - ctx->HasOutput("AvgSquaredUpdateOut"), - "Output(AvgSquaredUpdateOut) of AdadeltaOp should not be null."); - - auto param_dim = ctx->GetInputDim("Param"); - PADDLE_ENFORCE_EQ( - param_dim, ctx->GetInputDim("Grad"), - "param and grad input of AdadeltaOp should have same dimension"); - PADDLE_ENFORCE_EQ(param_dim, ctx->GetInputDim("AvgSquaredGrad"), - "Param and AvgSquaredGrad input of AdadeltaOp " - "should have same dimension"); - PADDLE_ENFORCE_EQ(param_dim, ctx->GetInputDim("AvgSquaredUpdate"), - "Param and AvgSquaredUpdate input of AdadeltaOp " - "should have same dimension"); - - ctx->SetOutputDim("ParamOut", param_dim); - ctx->SetOutputDim("AvgSquaredGradOut", param_dim); - ctx->SetOutputDim("AvgSquaredUpdateOut", param_dim); - } - - framework::OpKernelType GetExpectedKernelType( - const framework::ExecutionContext &ctx) const override { - auto input_data_type = - framework::ToDataType(ctx.Input("Param")->type()); - return framework::OpKernelType(input_data_type, ctx.GetPlace()); - } -}; - -class AdadeltaOpMaker : public framework::OpProtoAndCheckerMaker { - public: - void Make() override { - AddInput("Param", "(Tensor) Input parameter"); - AddInput("Grad", "(Tensor) Input gradient"); - AddInput("AvgSquaredGrad", "(Tensor) Input average of squared gradient"); - AddInput("AvgSquaredUpdate", - "(Tensor) Input average of squared parameter updates"); - - AddOutput("ParamOut", "(Tensor) Output parameter"); - AddOutput("AvgSquaredGradOut", - "(Tensor) Output average of squared gradient"); - AddOutput("AvgSquaredUpdateOut", - "(Tensor) Output average of squared parameter updates"); - - AddAttr("rho", - "(float, default 0.95) Exponential decay rate " - "for squared gradients.") - .SetDefault(0.95f); - AddAttr("epsilon", - "(float, default 1.0e-6) Constant for " - "numerical stability") - .SetDefault(1.0e-6f); - AddComment(R"DOC( -Adadelta Optimizer. - -Adadelta optimizer is implemented as explained in: -https://arxiv.org/abs/1212.5701 -Adadelta is a per-dimension adaptive learning rate method used -for gradient descent. - -Adadelta updates are as follows: - -$$ -avg\_squared\_grad\_out = \rho * avg\_squared\_grad + (1 - \rho) * grad * grad \\ -param\_update = - \sqrt{\frac{avg\_squared\_update + \epsilon}{avg\_squared\_grad\_out + \epsilon}} * grad \\ -avg\_squared\_update\_out = \rho * avg\_squared\_update + (1 - \rho) * {param\_update}^2 \\ -param\_out = param + param\_update -$$ - -)DOC"); - } -}; - -} // namespace operators -} // namespace paddle - -namespace ops = paddle::operators; -REGISTER_OP_WITHOUT_GRADIENT(adadelta, ops::AdadeltaOp, ops::AdadeltaOpMaker); -REGISTER_OP_CPU_KERNEL( - adadelta, ops::AdadeltaOpKernel, - ops::AdadeltaOpKernel); diff --git a/paddle/fluid/operators/adadelta_op.cu b/paddle/fluid/operators/adadelta_op.cu deleted file mode 100644 index fc10c6657476e7f87b2f703a1d0cb88eeebc35cf..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/adadelta_op.cu +++ /dev/null @@ -1,21 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#define EIGEN_USE_GPU -#include "paddle/fluid/operators/adadelta_op.h" - -namespace ops = paddle::operators; -REGISTER_OP_CUDA_KERNEL( - adadelta, ops::AdadeltaOpKernel, - ops::AdadeltaOpKernel); diff --git a/paddle/fluid/operators/adadelta_op.h b/paddle/fluid/operators/adadelta_op.h deleted file mode 100644 index 6c616aa03d9809e9b7725a700c7edd5ff5d6dc42..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/adadelta_op.h +++ /dev/null @@ -1,80 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#pragma once -#include "paddle/fluid/framework/eigen.h" -#include "paddle/fluid/framework/op_registry.h" - -namespace paddle { -namespace operators { - -template -class AdadeltaOpKernel : public framework::OpKernel { - public: - void Compute(const framework::ExecutionContext& ctx) const override { - const auto* param_var = ctx.InputVar("Param"); - PADDLE_ENFORCE(param_var->IsType(), - "The Var(%s)'s type should be LoDTensor, " - "but the received is %s", - ctx.Inputs("Param").front(), param_var->Type().name()); - const auto* grad_var = ctx.InputVar("Grad"); - PADDLE_ENFORCE(grad_var->IsType(), - "The Var(%s)'s type should be LoDTensor, " - "but the received is %s", - ctx.Inputs("Grad").front(), grad_var->Type().name()); - - auto param_out_tensor = ctx.Output("ParamOut"); - auto avg_squared_grad_out_tensor = - ctx.Output("AvgSquaredGradOut"); - auto avg_squared_update_out_tensor = - ctx.Output("AvgSquaredUpdateOut"); - - param_out_tensor->mutable_data(ctx.GetPlace()); - avg_squared_grad_out_tensor->mutable_data(ctx.GetPlace()); - avg_squared_update_out_tensor->mutable_data(ctx.GetPlace()); - - T rho = static_cast(ctx.Attr("rho")); - T epsilon = static_cast(ctx.Attr("epsilon")); - - auto param = framework::EigenVector::Flatten( - *ctx.Input("Param")); - auto grad = framework::EigenVector::Flatten( - *ctx.Input("Grad")); - // Squared gradient accumulator - auto avg_squared_grad = framework::EigenVector::Flatten( - *ctx.Input("AvgSquaredGrad")); - // Squared updates accumulator - auto avg_squared_update = framework::EigenVector::Flatten( - *ctx.Input("AvgSquaredUpdate")); - auto param_out = framework::EigenVector::Flatten(*param_out_tensor); - auto avg_squared_grad_out = - framework::EigenVector::Flatten(*avg_squared_grad_out_tensor); - auto avg_squared_update_out = - framework::EigenVector::Flatten(*avg_squared_update_out_tensor); - auto& place = *ctx.template device_context().eigen_device(); - - avg_squared_grad_out.device(place) = - rho * avg_squared_grad + (1 - rho) * grad.square(); - auto update = - -((avg_squared_update + epsilon) / (avg_squared_grad_out + epsilon)) - .sqrt() * - grad; - avg_squared_update_out.device(place) = - rho * avg_squared_update + (1 - rho) * update.square(); - param_out.device(place) = param + update; - } -}; - -} // namespace operators -} // namespace paddle diff --git a/paddle/fluid/operators/adagrad_op.cc b/paddle/fluid/operators/adagrad_op.cc deleted file mode 100644 index a3ef9ad9f91f1f626bd33876693ecc17ad76b96b..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/adagrad_op.cc +++ /dev/null @@ -1,152 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/operators/adagrad_op.h" -#include - -#include - -#include "paddle/fluid/operators/math/math_function.h" -#include "paddle/fluid/operators/math/selected_rows_functor.h" - -namespace paddle { -namespace operators { - -using Tensor = framework::Tensor; -class AdagradOp : public framework::OperatorWithKernel { - public: - using framework::OperatorWithKernel::OperatorWithKernel; - - void InferShape(framework::InferShapeContext* ctx) const override { - PADDLE_ENFORCE(ctx->HasInput("Param"), - "Input(Param) of AdagradOp should not be null."); - PADDLE_ENFORCE(ctx->HasInput("Grad"), - "Input(Grad) of AdagradOp should not be null."); - PADDLE_ENFORCE(ctx->HasInput("Moment"), - "Input(Moment) of AdagradOp should not be null."); - PADDLE_ENFORCE(ctx->HasInput("LearningRate"), - "Input(LearningRate) of AdagradOp should not be null."); - - PADDLE_ENFORCE(ctx->HasOutput("ParamOut"), - "Output(ParamOut) of AdagradOp should not be null."); - PADDLE_ENFORCE(ctx->HasOutput("MomentOut"), - "Output(MomentOut) of AdagradOp should not be null."); - - auto lr_dims = ctx->GetInputDim("LearningRate"); - PADDLE_ENFORCE_EQ(framework::product(lr_dims), 1, - "LearningRate should have one element"); - auto param_dims = ctx->GetInputDim("Param"); - PADDLE_ENFORCE_EQ( - param_dims, ctx->GetInputDim("Grad"), - "Param and Grad input of AdagradOp should have the same dimension."); - PADDLE_ENFORCE_EQ( - param_dims, ctx->GetInputDim("Moment"), - "Param and Moment input of AdagradOp should have the same dimension."); - - ctx->SetOutputDim("ParamOut", param_dims); - ctx->SetOutputDim("MomentOut", param_dims); - } - framework::OpKernelType GetExpectedKernelType( - const framework::ExecutionContext& ctx) const override { - auto input_data_type = - framework::ToDataType(ctx.Input("Param")->type()); - return framework::OpKernelType(input_data_type, ctx.GetPlace()); - } -}; - -class AdagradOpMaker : public framework::OpProtoAndCheckerMaker { - public: - void Make() override { - AddInput("Param", "(Tensor) Input parameter"); - AddInput("Grad", "(Tensor) Input gradient"); - AddInput("Moment", "(Tensor) Second moment"); - AddInput("LearningRate", "(Tensor) Learning rate"); - - AddOutput("ParamOut", "(Tensor) Output parameter"); - AddOutput("MomentOut", "(Tensor) Output second moment"); - - AddAttr("epsilon", - "(float, default 1.0e-6) " - "Constant for numerical stability") - .SetDefault(1.0e-6f); - AddComment(R"DOC( - -Adaptive Gradient Algorithm (Adagrad). - -The update is done as follows: - -$$moment\_out = moment + grad * grad \\ -param\_out = param - \frac{learning\_rate * grad}{\sqrt{moment\_out} + \epsilon} -$$ - -The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) -does not have the epsilon attribute. It is added here in our implementation -as also proposed here: http://cs231n.github.io/neural-networks-3/#ada -for numerical stability to avoid the division by zero error. - -)DOC"); - } -}; - -namespace { -size_t FindPos(const std::vector& rows, int64_t value) { - return std::find(rows.begin(), rows.end(), value) - rows.begin(); -} -} // namespace - -template -struct SparseAdagradFunctor { - void operator()(const platform::CPUDeviceContext& context, - const framework::SelectedRows& grad, - const framework::Tensor& learning_rate, T epsilon, - framework::Tensor* moment, framework::Tensor* param) { - // 1. g_m.rows = set(g.rows) - auto grad_width = grad.value().dims()[1]; - math::scatter::MergeAdd merge_func; - auto grad_merge = merge_func(context, grad); - auto& merge_rows = grad_merge.rows(); - auto* grad_merge_data = grad_merge.mutable_value()->template data(); - - // 2. m += g_m * g_m - math::scatter::Mul sqare_func; - auto grad_square = sqare_func(context, grad_merge, grad_merge); - - math::SelectedRowsAddToTensor functor; - functor(context, grad_square, moment); - - // 3. update parameter - auto* lr = learning_rate.data(); - auto* param_data = param->data(); - auto* moment_data = moment->data(); - - for (size_t i = 0; i < merge_rows.size(); i++) { - for (int64_t j = 0; j < grad_width; j++) { - param_data[merge_rows[i] * grad_width + j] -= - lr[0] * grad_merge_data[i * grad_width + j] / - (std::sqrt(moment_data[merge_rows[i] * grad_width + j]) + epsilon); - } - } - } -}; - -template struct SparseAdagradFunctor; -template struct SparseAdagradFunctor; -} // namespace operators -} // namespace paddle - -namespace ops = paddle::operators; -REGISTER_OP_WITHOUT_GRADIENT(adagrad, ops::AdagradOp, ops::AdagradOpMaker); -REGISTER_OP_CPU_KERNEL( - adagrad, ops::AdagradOpKernel, - ops::AdagradOpKernel); diff --git a/paddle/fluid/operators/adagrad_op.cu b/paddle/fluid/operators/adagrad_op.cu deleted file mode 100644 index b25268786d622bc7a94117849763833e528bef48..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/adagrad_op.cu +++ /dev/null @@ -1,119 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#define EIGEN_USE_GPU -#include "paddle/fluid/operators/adagrad_op.h" -#include "paddle/fluid/operators/math/math_function.h" -#include "paddle/fluid/operators/math/selected_rows_functor.h" -#include "paddle/fluid/platform/cuda_primitives.h" - -namespace paddle { -namespace operators { - -namespace { - -template -__global__ void MergeGradKernel(const T* grad, const int64_t* grad_rows, - T* grad_merge, const int64_t* grad_merge_rows, - size_t grad_merge_rows_size, - int64_t row_numel) { - const int ty = blockIdx.y; - int tid = threadIdx.x; - __shared__ size_t grad_merge_idx; - - if (tid == 0) { - for (size_t i = 0; i < grad_merge_rows_size; i++) { - if (grad_rows[ty] == grad_merge_rows[i]) { - grad_merge_idx = i; - } - } - } - - __syncthreads(); - - grad += ty * row_numel; - grad_merge += grad_merge_idx * row_numel; - for (int index = tid; index < row_numel; index += block_size) { - paddle::platform::CudaAtomicAdd(grad_merge + index, grad[index]); - } -} - -template -__global__ void SparseAdagradFunctorKernel(const T* grad, const int64_t* rows, - const T* learning_rate, T* param, - T* moment, int64_t row_numel, - T epsilon) { - const int ty = blockIdx.y; - int tid = threadIdx.x; - - grad += ty * row_numel; - param += rows[ty] * row_numel; - moment += rows[ty] * row_numel; - - for (int index = tid; index < row_numel; index += block_size) { - // Since index in rows of SelectedRows can be duplicate, we have to use - // Atomic Operation to avoid concurrent write error. - paddle::platform::CudaAtomicAdd(param + index, - -1.0 * learning_rate[0] * grad[index] / - (sqrt(moment[index]) + epsilon)); - } -} -} // namespace - -template -struct SparseAdagradFunctor { - void operator()(const platform::CUDADeviceContext& context, - const framework::SelectedRows& grad, - const framework::Tensor& learning_rate, T epsilon, - framework::Tensor* moment, framework::Tensor* param) { - // 1. g_m.rows = set(g.rows) - auto grad_width = grad.value().dims()[1]; - math::scatter::MergeAdd merge_func; - auto grad_merge = merge_func(context, grad); - auto* grad_merge_data = grad_merge.mutable_value()->template data(); - framework::Vector merge_rows(grad_merge.rows()); - // 2. m += g_m * g_m - math::scatter::Mul sqare_func; - auto grad_square = sqare_func(context, grad_merge, grad_merge); - - math::SelectedRowsAddToTensor functor; - functor(context, grad_square, moment); - - // 3. update parameter - auto* lr = learning_rate.data(); - auto* param_data = param->data(); - auto* moment_data = moment->data(); - - const int block_size = 256; - dim3 threads(block_size, 1); - dim3 grid2(1, merge_rows.size()); - SparseAdagradFunctorKernel< - T, 256><<(context) - .stream()>>>( - grad_merge_data, merge_rows.CUDAMutableData(context.GetPlace()), lr, - param_data, moment_data, grad_width, epsilon); - } -}; - -template struct SparseAdagradFunctor; -template struct SparseAdagradFunctor; - -} // namespace operators -} // namespace paddle - -namespace ops = paddle::operators; -REGISTER_OP_CUDA_KERNEL( - adagrad, ops::AdagradOpKernel, - ops::AdagradOpKernel); diff --git a/paddle/fluid/operators/adagrad_op.h b/paddle/fluid/operators/adagrad_op.h deleted file mode 100644 index 0a16ce00f71586ef55007c3753e024be29d0ed56..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/adagrad_op.h +++ /dev/null @@ -1,94 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#pragma once - -#include "paddle/fluid/framework/eigen.h" -#include "paddle/fluid/framework/op_registry.h" - -namespace paddle { -namespace operators { - -template -struct SparseAdagradFunctor { - void operator()(const DeviceContext &context, - const framework::SelectedRows &grad, - const framework::Tensor &learning_rate, T epsilon, - framework::Tensor *moment, framework::Tensor *param); -}; - -template -class AdagradOpKernel : public framework::OpKernel { - public: - void Compute(const framework::ExecutionContext &ctx) const override { - const auto *param_var = ctx.InputVar("Param"); - PADDLE_ENFORCE(param_var->IsType(), - "The Var(%s)'s type should be LoDTensor, " - "but the received is %s", - ctx.Inputs("Param").front(), param_var->Type().name()); - - auto *param_out_tensor = ctx.Output("ParamOut"); - auto *moment_out_tensor = ctx.Output("MomentOut"); - - param_out_tensor->mutable_data(ctx.GetPlace()); - moment_out_tensor->mutable_data(ctx.GetPlace()); - - T epsilon = static_cast(ctx.Attr("epsilon")); - - auto *grad_var = ctx.InputVar("Grad"); - if (grad_var->IsType()) { - auto param = framework::EigenVector::Flatten( - *ctx.Input("Param")); - auto grad = framework::EigenVector::Flatten( - *ctx.Input("Grad")); - auto moment = framework::EigenVector::Flatten( - *ctx.Input("Moment")); - auto *learning_rate = ctx.Input("LearningRate"); - - auto param_out = framework::EigenVector::Flatten(*param_out_tensor); - auto moment_out = framework::EigenVector::Flatten(*moment_out_tensor); - auto *place = ctx.template device_context().eigen_device(); - - moment_out.device(*place) = moment + grad * grad; - Eigen::DSizes m_dsize(moment_out_tensor->numel()); - if (platform::is_cpu_place(ctx.GetPlace())) { - auto *lr = learning_rate->data(); - param_out.device(*place) = - param - lr[0] * grad / (moment_out.sqrt() + epsilon); - } else { - auto lr = framework::EigenVector::Flatten(*learning_rate); - param_out.device(*place) = - param - - lr.broadcast(m_dsize) * grad / (moment_out.sqrt() + epsilon); - } - } else if (grad_var->IsType()) { - auto *param_tensor = ctx.Input("Param"); - PADDLE_ENFORCE_EQ(param_tensor, param_out_tensor); - - auto *moment_tensor = ctx.Input("Moment"); - PADDLE_ENFORCE_EQ(moment_tensor, moment_out_tensor); - - SparseAdagradFunctor functor; - functor(ctx.template device_context(), - *ctx.Input("Grad"), - *ctx.Input("LearningRate"), epsilon, - moment_out_tensor, param_out_tensor); - } else { - PADDLE_THROW("Unsupported Variable Type of Grad"); - } - } -}; - -} // namespace operators -} // namespace paddle diff --git a/paddle/fluid/operators/adam_op.cc b/paddle/fluid/operators/adam_op.cc deleted file mode 100644 index 5d670fe3b9d99a31a628ff707ff860564eca952e..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/adam_op.cc +++ /dev/null @@ -1,142 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/operators/adam_op.h" - -namespace paddle { -namespace operators { - -using Tensor = framework::Tensor; -class AdamOp : public framework::OperatorWithKernel { - public: - using framework::OperatorWithKernel::OperatorWithKernel; - - void InferShape(framework::InferShapeContext *ctx) const override { - PADDLE_ENFORCE(ctx->HasInput("Param"), - "Input(Param) of AdamOp should not be null."); - PADDLE_ENFORCE(ctx->HasInput("Grad"), - "Input(Grad) of AdamOp should not be null."); - PADDLE_ENFORCE(ctx->HasInput("Moment1"), - "Input(Moment1) of AdamOp should not be null."); - PADDLE_ENFORCE(ctx->HasInput("Moment2"), - "Input(Moment2) of AdamOp should not be null."); - PADDLE_ENFORCE(ctx->HasInput("LearningRate"), - "Input(LearningRate) of AdamOp should not be null."); - PADDLE_ENFORCE(ctx->HasInput("Beta1Pow"), - "Input(Beta1Pow) of AdamOp should not be null."); - PADDLE_ENFORCE(ctx->HasInput("Beta2Pow"), - "Input(Beta2Pow) of AdamOp should not be null."); - - PADDLE_ENFORCE(ctx->HasOutput("ParamOut"), - "Output(ParamOut) of AdamOp should not be null."); - PADDLE_ENFORCE(ctx->HasOutput("Moment1Out"), - "Output(Moment1Out) of AdamOp should not be null."); - PADDLE_ENFORCE(ctx->HasOutput("Moment2Out"), - "Output(Moment2Out) of AdamOp should not be null."); - - auto lr_dims = ctx->GetInputDim("LearningRate"); - PADDLE_ENFORCE_EQ(framework::product(lr_dims), 1, - "Learning rate should have 1 dimension"); - auto beta1_pow_dims = ctx->GetInputDim("Beta1Pow"); - PADDLE_ENFORCE_EQ(framework::product(beta1_pow_dims), 1, - "Beta1 power accumulator should have 1 dimension"); - auto beta2_pow_dims = ctx->GetInputDim("Beta2Pow"); - PADDLE_ENFORCE_EQ(framework::product(beta2_pow_dims), 1, - "Beta2 power accumulator should have 1 dimension"); - - auto param_dims = ctx->GetInputDim("Param"); - if (ctx->GetInputsVarType("Grad")[0] == - framework::proto::VarType::LOD_TENSOR) { - PADDLE_ENFORCE_EQ( - param_dims, ctx->GetInputDim("Grad"), - "Param and Grad input of AdamOp should have same dimension"); - } - PADDLE_ENFORCE_EQ( - param_dims, ctx->GetInputDim("Moment1"), - "Param and Moment1 input of AdamOp should have same dimension"); - PADDLE_ENFORCE_EQ( - param_dims, ctx->GetInputDim("Moment2"), - "Param and Moment2 input of AdamOp should have same dimension"); - - ctx->SetOutputDim("ParamOut", param_dims); - ctx->SetOutputDim("Moment1Out", param_dims); - ctx->SetOutputDim("Moment2Out", param_dims); - } - framework::OpKernelType GetExpectedKernelType( - const framework::ExecutionContext &ctx) const override { - auto input_data_type = - framework::ToDataType(ctx.Input("Param")->type()); - return framework::OpKernelType(input_data_type, ctx.GetPlace()); - } -}; - -class AdamOpMaker : public framework::OpProtoAndCheckerMaker { - public: - void Make() override { - AddInput("Param", "(Tensor) Input parameter"); - AddInput("Grad", "(Tensor) Input gradient"); - AddInput("LearningRate", "(Tensor) Learning rate"); - AddInput("Moment1", "(Tensor) Input first moment"); - AddInput("Moment2", "(Tensor) Input second moment"); - AddInput("Beta1Pow", "(Tensor) Input beta1 power accumulator"); - AddInput("Beta2Pow", "(Tensor) Input beta2 power accumulator"); - - AddOutput("ParamOut", "(Tensor) Output parameter").Reuse("Param"); - AddOutput("Moment1Out", "(Tensor) Output first moment").Reuse("Moment1"); - AddOutput("Moment2Out", "(Tensor) Output second moment").Reuse("Moment2"); - - AddAttr("beta1", - "(float, default 0.9) " - "Exponential decay rate for the " - "first moment estimates.") - .SetDefault(0.9f); - AddAttr("beta2", - "(float, default 0.999) " - "exponential decay rate for the " - "second moment estimates.") - .SetDefault(0.999f); - AddAttr("epsilon", - "(float, default 1.0e-8) " - "Constant for numerical stability") - .SetDefault(1.0e-8f); - - AddComment(R"DOC( -Adam Optimizer. - -This implements the Adam optimizer from Section 2 of the Adam -paper : https://arxiv.org/abs/1412.6980. -Adam is a first-order gradient-based optimization method based on -adaptive estimates of lower-order moments. - -Adam updates: - -$$ -moment\_1\_out = \beta_1 * moment\_1 + (1 - \beta_1) * grad \\ -moment\_2_\out = \beta_2 * moment\_2 + (1 - \beta_2) * grad * grad \\ -learning\_rate = learning\_rate * - \frac{\sqrt{1 - \beta_{2\_pow}}}{1 - \beta_{1\_pow}} \\ -param\_out = param - learning\_rate * \frac{moment\_1}{\sqrt{moment\_2} + \epsilon} -$$ - -)DOC"); - } -}; -} // namespace operators -} // namespace paddle - -namespace ops = paddle::operators; -REGISTER_OP_WITHOUT_GRADIENT(adam, ops::AdamOp, ops::AdamOpMaker); -REGISTER_OP_CPU_KERNEL( - adam, ops::AdamOpKernel, - ops::AdamOpKernel); diff --git a/paddle/fluid/operators/adam_op.cu b/paddle/fluid/operators/adam_op.cu deleted file mode 100644 index 77f1991002e6007e8b8dff4746739a90e836145d..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/adam_op.cu +++ /dev/null @@ -1,21 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#define EIGEN_USE_GPU -#include "paddle/fluid/operators/adam_op.h" - -namespace ops = paddle::operators; -REGISTER_OP_CUDA_KERNEL( - adam, ops::AdamOpKernel, - ops::AdamOpKernel); diff --git a/paddle/fluid/operators/adam_op.h b/paddle/fluid/operators/adam_op.h deleted file mode 100644 index 3455d1ee54e8e6e498d0b0e6932ec099af9c0b30..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/adam_op.h +++ /dev/null @@ -1,366 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#pragma once -#include // for sqrt in CPU and CUDA -#include -#include -#include "paddle/fluid/framework/op_registry.h" -#include "paddle/fluid/operators/detail/safe_ref.h" -#include "paddle/fluid/operators/math/algorithm.h" -#include "paddle/fluid/operators/math/selected_rows_functor.h" -#include "paddle/fluid/platform/for_range.h" - -namespace paddle { -namespace operators { - -namespace scatter = paddle::operators::math::scatter; - -struct GPUAdam; -struct CPUAdam; - -template -struct AdamFunctor; - -template -struct AdamFunctor { - T beta1_; - T beta2_; - T epsilon_; - - const T* beta1_pow_; - const T* beta2_pow_; - const T* moment1_; - T* moment1_out_; - const T* moment2_; - T* moment2_out_; - const T* lr_; - const T* grad_; - const T* param_; - T* param_out_; - - AdamFunctor(T beta1, T beta2, T epsilon, const T* beta1_pow, - const T* beta2_pow, const T* mom1, T* mom1_out, const T* mom2, - T* mom2_out, const T* lr, const T* grad, const T* param, - T* param_out) - : beta1_(beta1), - beta2_(beta2), - epsilon_(epsilon), - beta1_pow_(beta1_pow), - beta2_pow_(beta2_pow), - moment1_(mom1), - moment1_out_(mom1_out), - moment2_(mom2), - moment2_out_(mom2_out), - lr_(lr), - grad_(grad), - param_(param), - param_out_(param_out) {} - - inline HOSTDEVICE void operator()(size_t i) const { - // Merge all memory access together. - T g = grad_[i]; - T mom1 = moment1_[i]; - T mom2 = moment2_[i]; - T lr = *lr_; - T beta1_pow = *beta1_pow_; - T beta2_pow = *beta2_pow_; - T p = param_[i]; - - // Calculation - lr *= sqrt(1 - beta2_pow) / (1 - beta1_pow); - - mom1 = beta1_ * mom1 + (1 - beta1_) * g; - mom2 = beta2_ * mom2 + (1 - beta2_) * g * g; - p -= lr * (mom1 / (sqrt(mom2) + epsilon_)); - - // Write back to global memory - moment1_out_[i] = mom1; - moment2_out_[i] = mom2; - param_out_[i] = p; - } -}; - -template -struct AdamFunctor { - T beta1_; - T beta2_; - T epsilon_; - - const T* beta1_pow_; - const T* beta2_pow_; - const T* moment1_; - T* moment1_out_; - const T* moment2_; - T* moment2_out_; - const T* lr_; - const T* grad_; - const T* param_; - T* param_out_; - - AdamFunctor(T beta1, T beta2, T epsilon, const T* beta1_pow, - const T* beta2_pow, const T* mom1, T* mom1_out, const T* mom2, - T* mom2_out, const T* lr, const T* grad, const T* param, - T* param_out) - : beta1_(beta1), - beta2_(beta2), - epsilon_(epsilon), - beta1_pow_(beta1_pow), - beta2_pow_(beta2_pow), - moment1_(mom1), - moment1_out_(mom1_out), - moment2_(mom2), - moment2_out_(mom2_out), - lr_(lr), - grad_(grad), - param_(param), - param_out_(param_out) {} - - void operator()(size_t numel) const { - Eigen::Map> g{ - grad_, static_cast(numel)}; - Eigen::Map> mom1{ - moment1_, static_cast(numel)}; - Eigen::Map> mom2{ - moment2_, static_cast(numel)}; - Eigen::Map> param{ - param_, static_cast(numel)}; - - Eigen::Map> param_out{ - param_out_, static_cast(numel)}; - Eigen::Map> moment1_out{ - moment1_out_, static_cast(numel)}; - Eigen::Map> moment2_out{ - moment2_out_, static_cast(numel)}; - - T lr = *lr_; - T beta1_pow = *beta1_pow_; - T beta2_pow = *beta2_pow_; - - // Calculation - lr *= sqrt(1 - beta2_pow) / (1 - beta1_pow); - - moment1_out = beta1_ * mom1 + (1 - beta1_) * g; - moment2_out = beta2_ * mom2 + (1 - beta2_) * g * g; - param_out = param - lr * (moment1_out / (moment2_out.sqrt() + epsilon_)); - } -}; - -template -struct SparseAdamFunctor { - T beta1_; - T beta2_; - T epsilon_; - - const T* beta1_pow_; - const T* beta2_pow_; - const T* moment1_; - T* moment1_out_; - const T* moment2_; - T* moment2_out_; - const T* lr_; - const T* grad_; - const T* param_; - T* param_out_; - - const int64_t* rows_; - int64_t row_numel_; - int64_t row_count_; - - SparseAdamFunctor(T beta1, T beta2, T epsilon, const T* beta1_pow, - const T* beta2_pow, const T* mom1, T* mom1_out, - const T* mom2, T* mom2_out, const T* lr, const T* grad, - const T* param, T* param_out, const int64_t* rows, - int64_t row_numel, int64_t row_count) - : beta1_(beta1), - beta2_(beta2), - epsilon_(epsilon), - beta1_pow_(beta1_pow), - beta2_pow_(beta2_pow), - moment1_(mom1), - moment1_out_(mom1_out), - moment2_(mom2), - moment2_out_(mom2_out), - lr_(lr), - grad_(grad), - param_(param), - param_out_(param_out), - rows_(rows), - row_numel_(row_numel), - row_count_(row_count) {} - - inline HOSTDEVICE void operator()(size_t i) const { - auto row_idx = - math::BinarySearch(rows_, row_count_, i / row_numel_); - T g = row_idx >= 0 ? grad_[row_idx * row_numel_ + i % row_numel_] : 0; - - // The following code is the same as dense - T mom1 = moment1_[i]; - T mom2 = moment2_[i]; - T lr = *lr_; - T beta1_pow = *beta1_pow_; - T beta2_pow = *beta2_pow_; - T p = param_[i]; - - // Calculation - lr *= sqrt(1 - beta2_pow) / (1 - beta1_pow); - - mom1 = beta1_ * mom1 + (1 - beta1_) * g; - mom2 = beta2_ * mom2 + (1 - beta2_) * g * g; - p -= lr * (mom1 / (sqrt(mom2) + epsilon_)); - - // Write back to global memory - moment1_out_[i] = mom1; - moment2_out_[i] = mom2; - param_out_[i] = p; - } -}; - -template -class AdamOpKernel : public framework::OpKernel { - public: - void Compute(const framework::ExecutionContext& ctx) const override { - const auto* param_var = ctx.InputVar("Param"); - PADDLE_ENFORCE(param_var->IsType(), - "The Var(%s)'s type should be LoDTensor, " - "but the received is %s", - ctx.Inputs("Param").front(), param_var->Type().name()); - - using paddle::framework::LoDTensor; - using paddle::operators::detail::Ref; - - T beta1 = static_cast(ctx.Attr("beta1")); - T beta2 = static_cast(ctx.Attr("beta2")); - T epsilon = static_cast(ctx.Attr("epsilon")); - auto& param = Ref(ctx.Input("Param"), "Must set Param"); - // auto& grad = Ref(ctx.Input("Grad"), "Must set Grad"); - auto* grad_var = ctx.InputVar("Grad"); - auto& mom1 = Ref(ctx.Input("Moment1"), "Must set Moment1"); - auto& mom2 = Ref(ctx.Input("Moment2"), "Must set Moment2"); - auto& lr = - Ref(ctx.Input("LearningRate"), "Must set LearningRate"); - - auto& beta1_pow = - Ref(ctx.Input("Beta1Pow"), "Must set Beta1Pow"); - auto& beta2_pow = - Ref(ctx.Input("Beta2Pow"), "Must set Beta2Pow"); - - auto& param_out = - Ref(ctx.Output("ParamOut"), "Must set ParamOut"); - auto& mom1_out = - Ref(ctx.Output("Moment1Out"), "Must set Moment1Out"); - auto& mom2_out = - Ref(ctx.Output("Moment2Out"), "Must set Moment1Out"); - - if (grad_var->IsType()) { - auto& grad = Ref(ctx.Input("Grad"), "Must set Grad"); - - if (platform::is_cpu_place(ctx.GetPlace())) { - AdamFunctor functor( - beta1, beta2, epsilon, beta1_pow.template data(), - beta2_pow.template data(), mom1.template data(), - mom1_out.template mutable_data(ctx.GetPlace()), - mom2.template data(), - mom2_out.template mutable_data(ctx.GetPlace()), - lr.template data(), grad.template data(), - param.template data(), - param_out.template mutable_data(ctx.GetPlace())); - functor(param.numel()); - } else if (platform::is_gpu_place(ctx.GetPlace())) { - AdamFunctor functor( - beta1, beta2, epsilon, beta1_pow.template data(), - beta2_pow.template data(), mom1.template data(), - mom1_out.template mutable_data(ctx.GetPlace()), - mom2.template data(), - mom2_out.template mutable_data(ctx.GetPlace()), - lr.template data(), grad.template data(), - param.template data(), - param_out.template mutable_data(ctx.GetPlace())); - - platform::ForRange for_range( - static_cast(ctx.device_context()), - param.numel()); - for_range(functor); - } - } else if (grad_var->IsType()) { - auto& grad = - Ref(ctx.Input("Grad"), "Must set Grad"); - if (grad.rows().size() == 0) { - VLOG(3) << "grad row size is 0!!"; - return; - } - - std::vector cpu_rows(grad.rows().begin(), grad.rows().end()); - bool is_strict_sorted = true; - for (size_t i = 1; i < cpu_rows.size(); ++i) { - if (cpu_rows[i - 1] >= cpu_rows[i]) { - is_strict_sorted = false; - break; - } - } - - const framework::SelectedRows* grad_merge_ptr; - if (is_strict_sorted) { - grad_merge_ptr = &grad; - } else { - // merge duplicated rows if any. - // The rows of grad_merge have been sorted inside MergeAdd functor - scatter::MergeAdd merge_func; - auto* grad_merge_var = const_cast(ctx.scope()) - .Var() - ->GetMutable(); - merge_func(ctx.template device_context(), grad, - grad_merge_var); - grad_merge_ptr = grad_merge_var; - } - - auto& grad_merge = *grad_merge_ptr; - auto& grad_tensor = grad_merge.value(); - const T* grad_data = grad_tensor.template data(); - const int64_t* rows = nullptr; -// When compiled without CUDA, the CUDAData() interface should not be -// provided. -#if defined(PADDLE_WITH_CUDA) - if (platform::is_gpu_place(ctx.GetPlace())) { - rows = grad_merge.rows().CUDAData(ctx.GetPlace()); - } else { -#endif - rows = grad_merge.rows().data(); - -#if defined(PADDLE_WITH_CUDA) - } -#endif - auto row_numel = grad_tensor.numel() / grad_merge.rows().size(); - - SparseAdamFunctor functor( - beta1, beta2, epsilon, beta1_pow.template data(), - beta2_pow.template data(), mom1.template data(), - mom1_out.template mutable_data(ctx.GetPlace()), - mom2.template data(), - mom2_out.template mutable_data(ctx.GetPlace()), - lr.template data(), grad_data, param.template data(), - param_out.template mutable_data(ctx.GetPlace()), rows, row_numel, - grad_merge.rows().size()); - platform::ForRange for_range( - static_cast(ctx.device_context()), - param.numel()); - for_range(functor); - } else { - PADDLE_THROW("Variable type not supported by adam_op"); - } - } -}; - -} // namespace operators -} // namespace paddle diff --git a/paddle/fluid/operators/adamax_op.cc b/paddle/fluid/operators/adamax_op.cc deleted file mode 100644 index d4aa4d338a2379adf985ba7f89b528bc402eda06..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/adamax_op.cc +++ /dev/null @@ -1,148 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/operators/adamax_op.h" - -namespace paddle { -namespace operators { - -using Tensor = framework::Tensor; -class AdamaxOp : public framework::OperatorWithKernel { - public: - using framework::OperatorWithKernel::OperatorWithKernel; - - void InferShape(framework::InferShapeContext *ctx) const override { - PADDLE_ENFORCE(ctx->HasInput("Param"), - "Input(Param) of AdamaxOp should not be null."); - PADDLE_ENFORCE(ctx->HasInput("Grad"), - "Input(Grad) of AdamaxOp should not be null."); - PADDLE_ENFORCE(ctx->HasInput("Moment"), - "Input(Moment) of AdamaxOp should not be null."); - PADDLE_ENFORCE(ctx->HasInput("InfNorm"), - "Input(InfNorm) of AdamaxOp should not be null."); - PADDLE_ENFORCE(ctx->HasInput("LearningRate"), - "Input(LearningRate) of AdamaxOp should not be null."); - PADDLE_ENFORCE(ctx->HasInput("Beta1Pow"), - "Input(Beta1Pow) of AdamaxOp should not be null."); - PADDLE_ENFORCE( - ctx->GetInputsVarType("Param").front() == - framework::proto::VarType::LOD_TENSOR, - "The input var's type should be LoDTensor, but the received is %s", - ctx->Inputs("Param").front(), ctx->GetInputsVarType("Param").front()); - PADDLE_ENFORCE( - ctx->GetInputsVarType("Grad").front() == - framework::proto::VarType::LOD_TENSOR, - "The input var's type should be LoDTensor, but the received is %s", - ctx->Inputs("Grad").front(), ctx->GetInputsVarType("Grad").front()); - - PADDLE_ENFORCE(ctx->HasOutput("ParamOut"), - "Output(ParamOut) of AdamaxOp should not be null."); - PADDLE_ENFORCE(ctx->HasOutput("MomentOut"), - "Output(MomentOut) of AdamaxOp should not be null."); - PADDLE_ENFORCE(ctx->HasOutput("InfNormOut"), - "Output(InfNormOut) of AdamaxOp should not be null."); - - auto lr_dims = ctx->GetInputDim("LearningRate"); - PADDLE_ENFORCE_EQ(framework::product(lr_dims), 1, - "Learning rate should have 1 dimension"); - auto beta1_pow_dims = ctx->GetInputDim("Beta1Pow"); - PADDLE_ENFORCE_EQ(framework::product(beta1_pow_dims), 1, - "Beta1 power accumulator should have 1 dimension"); - auto param_dims = ctx->GetInputDim("Param"); - PADDLE_ENFORCE_EQ( - param_dims, ctx->GetInputDim("Grad"), - "Param and Grad input of AdamaxOp should have same dimension"); - PADDLE_ENFORCE_EQ( - param_dims, ctx->GetInputDim("Moment"), - "Param and Moment input of AdamaxOp should have same dimension"); - PADDLE_ENFORCE_EQ( - param_dims, ctx->GetInputDim("InfNorm"), - "Param and InfNorm input of AdamaxOp should have same dimension"); - - ctx->SetOutputDim("ParamOut", param_dims); - ctx->SetOutputDim("MomentOut", param_dims); - ctx->SetOutputDim("InfNormOut", param_dims); - } - framework::OpKernelType GetExpectedKernelType( - const framework::ExecutionContext &ctx) const override { - auto input_data_type = - framework::ToDataType(ctx.Input("Param")->type()); - return framework::OpKernelType(input_data_type, ctx.GetPlace()); - } -}; - -class AdamaxOpMaker : public framework::OpProtoAndCheckerMaker { - public: - void Make() override { - AddInput("Param", "(Tensor) Input parameter"); - AddInput("Grad", "(Tensor) Input gradient"); - AddInput("LearningRate", "(Tensor) Learning rate"); - AddInput("Moment", "(Tensor) First moment"); - AddInput("InfNorm", - "(Tensor) " - "Input exponentially weighted infinity norm"); - AddInput("Beta1Pow", "(Tensor) Input beta1 power accumulator"); - - AddOutput("ParamOut", "(Tensor) Output parameter"); - AddOutput("MomentOut", "(Tensor) Output first moment"); - AddOutput("InfNormOut", - "(Tensor) " - "Output exponentially weighted infinity norm"); - - AddAttr("beta1", - "(float, default 0.9) " - "Exponential decay rate for the " - "1st moment estimates.") - .SetDefault(0.9f); - AddAttr("beta2", - "(float, default 0.999) " - "exponential decay rate for the weighted " - "infinity norm estimates.") - .SetDefault(0.999f); - AddAttr("epsilon", - "(float, default 1.0e-8) " - "Constant for numerical stability") - .SetDefault(1.0e-8f); - AddComment(R"DOC( -Adamax Optimizer. - -We implement the Adamax optimizer from Section 7 of the Adam -paper: https://arxiv.org/abs/1412.6980. Adamax is a variant of the -Adam algorithm based on the infinity norm. - -Adamax updates: - -$$ -moment\_out = \beta_1 * moment + (1 - \beta_1) * grad \\ -inf\_norm\_out = max(\beta_2 * inf\_norm + \epsilon, |grad|) \\ -learning\_rate = \frac{learning\_rate}{1 - \beta_{1\_pow}} \\ -param\_out = param - learning\_rate * \frac{moment\_out}{inf\_norm\_out} -$$ - -The original paper does not have an epsilon attribute. -However, it is added here for numerical stability to prevent the -division by 0 error. - -)DOC"); - } -}; - -} // namespace operators -} // namespace paddle - -namespace ops = paddle::operators; -REGISTER_OP_WITHOUT_GRADIENT(adamax, ops::AdamaxOp, ops::AdamaxOpMaker); -REGISTER_OP_CPU_KERNEL( - adamax, ops::AdamaxOpKernel, - ops::AdamaxOpKernel); diff --git a/paddle/fluid/operators/adamax_op.cu b/paddle/fluid/operators/adamax_op.cu deleted file mode 100644 index 05cafd7a8eef79588d1d5724084586cb9b51d3d4..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/adamax_op.cu +++ /dev/null @@ -1,21 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#define EIGEN_USE_GPU -#include "paddle/fluid/operators/adamax_op.h" - -namespace ops = paddle::operators; -REGISTER_OP_CUDA_KERNEL( - adamax, ops::AdamaxOpKernel, - ops::AdamaxOpKernel); diff --git a/paddle/fluid/operators/adamax_op.h b/paddle/fluid/operators/adamax_op.h deleted file mode 100644 index 7137fbd9651b4523f6d1609a0595b30758aa40df..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/adamax_op.h +++ /dev/null @@ -1,78 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#pragma once -#include "paddle/fluid/framework/eigen.h" -#include "paddle/fluid/framework/op_registry.h" - -namespace paddle { -namespace operators { - -template -class AdamaxOpKernel : public framework::OpKernel { - public: - void Compute(const framework::ExecutionContext& ctx) const override { - const auto* param_var = ctx.InputVar("Param"); - PADDLE_ENFORCE(param_var->IsType(), - "The Var(%s)'s type should be LoDTensor, " - "but the received is %s", - ctx.Inputs("Param").front(), param_var->Type().name()); - const auto* grad_var = ctx.InputVar("Grad"); - PADDLE_ENFORCE(grad_var->IsType(), - "The Var(%s)'s type should be LoDTensor, " - "but the received is %s", - ctx.Inputs("Grad").front(), grad_var->Type().name()); - - auto param_out_tensor = ctx.Output("ParamOut"); - auto moment_out_tensor = ctx.Output("MomentOut"); - auto inf_norm_out_tensor = ctx.Output("InfNormOut"); - - param_out_tensor->mutable_data(ctx.GetPlace()); - moment_out_tensor->mutable_data(ctx.GetPlace()); - inf_norm_out_tensor->mutable_data(ctx.GetPlace()); - - T beta1 = static_cast(ctx.Attr("beta1")); - T beta2 = static_cast(ctx.Attr("beta2")); - T epsilon = static_cast(ctx.Attr("epsilon")); - - auto param = framework::EigenVector::Flatten( - *ctx.Input("Param")); - auto grad = framework::EigenVector::Flatten( - *ctx.Input("Grad")); - auto moment = framework::EigenVector::Flatten( - *ctx.Input("Moment")); - auto inf_norm = framework::EigenVector::Flatten( - *ctx.Input("InfNorm")); - auto lr = framework::EigenVector::Flatten( - *ctx.Input("LearningRate")); - auto beta1_pow = framework::EigenVector::Flatten( - *ctx.Input("Beta1Pow")); - auto param_out = framework::EigenVector::Flatten(*param_out_tensor); - auto moment_out = framework::EigenVector::Flatten(*moment_out_tensor); - auto inf_norm_out = - framework::EigenVector::Flatten(*inf_norm_out_tensor); - auto* place = ctx.template device_context().eigen_device(); - - moment_out.device(*place) = beta1 * moment + (1 - beta1) * grad; - inf_norm_out.device(*place) = - grad.abs().cwiseMax((beta2 * inf_norm) + epsilon); - auto lr_t = lr / (1 - beta1_pow); - Eigen::DSizes m_dsize(moment_out_tensor->numel()); - param_out.device(*place) = - param - lr_t.broadcast(m_dsize) * (moment_out / inf_norm_out); - } -}; - -} // namespace operators -} // namespace paddle diff --git a/paddle/fluid/operators/add_position_encoding_op.cc b/paddle/fluid/operators/add_position_encoding_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..3882bbedaa0be0ba14bca9c4fcb626d5ecaab129 --- /dev/null +++ b/paddle/fluid/operators/add_position_encoding_op.cc @@ -0,0 +1,109 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/add_position_encoding_op.h" +#include + +namespace paddle { +namespace operators { + +class AddPositionEncodingOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("X"), + "X(Input) of add_position_encoding_op should not be null."); + PADDLE_ENFORCE( + ctx->HasOutput("Out"), + "Out(Output) of add_position_encoding_op should not be null."); + + auto x_dims = ctx->GetInputDim("X"); + ctx->SetOutputDim("Out", x_dims); + ctx->ShareLoD("X", /*->*/ "Out"); + } +}; + +class AddPositionEncodingOpGrad : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext* ctx) const override { + if (ctx->HasOutput(framework::GradVarName("X"))) { + auto out_dims = ctx->GetInputDim(framework::GradVarName("Out")); + ctx->SetOutputDim(framework::GradVarName("X"), out_dims); + } + } +}; + +class AddPositionEncodingOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", "Input of AddPositionEncoding operator"); + AddOutput("Out", "Output of AddPositionEncoding operator"); + AddAttr("alpha", "The scale of Original Embedding.") + .SetDefault(1.0f) + .AddCustomChecker([](const float& alpha) { + PADDLE_ENFORCE(alpha >= 0.0f, "'alpha' must be above 0.0."); + }); + AddAttr("beta", "The scale of Position Embedding.") + .SetDefault(1.0f) + .AddCustomChecker([](const float& beta) { + PADDLE_ENFORCE(beta >= 0.0f, "'beta' must be between 0.0."); + }); + AddComment(R"DOC( + Add Position Encoding Operator. + + The add position encoding calculates the output based on the input, alpha, beta. + The size of each dimension of the parameters checked in the infer-shape. + )DOC"); + } +}; + +class AddPositionEncodingGradOpDescMaker + : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + std::unique_ptr op(new framework::OpDesc()); + op->SetType("add_position_encoding_grad"); + op->SetInput(framework::GradVarName("Out"), OutputGrad("Out")); + op->SetOutput(framework::GradVarName("X"), InputGrad("X")); + op->SetAttrMap(Attrs()); + return op; + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +namespace plt = paddle::platform; + +REGISTER_OPERATOR(add_position_encoding, ops::AddPositionEncodingOp, + ops::AddPositionEncodingOpMaker, + ops::AddPositionEncodingGradOpDescMaker); +REGISTER_OPERATOR(add_position_encoding_grad, ops::AddPositionEncodingOpGrad); + +REGISTER_OP_CPU_KERNEL( + add_position_encoding, + ops::AddPositionEncodingKernel, + ops::AddPositionEncodingKernel); + +REGISTER_OP_CPU_KERNEL( + add_position_encoding_grad, + ops::AddPositionEncodingGradKernel, + ops::AddPositionEncodingGradKernel); diff --git a/paddle/fluid/operators/add_position_encoding_op.h b/paddle/fluid/operators/add_position_encoding_op.h new file mode 100644 index 0000000000000000000000000000000000000000..0b40d3de890a02a9dbec2328f9f6388ffa35561b --- /dev/null +++ b/paddle/fluid/operators/add_position_encoding_op.h @@ -0,0 +1,106 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include "paddle/fluid/framework/eigen.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/detail/safe_ref.h" + +namespace paddle { +namespace operators { + +template +class AddPositionEncodingKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& context) const override { + auto* X = context.Input("X"); + auto& x_lod = X->lod(); + auto* src_ptr = X->data(); + + auto* Out = context.Output("Out"); + auto* dst_ptr = Out->mutable_data(context.GetPlace()); + + float alpha = context.Attr("alpha"); + float beta = context.Attr("beta"); + + auto x_dim = X->dims(); + int batch_size = 0; + int max_seq_len = 0; + int enc_size = 0; + + if (x_lod.empty()) { + PADDLE_ENFORCE( + x_dim.size() == 3UL, + "The input X of Add Position Encoding should be 3-D Tensor!"); + batch_size = x_dim[0]; + max_seq_len = x_dim[1]; + enc_size = x_dim[2]; + } else { + PADDLE_ENFORCE( + x_dim.size() == 2UL, + "The input X of Add Position Encoding should be 2-D LoDTensor!"); + PADDLE_ENFORCE( + x_lod.size() == 1UL, + "The Add Position Encoding Op only supports lod_level == 1!"); + batch_size = x_lod[0].size() - 1; + max_seq_len = -1; + enc_size = x_dim[1]; + } + + PADDLE_ENFORCE(enc_size % 2 == 0, "Only support even encode size!"); + + const int half_size = enc_size / 2; + for (int i = 0; i < batch_size; ++i) { + const int max_length = + x_lod.empty() ? max_seq_len : x_lod[0][i + 1] - x_lod[0][i]; + for (int j = 0; j < max_length; ++j) { + for (int k = 0; k < half_size; ++k) { + const double val = + (half_size > 1) + ? j / pow(10000.0, static_cast(k) / (half_size - 1)) + : j / 10000.0; + dst_ptr[k] = src_ptr[k] * alpha + sin(val) * beta; + dst_ptr[half_size + k] = + src_ptr[half_size + k] * alpha + cos(val) * beta; + } + src_ptr += enc_size; + dst_ptr += enc_size; + } + } + } +}; + +template +class AddPositionEncodingGradKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& context) const override { + auto* dOut = + context.Input(framework::GradVarName("Out")); + auto dout = framework::EigenVector::Flatten(*dOut); + + auto* dX = + context.Output(framework::GradVarName("X")); + dX->mutable_data(context.GetPlace()); + auto dx = framework::EigenVector::Flatten(*dX); + + float alpha = context.Attr("alpha"); + + auto* place = + context.template device_context().eigen_device(); + dx.device(*place) = dout * static_cast(alpha); + } +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/affine_channel_op.cc b/paddle/fluid/operators/affine_channel_op.cc index 8944a749674c3ba6c83526e4d66f449075716f43..268a5b894a95df8e27730879473b457a31e18cd6 100644 --- a/paddle/fluid/operators/affine_channel_op.cc +++ b/paddle/fluid/operators/affine_channel_op.cc @@ -67,6 +67,22 @@ class AffineChannelOp : public framework::OperatorWithKernel { "Input(Bias) of AffineChannelOp should not be null."); PADDLE_ENFORCE(ctx->HasOutput("Out"), "Output(Out) of AffineChannelOp should not be null."); + + auto x_dims = ctx->GetInputDim("X"); + auto scale_dims = ctx->GetInputDim("Scale"); + auto b_dims = ctx->GetInputDim("Bias"); + const framework::DataLayout data_layout = framework::StringToDataLayout( + ctx->Attrs().Get("data_layout")); + + const int64_t C = (data_layout == framework::DataLayout::kNCHW + ? x_dims[1] + : x_dims[x_dims.size() - 1]); + + PADDLE_ENFORCE_EQ(scale_dims.size(), 1UL); + PADDLE_ENFORCE_EQ(scale_dims[0], C); + PADDLE_ENFORCE_EQ(b_dims.size(), 1UL); + PADDLE_ENFORCE_EQ(b_dims[0], C); + ctx->SetOutputDim("Out", ctx->GetInputDim("X")); ctx->ShareLoD("X", "Out"); } @@ -97,6 +113,27 @@ class AffineChannelOpGrad : public framework::OperatorWithKernel { } }; +class AffineChannelGradMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + std::unique_ptr Apply() const override { + auto* op = new framework::OpDesc(); + op->SetType("affine_channel_grad"); + op->SetInput("X", Input("X")); + op->SetInput(framework::GradVarName("Out"), OutputGrad("Out")); + op->SetInput("Scale", Input("Scale")); + + op->SetAttrMap(Attrs()); + + op->SetOutput(framework::GradVarName("X"), InputGrad("X")); + op->SetOutput(framework::GradVarName("Scale"), InputGrad("Scale")); + op->SetOutput(framework::GradVarName("Bias"), InputGrad("Bias")); + + return std::unique_ptr(op); + } +}; + template using EigenArrayMap = Eigen::Map>; @@ -244,8 +281,7 @@ namespace ops = paddle::operators; using CPU = paddle::platform::CPUDeviceContext; REGISTER_OPERATOR(affine_channel, ops::AffineChannelOp, - ops::AffineChannelOpMaker, - paddle::framework::DefaultGradOpDescMaker); + ops::AffineChannelOpMaker, ops::AffineChannelGradMaker); REGISTER_OPERATOR(affine_channel_grad, ops::AffineChannelOpGrad); REGISTER_OP_CPU_KERNEL(affine_channel, ops::AffineChannelKernel, diff --git a/paddle/fluid/operators/affine_channel_op.cu b/paddle/fluid/operators/affine_channel_op.cu index 2bebdb345ab324eb0a2dafd54c74833dd21bdb6d..c054fdb1ba6e5ae5970a51ac9f071f6ef535a4b5 100644 --- a/paddle/fluid/operators/affine_channel_op.cu +++ b/paddle/fluid/operators/affine_channel_op.cu @@ -83,7 +83,7 @@ __global__ void AffineChannelScaleBiasGradientCUDAKernel( T* dbias) { const int outer_size = C; const int inner_size = N * HxW; - typedef cub::BlockReduce BlockReduce; + typedef cub::BlockReduce BlockReduce; __shared__ typename BlockReduce::TempStorage ds_storage; __shared__ typename BlockReduce::TempStorage db_storage; @@ -97,13 +97,16 @@ __global__ void AffineChannelScaleBiasGradientCUDAKernel( ds_sum += dy[index] * x[index]; db_sum += dy[index]; } - ds_sum = BlockReduce(ds_storage).Reduce(ds_sum, cub::Sum()); - db_sum = BlockReduce(db_storage).Reduce(db_sum, cub::Sum()); + __syncthreads(); + auto ds_out = + BlockReduce(ds_storage).Reduce(static_cast(ds_sum), cub::Sum()); + auto db_out = + BlockReduce(db_storage).Reduce(static_cast(db_sum), cub::Sum()); + __syncthreads(); if (threadIdx.x == 0) { - dscale[i] = ds_sum; - dbias[i] = db_sum; + dscale[i] = ds_out; + dbias[i] = db_out; } - __syncthreads(); } } diff --git a/paddle/fluid/operators/affine_grid_cudnn_op.cu.cc b/paddle/fluid/operators/affine_grid_cudnn_op.cu.cc new file mode 100644 index 0000000000000000000000000000000000000000..ed71594ba5781590f3291d56c4ba1a4443003bd5 --- /dev/null +++ b/paddle/fluid/operators/affine_grid_cudnn_op.cu.cc @@ -0,0 +1,112 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/platform/cudnn_helper.h" + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; +using ScopedSpatialTransformerDescriptor = + platform::ScopedSpatialTransformerDescriptor; + +template +class CUDNNAffineGridOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()), + "It must use CUDAPlace."); + auto& dev_ctx = ctx.template device_context(); + auto handle = dev_ctx.cudnn_handle(); + auto* theta = ctx.Input("Theta"); + auto* output = ctx.Output("Output"); + const T* theta_data = theta->data(); + + int n = theta->dims()[0]; + auto size_attr = ctx.Attr>("output_shape"); + Tensor h_sizes; + int* h_size_data; + if (size_attr.size() == 0) { + auto* output_shape = ctx.Input("OutputShape"); + framework::TensorCopy(*output_shape, platform::CPUPlace(), &h_sizes); + h_size_data = h_sizes.data(); + } else { + h_size_data = h_sizes.mutable_data({4}, platform::CPUPlace()); + h_size_data[0] = n; + h_size_data[1] = size_attr[1]; + h_size_data[2] = size_attr[2]; + h_size_data[3] = size_attr[3]; + } + + T* output_data = output->mutable_data( + {n, h_size_data[2], h_size_data[3], 2}, ctx.GetPlace()); + ScopedSpatialTransformerDescriptor st_desc; + cudnnSpatialTransformerDescriptor_t cudnn_st_desc = + st_desc.descriptor(4, h_size_data); + + PADDLE_ENFORCE(platform::dynload::cudnnSpatialTfGridGeneratorForward( + handle, cudnn_st_desc, theta_data, output_data)); + } +}; + +template +class CUDNNAffineGridGradOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()), + "It must use CUDAPlace."); + auto& dev_ctx = ctx.template device_context(); + auto handle = dev_ctx.cudnn_handle(); + auto output_grad = ctx.Input(framework::GradVarName("Output")); + auto theta_grad = ctx.Output(framework::GradVarName("Theta")); + + int n = output_grad->dims()[0]; + auto size_attr = ctx.Attr>("output_shape"); + Tensor h_sizes; + int* h_size_data; + if (size_attr.size() == 0) { + auto* output_shape = ctx.Input("OutputShape"); + framework::TensorCopy(*output_shape, platform::CPUPlace(), &h_sizes); + h_size_data = h_sizes.data(); + } else { + h_size_data = h_sizes.mutable_data({4}, platform::CPUPlace()); + h_size_data[0] = n; + h_size_data[1] = size_attr[1]; + h_size_data[2] = size_attr[2]; + h_size_data[3] = size_attr[3]; + } + + ScopedSpatialTransformerDescriptor st_desc; + cudnnSpatialTransformerDescriptor_t cudnn_st_desc = + st_desc.descriptor(4, h_size_data); + + const T* output_grad_data = output_grad->data(); + T* theta_grad_data = theta_grad->mutable_data(ctx.GetPlace()); + + PADDLE_ENFORCE(platform::dynload::cudnnSpatialTfGridGeneratorBackward( + handle, cudnn_st_desc, output_grad_data, theta_grad_data)); + } +}; + +} // namespace operators +} // namespace paddle + +namespace plat = paddle::platform; +REGISTER_OP_KERNEL(affine_grid, CUDNN, plat::CUDAPlace, + paddle::operators::CUDNNAffineGridOpKernel, + paddle::operators::CUDNNAffineGridOpKernel); +REGISTER_OP_KERNEL(affine_grid_grad, CUDNN, plat::CUDAPlace, + paddle::operators::CUDNNAffineGridGradOpKernel, + paddle::operators::CUDNNAffineGridGradOpKernel); diff --git a/paddle/fluid/operators/affine_grid_op.cc b/paddle/fluid/operators/affine_grid_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..9d7100cc3db91f5bf7dbd993c9f9ba5d4fc98ea6 --- /dev/null +++ b/paddle/fluid/operators/affine_grid_op.cc @@ -0,0 +1,234 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/affine_grid_op.h" +#include +#include +#include +#include "paddle/fluid/framework/op_registry.h" +#ifdef PADDLE_WITH_CUDA +#include "paddle/fluid/platform/cudnn_helper.h" +#endif + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; + +template +struct Linspace { + void operator()(T start, T end, int count, framework::Tensor* numbers, + const framework::ExecutionContext& ctx) { + T* number_data = numbers->mutable_data({count}, platform::CPUPlace()); + T slice = (end - start) / (T)(count - 1); + for (int i = 0; i < count; ++i) { + number_data[i] = start + (T)i * slice; + } + } +}; + +class AffineGridOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("Theta"), + "Input(Theta) of AffineGridOp should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("Output"), + "Output(Output) of AffineGridOp should not be null."); + auto theta_dims = ctx->GetInputDim("Theta"); + PADDLE_ENFORCE(theta_dims.size() == 3, + "AffineGrid's Input(Theta) should be 3-D tensor."); + + auto output_shape = ctx->Attrs().Get>("output_shape"); + if (output_shape.size() == 0) { + PADDLE_ENFORCE(ctx->HasInput("OutputShape"), + "Input(OutputShape) of AffineGridOp should not be null if " + "attr(output_shape) is not configured."); + auto output_shape_dims = ctx->GetInputDim("OutputShape"); + PADDLE_ENFORCE(output_shape_dims.size() == 1, + "AffineGrid's Input(OutputShape) should be 1-D tensor."); + } else { + PADDLE_ENFORCE(output_shape.size() == 4, + "The size of attr(output_shape) should be 4."); + } + + PADDLE_ENFORCE(theta_dims[1] == 2, "Input(theta) dims[1] should be 2."); + PADDLE_ENFORCE(theta_dims[2] == 3, "Input(theta) dims[2] should be 3."); + // N * H * W * 2 + ctx->SetOutputDim("Output", + framework::make_ddim({theta_dims[0], -1, -1, 2})); + ctx->ShareLoD("Theta", "Output"); + } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext& ctx) const override { + framework::LibraryType library{framework::LibraryType::kPlain}; +#ifdef PADDLE_WITH_CUDA + if (platform::CanCUDNNBeUsed(ctx)) { + library = framework::LibraryType::kCUDNN; + } +#endif + auto data_type = ctx.Input("Theta")->type(); + return framework::OpKernelType(data_type, ctx.GetPlace(), + framework::DataLayout::kAnyLayout, library); + } +}; + +class AffineGridOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput( + "Theta", + "(Tensor) A batch of affine transform parameters with shape [N, 2, 3]. " + "It is used to transform coordinate (x_0, y_0) to coordinate (x_1, " + "y_1)."); + AddInput("OutputShape", + "(Tensor) The shape of target image with format [N, C, H, W].") + .AsDispensable(); + AddOutput("Output", "(Tensor) Output Tensor with shape [N, H, W, 2]."); + AddAttr( + "use_cudnn", + "(bool, default false) Only used in cudnn kernel, need install cudnn") + .SetDefault(true); + AddAttr>( + "output_shape", + "The target output image shape with format [N, C, H, W].") + .SetDefault(std::vector()); + + AddComment(R"DOC( + It generates a grid of (x,y) coordinates using the parameters of the + affine transformation that correspond to a set of points where the input + feature map should be sampled to produce the transformed output feature map. + + Given: + Theta = [[[x_11, x_12, x_13] + [x_14, x_15, x_16]] + [[x_21, x_22, x_23] + [x_24, x_25, x_26]]] + + OutputShape = [2, 3, 5, 5] + + Step 1: + + Generate relative coordinates according to OutputShape. + The values of relative coordinates are in the interval between -1 and 1. + The shape of the relative coordinates is [2, H, W] as below: + + C = [[[-1. -1. -1. -1. -1. ] + [-0.5 -0.5 -0.5 -0.5 -0.5] + [ 0. 0. 0. 0. 0. ] + [ 0.5 0.5 0.5 0.5 0.5] + [ 1. 1. 1. 1. 1. ]] + [[-1. -0.5 0. 0.5 1. ] + [-1. -0.5 0. 0.5 1. ] + [-1. -0.5 0. 0.5 1. ] + [-1. -0.5 0. 0.5 1. ] + [-1. -0.5 0. 0.5 1. ]]] + C[0] is the coordinates in height axis and C[1] is the coordinates in width axis. + + Step2: + Tanspose and reshape C to shape [H * W, 2] and append ones to last dimension. The we get: + C_ = [[-1. -1. 1. ] + [-0.5 -1. 1. ] + [ 0. -1. 1. ] + [ 0.5 -1. 1. ] + [ 1. -1. 1. ] + [-1. -0.5 1. ] + [-0.5 -0.5 1. ] + [ 0. -0.5 1. ] + [ 0.5 -0.5 1. ] + [ 1. -0.5 1. ] + [-1. 0. 1. ] + [-0.5 0. 1. ] + [ 0. 0. 1. ] + [ 0.5 0. 1. ] + [ 1. 0. 1. ] + [-1. 0.5 1. ] + [-0.5 0.5 1. ] + [ 0. 0.5 1. ] + [ 0.5 0.5 1. ] + [ 1. 0.5 1. ] + [-1. 1. 1. ] + [-0.5 1. 1. ] + [ 0. 1. 1. ] + [ 0.5 1. 1. ] + [ 1. 1. 1. ]] + Step3: + Compute output by equation $$Output[i] = C_ * Theta[i]^T$$ + )DOC"); + } +}; + +class AffineGridOpGrad : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + void InferShape(framework::InferShapeContext* ctx) const override { + if (ctx->HasOutput(framework::GradVarName("Theta"))) { + auto output_dims = ctx->GetInputDim(framework::GradVarName("Output")); + ctx->SetOutputDim(framework::GradVarName("Theta"), + {output_dims[0], 2, 3}); + } + } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext& ctx) const override { + framework::LibraryType library_{framework::LibraryType::kPlain}; +#ifdef PADDLE_WITH_CUDA + if (platform::CanCUDNNBeUsed(ctx)) { + library_ = framework::LibraryType::kCUDNN; + } +#endif + return framework::OpKernelType(ctx.Input("Theta")->type(), + ctx.GetPlace(), + framework::DataLayout::kAnyLayout, library_); + } +}; + +class AffineGridGradMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + auto* op = new framework::OpDesc(); + op->SetType("affine_grid_grad"); + op->SetInput("Theta", Input("Theta")); + op->SetInput("OutputShape", Input("OutputShape")); + op->SetInput(framework::GradVarName("Output"), OutputGrad("Output")); + + op->SetAttrMap(Attrs()); + + op->SetOutput(framework::GradVarName("Theta"), InputGrad("Theta")); + return std::unique_ptr(op); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(affine_grid, ops::AffineGridOp, ops::AffineGridOpMaker, + ops::AffineGridGradMaker); +REGISTER_OPERATOR(affine_grid_grad, ops::AffineGridOpGrad); + +REGISTER_OP_CPU_KERNEL( + affine_grid, + ops::AffineGridOpKernel, + ops::AffineGridOpKernel); +REGISTER_OP_CPU_KERNEL( + affine_grid_grad, + ops::AffineGridGradOpKernel, + ops::AffineGridGradOpKernel); diff --git a/paddle/fluid/operators/affine_grid_op.h b/paddle/fluid/operators/affine_grid_op.h new file mode 100644 index 0000000000000000000000000000000000000000..87d23831486e658374d4c011412fdef57be1b994 --- /dev/null +++ b/paddle/fluid/operators/affine_grid_op.h @@ -0,0 +1,174 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include +#include "paddle/fluid/framework/eigen.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/math/blas.h" +#include "paddle/fluid/operators/math/math_function.h" + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; +template +using EigenTensor = framework::EigenTensor; + +using Array1 = Eigen::DSizes; +using Array2 = Eigen::DSizes; +using Array3 = Eigen::DSizes; +using Array4 = Eigen::DSizes; + +/** + *Return a tensor with evenly spaced numbers over a specified interval. + */ +template +struct Linspace { + void operator()(T start, T end, int count, framework::Tensor* numbers, + const framework::ExecutionContext& ctx); +}; + +template +inline void GetIdxMap(int n, int h, int w, Tensor* grid, + const framework::ExecutionContext& ctx) { + auto& place = *ctx.template device_context().eigen_device(); + grid->mutable_data({n, h, w, 3}, ctx.GetPlace()); + auto grid_t = EigenTensor::From(*grid); + // Get indexes of height with shape [height, width, 1] + Tensor h_idx; + Linspace linspace; + linspace((T)-1, (T)1, h, &h_idx, ctx); + auto h_idx_t = EigenTensor::From(h_idx); + // Get indexes of width with shape [height, width, 1] + Tensor w_idx; + linspace((T)-1, (T)1, w, &w_idx, ctx); + auto w_idx_t = EigenTensor::From(w_idx); + // Get constant ones tensor with shape [height, width, 1] + Tensor ones; + ones.mutable_data({h, w, 1}, ctx.GetPlace()); + auto ones_t = EigenTensor::From(ones).setConstant((T)1); + // Get grid tensor with shape [n, h, w, 3] by concatenating h_idx, w_idx and + // ones + Tensor w_idx_map; + w_idx_map.mutable_data({h, w, 1}, ctx.GetPlace()); + auto w_idx_map_t = EigenTensor::From(w_idx_map); + Tensor h_idx_map; + h_idx_map.mutable_data({h, w, 1}, ctx.GetPlace()); + auto h_idx_map_t = EigenTensor::From(h_idx_map); + Tensor w_h_idx_map; + w_h_idx_map.mutable_data({h, w, 2}, ctx.GetPlace()); + auto w_h_idx_map_t = EigenTensor::From(w_h_idx_map); + Tensor w_h_one_idx_map; + w_h_one_idx_map.mutable_data({h, w, 3}, ctx.GetPlace()); + auto w_h_one_idx_map_t = EigenTensor::From(w_h_one_idx_map); + + w_idx_map_t.device(place) = w_idx_t.reshape(Array2(1, w)) + .broadcast(Array2(h, 1)) + .reshape(Array3(h, w, 1)); + + h_idx_map_t.device(place) = h_idx_t.reshape(Array2(1, h)) + .broadcast(Array2(w, 1)) + .shuffle(Array2(1, 0)) + .reshape(Array3(h, w, 1)); + + w_h_idx_map_t.device(place) = w_idx_map_t.concatenate(h_idx_map_t, 2); + w_h_one_idx_map_t.device(place) = w_h_idx_map_t.concatenate(ones_t, 2); + grid_t.device(place) = w_h_one_idx_map_t.reshape(Array4(1, h, w, 3)) + .broadcast(Array4(n, 1, 1, 1)); +} + +template +class AffineGridOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* theta = ctx.Input("Theta"); + int n = theta->dims()[0]; + auto size_attr = ctx.Attr>("output_shape"); + int h = 0; + int w = 0; + if (size_attr.size() == 0) { + auto* output_shape = ctx.Input("OutputShape"); + Tensor h_sizes; + framework::TensorCopy(*output_shape, platform::CPUPlace(), &h_sizes); + const int* h_size_data = h_sizes.data(); + h = h_size_data[2]; + w = h_size_data[3]; + } else { + h = size_attr[2]; + w = size_attr[3]; + } + auto* output = ctx.Output("Output"); + output->mutable_data({n, h, w, 2}, ctx.GetPlace()); + math::SetConstant()( + ctx.template device_context(), output, + static_cast(0)); + Tensor grid; + GetIdxMap(n, h, w, &grid, ctx); + // output = grid * theta.T + // TODO(wanghaoshuang): Refine batched matrix multiply + auto blas = math::GetBlas(ctx); + for (int i = 0; i < n; ++i) { + Tensor sliced_grid = grid.Slice(i, i + 1).Resize({h * w, 3}); + Tensor sliced_theta = theta->Slice(i, i + 1).Resize({2, 3}); + Tensor sliced_out = output->Slice(i, i + 1).Resize({h * w, 2}); + blas.MatMul(sliced_grid, false, sliced_theta, true, T(1), &sliced_out, + T(0)); + } + } +}; + +template +class AffineGridGradOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto output_grad = ctx.Input(framework::GradVarName("Output")); + auto theta_grad = ctx.Output(framework::GradVarName("Theta")); + int n = output_grad->dims()[0]; + auto size_attr = ctx.Attr>("output_shape"); + int h = 0; + int w = 0; + if (size_attr.size() == 0) { + auto* output_shape = ctx.Input("OutputShape"); + Tensor h_sizes; + framework::TensorCopy(*output_shape, platform::CPUPlace(), &h_sizes); + const int* h_size_data = h_sizes.data(); + h = h_size_data[2]; + w = h_size_data[3]; + } else { + h = size_attr[2]; + w = size_attr[3]; + } + theta_grad->mutable_data({n, 2, 3}, ctx.GetPlace()); + math::SetConstant()( + ctx.template device_context(), theta_grad, + static_cast(0)); + Tensor grid; + GetIdxMap(n, h, w, &grid, ctx); + // output = grid * theta.T + // TODO(wanghaoshuang): Refine batched matrix multiply + auto blas = math::GetBlas(ctx); + for (int i = 0; i < n; ++i) { + Tensor sliced_grid = grid.Slice(i, i + 1).Resize({h * w, 3}); + Tensor sliced_out_grad = output_grad->Slice(i, i + 1).Resize({h * w, 2}); + Tensor sliced_theta_grad = theta_grad->Slice(i, i + 1).Resize({2, 3}); + blas.MatMul(sliced_out_grad, true, sliced_grid, false, T(1), + &sliced_theta_grad, T(0)); + } + } +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/alloc_continuous_space_op.cc b/paddle/fluid/operators/alloc_continuous_space_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..d4bdecff62c016a31011266a0f066076d85fcdef --- /dev/null +++ b/paddle/fluid/operators/alloc_continuous_space_op.cc @@ -0,0 +1,236 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/framework/operator.h" +#include "paddle/fluid/framework/var_type.h" +#include "paddle/fluid/operators/math/math_function.h" + +namespace paddle { +namespace operators { + +static framework::proto::VarType::Type kDefaultDtype = + framework::proto::VarType::Type::VarType_Type_BOOL; + +template +class AllocContinuousSpaceKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext &context) const override { + auto &in_var_names = context.Inputs("Input"); + auto &out_var_names = context.Outputs("Output"); + auto &in_vars = context.MultiInputVar("Input"); + auto out_vars = context.MultiOutputVar("Output"); + + PADDLE_ENFORCE_GT(in_var_names.size(), static_cast(0)); + PADDLE_ENFORCE_EQ(in_var_names.size(), out_var_names.size()); + + for (size_t i = 0; i < in_var_names.size(); ++i) { + // Only support LoDTensor + PADDLE_ENFORCE_NOT_NULL(in_vars[i], "%s should not be nullptr,", + in_var_names[i]); + PADDLE_ENFORCE_NOT_NULL(out_vars[i], "%s should not be nullptr,", + out_var_names[i]); + PADDLE_ENFORCE(in_vars[i]->IsType()); + PADDLE_ENFORCE(out_vars[i]->IsType()); + } + + auto in_tensors = context.MultiInput("Input"); + + if (context.Attr("check_name")) { + for (size_t i = 0; i < in_var_names.size(); ++i) { + PADDLE_ENFORCE_EQ(in_var_names[i], out_var_names[i]); + } + } else { + // Init the output as input + for (size_t i = 0; i < in_tensors.size(); ++i) { + out_vars[i]->GetMutable()->Resize( + in_tensors[i]->dims()); + } + } + + auto &dev_ctx = context.template device_context(); + + // Get numel and dtype + size_t numel = 0; + auto dtype = kDefaultDtype; + GetMemSizeAndDtype(in_tensors, in_var_names, &numel, &dtype, + context.GetPlace()); + + // Alloc the continuous space + auto fused_tensor = context.Output("FusedOutput"); + fused_tensor->Resize(framework::make_ddim({static_cast(numel)})) + .mutable_data(context.GetPlace(), dtype); + + // Init the continuous space + auto out_tensors = context.MultiOutput("Output"); + size_t offset = 0; + size_t size_of_dtype = framework::SizeOfType(dtype); + if (context.Attr("copy_data")) { + for (size_t i = 0; i < in_var_names.size(); ++i) { + size_t len = static_cast(in_tensors[i]->numel()); + auto sub_tensor = fused_tensor->Slice( + static_cast(offset), static_cast(offset + len)); + framework::TensorCopy(*in_tensors[i], context.GetPlace(), dev_ctx, + &sub_tensor); + + offset += + Alignment(len * size_of_dtype, context.GetPlace()) / size_of_dtype; + } + } else if (context.Attr("set_constant")) { + math::SetConstant set_constant; + set_constant(dev_ctx, fused_tensor, + static_cast(context.Attr("constant"))); + } + + // Make the outputs point to the continuous space. + offset = 0; + for (size_t i = 0; i < out_tensors.size(); ++i) { + size_t len = static_cast(out_tensors[i]->numel()); + auto dim = out_tensors[i]->dims(); + out_tensors[i] + ->ShareDataWith(fused_tensor->Slice( + static_cast(offset), static_cast(offset + len))) + .Resize(dim); + len = Alignment(len * size_of_dtype, context.GetPlace()) / size_of_dtype; + offset += len; + VLOG(10) << "alloc_space_for_vars: output(" << out_var_names[i] + << ") ,dim:(" << dim << ")" + << " Address: " << out_tensors[i]->data(); + } + } + + private: + // Note(zcd): Addresses should be aligned, otherwise, the results may have + // diff. + size_t Alignment(size_t size, const platform::Place &place) const { + // Allow to allocate the minimum chunk size is 4 KB. + size_t alignment = 1 << 12; + if (platform::is_gpu_place(place)) { + // Allow to allocate the minimum chunk size is 256 B. + alignment = 1 << 8; + } + size_t remaining = size % alignment; + return remaining == 0 ? size : size + (alignment - remaining); + } + + void GetMemSizeAndDtype( + const std::vector &lod_tensors, + const std::vector var_names, size_t *numel, + framework::proto::VarType::Type *dtype, + const platform::Place &place) const { + PADDLE_ENFORCE_EQ(lod_tensors.size(), var_names.size()); + *numel = 0; + size_t size_of_dtype = 0; + for (size_t i = 0; i < var_names.size(); ++i) { + PADDLE_ENFORCE(lod_tensors[i]->IsInitialized(), "%s is not initialized.", + var_names[i]); + + auto p_dtype = lod_tensors[i]->type(); + if (*dtype == kDefaultDtype) { + PADDLE_ENFORCE_NE(p_dtype, kDefaultDtype, "%s's type should not be %s.", + var_names[i], kDefaultDtype); + *dtype = p_dtype; + size_of_dtype = framework::SizeOfType(p_dtype); + } + PADDLE_ENFORCE_EQ(p_dtype, *dtype, "Input vars is not equal."); + + auto size = lod_tensors[i]->numel(); + PADDLE_ENFORCE_GT(size, 0); + VLOG(10) << "alloc_space_for_vars: input(" << var_names[i] << ") ,dim:(" + << lod_tensors[i]->dims() << ")"; + *numel += Alignment(static_cast(size) * size_of_dtype, place) / + size_of_dtype; + } + } +}; + +class AllocContinuousSpaceOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext *ctx) const override {} +}; + +class AllocContinuousSpaceOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("Input", + "(vector) The input tensors of" + " alloc_continuous_space operator.") + .AsDuplicable(); + AddOutput("Output", + "(vector) The output " + "tensors of alloc_continuous_space operator. And the address " + "of output tensors are continuous, they are sliced from the " + "tensor of FusedOutput.") + .AsDuplicable(); + AddOutput("FusedOutput", + "(LoDTensor) The output tensor " + "of alloc_continuous_space operator. And the tensors of" + " Output is sliced from the tensor of FusedOutput."); + AddAttr("copy_data", "Whether to copy the Input value to Output.") + .SetDefault(false); + AddAttr("set_constant", + "Whether to set the Output with a constant value.") + .SetDefault(false); + AddAttr("constant", + "If set_constant is true, the constant value will be used " + "to set the Output.") + .SetDefault(0.0); + AddAttr("check_name", + "Whether to check the name of Input and Output to ensure " + "they are the same separately.") + .SetDefault(false); + AddComment(R"DOC( +AllocContinuousSpace Operator. + +alloc_continuous_space is used to make the address of Output +continuous according to the Input. This Op will alloc a big tensor +according to the tensors of Input, the dtype is the same with those input tensors, +the size is the sum of those input tensors' numel, and the dim of the big +tensor is {sum(numel)}. And the big tensor is stored in FusedOutput. +The tensors of Output are sliced from the tensor of FusedOutput. +Note that, the dtype of Input should be the same, and the dim of Input +and Output should equal. +The tensors of Input and Output could be the same or different. And +alloc_continuous_space allows copying the value of Input to Output, or +setting the Output with a constant value. + +)DOC"); + } +}; + +} // namespace operators +} // namespace paddle + +REGISTER_OPERATOR(alloc_continuous_space, + paddle::operators::AllocContinuousSpaceOp, + paddle::operators::AllocContinuousSpaceOpMaker); +namespace ops = paddle::operators; +REGISTER_OP_CPU_KERNEL( + alloc_continuous_space, + ops::AllocContinuousSpaceKernel, + ops::AllocContinuousSpaceKernel, + ops::AllocContinuousSpaceKernel); + +#ifdef PADDLE_WITH_CUDA +REGISTER_OP_CUDA_KERNEL( + alloc_continuous_space, + ops::AllocContinuousSpaceKernel, + ops::AllocContinuousSpaceKernel, + ops::AllocContinuousSpaceKernel); +#endif diff --git a/paddle/fluid/operators/anakin/CMakeLists.txt b/paddle/fluid/operators/anakin/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..5eacefc645bab288da7c289a5d7701abbcbef03d --- /dev/null +++ b/paddle/fluid/operators/anakin/CMakeLists.txt @@ -0,0 +1,2 @@ +op_library(anakin_engine_op DEPS anakin_engine anakin_op_converter) +# file(APPEND ${pybind_file} "USE_NO_KERNEL_OP(anakin_engine);\n") diff --git a/paddle/fluid/operators/anakin/anakin_engine_op.cc b/paddle/fluid/operators/anakin/anakin_engine_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..58db16ea0c1347a366a4d5927e414d76864cb6ab --- /dev/null +++ b/paddle/fluid/operators/anakin/anakin_engine_op.cc @@ -0,0 +1,53 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#ifdef PADDLE_WITH_CUDA + +#include +#include + +#include "paddle/fluid/operators/anakin/anakin_engine_op.h" + +namespace paddle { + +namespace operators { + +class AnakinEngineOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("Xs", "A list of inputs.").AsDuplicable(); + AddOutput("Ys", "A list of outputs").AsDuplicable(); + AddAttr("subgraph", "the subgraph."); + AddAttr( + "engine_key", + "The engine_key here is used to distinguish different TRT Engines"); + AddAttr("sub_block", "the trt block"); + AddComment("Anakin engine operator."); + } +}; + +class AnakinEngineInferVarType : public framework::VarTypeInference { + public: + void operator()(framework::InferVarTypeContext *ctx) const override {} +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; + +REGISTER_OPERATOR(anakin_engine, ops::AnakinEngineOp, ops::AnakinEngineOpMaker, + ops::AnakinEngineOpMaker); + +#endif // PADDLE_WITH_CUDA diff --git a/paddle/fluid/operators/anakin/anakin_engine_op.h b/paddle/fluid/operators/anakin/anakin_engine_op.h new file mode 100644 index 0000000000000000000000000000000000000000..e4feb14b2271a50c8e8fb7ce4c81dd6c99042e21 --- /dev/null +++ b/paddle/fluid/operators/anakin/anakin_engine_op.h @@ -0,0 +1,130 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#ifdef PADDLE_WITH_CUDA + +#include +#include +#include +#include +#include +#include +#include + +#include "paddle/fluid/framework/executor.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/framework/operator.h" +#include "paddle/fluid/inference/anakin/convert/op_converter.h" +#include "paddle/fluid/inference/anakin/engine.h" +#include "paddle/fluid/inference/analysis/helper.h" + +namespace paddle { +namespace operators { + +using FluidDT = framework::proto::VarType_Type; +using inference::Singleton; + +using anakin::graph::GraphGlobalMem; +using anakin::AK_FLOAT; +using anakin::Precision; +using anakin::saber::NV; +using anakin::saber::X86; +using anakin::saber::Shape; +using anakin::PBlock; +using anakin::PTuple; +using inference::anakin::AnakinEngine; + +class AnakinEngineOp : public framework::OperatorBase { + using AnakinNvEngineT = AnakinEngine; + + private: + std::vector input_names_; + std::unordered_set param_names_; + mutable AnakinNvEngineT *anakin_engine_; + std::string engine_key_; + std::string engine_serialized_data_; + + public: + AnakinEngineOp(const std::string &type, + const framework::VariableNameMap &inputs, + const framework::VariableNameMap &outputs, + const framework::AttributeMap &attrs) + : framework::OperatorBase(type, inputs, outputs, attrs) { + input_names_ = Inputs("Xs"); + engine_key_ = Attr("engine_key"); + auto params = Attr>("parameters"); + for (const auto ¶m : params) { + param_names_.insert(param); + } + anakin_engine_ = nullptr; + } + + protected: + void RunImpl(const framework::Scope &scope, + const platform::Place &dev_place) const override { + RunAnakin(scope, dev_place); + } + + void RunAnakin(const framework::Scope &scope, + const platform::Place &dev_place) const { + auto *engine = GetEngine(scope, dev_place); + platform::DeviceContextPool &pool = platform::DeviceContextPool::Instance(); + auto &dev_ctx = *pool.Get(dev_place); + auto stream = + reinterpret_cast(dev_ctx).stream(); + + PADDLE_ENFORCE(!input_names_.empty(), "should pass more than one inputs"); + + std::vector output_maps = + Attr>("output_name_mapping"); + + std::map inputs; + // Convert input tensor from fluid to engine. + for (const auto &x : Inputs("Xs")) { + if (param_names_.count(x)) continue; + auto &t = + inference::analysis::GetFromScope(scope, x); + + inputs.insert({x, &t}); + } + + std::map outputs; + int output_index = 0; + for (const auto &y : Outputs("Ys")) { + auto *fluid_v = scope.FindVar(y); + PADDLE_ENFORCE_NOT_NULL(fluid_v, "no output variable called %s", y); + auto *fluid_t = fluid_v->GetMutable(); + outputs.insert({output_maps[output_index], fluid_t}); + output_index += 1; + } + engine->Execute(inputs, outputs, stream); + } + + AnakinNvEngineT *GetEngine(const framework::Scope &scope, + const platform::Place &dev_place) const { + if (anakin_engine_ == nullptr) { + anakin_engine_ = + inference::Singleton::Global() + .Get(engine_key_); + } + return anakin_engine_; + } +}; + +} // namespace operators +} // namespace paddle + +#endif // PADDLE_WITH_CUDA diff --git a/paddle/fluid/operators/arg_max_op.cc b/paddle/fluid/operators/arg_max_op.cc index 8174d3735859b1fac40cd4c07545f34874d31ab7..7fe9a0df7467970286fb0efc7c5ce7aaf01ac28b 100644 --- a/paddle/fluid/operators/arg_max_op.cc +++ b/paddle/fluid/operators/arg_max_op.cc @@ -28,6 +28,5 @@ REGISTER_OP_CPU_KERNEL( int32_t>, paddle::operators::ArgMaxKernel, - paddle::operators::ArgMaxKernel, paddle::operators::ArgMaxKernel); diff --git a/paddle/fluid/operators/arg_max_op.cu b/paddle/fluid/operators/arg_max_op.cu index a147d77a9e9c577984028e1a6ed9582dda622069..85e4f98173511435a52b32e506afc8d5b772f74f 100644 --- a/paddle/fluid/operators/arg_max_op.cu +++ b/paddle/fluid/operators/arg_max_op.cu @@ -25,7 +25,5 @@ REGISTER_OP_CUDA_KERNEL( int32_t>, paddle::operators::ArgMaxKernel, - paddle::operators::ArgMaxKernel, paddle::operators::ArgMaxKernel); diff --git a/paddle/fluid/operators/arg_min_max_op_base.h b/paddle/fluid/operators/arg_min_max_op_base.h index 6cbdaefeda099c36a864289ef8195c20d09c55e6..bf7b83bb7a7d4f4861276a228389e87a42a39ef7 100644 --- a/paddle/fluid/operators/arg_min_max_op_base.h +++ b/paddle/fluid/operators/arg_min_max_op_base.h @@ -58,6 +58,8 @@ class ArgMinMaxKernel : public framework::OpKernel { auto& out = *(ctx.Output("Out")); out.mutable_data(ctx.GetPlace()); auto axis = ctx.Attr("axis"); + auto x_rank = x.dims().size(); + if (axis < 0) axis += x_rank; auto& dev_ctx = ctx.template device_context(); #define CALL_ARG_MINMAX_FUNCTOR(rank) \ diff --git a/paddle/fluid/operators/arg_min_op.cc b/paddle/fluid/operators/arg_min_op.cc index 41f188029f17dbe8717afc0ca0760a39edc24b54..23b24735cd0ba17afd30b95c329cb0530a1f0104 100644 --- a/paddle/fluid/operators/arg_min_op.cc +++ b/paddle/fluid/operators/arg_min_op.cc @@ -28,6 +28,5 @@ REGISTER_OP_CPU_KERNEL( int32_t>, paddle::operators::ArgMinKernel, - paddle::operators::ArgMinKernel, paddle::operators::ArgMinKernel); diff --git a/paddle/fluid/operators/arg_min_op.cu b/paddle/fluid/operators/arg_min_op.cu index 4d020508505a6ebac8be41ce1e4f99d436b67ab5..47d7c8b12243c6c5c501188af7f48f125c266009 100644 --- a/paddle/fluid/operators/arg_min_op.cu +++ b/paddle/fluid/operators/arg_min_op.cu @@ -25,7 +25,5 @@ REGISTER_OP_CUDA_KERNEL( int32_t>, paddle::operators::ArgMinKernel, - paddle::operators::ArgMinKernel, paddle::operators::ArgMinKernel); diff --git a/paddle/fluid/operators/array_to_lod_tensor_op.cc b/paddle/fluid/operators/array_to_lod_tensor_op.cc index b8b8b2290a0f002fd379032e28590b84a1da38e9..d942391b8644959f63ac58f6a7122bbd3c0ddf84 100644 --- a/paddle/fluid/operators/array_to_lod_tensor_op.cc +++ b/paddle/fluid/operators/array_to_lod_tensor_op.cc @@ -11,7 +11,7 @@ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include +#include #include #include "paddle/fluid/framework/lod_rank_table.h" @@ -58,7 +58,7 @@ struct ArrayToLoDFunctor : public boost::static_visitor { ArrayToLoDFunctorImpl functor; functor.dev_ctx_ = dev_ctx; functor.prev_functor_ = this; - framework::VisitDataType(framework::ToDataType(out->type()), functor); + framework::VisitDataType(out->type(), functor); } }; @@ -91,7 +91,7 @@ class ArrayToLoDTensorOp : public framework::OperatorBase { PADDLE_ENFORCE(!x.empty(), "There's no element in the input array."); int rank = x[0].dims().size(); platform::Place place = x[0].place(); - std::type_index data_type = x[0].type(); + auto data_type = x[0].type(); int64_t batch_size = x[0].dims()[0]; framework::DDim ins_dims = rank > 1 ? framework::slice_ddim(x[0].dims(), 1, rank) diff --git a/paddle/fluid/operators/attention_lstm_op.cc b/paddle/fluid/operators/attention_lstm_op.cc index 9b943440a869e213db4ed761cfe7c508bc5e94ae..912ec79910301b67bc520b1aa78d3fa1fd165d1f 100644 --- a/paddle/fluid/operators/attention_lstm_op.cc +++ b/paddle/fluid/operators/attention_lstm_op.cc @@ -121,9 +121,8 @@ void AttentionLSTMOp::InferShape(framework::InferShapeContext* ctx) const { framework::OpKernelType AttentionLSTMOp::GetExpectedKernelType( const framework::ExecutionContext& ctx) const { - return framework::OpKernelType( - framework::ToDataType(ctx.Input("X")->type()), - ctx.device_context()); + return framework::OpKernelType(ctx.Input("X")->type(), + ctx.device_context()); } void AttentionLSTMOpMaker::Make() { @@ -231,10 +230,10 @@ use lstm_x_t as input and compute as standard LSTM. template inline void bias_relu(const int n, const T* x, const T* bias, T* y) { if (bias) { - math::vec_add_bias(n, *bias, x, y); - math::vec_relu(n, y, y); + math::vec_add_bias(n, *bias, x, y); + math::vec_relu(n, y, y); } else { - math::vec_relu(n, x, y); + math::vec_relu(n, x, y); } } @@ -245,8 +244,8 @@ inline void vec_softmax(const int n, const T* x, T* y) { for (int i = 1; i < n; ++i) { scalar = scalar < x[i] ? x[i] : scalar; } - math::vec_add_bias(n, -scalar, x, y); // sub - math::vec_exp(n, y, y); // exp + math::vec_add_bias(n, -scalar, x, y); // sub + math::vec_exp(n, y, y); // exp // sum scalar = T(0); for (int i = 0; i < n; ++i) { @@ -294,7 +293,7 @@ class AttentionLSTMKernel : public framework::OpKernel { int len = x_lod[0][i + 1] - x_lod[0][i]; max_seq_len = max_seq_len < len ? len : max_seq_len; } - PADDLE_ENFORCE_EQ(x_lod.size(), 1, "Input(X)'s lod size must be 1."); + PADDLE_ENFORCE_EQ(x_lod.size(), 1UL, "Input(X)'s lod size must be 1."); PADDLE_ENFORCE_EQ(c0->dims()[0], N, "C0 dims should be %d x %d.", N, D); fc_out->Resize({max_seq_len, 1}); @@ -302,13 +301,13 @@ class AttentionLSTMKernel : public framework::OpKernel { auto& act_gate_str = ctx.Attr("gate_activation"); auto& act_cell_str = ctx.Attr("cell_activation"); auto& act_cand_str = ctx.Attr("candidate_activation"); - if (platform::jit::MayIUse(platform::jit::avx)) { - math::VecActivations act_functor; + if (platform::MayIUse(platform::avx)) { + math::VecActivations act_functor; act_gate = act_functor(act_gate_str); act_cell = act_functor(act_cell_str); act_cand = act_functor(act_cand_str); } else { - math::VecActivations act_functor; + math::VecActivations act_functor; act_gate = act_functor(act_gate_str); act_cell = act_functor(act_cell_str); act_cand = act_functor(act_cand_str); diff --git a/paddle/fluid/operators/auc_op.cc b/paddle/fluid/operators/auc_op.cc deleted file mode 100644 index 0784920064a879963cd9725cd9acf4cec7b874ce..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/auc_op.cc +++ /dev/null @@ -1,112 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/operators/auc_op.h" - -namespace paddle { -namespace operators { - -class AucOp : public framework::OperatorWithKernel { - public: - using framework::OperatorWithKernel::OperatorWithKernel; - - protected: - void InferShape(framework::InferShapeContext *ctx) const override { - PADDLE_ENFORCE(ctx->HasInput("Predict"), - "Input of Out should not be null."); - PADDLE_ENFORCE(ctx->HasInput("Label"), - "Input of Label should not be null."); - auto predict_width = ctx->GetInputDim("Predict")[1]; - PADDLE_ENFORCE_EQ(predict_width, 2, "Only support binary classification"); - auto predict_height = ctx->GetInputDim("Predict")[0]; - auto label_height = ctx->GetInputDim("Label")[0]; - - PADDLE_ENFORCE_EQ(predict_height, label_height, - "Out and Label should have same height."); - - int num_pred_buckets = ctx->Attrs().Get("num_thresholds") + 1; - int slide_steps = ctx->Attrs().Get("slide_steps"); - - PADDLE_ENFORCE_GE(num_pred_buckets, 1, "num_thresholds must larger than 1"); - PADDLE_ENFORCE_GE(slide_steps, 0, "slide_steps must be natural number"); - - ctx->SetOutputDim("AUC", {1}); - - slide_steps = slide_steps == 0 ? 1 : slide_steps; - ctx->SetOutputDim("StatPosOut", {slide_steps, num_pred_buckets}); - ctx->SetOutputDim("StatNegOut", {slide_steps, num_pred_buckets}); - } - - protected: - framework::OpKernelType GetExpectedKernelType( - const framework::ExecutionContext &ctx) const override { - return framework::OpKernelType( - framework::ToDataType(ctx.Input("Predict")->type()), - ctx.device_context()); - } -}; - -class AucOpMaker : public framework::OpProtoAndCheckerMaker { - public: - void Make() override { - AddInput("Predict", - "A floating point 2D tensor with shape [batch_size, 2], values " - "are in the range [0, 1]." - "Typically, this tensor indicates the probability of each label"); - AddInput("Label", - "A 2D int tensor indicating the label of the training data. " - "shape: [batch_size, 1]"); - - // TODO(typhoonzero): support weight input - AddInput("StatPos", "Statistic value when label = 1"); - AddInput("StatNeg", "Statistic value when label = 0"); - - AddOutput("AUC", - "A scalar representing the " - "current area-under-the-curve."); - - AddOutput("StatPosOut", "Statistic value when label = 1"); - AddOutput("StatNegOut", "Statistic value when label = 0"); - - AddAttr("curve", "Curve type, can be 'ROC' or 'PR'.") - .SetDefault("ROC"); - - AddAttr( - "num_thresholds", - "The number of thresholds to use when discretizing the roc curve.") - .SetDefault((2 << 12) - 1); - AddAttr("slide_steps", "Use slide steps to calc batch auc.") - .SetDefault(1); - AddComment(R"DOC( -Area Under The Curve (AUC) Operator. - -This implementation computes the AUC according to forward output and label. -It is used very widely in binary classification evaluation. As a note: -If input label contains values other than 0 and 1, it will be cast -to bool. You can find the relevant definitions here: -https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve - -There are two types of possible curves: -1. ROC: Receiver operating characteristic -2. PR: Precision Recall -)DOC"); - } -}; - -} // namespace operators -} // namespace paddle - -namespace ops = paddle::operators; -REGISTER_OP_WITHOUT_GRADIENT(auc, ops::AucOp, ops::AucOpMaker); -REGISTER_OP_CPU_KERNEL(auc, ops::AucKernel); diff --git a/paddle/fluid/operators/auc_op.h b/paddle/fluid/operators/auc_op.h deleted file mode 100644 index fb370842d1942c3b3eebecb1fe5e8ffb845cb34b..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/auc_op.h +++ /dev/null @@ -1,161 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#pragma once - -#include -#include -#include "paddle/fluid/framework/op_registry.h" - -namespace paddle { -namespace operators { - -using Tensor = framework::Tensor; - -template -class AucKernel : public framework::OpKernel { - public: - void Compute(const framework::ExecutionContext &ctx) const override { - auto *predict = ctx.Input("Predict"); - auto *label = ctx.Input("Label"); - - std::string curve = ctx.Attr("curve"); - int num_thresholds = ctx.Attr("num_thresholds"); - // buckets contain numbers from 0 to num_thresholds - int num_pred_buckets = num_thresholds + 1; - int slide_steps = ctx.Attr("slide_steps"); - - // Only use output var for now, make sure it's persistable and - // not cleaned up for each batch. - auto *auc = ctx.Output("AUC"); - auto *stat_pos = ctx.Output("StatPosOut"); - auto *stat_neg = ctx.Output("StatNegOut"); - - auto *origin_stat_pos = stat_pos->mutable_data(ctx.GetPlace()); - auto *origin_stat_neg = stat_neg->mutable_data(ctx.GetPlace()); - - std::vector stat_pos_data(num_pred_buckets, 0); - std::vector stat_neg_data(num_pred_buckets, 0); - - auto stat_pos_calc = stat_pos_data.data(); - auto stat_neg_calc = stat_neg_data.data(); - - statAuc(label, predict, num_pred_buckets, num_thresholds, slide_steps, - origin_stat_pos, origin_stat_neg, &stat_pos_calc, &stat_neg_calc); - - calcAuc(ctx, stat_pos_calc, stat_neg_calc, num_thresholds, auc); - } - - private: - inline static double trapezoidArea(double X1, double X2, double Y1, - double Y2) { - return (X1 > X2 ? (X1 - X2) : (X2 - X1)) * (Y1 + Y2) / 2.0; - } - - inline static void statAuc(const framework::Tensor *label, - const framework::Tensor *predict, - const int num_pred_buckets, - const int num_thresholds, const int slide_steps, - int64_t *origin_stat_pos, int64_t *origin_stat_neg, - int64_t **stat_pos, int64_t **stat_neg) { - size_t batch_size = predict->dims()[0]; - size_t inference_width = predict->dims()[1]; - const T *inference_data = predict->data(); - const auto *label_data = label->data(); - - for (size_t i = 0; i < batch_size; i++) { - uint32_t binIdx = static_cast( - inference_data[i * inference_width + 1] * num_thresholds); - if (label_data[i]) { - (*stat_pos)[binIdx] += 1.0; - } else { - (*stat_neg)[binIdx] += 1.0; - } - } - - int bucket_length = num_pred_buckets * sizeof(int64_t); - - // will stat auc unlimited. - if (slide_steps == 0) { - for (int slide = 0; slide < num_pred_buckets; ++slide) { - origin_stat_pos[slide] += (*stat_pos)[slide]; - origin_stat_neg[slide] += (*stat_neg)[slide]; - } - - *stat_pos = origin_stat_pos; - *stat_neg = origin_stat_neg; - - } else { - for (int slide = 1; slide < slide_steps; ++slide) { - int dst_idx = (slide - 1) * num_pred_buckets; - int src_inx = slide * num_pred_buckets; - std::memcpy(origin_stat_pos + dst_idx, origin_stat_pos + src_inx, - bucket_length); - std::memcpy(origin_stat_neg + dst_idx, origin_stat_neg + src_inx, - bucket_length); - } - - std::memcpy(origin_stat_pos + (slide_steps - 1) * num_pred_buckets, - *stat_pos, bucket_length); - std::memcpy(origin_stat_neg + (slide_steps - 1) * num_pred_buckets, - *stat_neg, bucket_length); - - std::memset(*stat_pos, 0, bucket_length); - std::memset(*stat_neg, 0, bucket_length); - - for (int slide = 0; slide < num_pred_buckets; ++slide) { - int stat_pos_steps = 0; - int stat_neg_steps = 0; - for (int step = 0; step < slide_steps; ++step) { - stat_pos_steps += origin_stat_pos[slide + step * num_pred_buckets]; - stat_neg_steps += origin_stat_neg[slide + step * num_pred_buckets]; - } - (*stat_pos)[slide] += stat_pos_steps; - (*stat_neg)[slide] += stat_neg_steps; - } - } - } - - inline static void calcAuc(const framework::ExecutionContext &ctx, - int64_t *stat_pos, int64_t *stat_neg, - int num_thresholds, - framework::Tensor *auc_tensor) { - auto *auc = auc_tensor->mutable_data(ctx.GetPlace()); - - *auc = 0.0f; - - double totPos = 0.0; - double totNeg = 0.0; - double totPosPrev = 0.0; - double totNegPrev = 0.0; - - int idx = num_thresholds; - - while (idx >= 0) { - totPosPrev = totPos; - totNegPrev = totNeg; - totPos += stat_pos[idx]; - totNeg += stat_neg[idx]; - *auc += trapezoidArea(totNeg, totNegPrev, totPos, totPosPrev); - --idx; - } - - if (totPos > 0.0 && totNeg > 0.0) { - *auc = *auc / totPos / totNeg; - } - } -}; - -} // namespace operators -} // namespace paddle diff --git a/paddle/fluid/operators/average_accumulates_op.cc b/paddle/fluid/operators/average_accumulates_op.cc index f389eab605e087c535b9918264e6502217062505..0922b03b5f5fbd2a7a62b0a325ebed9600767497 100644 --- a/paddle/fluid/operators/average_accumulates_op.cc +++ b/paddle/fluid/operators/average_accumulates_op.cc @@ -103,9 +103,8 @@ class AverageAccumulatesOp : public framework::OperatorWithKernel { protected: framework::OpKernelType GetExpectedKernelType( const framework::ExecutionContext& ctx) const override { - return framework::OpKernelType( - framework::ToDataType(ctx.Input("param")->type()), - ctx.GetPlace()); + return framework::OpKernelType(ctx.Input("param")->type(), + ctx.GetPlace()); } }; diff --git a/paddle/fluid/operators/batch_norm_mkldnn_op.cc b/paddle/fluid/operators/batch_norm_mkldnn_op.cc deleted file mode 100644 index de641cb08e4cc3322cc8387d873f2aaab279e1dd..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/batch_norm_mkldnn_op.cc +++ /dev/null @@ -1,514 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "mkldnn.hpp" -#include "paddle/fluid/operators/batch_norm_op.h" -#include "paddle/fluid/platform/mkldnn_helper.h" - -namespace paddle { -namespace operators { - -using batch_norm_bwd = mkldnn::batch_normalization_backward; -using batch_norm_fwd = mkldnn::batch_normalization_forward; -using mkldnn::memory; -using mkldnn::primitive; -using mkldnn::reorder; -using mkldnn::stream; -using paddle::platform::MKLDNNDeviceContext; -using paddle::platform::MKLDNNMemDesc; -using platform::to_void_cast; - -namespace { -template -struct bn_type_traits { - using op_type = T; - using op_desc = typename op_type::desc; - using op_prim = typename op_type::primitive_desc; -}; - -class BatchNormMKLDNNHandler : public platform::MKLDNNHandler { - public: - BatchNormMKLDNNHandler( - std::shared_ptr batch_norm_pd, - const platform::MKLDNNDeviceContext &dev_ctx, mkldnn::engine engine, - const std::string &base_key) - : platform::MKLDNNHandler(dev_ctx, engine, base_key) { - batch_norm_pd_ = batch_norm_pd; - } - - std::shared_ptr AcquireScaleshiftMemoryFromPrimitive(void *ptr) { - return this->AcquireMemoryFromPrimitive( - batch_norm_pd_->weights_primitive_desc(), ptr, "@scaleshift_mem_p"); - } - - std::shared_ptr AcquireMeanMemoryFromPrimitive(void *ptr) { - return this->AcquireMemoryFromPrimitive( - batch_norm_pd_->mean_primitive_desc(), ptr, "@mean_mem_p"); - } - - std::shared_ptr AcquireVarianceMemoryFromPrimitive(void *ptr) { - return this->AcquireMemoryFromPrimitive( - batch_norm_pd_->variance_primitive_desc(), ptr, "@variance_mem_p"); - } - - std::shared_ptr AcquireTestTrainingBatchNormFwd( - std::shared_ptr src_memory, - std::shared_ptr scaleshift_memory, - std::shared_ptr dst_memory, std::shared_ptr mean_memory, - std::shared_ptr variance_memory, bool is_test) { - auto prim_key = key_ + "@batch_norm_p"; - auto batch_norm_p = - std::static_pointer_cast(dev_ctx_.GetBlob(prim_key)); - - PADDLE_ENFORCE((batch_norm_p != nullptr) || !is_reusing_, - "Fail to find batch norm primitive in device context"); - - if (batch_norm_p == nullptr) { - if (is_test) { - batch_norm_p = std::make_shared( - *batch_norm_pd_, *src_memory, - (const mkldnn::primitive::at &)*mean_memory, - (const mkldnn::primitive::at &)*variance_memory, *scaleshift_memory, - *dst_memory); - } else { - batch_norm_p = std::make_shared( - *batch_norm_pd_, *src_memory, *scaleshift_memory, *dst_memory, - *mean_memory, *variance_memory); - } - - dev_ctx_.SetBlob(prim_key, batch_norm_p); - } else { - is_reusing_ = true; - } - - return batch_norm_p; - } - - static std::string GetHash(const memory::dims &input_dims, float epsilon, - unsigned flag, bool is_test, memory::format format, - const std::string &suffix = "") { - auto dims2str = [](const memory::dims &operand_dims) { - std::string dstr = ""; - for (size_t i = 0; i < operand_dims.size(); ++i) { - dstr += std::to_string(operand_dims[i]) + "-"; - } - return dstr; - }; - return dims2str(input_dims) + std::to_string(epsilon) + - std::to_string(flag) + std::to_string(is_test) + - std::to_string(format) + suffix; - } - - private: - std::shared_ptr batch_norm_pd_; -}; - -std::shared_ptr UpdateMemoryData( - const platform::MKLDNNDeviceContext &dev_ctx, const std::string &key, - void *new_ptr) { - auto mem = std::static_pointer_cast(dev_ctx.GetBlob(key)); - PADDLE_ENFORCE( - mem != nullptr, - (std::string("Fail to find memory in device context [key: ") + key + "]") - .c_str()); - mem->set_data_handle(new_ptr); - return mem; -} - -template -void copy_to_weights(T scale_begin, T scale_end, T shift_begin, T shift_end, - Container *c) { - auto it = std::begin(*c); - - std::copy(scale_begin, scale_end, std::inserter(*c, it)); - std::copy( - shift_begin, shift_end, - std::inserter(*c, std::next(it, std::distance(scale_begin, scale_end)))); -} - -} // namespace - -template -class BatchNormMKLDNNOpKernel : public paddle::framework::OpKernel { - public: - void Compute(const framework::ExecutionContext &ctx) const override { - const float epsilon = ctx.Attr("epsilon"); - const float momentum = ctx.Attr("momentum"); - const bool is_test = ctx.Attr("is_test"); - const bool fuse_with_relu = ctx.Attr("fuse_with_relu"); - - const auto *x = ctx.Input("X"); - const auto *mean = ctx.Input("Mean"); - const auto *variance = ctx.Input("Variance"); - - auto &dev_ctx = ctx.template device_context(); - auto mkldnn_engine = dev_ctx.GetEngine(); - - auto *y = ctx.Output("Y"); - auto *mean_out = ctx.Output("MeanOut"); - auto *variance_out = ctx.Output("VarianceOut"); - auto *batch_mean = ctx.Output("SavedMean"); - auto *batch_variance = ctx.Output("SavedVariance"); - - const auto *scale = ctx.Input("Scale"); - const auto *shift = ctx.Input("Bias"); - - PADDLE_ENFORCE(x->layout() == DataLayout::kMKLDNN && - x->format() != memory::format::format_undef, - "Wrong layout/format set for Input x tensor"); - - const T *x_data = x->data(); - const T *mean_data = mean->data(); - const T *variance_data = variance->data(); - T *y_data = y->mutable_data(ctx.GetPlace()); - T *mean_out_data = mean_out->mutable_data(ctx.GetPlace()); - T *variance_out_data = variance_out->mutable_data(ctx.GetPlace()); - T *batch_mean_data = nullptr; - T *batch_variance_data = nullptr; - - if (!is_test) { - batch_mean_data = batch_mean->mutable_data(ctx.GetPlace()); - batch_variance_data = batch_variance->mutable_data(ctx.GetPlace()); - } - - auto propagation = is_test == true ? mkldnn::prop_kind::forward_scoring - : mkldnn::prop_kind::forward_training; - - auto src_tz = paddle::framework::vectorize2int(x->dims()); - auto scale_tz = paddle::framework::vectorize2int(scale->dims()); - PADDLE_ENFORCE(scale_tz.size() == 1, "Dims of scale tensor is NOT 1"); - const unsigned int ic = scale_tz[0]; - - // MKLDNN requires a single piece of memory for scale and shift/bias data - const size_t scaleshift_size = 2 * ic; - std::vector scaleshift_data; - scaleshift_data.reserve(scaleshift_size); - - copy_to_weights(scale->data(), scale->data() + ic, shift->data(), - shift->data() + ic, &scaleshift_data); - - unsigned flags = mkldnn::use_scale_shift; - if (is_test) flags |= mkldnn::use_global_stats; - if (fuse_with_relu) flags |= mkldnn::fuse_bn_relu; - - // create mkldnn memory from input x tensor - mkldnn::memory::format input_format = - platform::MKLDNNFormatForSize(src_tz.size(), x->format()); - - // keys for backward pass - const std::string key = BatchNormMKLDNNHandler::GetHash( - src_tz, epsilon, flags, is_test, input_format, - ctx.op().Output("SavedMean")); - const std::string key_batch_norm_fwd_pd = key + "@bn_fwd_pd"; - - auto user_src_md = platform::MKLDNNMemDesc( - {src_tz}, platform::MKLDNNGetDataType(), input_format); - - // create primitive descriptor for batch norm forward - using bn_fwd_types = bn_type_traits; - auto batch_norm_fwd_desc = - bn_fwd_types::op_desc{propagation, user_src_md, epsilon, flags}; - auto batch_norm_fwd_pd = std::make_shared( - batch_norm_fwd_desc, mkldnn_engine); - // Save conv_pd/src_memory/weights_memory for backward pass - dev_ctx.SetBlob(key_batch_norm_fwd_pd, batch_norm_fwd_pd); - - BatchNormMKLDNNHandler handler(batch_norm_fwd_pd, dev_ctx, mkldnn_engine, - key); - - auto src_memory = - handler.AcquireSrcMemory(user_src_md, to_void_cast(x_data)); - - // crate mkldnn memory for weights(scale/shift) - auto scaleshift_memory = - handler.AcquireScaleshiftMemoryFromPrimitive(scaleshift_data.data()); - - // create mkldnn memory for output y tensor - auto dst_memory = handler.AcquireDstMemory( - batch_norm_fwd_pd->dst_primitive_desc().desc(), y_data); - - std::shared_ptr batch_norm_p; - if (is_test) { - // create mkldnn memory for stats (as input) - std::shared_ptr mean_memory = - handler.AcquireMeanMemoryFromPrimitive(to_void_cast(mean_data)); - std::shared_ptr variance_memory = - handler.AcquireVarianceMemoryFromPrimitive( - to_void_cast(variance_data)); - - batch_norm_p = handler.AcquireTestTrainingBatchNormFwd( - src_memory, scaleshift_memory, dst_memory, mean_memory, - variance_memory, true); - } else { - // create mkldnn memory for stats (as output) - std::shared_ptr mean_memory = - handler.AcquireMeanMemoryFromPrimitive(batch_mean_data); - std::shared_ptr variance_memory = - handler.AcquireVarianceMemoryFromPrimitive(batch_variance_data); - - batch_norm_p = handler.AcquireTestTrainingBatchNormFwd( - src_memory, scaleshift_memory, dst_memory, mean_memory, - variance_memory, false); - } - - y->set_layout(DataLayout::kMKLDNN); - y->set_format(platform::GetMKLDNNFormat(*dst_memory)); - - std::vector pipeline; - pipeline.push_back(*batch_norm_p); - mkldnn::stream(mkldnn::stream::kind::eager).submit(pipeline).wait(); - - if (!is_test) { - // mkldnn only compute stats for current batch - // so we need compute momentum stats via Eigen lib - EigenVectorArrayMap batch_mean_e(batch_mean_data, ic); - EigenVectorArrayMap batch_variance_e(batch_variance_data, ic); - ConstEigenVectorArrayMap mean_e(mean_data, ic); - ConstEigenVectorArrayMap variance_e{variance_data, ic}; - - EigenVectorArrayMap running_mean_e(mean_out_data, ic); - EigenVectorArrayMap running_variance_e(variance_out_data, ic); - - auto one_minus_momentum = 1. - momentum; - running_mean_e = mean_e * momentum + batch_mean_e * one_minus_momentum; - running_variance_e = - variance_e * momentum + batch_variance_e * one_minus_momentum; - } - } -}; - -template -class BatchNormMKLDNNGradOpKernel : public paddle::framework::OpKernel { - public: - void Compute(const paddle::framework::ExecutionContext &ctx) const override { - auto &dev_ctx = ctx.template device_context(); - auto mkldnn_engine = dev_ctx.GetEngine(); - - const float epsilon = ctx.Attr("epsilon"); - - const auto *x = ctx.Input("X"); - const auto *scale = ctx.Input("Scale"); - const auto *shift = ctx.Input("Bias"); - const auto *batch_mean = ctx.Input("SavedMean"); - const auto *batch_variance = ctx.Input("SavedVariance"); - - const auto *diff_y = ctx.Input(framework::GradVarName("Y")); - auto *diff_x = ctx.Output(framework::GradVarName("X")); - auto *diff_scale = ctx.Output(framework::GradVarName("Scale")); - auto *diff_shift = ctx.Output(framework::GradVarName("Bias")); - - PADDLE_ENFORCE(diff_y->layout() == DataLayout::kMKLDNN && - diff_y->format() != memory::format::format_undef, - "Wrong layout/format set for Input diff_y tensor"); - - const T *x_data = x->data(); - const T *diff_y_data = diff_y->data(); - const T *batch_mean_data = batch_mean->data(); - const T *batch_variance_data = batch_variance->data(); - const T *scale_data = scale->data(); - const T *shift_data = shift->data(); - T *diff_x_data = diff_x->mutable_data(ctx.GetPlace()); - T *diff_scale_data = diff_scale->mutable_data(ctx.GetPlace()); - T *diff_shift_data = diff_shift->mutable_data(ctx.GetPlace()); - - auto src_tz = paddle::framework::vectorize2int(x->dims()); - auto diff_src_tz = src_tz; - auto dst_tz = src_tz; - auto diff_dst_tz = dst_tz; - auto scale_tz = paddle::framework::vectorize2int(scale->dims()); - PADDLE_ENFORCE(scale_tz.size() == 1, "Dims of scale tensor is NOT 1"); - - const unsigned int ic = scale_tz[0]; - - using bn_bwd_types = bn_type_traits; - - mkldnn::memory::format dst_format = - platform::MKLDNNFormatForSize(src_tz.size(), diff_y->format()); - - mkldnn::memory::format input_format = - platform::MKLDNNFormatForSize(src_tz.size(), x->format()); - - unsigned flags = mkldnn::use_scale_shift; - - // keys from forward pass - const std::string key = BatchNormMKLDNNHandler::GetHash( - src_tz, epsilon, flags, false, input_format, - ctx.op().Input("SavedMean")); - const std::string key_batch_norm_fwd_pd = key + "@bn_fwd_pd"; - - // keys for primitives reuse - const std::string key_with_hash = - key + BatchNormMKLDNNHandler::GetHash(src_tz, epsilon, flags, false, - input_format); - const std::string key_batch_norm_bwd_p = - key_with_hash + "@batch_norm_bwd_p"; - const std::string key_batch_norm_src_mem_p = - key_with_hash + "@batch_norm_bwd_src_mem_p"; - const std::string key_batch_norm_mean_mem_p = - key_with_hash + "@batch_norm_bwd_mean_mem_p"; - const std::string key_batch_norm_variance_mem_p = - key_with_hash + "@batch_norm_bwd_variance_mem_p"; - const std::string key_batch_norm_scaleshift_mem_p = - key_with_hash + "@batch_norm_bwd_scaleshift_mem_p"; - const std::string key_batch_norm_diff_scaleshift_mem_p = - key_with_hash + "@batch_norm_bwd_diff_scaleshift_mem_p"; - const std::string key_batch_norm_diff_src_mem_p = - key_with_hash + "@batch_norm_bwd_diff_src_mem_p"; - const std::string key_batch_norm_diff_dst_mem_p = - key_with_hash + "@batch_norm_bwd_diff_dst_mem_p"; - - primitive reorder_diff_dst; - bool is_diff_dst_reordered = false; - auto user_diff_dst_memory = memory( - {{{diff_dst_tz}, memory::data_type::f32, dst_format}, mkldnn_engine}, - to_void_cast(diff_y_data)); - - // MKLDNN requires a single piece of memory for scale and shift/bias data - const size_t scaleshift_size = 2 * ic; - - std::vector scaleshift_data; - scaleshift_data.reserve(scaleshift_size); - copy_to_weights(scale_data, scale_data + ic, shift_data, shift_data + ic, - &scaleshift_data); - - std::vector diff_scaleshift_data; - diff_scaleshift_data.reserve(scaleshift_size); - - auto batch_norm_fwd_pd = - std::static_pointer_cast( - dev_ctx.GetBlob(key_batch_norm_fwd_pd)); - PADDLE_ENFORCE(batch_norm_fwd_pd != nullptr, - "Fail to find batch_norm_fwd_pd in device context"); - - auto batch_norm_bwd_p = std::static_pointer_cast( - dev_ctx.GetBlob(key_batch_norm_bwd_p)); - - if (batch_norm_bwd_p == nullptr) { - auto src_memory = std::shared_ptr(new memory( - {{{src_tz}, memory::data_type::f32, input_format}, mkldnn_engine}, - to_void_cast(x_data))); - - // for diff_dst, try to use same format as dst in forward pass - auto diff_dst_pd = batch_norm_fwd_pd.get()->dst_primitive_desc(); - auto diff_dst_md = diff_dst_pd.desc(); - - // create primitive descriptor for batch norm backward - auto batch_norm_bwd_desc = bn_bwd_types::op_desc{ - mkldnn::prop_kind::backward, diff_dst_md, - src_memory->get_primitive_desc().desc(), epsilon, flags}; - auto batch_norm_bwd_pd = bn_bwd_types::op_prim{ - batch_norm_bwd_desc, mkldnn_engine, *batch_norm_fwd_pd}; - - // reorder user_diff_dst if it's not in preferred format - auto diff_dst_memory = std::make_shared(user_diff_dst_memory); - if (diff_dst_pd != user_diff_dst_memory.get_primitive_desc()) { - diff_dst_memory = std::make_shared(diff_dst_pd); - reorder_diff_dst = reorder(user_diff_dst_memory, *diff_dst_memory); - is_diff_dst_reordered = true; - } - - // create mkldnn memory for input tensors (src/mean/variance) - auto mean_memory = - std::make_shared(batch_norm_bwd_pd.mean_primitive_desc(), - to_void_cast(batch_mean_data)); - auto variance_memory = - std::make_shared(batch_norm_bwd_pd.variance_primitive_desc(), - to_void_cast(batch_variance_data)); - - // create mkldnn memory for input tensors (scale/shift) - auto scaleshift_memory = std::make_shared( - batch_norm_bwd_pd.weights_primitive_desc(), scaleshift_data.data()); - - // create mkldnn memory for output diff weights (combined scale/shift) - auto diff_scaleshift_memory = std::make_shared( - batch_norm_bwd_pd.diff_weights_primitive_desc(), - diff_scaleshift_data.data()); - - // here assume diff_src is in the same format of src - auto diff_src_memory = std::make_shared( - src_memory->get_primitive_desc(), diff_x_data); - - // finally create batch_norm backward primitive - batch_norm_bwd_p = std::make_shared( - batch_norm_bwd_pd, *src_memory, *mean_memory, *variance_memory, - *diff_dst_memory, *scaleshift_memory, *diff_src_memory, - *diff_scaleshift_memory); - - dev_ctx.SetBlob(key_batch_norm_bwd_p, batch_norm_bwd_p); - dev_ctx.SetBlob(key_batch_norm_src_mem_p, src_memory); - dev_ctx.SetBlob(key_batch_norm_mean_mem_p, mean_memory); - dev_ctx.SetBlob(key_batch_norm_variance_mem_p, variance_memory); - dev_ctx.SetBlob(key_batch_norm_scaleshift_mem_p, scaleshift_memory); - dev_ctx.SetBlob(key_batch_norm_diff_scaleshift_mem_p, - diff_scaleshift_memory); - dev_ctx.SetBlob(key_batch_norm_diff_src_mem_p, diff_src_memory); - dev_ctx.SetBlob(key_batch_norm_diff_dst_mem_p, diff_dst_memory); - - // set layout/format of output tensors - diff_x->set_layout(DataLayout::kMKLDNN); - diff_x->set_format((memory::format)diff_src_memory->get_primitive_desc() - .desc() - .data.format); - } else { - // primitives already exist - UpdateMemoryData(dev_ctx, key_batch_norm_src_mem_p, to_void_cast(x_data)); - UpdateMemoryData(dev_ctx, key_batch_norm_mean_mem_p, - to_void_cast(batch_mean_data)); - UpdateMemoryData(dev_ctx, key_batch_norm_variance_mem_p, - to_void_cast(batch_variance_data)); - UpdateMemoryData(dev_ctx, key_batch_norm_scaleshift_mem_p, - scaleshift_data.data()); - UpdateMemoryData(dev_ctx, key_batch_norm_diff_scaleshift_mem_p, - diff_scaleshift_data.data()); - auto diff_src_memory = UpdateMemoryData( - dev_ctx, key_batch_norm_diff_src_mem_p, to_void_cast(diff_x_data)); - auto diff_dst_memory = UpdateMemoryData( - dev_ctx, key_batch_norm_diff_dst_mem_p, to_void_cast(diff_y_data)); - - // reorder user_diff_dst if it's not in preferred format - if (diff_dst_memory->get_primitive_desc() != - user_diff_dst_memory.get_primitive_desc()) { - reorder_diff_dst = reorder(user_diff_dst_memory, *diff_dst_memory); - is_diff_dst_reordered = true; - } - - // set layout/format of output tensors - diff_x->set_layout(DataLayout::kMKLDNN); - diff_x->set_format((memory::format)diff_src_memory->get_primitive_desc() - .desc() - .data.format); - } - - // execute optional reorder and batch_norm backward primitive - std::vector pipeline; - if (is_diff_dst_reordered) pipeline.push_back(reorder_diff_dst); - pipeline.push_back(*batch_norm_bwd_p); - stream(stream::kind::eager).submit(pipeline).wait(); - - // copy back diff sacle/shift to output tensors (diff scale/shift) - diff_scaleshift_data.resize(scaleshift_size); - auto it = std::begin(diff_scaleshift_data); - std::copy(it, std::next(it, ic), diff_scale_data); - std::copy(std::next(it, ic), std::end(diff_scaleshift_data), - diff_shift_data); - } -}; -} // namespace operators -} // namespace paddle - -namespace ops = paddle::operators; -REGISTER_OP_KERNEL(batch_norm, MKLDNN, ::paddle::platform::CPUPlace, - ops::BatchNormMKLDNNOpKernel); -REGISTER_OP_KERNEL(batch_norm_grad, MKLDNN, ::paddle::platform::CPUPlace, - ops::BatchNormMKLDNNGradOpKernel); diff --git a/paddle/fluid/operators/batch_norm_op.cc b/paddle/fluid/operators/batch_norm_op.cc index 5912a1a17cbd29c3ebd83f37133c044f0905c8bd..494d26f58f23ad1e445bbe8d7f8ce1037e5aa598 100644 --- a/paddle/fluid/operators/batch_norm_op.cc +++ b/paddle/fluid/operators/batch_norm_op.cc @@ -13,7 +13,9 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/operators/batch_norm_op.h" +#include #include +#include #include "paddle/fluid/framework/data_layout.h" #ifdef PADDLE_WITH_MKLDNN #include "paddle/fluid/platform/mkldnn_helper.h" @@ -22,143 +24,150 @@ limitations under the License. */ namespace paddle { namespace operators { -class BatchNormOp : public framework::OperatorWithKernel { - public: - using framework::OperatorWithKernel::OperatorWithKernel; - - void InferShape(framework::InferShapeContext *ctx) const override { - PADDLE_ENFORCE(ctx->HasInput("X"), ""); - PADDLE_ENFORCE(ctx->HasInput("Scale"), ""); - PADDLE_ENFORCE(ctx->HasInput("Bias"), ""); - PADDLE_ENFORCE(ctx->HasInput("Mean"), ""); - PADDLE_ENFORCE(ctx->HasInput("Variance"), ""); - PADDLE_ENFORCE(ctx->HasOutput("Y"), ""); - PADDLE_ENFORCE(ctx->HasOutput("MeanOut"), ""); - PADDLE_ENFORCE(ctx->HasOutput("VarianceOut"), ""); - PADDLE_ENFORCE(ctx->HasOutput("SavedMean"), ""); - PADDLE_ENFORCE(ctx->HasOutput("SavedVariance"), ""); - - // make sure Mean/MeanOut and Variance/VarianceOut share memory in Python - PADDLE_ENFORCE_EQ(ctx->Inputs("Mean")[0], ctx->Outputs("MeanOut")[0], - "Mean and MeanOut should share the same memory"); - PADDLE_ENFORCE_EQ(ctx->Inputs("Variance")[0], - ctx->Outputs("VarianceOut")[0], - "Variance and VarianceOut should share the same memory"); - - const auto x_dims = ctx->GetInputDim("X"); - const DataLayout data_layout = framework::StringToDataLayout( - ctx->Attrs().Get("data_layout")); - - PADDLE_ENFORCE(x_dims.size() >= 2 && x_dims.size() <= 5, - "Input X must have 2 to 5 dimensions."); - - const int64_t C = - (data_layout == DataLayout::kNCHW ? x_dims[1] - : x_dims[x_dims.size() - 1]); - - PADDLE_ENFORCE_EQ(ctx->GetInputDim("Scale").size(), 1UL); - PADDLE_ENFORCE_EQ(ctx->GetInputDim("Scale")[0], C); - PADDLE_ENFORCE_EQ(ctx->GetInputDim("Bias").size(), 1UL); - PADDLE_ENFORCE_EQ(ctx->GetInputDim("Bias")[0], C); - - ctx->SetOutputDim("Y", x_dims); - ctx->SetOutputDim("MeanOut", {C}); - ctx->SetOutputDim("VarianceOut", {C}); - ctx->SetOutputDim("SavedMean", {C}); - ctx->SetOutputDim("SavedVariance", {C}); - ctx->ShareLoD("X", "Y"); +void BatchNormOp::InferShape(framework::InferShapeContext *ctx) const { + PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) of ConvOp should not be null."); + PADDLE_ENFORCE(ctx->HasInput("Scale"), + "Input(Scale) of ConvOp should not be null."); + PADDLE_ENFORCE(ctx->HasInput("Bias"), + "Input(Bias) of ConvOp should not be null."); + PADDLE_ENFORCE(ctx->HasInput("Mean"), + "Input(Mean) of ConvOp should not be null."); + PADDLE_ENFORCE(ctx->HasInput("Variance"), + "Input(Variance) of ConvOp should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("Y"), + "Output(Y) of ConvOp should not be null."); + bool is_test = ctx->Attrs().Get("is_test"); + if (!is_test) { + PADDLE_ENFORCE(ctx->HasOutput("MeanOut"), + "Output(MeanOut) of ConvOp should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("VarianceOut"), + "Output(VarianceOut) of ConvOp should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("SavedMean"), + "Output(SavedMean) of ConvOp should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("SavedVariance"), + "Output(SavedVariance) of ConvOp should not be null."); } - protected: - framework::OpKernelType GetExpectedKernelType( - const framework::ExecutionContext &ctx) const override { - auto input_data_type = - framework::ToDataType(ctx.Input("X")->type()); - // By default, the type of the scale, bias, mean, - // and var tensors should both be float. (For float or float16 input tensor) - // or double (For double input tensor). - auto bn_param_type = framework::proto::VarType::FP32; - if (input_data_type == framework::proto::VarType::FP64) { - bn_param_type = framework::proto::VarType::FP64; - } - PADDLE_ENFORCE_EQ(bn_param_type, - framework::ToDataType(ctx.Input("Scale")->type()), - "Scale input should be of float type"); - PADDLE_ENFORCE_EQ(bn_param_type, - framework::ToDataType(ctx.Input("Bias")->type()), - "Bias input should be of float type"); - PADDLE_ENFORCE_EQ(bn_param_type, - framework::ToDataType(ctx.Input("Mean")->type()), - "Mean input should be of float type"); - PADDLE_ENFORCE_EQ(bn_param_type, framework::ToDataType( - ctx.Input("Variance")->type()), - "Variance input should be of float type"); - - // TODO(pzelazko-intel): enable MKLDNN layout when it's ready - framework::LibraryType library = framework::LibraryType::kPlain; - framework::DataLayout layout = framework::DataLayout::kAnyLayout; + // make sure Mean/MeanOut and Variance/VarianceOut share memory in Python + PADDLE_ENFORCE_EQ(ctx->Inputs("Mean")[0], ctx->Outputs("MeanOut")[0], + "Mean and MeanOut should share the same memory"); + PADDLE_ENFORCE_EQ(ctx->Inputs("Variance")[0], ctx->Outputs("VarianceOut")[0], + "Variance and VarianceOut should share the same memory"); + + const auto x_dims = ctx->GetInputDim("X"); + const DataLayout data_layout = framework::StringToDataLayout( + ctx->Attrs().Get("data_layout")); + + PADDLE_ENFORCE(x_dims.size() >= 2 && x_dims.size() <= 5, + "Input X must have 2 to 5 dimensions."); + + const int64_t C = + (data_layout == DataLayout::kNCHW ? x_dims[1] + : x_dims[x_dims.size() - 1]); + + PADDLE_ENFORCE_EQ(ctx->GetInputDim("Scale").size(), 1UL); + PADDLE_ENFORCE_EQ(ctx->GetInputDim("Scale")[0], C); + PADDLE_ENFORCE_EQ(ctx->GetInputDim("Bias").size(), 1UL); + PADDLE_ENFORCE_EQ(ctx->GetInputDim("Bias")[0], C); + + ctx->SetOutputDim("Y", x_dims); + ctx->SetOutputDim("MeanOut", {C}); + ctx->SetOutputDim("VarianceOut", {C}); + ctx->SetOutputDim("SavedMean", {C}); + ctx->SetOutputDim("SavedVariance", {C}); + ctx->ShareLoD("X", "Y"); +} + +framework::OpKernelType BatchNormOp::GetExpectedKernelType( + const framework::ExecutionContext &ctx) const { + auto input_data_type = ctx.Input("X")->type(); + // By default, the type of the scale, bias, mean, + // and var tensors should both be float. (For float or float16 input tensor) + // or double (For double input tensor). + auto bn_param_type = framework::proto::VarType::FP32; + if (input_data_type == framework::proto::VarType::FP64) { + bn_param_type = framework::proto::VarType::FP64; + } + PADDLE_ENFORCE_EQ(bn_param_type, ctx.Input("Scale")->type(), + "Scale input should be of float type"); + PADDLE_ENFORCE_EQ(bn_param_type, ctx.Input("Bias")->type(), + "Bias input should be of float type"); + PADDLE_ENFORCE_EQ(bn_param_type, ctx.Input("Mean")->type(), + "Mean input should be of float type"); + PADDLE_ENFORCE_EQ(bn_param_type, ctx.Input("Variance")->type(), + "Variance input should be of float type"); + + // TODO(pzelazko-intel): enable MKLDNN layout when it's ready + framework::LibraryType library = framework::LibraryType::kPlain; + framework::DataLayout layout = framework::DataLayout::kAnyLayout; #ifdef PADDLE_WITH_MKLDNN - if (library == framework::LibraryType::kPlain && - platform::CanMKLDNNBeUsed(ctx)) { - library = framework::LibraryType::kMKLDNN; - layout = framework::DataLayout::kMKLDNN; - } -#endif - - return framework::OpKernelType(input_data_type, ctx.GetPlace(), layout, - library); + if (library == framework::LibraryType::kPlain && + platform::CanMKLDNNBeUsed(ctx)) { + library = framework::LibraryType::kMKLDNN; + layout = framework::DataLayout::kMKLDNN; } -}; +#endif -class BatchNormOpMaker : public framework::OpProtoAndCheckerMaker { - public: - void Make() override { - AddAttr("is_test", "").SetDefault(false); - AddAttr("momentum", "").SetDefault(0.9); - AddAttr("epsilon", "") - .SetDefault(1e-5) - .AddCustomChecker([](const float &epsilon) { - PADDLE_ENFORCE(epsilon >= 0.0f && epsilon <= 0.001f, - "'epsilon' should be between 0.0 and 0.001."); - }); - AddAttr("data_layout", "").SetDefault("NCHW"); - AddInput("X", "The input tensor"); - AddInput("Scale", - "Scale is a 1-dimensional tensor of size C " - "that is applied to the output"); - AddInput("Bias", - "Bias is a 1-dimensional tensor of size C " - "that is applied to the output"); - AddInput("Mean", - "The global mean (for training) or " - "estimated mean (for testing)"); - AddInput("Variance", - "The global variance (for training) " - "or estimated Variance (for testing)"); - AddOutput("Y", "result after normalization").Reuse("X"); - AddOutput("MeanOut", - "Share memory with Mean. " - "Store the global mean when training") - .Reuse("Mean"); - AddOutput("VarianceOut", - "Share memory with Variance. " - "Store the global Variance when training") - .Reuse("Variance"); - AddOutput("SavedMean", - "Mean of the current mini batch, " - "will apply to output when training") - .AsIntermediate(); - AddOutput("SavedVariance", - "Variance of the current mini batch, " - "will apply to output when training") - .AsIntermediate(); - AddAttr("use_mkldnn", - "(bool, default false) Only used in mkldnn kernel") - .SetDefault(false); - AddAttr("fuse_with_relu", - "(bool, default false) Only used in mkldnn kernel") - .SetDefault(false); - AddComment(R"DOC( + return framework::OpKernelType(input_data_type, ctx.GetPlace(), layout, + library); +} + +void BatchNormOpMaker::Make() { + AddAttr("is_test", + "(bool, default false) Set to true for inference only, false " + "for training. Some layers may run faster when this is true.") + .SetDefault(false); + AddAttr("momentum", "").SetDefault(0.9); + AddAttr("epsilon", "") + .SetDefault(1e-5) + .AddCustomChecker([](const float &epsilon) { + PADDLE_ENFORCE(epsilon >= 0.0f && epsilon <= 0.001f, + "'epsilon' should be between 0.0 and 0.001."); + }); + AddAttr("data_layout", "").SetDefault("NCHW"); + AddInput("X", "The input tensor"); + AddInput("Scale", + "Scale is a 1-dimensional tensor of size C " + "that is applied to the output"); + AddInput("Bias", + "Bias is a 1-dimensional tensor of size C " + "that is applied to the output"); + AddInput("Mean", + "The global mean (for training) or " + "estimated mean (for testing)"); + AddInput("Variance", + "The global variance (for training) " + "or estimated Variance (for testing)"); + AddOutput("Y", "result after normalization"); + AddOutput("MeanOut", + "Share memory with Mean. " + "Store the global mean when training"); + AddOutput("VarianceOut", + "Share memory with Variance. " + "Store the global Variance when training"); + AddOutput("SavedMean", + "Mean of the current mini batch, " + "will apply to output when training") + .AsIntermediate(); + AddOutput("SavedVariance", + "Variance of the current mini batch, " + "will apply to output when training") + .AsIntermediate(); + AddAttr("use_mkldnn", + "(bool, default false) Only used in mkldnn kernel") + .SetDefault(false); + AddAttr("fuse_with_relu", + "(bool, default false) Only used in mkldnn kernel") + .SetDefault(false); + AddAttr("use_global_stats", + "(bool, default false) Whether to use global mean and " + "variance. In inference or test mode, set use_global_stats " + "to true or is_test true. the behavior is equivalent. " + "In train mode, when setting use_global_stats True, the " + "global mean and variance are also used during train time, " + "the BN acts as scaling and shiffting.") + .SetDefault(false); + AddComment(R"DOC( Batch Normalization. Batch Norm has been implemented as discussed in the paper: @@ -169,8 +178,7 @@ The required data format for this layer is one of the following: 2. NCHW `[batch, in_channels, in_height, in_width]` )DOC"); - } -}; +} template class BatchNormKernel @@ -180,6 +188,10 @@ class BatchNormKernel const float epsilon = ctx.Attr("epsilon"); const float momentum = ctx.Attr("momentum"); const bool is_test = ctx.Attr("is_test"); + const bool use_global_stats = ctx.Attr("use_global_stats"); + + bool global_stats = is_test || use_global_stats; + const std::string data_layout_str = ctx.Attr("data_layout"); const DataLayout data_layout = framework::StringToDataLayout(data_layout_str); @@ -207,7 +219,7 @@ class BatchNormKernel saved_mean->mutable_data(ctx.GetPlace()); saved_variance->mutable_data(ctx.GetPlace()); - if (!is_test) { + if (!global_stats) { // saved_xx is use just in this batch of data EigenVectorArrayMap saved_mean_e( saved_mean->mutable_data(ctx.GetPlace()), C); @@ -224,7 +236,7 @@ class BatchNormKernel if ((N * sample_size) == 1) { LOG(WARNING) << "Only 1 element in normalization dimension, " << "we skip the batch norm calculation, let y = x."; - framework::TensorCopySync(*x, ctx.GetPlace(), y); + framework::TensorCopy(*x, ctx.GetPlace(), y); return; } @@ -267,7 +279,7 @@ class BatchNormKernel // use SavedMean and SavedVariance to do normalize Eigen::Array inv_std(C); - if (is_test) { + if (global_stats) { ConstEigenVectorArrayMap var_arr( ctx.Input("Variance")->data(), C); inv_std = (var_arr + epsilon).sqrt().inverse(); @@ -279,8 +291,8 @@ class BatchNormKernel inv_std = saved_inv_std; } ConstEigenVectorArrayMap mean_arr( - is_test ? ctx.Input("Mean")->data() - : ctx.Output("SavedMean")->data(), + global_stats ? ctx.Input("Mean")->data() + : ctx.Output("SavedMean")->data(), C); // ((x - est_mean) * (inv_var) * scale + bias @@ -319,69 +331,75 @@ class BatchNormKernel } }; -class BatchNormGradOp : public framework::OperatorWithKernel { - public: - using framework::OperatorWithKernel::OperatorWithKernel; - - void InferShape(framework::InferShapeContext *ctx) const override { - // check input - PADDLE_ENFORCE(ctx->HasInput("X")); - PADDLE_ENFORCE(ctx->HasInput("Scale"), ""); - PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Y")), ""); - PADDLE_ENFORCE(ctx->HasInput("SavedMean"), ""); - PADDLE_ENFORCE(ctx->HasInput("SavedVariance"), ""); - - // check output - PADDLE_ENFORCE(ctx->HasOutput(framework::GradVarName("X")), ""); - PADDLE_ENFORCE(ctx->HasOutput(framework::GradVarName("Scale")), ""); - PADDLE_ENFORCE(ctx->HasOutput(framework::GradVarName("Bias")), ""); - - const auto x_dims = ctx->GetInputDim("X"); - const DataLayout data_layout = framework::StringToDataLayout( - ctx->Attrs().Get("data_layout")); - const int C = - (data_layout == DataLayout::kNCHW ? x_dims[1] - : x_dims[x_dims.size() - 1]); +void BatchNormGradOp::InferShape(framework::InferShapeContext *ctx) const { + // check input + PADDLE_ENFORCE(ctx->HasInput("X")); + PADDLE_ENFORCE(ctx->HasInput("Scale"), "Input(scale) should not be null."); + PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Y")), + "Input(Y@GRAD) should not be null."); + PADDLE_ENFORCE(ctx->HasInput("SavedMean"), + "Input(SavedMean) should not be null."); + PADDLE_ENFORCE(ctx->HasInput("SavedVariance"), + "Input(SavedVariance) should not be null"); + + // check output + PADDLE_ENFORCE(ctx->HasOutput(framework::GradVarName("X")), ""); + if (ctx->HasOutput(framework::GradVarName("Scale"))) { + PADDLE_ENFORCE(ctx->HasOutput(framework::GradVarName("Bias")), + "Output(Scale@GRAD) and Output(Bias@GRAD) should not be " + "null at same time"); + } + const bool use_global_stats = ctx->Attrs().Get("use_global_stats"); + if (use_global_stats) { + PADDLE_ENFORCE(!ctx->Attrs().Get("use_mkldnn"), + "Using global stats during training is not supported " + "in gradient op kernel of batch_norm_mkldnn_op now."); + } - ctx->SetOutputDim(framework::GradVarName("X"), x_dims); + const auto x_dims = ctx->GetInputDim("X"); + const DataLayout data_layout = framework::StringToDataLayout( + ctx->Attrs().Get("data_layout")); + const int C = (data_layout == DataLayout::kNCHW ? x_dims[1] + : x_dims[x_dims.size() - 1]); + + ctx->SetOutputDim(framework::GradVarName("X"), x_dims); + if (ctx->HasOutput(framework::GradVarName("Scale"))) { ctx->SetOutputDim(framework::GradVarName("Scale"), {C}); ctx->SetOutputDim(framework::GradVarName("Bias"), {C}); } +} - protected: - framework::OpKernelType GetExpectedKernelType( - const framework::ExecutionContext &ctx) const override { - const auto *var = ctx.InputVar(framework::GradVarName("Y")); - if (var == nullptr) { - PADDLE_THROW("can't find Y@GRAD"); - } - const Tensor *t = nullptr; - if (var->IsType()) { - t = &var->Get(); - } else if (var->IsType()) { - t = &var->Get(); - } - if (t == nullptr) { - PADDLE_THROW("can't find Y@GRAD"); - } +framework::OpKernelType BatchNormGradOp::GetExpectedKernelType( + const framework::ExecutionContext &ctx) const { + const auto *var = ctx.InputVar(framework::GradVarName("Y")); + if (var == nullptr) { + PADDLE_THROW("can't find Y@GRAD"); + } + const Tensor *t = nullptr; + if (var->IsType()) { + t = &var->Get(); + } else if (var->IsType()) { + t = &var->Get(); + } + if (t == nullptr) { + PADDLE_THROW("can't find Y@GRAD"); + } - // TODO(pzelazko-intel): enable MKLDNN layout when it's ready - framework::LibraryType library = framework::LibraryType::kPlain; - framework::DataLayout layout = framework::DataLayout::kAnyLayout; + // TODO(pzelazko-intel): enable MKLDNN layout when it's ready + framework::LibraryType library = framework::LibraryType::kPlain; + framework::DataLayout layout = framework::DataLayout::kAnyLayout; #ifdef PADDLE_WITH_MKLDNN - if (library == framework::LibraryType::kPlain && - platform::CanMKLDNNBeUsed(ctx)) { - library = framework::LibraryType::kMKLDNN; - layout = framework::DataLayout::kMKLDNN; - } + if (library == framework::LibraryType::kPlain && + platform::CanMKLDNNBeUsed(ctx)) { + library = framework::LibraryType::kMKLDNN; + layout = framework::DataLayout::kMKLDNN; + } #endif - return framework::OpKernelType( - framework::ToDataType(ctx.Input("X")->type()), ctx.GetPlace(), - layout, library); - } -}; + return framework::OpKernelType(ctx.Input("X")->type(), ctx.GetPlace(), + layout, library); +} template class BatchNormGradKernel @@ -395,6 +413,8 @@ class BatchNormGradKernel // SavedVariance have been reverted in forward operator const auto *saved_inv_variance = ctx.Input("SavedVariance"); const std::string data_layout_str = ctx.Attr("data_layout"); + const bool use_global_stats = ctx.Attr("use_global_stats"); + const float epsilon = ctx.Attr("epsilon"); const DataLayout data_layout = framework::StringToDataLayout(data_layout_str); @@ -409,38 +429,60 @@ class BatchNormGradKernel : x_dims[x_dims.size() - 1]); const int sample_size = x->numel() / N / C; - ConstEigenVectorArrayMap scale_arr(scale->data(), C); - ConstEigenVectorArrayMap mean_arr(saved_mean->data(), C); - ConstEigenVectorArrayMap inv_var_arr(saved_inv_variance->data(), C); - // init output auto *d_x = ctx.Output(framework::GradVarName("X")); auto *d_scale = ctx.Output(framework::GradVarName("Scale")); auto *d_bias = ctx.Output(framework::GradVarName("Bias")); d_x->mutable_data(ctx.GetPlace()); - d_scale->mutable_data(ctx.GetPlace()); - d_bias->mutable_data(ctx.GetPlace()); + + const T *mean_data = saved_mean->data(); + const T *inv_var_data = saved_inv_variance->data(); + Tensor inv_var_tensor; + if (use_global_stats) { + const auto *running_mean = ctx.Input("Mean"); + const auto *running_variance = ctx.Input("Variance"); + mean_data = running_mean->data(); + T *running_inv_var_data = inv_var_tensor.mutable_data(ctx.GetPlace()); + EigenVectorArrayMap inv_var_tmp(running_inv_var_data, C); + ConstEigenVectorArrayMap var_arr(running_variance->data(), C); + + inv_var_tmp = (var_arr + epsilon).sqrt().inverse().eval(); + inv_var_data = running_inv_var_data; + } + + ConstEigenVectorArrayMap scale_arr(scale->data(), C); + ConstEigenVectorArrayMap mean_arr(mean_data, C); + ConstEigenVectorArrayMap inv_var_arr(inv_var_data, C); + + T *d_bias_data = nullptr; + T *d_scale_data = nullptr; + if (d_scale && d_bias) { + d_scale->mutable_data(ctx.GetPlace()); + d_bias->mutable_data(ctx.GetPlace()); + d_bias_data = d_bias->mutable_data(ctx.GetPlace()); + d_scale_data = d_scale->mutable_data(ctx.GetPlace()); + } // d_bias = np.sum(d_y, axis=0) // d_scale = np.sum((X - mean) / inv_std * dy, axis=0) // d_x = (1. / N) * scale * inv_var * (N * d_y - np.sum(d_y, axis=0) // - (X - mean) * inv_var * inv_var * np.sum(d_y * (X - mean), axis=0)) + EigenVectorArrayMap d_bias_arr(d_bias_data, C); + EigenVectorArrayMap d_scale_arr(d_scale_data, C); - EigenVectorArrayMap d_bias_arr(d_bias->mutable_data(ctx.GetPlace()), - C); - EigenVectorArrayMap d_scale_arr(d_scale->mutable_data(ctx.GetPlace()), - C); - - d_bias_arr.setZero(); - d_scale_arr.setZero(); + if (d_scale && d_bias) { + d_bias_arr.setZero(); + d_scale_arr.setZero(); + } - if ((N * sample_size) == 1) { - framework::TensorCopySync(*d_y, ctx.GetPlace(), d_x); + if ((N * sample_size) == 1 && !use_global_stats) { + framework::TensorCopy(*d_y, ctx.GetPlace(), d_x); return; } - const auto scale_inv_var_nhw = scale_arr * inv_var_arr / (N * sample_size); + int scale_coefff = use_global_stats ? 1 : N * sample_size; + const auto scale_inv_var_nhw = scale_arr * inv_var_arr / scale_coefff; switch (data_layout) { case DataLayout::kNCHW: { @@ -450,19 +492,29 @@ class BatchNormGradKernel sample_size, N * C); d_x_arr.setZero(); - for (int nc = 0; nc < N * C; ++nc) { - int c = nc % C; - d_bias_arr(c) += d_y_arr.col(nc).sum(); - d_scale_arr(c) += - ((x_arr.col(nc) - mean_arr(c)) * inv_var_arr(c) * d_y_arr.col(nc)) - .sum(); + if (d_scale && d_bias) { + for (int nc = 0; nc < N * C; ++nc) { + int c = nc % C; + d_bias_arr(c) += d_y_arr.col(nc).sum(); + d_scale_arr(c) += ((x_arr.col(nc) - mean_arr(c)) * inv_var_arr(c) * + d_y_arr.col(nc)) + .sum(); + } } - for (int nc = 0; nc < N * C; ++nc) { - int c = nc % C; - d_x_arr.col(nc) += - scale_inv_var_nhw(c) * - (d_y_arr.col(nc) * N * sample_size - d_bias_arr(c) - - (x_arr.col(nc) - mean_arr[c]) * d_scale_arr(c) * inv_var_arr(c)); + if (!use_global_stats) { + for (int nc = 0; nc < N * C; ++nc) { + int c = nc % C; + d_x_arr.col(nc) += + scale_inv_var_nhw(c) * + (d_y_arr.col(nc) * N * sample_size - d_bias_arr(c) - + (x_arr.col(nc) - mean_arr[c]) * d_scale_arr(c) * + inv_var_arr(c)); + } + } else { + for (int nc = 0; nc < N * C; ++nc) { + int c = nc % C; + d_x_arr.col(nc) += scale_inv_var_nhw(c) * d_y_arr.col(nc); + } } break; } @@ -478,15 +530,27 @@ class BatchNormGradKernel const auto d_y_mul_x_minus_mean_row_sum = (d_y_arr * x_minus_mean).rowwise().sum(); const auto inv_var_sqr = inv_var_arr * inv_var_arr; - for (int nhw = 0; nhw < N * sample_size; ++nhw) { - d_bias_arr += d_y_arr.col(nhw); - d_scale_arr += - (x_arr.col(nhw) - mean_arr) * inv_var_arr * d_y_arr.col(nhw); - d_x_arr.col(nhw) += - scale_inv_var_nhw * - (d_y_arr.col(nhw) * N * sample_size - d_y_row_sum - - x_minus_mean.col(nhw) * inv_var_sqr * - d_y_mul_x_minus_mean_row_sum); + + if (d_scale && d_bias) { + for (int nhw = 0; nhw < N * sample_size; ++nhw) { + d_bias_arr += d_y_arr.col(nhw); + d_scale_arr += + (x_arr.col(nhw) - mean_arr) * inv_var_arr * d_y_arr.col(nhw); + } + } + + if (!use_global_stats) { + for (int nhw = 0; nhw < N * sample_size; ++nhw) { + d_x_arr.col(nhw) += + scale_inv_var_nhw * + (d_y_arr.col(nhw) * N * sample_size - d_y_row_sum - + x_minus_mean.col(nhw) * inv_var_sqr * + d_y_mul_x_minus_mean_row_sum); + } + } else { + for (int nhw = 0; nhw < N * sample_size; ++nhw) { + d_x_arr.col(nhw) += scale_inv_var_nhw * d_y_arr.col(nhw); + } } break; } @@ -496,29 +560,54 @@ class BatchNormGradKernel } }; -class BatchNormGradMaker : public framework::SingleGradOpDescMaker { - public: - using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; +std::unique_ptr BatchNormGradMaker::Apply() const { + auto *op = new framework::OpDesc(); + op->SetType(GradOpType()); + op->SetInput("X", Input("X")); + op->SetInput(framework::GradVarName("Y"), OutputGrad("Y")); + + op->SetInput("Scale", Input("Scale")); + op->SetInput("Bias", Input("Bias")); + op->SetInput("SavedMean", Output("SavedMean")); + op->SetInput("SavedVariance", Output("SavedVariance")); + + // used when setting use_global_stats True during training + if (boost::get(GetAttr("use_global_stats"))) { + op->SetInput("Mean", Output("MeanOut")); + op->SetInput("Variance", Output("VarianceOut")); + } - protected: - std::unique_ptr Apply() const override { - auto *op = new framework::OpDesc(); - op->SetType("batch_norm_grad"); - op->SetInput("X", Input("X")); - op->SetInput(framework::GradVarName("Y"), OutputGrad("Y")); + op->SetAttrMap(Attrs()); - op->SetInput("Scale", Input("Scale")); - op->SetInput("Bias", Input("Bias")); - op->SetInput("SavedMean", Output("SavedMean")); - op->SetInput("SavedVariance", Output("SavedVariance")); + op->SetOutput(framework::GradVarName("X"), InputGrad("X")); + op->SetOutput(framework::GradVarName("Scale"), InputGrad("Scale")); + op->SetOutput(framework::GradVarName("Bias"), InputGrad("Bias")); - op->SetAttrMap(Attrs()); + return std::unique_ptr(op); +} - op->SetOutput(framework::GradVarName("X"), InputGrad("X")); - op->SetOutput(framework::GradVarName("Scale"), InputGrad("Scale")); - op->SetOutput(framework::GradVarName("Bias"), InputGrad("Bias")); +class BatchNormInplaceInToOut : public framework::InplaceOpInference { + public: + std::unordered_map operator()( + const framework::OpDesc &op_desc) const override { + std::unordered_map inplace_in_to_out = { + {"Mean", "MeanOut"}, {"Variance", "VarianceOut"}, {"X", "Y"}, + }; + return inplace_in_to_out; + } +}; - return std::unique_ptr(op); +class BatchNormGradInplaceInToOut : public framework::InplaceOpInference { + public: + std::unordered_map operator()( + const framework::OpDesc &op_desc) const override { + std::unordered_map inplace_in_to_out = { + // Scale, Bias, SavedMean, SavedVariance shape is [batch_size, C] + {framework::GradVarName("Y"), framework::GradVarName("X")}, + {"SavedMean", framework::GradVarName("Scale")}, + {"SavedVariance", framework::GradVarName("Bias")}, + }; + return inplace_in_to_out; } }; @@ -527,8 +616,10 @@ class BatchNormGradMaker : public framework::SingleGradOpDescMaker { namespace ops = paddle::operators; REGISTER_OPERATOR(batch_norm, ops::BatchNormOp, ops::BatchNormOpMaker, - ops::BatchNormGradMaker); -REGISTER_OPERATOR(batch_norm_grad, ops::BatchNormGradOp); + ops::BatchNormOpInferVarType, ops::BatchNormGradMaker) +// ops::BatchNormInplaceInToOut); +REGISTER_OPERATOR(batch_norm_grad, ops::BatchNormGradOp) +// ops::BatchNormGradInplaceInToOut); REGISTER_OP_CPU_KERNEL( batch_norm, ops::BatchNormKernel, diff --git a/paddle/fluid/operators/batch_norm_op.cu b/paddle/fluid/operators/batch_norm_op.cu new file mode 100644 index 0000000000000000000000000000000000000000..f8baf082597d6152257e2ea74f14b6903a7be332 --- /dev/null +++ b/paddle/fluid/operators/batch_norm_op.cu @@ -0,0 +1,418 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include +#include +#include "cub/cub.cuh" +#include "paddle/fluid/framework/data_layout.h" +#include "paddle/fluid/operators/batch_norm_op.h" +#include "paddle/fluid/operators/math/math_function.h" +#include "paddle/fluid/platform/cudnn_helper.h" +#include "paddle/fluid/platform/float16.h" + +// CUDNN_BATCHNORM_SPATIAL_PERSISTENT in batchnorm. This mode can be faster in +// some tasks because an optimized path may be selected for CUDNN_DATA_FLOAT +// and CUDNN_DATA_HALF data types, compute capability 6.0 or higher. The +// reason we set it to false by default is that this mode may use scaled +// atomic integer reduction that may cause a numerical overflow for certain +// input data range. +DEFINE_bool(cudnn_batchnorm_spatial_persistent, false, + "Whether enable CUDNN_BATCHNORM_SPATIAL_PERSISTENT mode for cudnn " + "batch_norm, defalut is False."); + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; +using DataLayout = framework::DataLayout; +template +using CudnnDataType = platform::CudnnDataType; +template +using BatchNormParamType = typename CudnnDataType::BatchNormParamType; + +template +class BatchNormKernel + : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext &ctx) const override { + PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()), + "It must use CUDAPlace."); + double epsilon = static_cast(ctx.Attr("epsilon")); + const float momentum = ctx.Attr("momentum"); + const bool is_test = ctx.Attr("is_test"); + const bool use_global_stats = ctx.Attr("use_global_stats"); + const std::string data_layout_str = ctx.Attr("data_layout"); + const DataLayout data_layout = + framework::StringToDataLayout(data_layout_str); + + // Get the size for each dimension. + // NCHW [batch_size, in_channels, in_height, in_width] + const auto *x = ctx.Input("X"); + const auto &x_dims = x->dims(); + PADDLE_ENFORCE(x_dims.size() >= 2 && x_dims.size() <= 5, + "The Input dim size should be between 2 and 5"); + int N, C, H, W, D; + ExtractNCWHD(x_dims, data_layout, &N, &C, &H, &W, &D); + + auto *y = ctx.Output("Y"); + y->mutable_data(ctx.GetPlace()); + + // ------------------- cudnn descriptors --------------------- + cudnnTensorDescriptor_t data_desc_; + cudnnTensorDescriptor_t bn_param_desc_; + cudnnBatchNormMode_t mode_; + + CUDNN_ENFORCE(platform::dynload::cudnnCreateTensorDescriptor(&data_desc_)); + CUDNN_ENFORCE( + platform::dynload::cudnnCreateTensorDescriptor(&bn_param_desc_)); + + if (epsilon <= CUDNN_BN_MIN_EPSILON - FLT_EPSILON) { + LOG(ERROR) << "Provided epsilon is smaller than " + << "CUDNN_BN_MIN_EPSILON. Setting it to " + << "CUDNN_BN_MIN_EPSILON instead."; + } + epsilon = std::max(epsilon, CUDNN_BN_MIN_EPSILON); +#if CUDNN_VERSION_MIN(7, 0, 0) + if (FLAGS_cudnn_batchnorm_spatial_persistent) { + mode_ = CUDNN_BATCHNORM_SPATIAL_PERSISTENT; + } else { + mode_ = CUDNN_BATCHNORM_SPATIAL; + } +#else + mode_ = CUDNN_BATCHNORM_SPATIAL; +#endif + + VLOG(3) << "Setting descriptors."; + std::vector dims; + std::vector strides; + if (data_layout == DataLayout::kNCHW) { + dims = {N, C, H, W, D}; + strides = {C * H * W * D, H * W * D, W * D, D, 1}; + } else { + dims = {N, C, H, W, D}; + strides = {H * W * D * C, 1, W * D * C, D * C, C}; + } + CUDNN_ENFORCE(platform::dynload::cudnnSetTensorNdDescriptor( + data_desc_, CudnnDataType::type, + x_dims.size() > 3 ? x_dims.size() : 4, dims.data(), strides.data())); + // Note: PERSISTENT not implemented for inference + CUDNN_ENFORCE(platform::dynload::cudnnDeriveBNTensorDescriptor( + bn_param_desc_, data_desc_, is_test ? CUDNN_BATCHNORM_SPATIAL : mode_)); + + const auto *scale = ctx.Input("Scale"); + const auto *bias = ctx.Input("Bias"); + + auto &dev_ctx = ctx.template device_context(); + + auto handle = dev_ctx.cudnn_handle(); + + // Now, depending on whether we are running test or not, we have two paths. + if (is_test || use_global_stats) { + // only when test we use input to do computation. + const auto *est_mean = ctx.Input("Mean"); + const auto *est_var = ctx.Input("Variance"); + // Run inference mode. + PADDLE_ENFORCE_EQ(est_mean->dims().size(), 1UL); + PADDLE_ENFORCE_EQ(est_var->dims().size(), 1UL); + PADDLE_ENFORCE_EQ(est_mean->dims()[0], C); + PADDLE_ENFORCE_EQ(est_var->dims()[0], C); + + CUDNN_ENFORCE(platform::dynload::cudnnBatchNormalizationForwardInference( + handle, + // Note: PERSISTENT not implemented for inference + CUDNN_BATCHNORM_SPATIAL, CudnnDataType::kOne(), + CudnnDataType::kZero(), data_desc_, x->template data(), + data_desc_, y->template mutable_data(ctx.GetPlace()), + bn_param_desc_, scale->template data>(), + bias->template data>(), + est_mean->template data>(), + est_var->template data>(), epsilon)); + } else { + // Run training mode. + // obtain running mean and running inv var, and see if we need to + // initialize them. + + auto *mean_out = ctx.Output("MeanOut"); + auto *variance_out = ctx.Output("VarianceOut"); + mean_out->mutable_data>(ctx.GetPlace()); + variance_out->mutable_data>(ctx.GetPlace()); + + auto *saved_mean = ctx.Output("SavedMean"); + auto *saved_variance = ctx.Output("SavedVariance"); + saved_mean->mutable_data>(ctx.GetPlace()); + saved_variance->mutable_data>(ctx.GetPlace()); + math::SetConstant> + functor; + functor(dev_ctx, saved_mean, static_cast>(0)); + functor(dev_ctx, saved_variance, static_cast>(0)); + + if ((N * H * W * D) == 1) { + LOG(WARNING) << "Only 1 element in normalization dimension, " + << "we skip the batch norm calculation, let y = x."; + framework::TensorCopy(*x, ctx.GetPlace(), y); + } else { + double this_factor = 1. - momentum; + + CUDNN_ENFORCE(platform::dynload::cudnnBatchNormalizationForwardTraining( + handle, mode_, CudnnDataType::kOne(), CudnnDataType::kZero(), + data_desc_, x->template data(), data_desc_, + y->template mutable_data(ctx.GetPlace()), bn_param_desc_, + scale->template data>(), + bias->template data>(), this_factor, + mean_out->template mutable_data>( + ctx.GetPlace()), + variance_out->template mutable_data>( + ctx.GetPlace()), + epsilon, saved_mean->template mutable_data>( + ctx.GetPlace()), + saved_variance->template mutable_data>( + ctx.GetPlace()))); + } + } + + // clean when exit. + CUDNN_ENFORCE(platform::dynload::cudnnDestroyTensorDescriptor(data_desc_)); + CUDNN_ENFORCE( + platform::dynload::cudnnDestroyTensorDescriptor(bn_param_desc_)); + } +}; + +template +static __global__ void KeBNBackwardScaleBias( + const T *dy, const T *x, const BatchNormParamType *mean, + const BatchNormParamType *variance, const double epsilon, const int N, + const int C, const int HxW, BatchNormParamType *dscale, + BatchNormParamType *dbias) { + const int outer_size = C; + const int inner_size = N * HxW; + typedef cub::BlockReduce, BlockDim> BlockReduce; + __shared__ typename BlockReduce::TempStorage ds_storage; + __shared__ typename BlockReduce::TempStorage db_storage; + + for (int i = blockIdx.x; i < outer_size; i += gridDim.x) { + BatchNormParamType ds_sum = static_cast>(0); + BatchNormParamType db_sum = static_cast>(0); + + BatchNormParamType inv_var_i = 1.0 / sqrt(variance[i] + epsilon); + BatchNormParamType mean_i = mean[i]; + for (int j = threadIdx.x; j < inner_size; j += blockDim.x) { + const int index = layout == framework::DataLayout::kNCHW + ? (j / HxW * C + i) * HxW + j % HxW + : j * outer_size + i; + ds_sum += static_cast>(dy[index]) * + (static_cast>(x[index]) - mean_i); + db_sum += static_cast>(dy[index]); + } + ds_sum = BlockReduce(ds_storage).Reduce(ds_sum, cub::Sum()); + db_sum = BlockReduce(db_storage).Reduce(db_sum, cub::Sum()); + if (threadIdx.x == 0) { + dscale[i] = ds_sum * inv_var_i; + dbias[i] = db_sum; + } + __syncthreads(); + } +} + +template +static __global__ void KeBNBackwardData(const T *dy, + const BatchNormParamType *scale, + const BatchNormParamType *variance, + const double epsilon, const int C, + const int HxW, const int num, T *dx) { + int gid = blockIdx.x * blockDim.x + threadIdx.x; + int stride = blockDim.x * gridDim.x; + for (int i = gid; i < num; i += stride) { + const int c = layout == framework::DataLayout::kNCHW ? i / HxW % C : i % C; + BatchNormParamType inv_var = 1.0 / sqrt(variance[c] + epsilon); + dx[i] = static_cast(static_cast>(dy[i]) * + scale[c] * inv_var); + } +} + +template +class BatchNormGradKernel + : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext &ctx) const override { + PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()), + "It must use CUDAPlace."); + double epsilon = static_cast(ctx.Attr("epsilon")); + const std::string data_layout_str = ctx.Attr("data_layout"); + const bool use_global_stats = ctx.Attr("use_global_stats"); + + const DataLayout data_layout = + framework::StringToDataLayout(data_layout_str); + const auto *x = ctx.Input("X"); + const auto *d_y = ctx.Input(framework::GradVarName("Y")); + const auto *scale = ctx.Input("Scale"); + + const auto &x_dims = x->dims(); + + PADDLE_ENFORCE(x_dims.size() >= 2 && x_dims.size() <= 5, + "The Input dim size should be between 2 and 5"); + int N, C, H, W, D; + ExtractNCWHD(x_dims, data_layout, &N, &C, &H, &W, &D); + + // init output + auto *d_x = ctx.Output(framework::GradVarName("X")); + auto *d_scale = ctx.Output(framework::GradVarName("Scale")); + auto *d_bias = ctx.Output(framework::GradVarName("Bias")); + + d_x->mutable_data(ctx.GetPlace()); + if (d_scale && d_bias) { + d_scale->mutable_data>(ctx.GetPlace()); + d_bias->mutable_data>(ctx.GetPlace()); + } + PADDLE_ENFORCE_EQ(scale->dims().size(), 1UL); + PADDLE_ENFORCE_EQ(scale->dims()[0], C); + + std::vector dims; + std::vector strides; + if (data_layout == DataLayout::kNCHW) { + dims = {N, C, H, W, D}; + strides = {C * H * W * D, H * W * D, W * D, D, 1}; + } else { + dims = {N, C, H, W, D}; + strides = {H * W * C * D, 1, W * D * C, D * C, C}; + } + + auto &dev_ctx = ctx.template device_context(); + if (!use_global_stats) { + if ((N * H * W * D) == 1) { + framework::TensorCopy(*d_y, ctx.GetPlace(), d_x); + math::SetConstant> + functor; + functor(dev_ctx, d_scale, static_cast>(0)); + functor(dev_ctx, d_bias, static_cast>(0)); + return; + } + + // ------------------- cudnn descriptors --------------------- + cudnnTensorDescriptor_t data_desc_; + cudnnTensorDescriptor_t bn_param_desc_; + cudnnBatchNormMode_t mode_; + + CUDNN_ENFORCE( + platform::dynload::cudnnCreateTensorDescriptor(&data_desc_)); + CUDNN_ENFORCE( + platform::dynload::cudnnCreateTensorDescriptor(&bn_param_desc_)); + if (epsilon <= CUDNN_BN_MIN_EPSILON - FLT_EPSILON) { + LOG(ERROR) << "Provided epsilon is smaller than " + << "CUDNN_BN_MIN_EPSILON. Setting it to " + << "CUDNN_BN_MIN_EPSILON instead."; + } + epsilon = std::max(epsilon, CUDNN_BN_MIN_EPSILON); +#if CUDNN_VERSION_MIN(7, 0, 0) + if (FLAGS_cudnn_batchnorm_spatial_persistent) { + mode_ = CUDNN_BATCHNORM_SPATIAL_PERSISTENT; + } else { + mode_ = CUDNN_BATCHNORM_SPATIAL; + } +#else + mode_ = CUDNN_BATCHNORM_SPATIAL; +#endif + + CUDNN_ENFORCE(platform::dynload::cudnnSetTensorNdDescriptor( + data_desc_, CudnnDataType::type, + x_dims.size() > 3 ? x_dims.size() : 4, dims.data(), strides.data())); + CUDNN_ENFORCE(platform::dynload::cudnnDeriveBNTensorDescriptor( + bn_param_desc_, data_desc_, mode_)); + + const auto *saved_mean = ctx.Input("SavedMean"); + const auto *saved_var = ctx.Input("SavedVariance"); + const void *saved_mean_data = + saved_mean->template data>(); + const void *saved_var_data = + saved_var->template data>(); + + CUDNN_ENFORCE(platform::dynload::cudnnBatchNormalizationBackward( + dev_ctx.cudnn_handle(), mode_, CudnnDataType::kOne(), + CudnnDataType::kZero(), CudnnDataType::kOne(), + CudnnDataType::kZero(), data_desc_, x->template data(), + data_desc_, d_y->template data(), data_desc_, + d_x->template mutable_data(ctx.GetPlace()), bn_param_desc_, + scale->template data>(), + d_scale->template mutable_data>(ctx.GetPlace()), + d_bias->template mutable_data>(ctx.GetPlace()), + epsilon, saved_mean_data, saved_var_data)); + + // clean when exit. + CUDNN_ENFORCE( + platform::dynload::cudnnDestroyTensorDescriptor(data_desc_)); + CUDNN_ENFORCE( + platform::dynload::cudnnDestroyTensorDescriptor(bn_param_desc_)); + } else { + const auto *running_mean = ctx.Input("Mean"); + const auto *running_var = ctx.Input("Variance"); + + const auto *running_mean_data = + running_mean->template data>(); + const auto *running_var_data = + running_var->template data>(); + + const int num = x->numel(); + const int block = 512; + int max_threads = dev_ctx.GetMaxPhysicalThreadCount(); + const int max_blocks = std::max(max_threads / block, 1); + int grid1 = (num + block - 1) / block; + int grid2 = std::min(C, max_blocks); + + if (data_layout == framework::DataLayout::kNCHW) { + if (d_x) { + KeBNBackwardData<<< + grid1, block, 0, dev_ctx.stream()>>>( + d_y->data(), scale->data>(), + running_var_data, epsilon, C, H * W, num, d_x->data()); + } + if (d_scale && d_bias) { + KeBNBackwardScaleBias<<< + grid2, block, 0, dev_ctx.stream()>>>( + d_y->data(), x->data(), running_mean_data, running_var_data, + epsilon, N, C, H * W * D, d_scale->data>(), + d_bias->data>()); + } + } else { + if (d_x) { + KeBNBackwardData<<< + grid1, block, 0, dev_ctx.stream()>>>( + d_y->data(), scale->data>(), + running_var_data, epsilon, C, H * W, num, d_x->data()); + } + if (d_scale && d_bias) { + KeBNBackwardScaleBias<<< + grid2, block, 0, dev_ctx.stream()>>>( + d_y->data(), x->data(), running_mean_data, running_var_data, + epsilon, N, C, H * W * D, d_scale->data>(), + d_bias->data>()); + } + } + } + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +namespace plat = paddle::platform; +REGISTER_OP_CUDA_KERNEL( + batch_norm, ops::BatchNormKernel, + ops::BatchNormKernel, + ops::BatchNormKernel); +REGISTER_OP_CUDA_KERNEL( + batch_norm_grad, ops::BatchNormGradKernel, + ops::BatchNormGradKernel, + ops::BatchNormGradKernel); diff --git a/paddle/fluid/operators/batch_norm_op.cu.cc b/paddle/fluid/operators/batch_norm_op.cu.cc deleted file mode 100644 index ca6cd8669352fd5814f25a04433ca97fe4abe9ff..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/batch_norm_op.cu.cc +++ /dev/null @@ -1,307 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/operators/batch_norm_op.h" -#include -#include "paddle/fluid/framework/data_layout.h" -#include "paddle/fluid/operators/math/math_function.h" -#include "paddle/fluid/platform/cudnn_helper.h" -#include "paddle/fluid/platform/float16.h" - -namespace paddle { -namespace operators { - -using Tensor = framework::Tensor; -using DataLayout = framework::DataLayout; -template -using CudnnDataType = platform::CudnnDataType; -template -using BatchNormParamType = typename CudnnDataType::BatchNormParamType; - -void ExtractNCWHD(const framework::DDim &dims, const DataLayout &data_layout, - int *N, int *C, int *H, int *W, int *D) { - *N = dims[0]; - if (dims.size() == 2) { - *C = dims[1]; - *H = 1; - *W = 1; - *D = 1; - } else { - *C = data_layout == DataLayout::kNCHW ? dims[1] : dims[dims.size() - 1]; - *H = data_layout == DataLayout::kNCHW ? dims[2] : dims[1]; - *W = dims.size() > 3 - ? (data_layout == DataLayout::kNCHW ? dims[3] : dims[2]) - : 1; - *D = dims.size() > 4 - ? (data_layout == DataLayout::kNCHW ? dims[4] : dims[3]) - : 1; - } -} - -template -class BatchNormKernel - : public framework::OpKernel { - public: - void Compute(const framework::ExecutionContext &ctx) const override { - PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()), - "It must use CUDAPlace."); - double epsilon = static_cast(ctx.Attr("epsilon")); - const float momentum = ctx.Attr("momentum"); - const bool is_test = ctx.Attr("is_test"); - const std::string data_layout_str = ctx.Attr("data_layout"); - const DataLayout data_layout = - framework::StringToDataLayout(data_layout_str); - - // Get the size for each dimension. - // NCHW [batch_size, in_channels, in_height, in_width] - const auto *x = ctx.Input("X"); - const auto &x_dims = x->dims(); - PADDLE_ENFORCE(x_dims.size() >= 2 && x_dims.size() <= 5, - "The Input dim size should be between 2 and 5"); - int N, C, H, W, D; - ExtractNCWHD(x_dims, data_layout, &N, &C, &H, &W, &D); - - auto *y = ctx.Output("Y"); - y->mutable_data(ctx.GetPlace()); - - // ------------------- cudnn descriptors --------------------- - cudnnTensorDescriptor_t data_desc_; - cudnnTensorDescriptor_t bn_param_desc_; - cudnnBatchNormMode_t mode_; - - CUDNN_ENFORCE(platform::dynload::cudnnCreateTensorDescriptor(&data_desc_)); - CUDNN_ENFORCE( - platform::dynload::cudnnCreateTensorDescriptor(&bn_param_desc_)); - - if (epsilon <= CUDNN_BN_MIN_EPSILON - FLT_EPSILON) { - LOG(ERROR) << "Provided epsilon is smaller than " - << "CUDNN_BN_MIN_EPSILON. Setting it to " - << "CUDNN_BN_MIN_EPSILON instead."; - } - epsilon = std::max(epsilon, CUDNN_BN_MIN_EPSILON); -#if CUDNN_VERSION_MIN(7, 0, 0) - mode_ = CUDNN_BATCHNORM_SPATIAL_PERSISTENT; -#else - mode_ = CUDNN_BATCHNORM_SPATIAL; -#endif - - VLOG(3) << "Setting descriptors."; - std::vector dims; - std::vector strides; - if (data_layout == DataLayout::kNCHW) { - dims = {N, C, H, W, D}; - strides = {C * H * W * D, H * W * D, W * D, D, 1}; - } else { - dims = {N, C, H, W, D}; - strides = {H * W * D * C, 1, W * D * C, D * C, C}; - } - CUDNN_ENFORCE(platform::dynload::cudnnSetTensorNdDescriptor( - data_desc_, CudnnDataType::type, - x_dims.size() > 3 ? x_dims.size() : 4, dims.data(), strides.data())); - // Note: PERSISTENT not implemented for inference - CUDNN_ENFORCE(platform::dynload::cudnnDeriveBNTensorDescriptor( - bn_param_desc_, data_desc_, is_test ? CUDNN_BATCHNORM_SPATIAL : mode_)); - - const auto *scale = ctx.Input("Scale"); - const auto *bias = ctx.Input("Bias"); - - auto &dev_ctx = ctx.template device_context(); - - auto handle = dev_ctx.cudnn_handle(); - - // Now, depending on whether we are running test or not, we have two paths. - if (is_test) { - // only when test we use input to do computation. - const auto *est_mean = ctx.Input("Mean"); - const auto *est_var = ctx.Input("Variance"); - // Run inference mode. - PADDLE_ENFORCE_EQ(est_mean->dims().size(), 1UL); - PADDLE_ENFORCE_EQ(est_var->dims().size(), 1UL); - PADDLE_ENFORCE_EQ(est_mean->dims()[0], C); - PADDLE_ENFORCE_EQ(est_var->dims()[0], C); - - CUDNN_ENFORCE(platform::dynload::cudnnBatchNormalizationForwardInference( - handle, - // Note: PERSISTENT not implemented for inference - CUDNN_BATCHNORM_SPATIAL, CudnnDataType::kOne(), - CudnnDataType::kZero(), data_desc_, x->template data(), - data_desc_, y->template mutable_data(ctx.GetPlace()), - bn_param_desc_, scale->template data>(), - bias->template data>(), - est_mean->template data>(), - est_var->template data>(), epsilon)); - } else { - // Run training mode. - // obtain running mean and running inv var, and see if we need to - // initialize them. - - auto *mean_out = ctx.Output("MeanOut"); - auto *variance_out = ctx.Output("VarianceOut"); - mean_out->mutable_data>(ctx.GetPlace()); - variance_out->mutable_data>(ctx.GetPlace()); - - auto *saved_mean = ctx.Output("SavedMean"); - auto *saved_variance = ctx.Output("SavedVariance"); - saved_mean->mutable_data>(ctx.GetPlace()); - saved_variance->mutable_data>(ctx.GetPlace()); - math::SetConstant> - functor; - functor(dev_ctx, saved_mean, static_cast>(0)); - functor(dev_ctx, saved_variance, static_cast>(0)); - - if ((N * H * W * D) == 1) { - LOG(WARNING) << "Only 1 element in normalization dimension, " - << "we skip the batch norm calculation, let y = x."; - framework::TensorCopySync(*x, ctx.GetPlace(), y); - } else { - double this_factor = 1. - momentum; - - CUDNN_ENFORCE(platform::dynload::cudnnBatchNormalizationForwardTraining( - handle, mode_, CudnnDataType::kOne(), CudnnDataType::kZero(), - data_desc_, x->template data(), data_desc_, - y->template mutable_data(ctx.GetPlace()), bn_param_desc_, - scale->template data>(), - bias->template data>(), this_factor, - mean_out->template mutable_data>( - ctx.GetPlace()), - variance_out->template mutable_data>( - ctx.GetPlace()), - epsilon, saved_mean->template mutable_data>( - ctx.GetPlace()), - saved_variance->template mutable_data>( - ctx.GetPlace()))); - } - } - - // clean when exit. - CUDNN_ENFORCE(platform::dynload::cudnnDestroyTensorDescriptor(data_desc_)); - CUDNN_ENFORCE( - platform::dynload::cudnnDestroyTensorDescriptor(bn_param_desc_)); - } -}; - -template -class BatchNormGradKernel - : public framework::OpKernel { - public: - void Compute(const framework::ExecutionContext &ctx) const override { - PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()), - "It must use CUDAPlace."); - double epsilon = static_cast(ctx.Attr("epsilon")); - const std::string data_layout_str = ctx.Attr("data_layout"); - const DataLayout data_layout = - framework::StringToDataLayout(data_layout_str); - const auto *x = ctx.Input("X"); - const auto *d_y = ctx.Input(framework::GradVarName("Y")); - const auto *scale = ctx.Input("Scale"); - - const auto &x_dims = x->dims(); - - PADDLE_ENFORCE(x_dims.size() >= 2 && x_dims.size() <= 5, - "The Input dim size should be between 2 and 5"); - int N, C, H, W, D; - ExtractNCWHD(x_dims, data_layout, &N, &C, &H, &W, &D); - - // init output - auto *d_x = ctx.Output(framework::GradVarName("X")); - auto *d_scale = ctx.Output(framework::GradVarName("Scale")); - auto *d_bias = ctx.Output(framework::GradVarName("Bias")); - - d_x->mutable_data(ctx.GetPlace()); - d_scale->mutable_data(ctx.GetPlace()); - d_bias->mutable_data(ctx.GetPlace()); - - auto &dev_ctx = ctx.template device_context(); - if ((N * H * W * D) == 1) { - framework::TensorCopySync(*d_y, ctx.GetPlace(), d_x); - math::SetConstant> - functor; - functor(dev_ctx, d_scale, static_cast>(0)); - functor(dev_ctx, d_bias, static_cast>(0)); - return; - } - - PADDLE_ENFORCE_EQ(scale->dims().size(), 1UL); - PADDLE_ENFORCE_EQ(scale->dims()[0], C); - - // ------------------- cudnn descriptors --------------------- - cudnnTensorDescriptor_t data_desc_; - cudnnTensorDescriptor_t bn_param_desc_; - cudnnBatchNormMode_t mode_; - - CUDNN_ENFORCE(platform::dynload::cudnnCreateTensorDescriptor(&data_desc_)); - CUDNN_ENFORCE( - platform::dynload::cudnnCreateTensorDescriptor(&bn_param_desc_)); - if (epsilon <= CUDNN_BN_MIN_EPSILON - FLT_EPSILON) { - LOG(ERROR) << "Provided epsilon is smaller than " - << "CUDNN_BN_MIN_EPSILON. Setting it to " - << "CUDNN_BN_MIN_EPSILON instead."; - } - epsilon = std::max(epsilon, CUDNN_BN_MIN_EPSILON); -#if CUDNN_VERSION_MIN(7, 0, 0) - mode_ = CUDNN_BATCHNORM_SPATIAL_PERSISTENT; -#else - mode_ = CUDNN_BATCHNORM_SPATIAL; -#endif - - std::vector dims; - std::vector strides; - if (data_layout == DataLayout::kNCHW) { - dims = {N, C, H, W, D}; - strides = {C * H * W * D, H * W * D, W * D, D, 1}; - } else { - dims = {N, C, H, W, D}; - strides = {H * W * C * D, 1, W * D * C, D * C, C}; - } - CUDNN_ENFORCE(platform::dynload::cudnnSetTensorNdDescriptor( - data_desc_, CudnnDataType::type, - x_dims.size() > 3 ? x_dims.size() : 4, dims.data(), strides.data())); - CUDNN_ENFORCE(platform::dynload::cudnnDeriveBNTensorDescriptor( - bn_param_desc_, data_desc_, mode_)); - - const auto *saved_mean = ctx.Input("SavedMean"); - const auto *saved_var = ctx.Input("SavedVariance"); - const void *saved_mean_data = saved_mean->template data(); - const void *saved_var_data = saved_var->template data(); - - CUDNN_ENFORCE(platform::dynload::cudnnBatchNormalizationBackward( - dev_ctx.cudnn_handle(), mode_, CudnnDataType::kOne(), - CudnnDataType::kZero(), CudnnDataType::kOne(), - CudnnDataType::kZero(), data_desc_, x->template data(), - data_desc_, d_y->template data(), data_desc_, - d_x->template mutable_data(ctx.GetPlace()), bn_param_desc_, - scale->template data(), - d_scale->template mutable_data(ctx.GetPlace()), - d_bias->template mutable_data(ctx.GetPlace()), epsilon, - saved_mean_data, saved_var_data)); - - // clean when exit. - CUDNN_ENFORCE(platform::dynload::cudnnDestroyTensorDescriptor(data_desc_)); - CUDNN_ENFORCE( - platform::dynload::cudnnDestroyTensorDescriptor(bn_param_desc_)); - } -}; - -} // namespace operators -} // namespace paddle - -namespace ops = paddle::operators; -namespace plat = paddle::platform; -REGISTER_OP_CUDA_KERNEL( - batch_norm, ops::BatchNormKernel, - ops::BatchNormKernel, - ops::BatchNormKernel); -REGISTER_OP_CUDA_KERNEL( - batch_norm_grad, ops::BatchNormGradKernel, - ops::BatchNormGradKernel); diff --git a/paddle/fluid/operators/batch_norm_op.h b/paddle/fluid/operators/batch_norm_op.h index 5e3d630d6889e445c5e84fa836d2d81bb7266779..6e89d73eb236ee7844c7de3c273e0b0f275a3e33 100644 --- a/paddle/fluid/operators/batch_norm_op.h +++ b/paddle/fluid/operators/batch_norm_op.h @@ -13,6 +13,9 @@ See the License for the specific language governing permissions and limitations under the License. */ #pragma once +#include +#include +#include #include "paddle/fluid/framework/eigen.h" #include "paddle/fluid/framework/op_registry.h" @@ -35,17 +38,84 @@ template using ConstEigenVectorArrayMap = Eigen::Map>; +class BatchNormOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + void InferShape(framework::InferShapeContext *ctx) const override; + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext &ctx) const override; +}; + +class BatchNormGradOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + void InferShape(framework::InferShapeContext *ctx) const override; + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext &ctx) const override; +}; + +class BatchNormOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override; +}; + +class BatchNormGradMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override; + + virtual std::string GradOpType() const { + return this->ForwardOpType() + "_grad"; + } +}; + +class BatchNormOpInferVarType + : public framework::PassInDtypeAndVarTypeToOutput { + protected: + std::unordered_map GetInputOutputWithSameType() + const override { + return std::unordered_map{{"X", /*->*/ "Y"}}; + } +}; + template class BatchNormKernel : public framework::OpKernel { public: - void Compute(const framework::ExecutionContext& ctx) const override; + void Compute(const framework::ExecutionContext &ctx) const override; }; template class BatchNormGradKernel : public framework::OpKernel { public: - void Compute(const framework::ExecutionContext& ctx) const override; + void Compute(const framework::ExecutionContext &ctx) const override; }; +inline void ExtractNCWHD(const framework::DDim &dims, + const DataLayout &data_layout, int *N, int *C, int *H, + int *W, int *D) { + *N = dims[0]; + if (dims.size() == 2) { + *C = dims[1]; + *H = 1; + *W = 1; + *D = 1; + } else { + *C = data_layout == DataLayout::kNCHW ? dims[1] : dims[dims.size() - 1]; + *H = data_layout == DataLayout::kNCHW ? dims[2] : dims[1]; + *W = dims.size() > 3 + ? (data_layout == DataLayout::kNCHW ? dims[3] : dims[2]) + : 1; + *D = dims.size() > 4 + ? (data_layout == DataLayout::kNCHW ? dims[4] : dims[3]) + : 1; + } +} + } // namespace operators } // namespace paddle diff --git a/paddle/fluid/operators/batch_size_like.h b/paddle/fluid/operators/batch_size_like.h index fc15d56891cf7af10a91ca22a09c84fa2e52d465..7e2740f148f1d273310f44ed4a35d413e7201394 100644 --- a/paddle/fluid/operators/batch_size_like.h +++ b/paddle/fluid/operators/batch_size_like.h @@ -74,5 +74,8 @@ class BatchSizeLikeOpMaker : public framework::OpProtoAndCheckerMaker { virtual void Apply() = 0; }; +DECLARE_NO_NEED_BUFFER_VARS_INFERENCE(BatchSizeLikeNoNeedBufferVarsInference, + "Input"); + } // namespace operators } // namespace paddle diff --git a/paddle/fluid/operators/beam_search_decode_op.cc b/paddle/fluid/operators/beam_search_decode_op.cc index b6cb935814e25b31d4104f9ce24fe952680cb491..4cef49280dfb5207a9d94df42d94657f03ec838f 100644 --- a/paddle/fluid/operators/beam_search_decode_op.cc +++ b/paddle/fluid/operators/beam_search_decode_op.cc @@ -79,6 +79,9 @@ struct BeamSearchDecodeFunctor { bool tensor_on_gpu_; size_t beam_size_; int end_id_; + // TODO(Superjomn) Here might result serious performance issue in the + // concurrency + // scenarios. const LoDTensorArray& step_ids_origin_; const LoDTensorArray& step_scores_origin_; LoDTensorArray step_ids_ = LoDTensorArray(); @@ -119,7 +122,8 @@ class BeamSearchDecodeOp : public framework::OperatorBase { platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance(); auto& dev_ctx = *pool.Get(dev_place); - framework::ExecutionContext ctx(*this, scope, dev_ctx); + framework::RuntimeContext run_ctx(Inputs(), Outputs(), scope); + framework::ExecutionContext ctx(*this, scope, dev_ctx, run_ctx, nullptr); const LoDTensorArray* ids = ctx.Input("Ids"); const LoDTensorArray* scores = ctx.Input("Scores"); @@ -142,7 +146,7 @@ class BeamSearchDecodeOp : public framework::OperatorBase { LoDTensor* sentenceScores = ctx.Output("SentenceScores"); framework::VisitDataType( - framework::ToDataType(scores->at(0).type()), + scores->at(0).type(), BeamSearchDecodeFunctor(*ids, *scores, sentenceIds, sentenceScores, beam_size, end_id)); } @@ -174,10 +178,10 @@ Beam Search Decode Operator. This Operator constructs the full hypotheses for each source sentence by walking back along the LoDTensorArray Input(ids) whose lods can be used to restore the path in the beam search tree. -The Output(SentenceIds) and Output(SentenceScores) separately contain the -generated id sequences and the corresponding scores. The shapes and lods of the -two LodTensor are same. The lod level is 2 and the two levels separately -indicate how many hypotheses each source sentence has and how many ids each +The Output(SentenceIds) and Output(SentenceScores) separately contain the +generated id sequences and the corresponding scores. The shapes and lods of the +two LodTensor are same. The lod level is 2 and the two levels separately +indicate how many hypotheses each source sentence has and how many ids each hypothesis has. )DOC"); } @@ -199,15 +203,12 @@ class BeamSearchDecodeInferShape : public framework::InferShapeBase { class BeamSearchDecodeInferVarType : public framework::VarTypeInference { public: - void operator()(const framework::OpDesc& op_desc, - framework::BlockDesc* block) const override { - for (auto& o : op_desc.Output("SentenceIds")) { - auto& sentence_ids = block->FindRecursiveOrCreateVar(o); - sentence_ids.SetType(framework::proto::VarType::LOD_TENSOR); + void operator()(framework::InferVarTypeContext* ctx) const override { + for (auto& o : ctx->Output("SentenceIds")) { + ctx->SetType(o, framework::proto::VarType::LOD_TENSOR); } - for (auto& o : op_desc.Output("SentenceScores")) { - auto& sentence_scores = block->FindRecursiveOrCreateVar(o); - sentence_scores.SetType(framework::proto::VarType::LOD_TENSOR); + for (auto& o : ctx->Output("SentenceScores")) { + ctx->SetType(o, framework::proto::VarType::LOD_TENSOR); } } }; diff --git a/paddle/fluid/operators/beam_search_decode_op.h b/paddle/fluid/operators/beam_search_decode_op.h index 6aefc5446f167eebb0da673b3fbdf7ed128daa98..0b883c3158fb922caae2e731875bbb8d43a1e9ca 100644 --- a/paddle/fluid/operators/beam_search_decode_op.h +++ b/paddle/fluid/operators/beam_search_decode_op.h @@ -122,7 +122,7 @@ void BeamSearchDecoder::ConvertSentenceVectorToLodTensor( auto cpu_place = std::unique_ptr( new paddle::platform::CPUPlace()); - paddle::platform::CPUDeviceContext cpu_ctx(*cpu_place.get()); + paddle::platform::CPUDeviceContext cpu_ctx(*cpu_place); framework::LoD lod; lod.push_back(source_level_lod); diff --git a/paddle/fluid/operators/beam_search_op.cc b/paddle/fluid/operators/beam_search_op.cc index 62771d09f112785ca1ba741a0ba239b1f0234633..a6aa35e0569364d79c15aea6e6dbc6ca670d49f0 100644 --- a/paddle/fluid/operators/beam_search_op.cc +++ b/paddle/fluid/operators/beam_search_op.cc @@ -12,205 +12,15 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include -#include +#include "paddle/fluid/operators/beam_search_op.h" + #include #include - -#include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/framework/op_registry.h" -#include "paddle/fluid/operators/beam_search_op.h" namespace paddle { namespace operators { -void BeamSearch::operator()(const framework::LoDTensor &pre_ids, - const framework::LoDTensor &pre_scores, - framework::LoDTensor *selected_ids, - framework::LoDTensor *selected_scores) { - auto abs_lod = framework::ToAbsOffset(ids_->lod()); - auto &high_level = abs_lod[lod_level_]; - - auto items = SelectTopBeamSizeItems(pre_ids, pre_scores); - auto selected_items = ToMap(items, high_level.back()); - VLOG(3) << "selected_items:"; - for (size_t i = 0; i < selected_items.size(); ++i) { - VLOG(3) << "offset:" << i; - for (auto &item : selected_items[i]) { - VLOG(3) << ItemToString(item); - } - } - - PruneEndBeams(pre_ids, &selected_items); - // calculate the output tensor's height - size_t num_instances = std::accumulate( - std::begin(selected_items), std::end(selected_items), 0, - [](size_t a, std::vector &b) { return a + b.size(); }); - // the output tensor shape should be [num_instances, 1] - auto dims = framework::make_ddim( - std::vector({static_cast(num_instances), 1})); - selected_ids->Resize(dims); - selected_scores->Resize(dims); - - std::map> hash; - framework::LoD new_lod; - auto *ids_data = selected_ids->mutable_data(platform::CPUPlace()); - auto *scores_data = - selected_scores->mutable_data(platform::CPUPlace()); - - // fill in data - std::vector low_level; - size_t low_offset = 0; - for (auto &items : selected_items) { - low_level.push_back(low_offset); - for (auto &item : items) { - ids_data[low_offset] = item.id; - scores_data[low_offset] = item.score; - low_offset++; - } - } - low_level.push_back(low_offset); - - // fill lod - framework::LoD lod(2); - lod[0].assign(high_level.begin(), high_level.end()); - lod[1].assign(low_level.begin(), low_level.end()); - if (!framework::CheckLoD(lod)) { - PADDLE_THROW("lod %s is not right", framework::LoDToString(lod)); - } - selected_ids->set_lod(lod); - selected_scores->set_lod(lod); -} - -void BeamSearch::PruneEndBeams(const framework::LoDTensor &pre_ids, - std::vector> *items) { - auto *pre_ids_data = pre_ids.data(); - auto abs_lod = framework::ToAbsOffset(ids_->lod()); - auto &high_level = abs_lod[lod_level_]; - for (size_t src_idx = 0; src_idx < high_level.size() - 1; ++src_idx) { - size_t src_prefix_start = high_level[src_idx]; - size_t src_prefix_end = high_level[src_idx + 1]; - bool finish_flag = true; - for (size_t offset = src_prefix_start; offset < src_prefix_end; offset++) { - for (auto &item : items->at(offset)) { - if (item.id != static_cast(end_id_) || - pre_ids_data[offset] != end_id_) { - finish_flag = false; - break; - } - } - if (!finish_flag) break; - } - if (finish_flag) { // all branchs of the beam (source sentence) end and - // prune this beam - for (size_t offset = src_prefix_start; offset < src_prefix_end; offset++) - items->at(offset).clear(); - } - } -} - -std::vector> BeamSearch::ToMap( - const std::vector> &items, size_t element_num) { - std::vector> result; - result.resize(element_num); - for (auto &entries : items) { - for (const auto &item : entries) { - result[item.offset].push_back(item); - } - } - return result; -} - -std::vector> BeamSearch::SelectTopBeamSizeItems( - const framework::LoDTensor &pre_ids, - const framework::LoDTensor &pre_scores) { - std::vector> result; - std::vector items; - // for each source sentence, select the top beam_size items across all - // candidate sets. - while (NextItemSet(pre_ids, pre_scores, &items)) { - std::nth_element( - std::begin(items), std::begin(items) + beam_size_, std::end(items), - [](const Item &a, const Item &b) { return a.score > b.score; }); - // prune the top beam_size items. - if (items.size() > beam_size_) { - items.resize(beam_size_); - } - result.emplace_back(items); - } - VLOG(3) << "SelectTopBeamSizeItems result size " << result.size(); - for (auto &items : result) { - VLOG(3) << "item set:"; - for (auto &item : items) { - VLOG(3) << ItemToString(item); - } - } - - return result; -} - -// the candidates of a source -bool BeamSearch::NextItemSet(const framework::LoDTensor &pre_ids, - const framework::LoDTensor &pre_scores, - std::vector *items) { - if (sent_offset_ >= ids_->NumElements(lod_level_)) { - return false; - } - // find the current candidates - auto ids = *ids_; - auto scores = *scores_; - - auto abs_lod = framework::ToAbsOffset(ids.lod()); - - auto *ids_data = ids.data(); - auto *scores_data = scores.data(); - - size_t instance_dim = 1; - for (int i = 1; i < ids.dims().size(); i++) { - instance_dim *= ids.dims()[i]; - } - - auto *pre_ids_data = pre_ids.data(); - auto *pre_scores_data = pre_scores.data(); - items->clear(); - items->reserve(framework::product(ids.dims())); - for (size_t offset = abs_lod[lod_level_][sent_offset_]; - offset < abs_lod[lod_level_][sent_offset_ + 1]; offset++) { - auto pre_id = pre_ids_data[offset]; - auto pre_score = pre_scores_data[offset]; - if (pre_id == end_id_) { - // Allocate all probability mass to eos_id for finished branchs and the - // other candidate ids can be ignored. - items->emplace_back(offset, end_id_, pre_score); - } else { - for (size_t d = 0; d < instance_dim; d++) { - const size_t dim_offset = offset * instance_dim + d; - items->emplace_back(offset, ids_data[dim_offset], - scores_data[dim_offset]); - } - } - } - - sent_offset_++; - return true; -} - -std::ostream &operator<<(std::ostream &os, const BeamSearch::Item &item) { - os << "{"; - os << "offset: " << item.offset << ", "; - os << "id: " << item.id << ", "; - os << "score: " << item.score << ""; - os << "}"; - - return os; -} - -std::string ItemToString(const BeamSearch::Item &item) { - std::ostringstream stream; - stream << item; - return stream.str(); -} - class BeamSearchOpMaker : public framework::OpProtoAndCheckerMaker { public: void Make() override { @@ -219,32 +29,43 @@ class BeamSearchOpMaker : public framework::OpProtoAndCheckerMaker { "(LoDTensor) The LoDTensor containing the selected ids at the " "previous step. It should be a tensor with shape (batch_size, 1) " "and lod `[[0, 1, ... , batch_size], [0, 1, ..., batch_size]]` at " - "thefirst step."); + "the first step."); AddInput("pre_scores", "(LoDTensor) The LoDTensor containing the accumulated " "scores corresponding to the selected ids at the previous step."); AddInput("ids", "(LoDTensor) The LoDTensor containing the candidates ids. Its " - "shape should be (batch_size * beam_size, K), where K supposed to " - "be beam_size."); + "shape should be (batch_size * beam_size, W). If not set, it will " + "be calculated out according to Input(scores) in this operator.") + .AsDispensable(); AddInput("scores", - "(LoDTensor) The LodTensor containing the accumulated scores " - "corresponding to Input(ids) and its shape is the same as the " - "shape of Input(ids)."); + "(LoDTensor) The LoDTensor containing the current scores " + "corresponding to Input(ids). If Input(ids) is not nullptr, its " + "shape is the same as that of Input(ids)." + "If is_accumulated is true, Input(scores) is accumulated scores " + "and will be used derectedly. Else, each score will be " + "transformed to the log field and accumulate Input(pre_sores) " + "first."); AddOutput("selected_ids", "A LodTensor that stores the IDs selected by beam search."); AddOutput("selected_scores", "A LoDTensor containing the accumulated scores corresponding to " "Output(selected_ids)."); + AddOutput("parent_idx", + "A Tensor preserving the selected_ids' parent indice in pre_ids.") + .AsDispensable(); // Attributes stored in AttributeMap AddAttr("level", "the level of LoDTensor"); AddAttr("beam_size", "beam size for beam search"); AddAttr("end_id", "the token id which indicates the end of a sequence"); + AddAttr("is_accumulated", + "Whether the Input(scores) is accumulated scores.") + .SetDefault(true); AddComment(R"DOC( -This operator does the search in beams for one time step. +This operator does the search in beams for one time step. Specifically, it selects the top-K candidate word ids of current step from Input(ids) according to their Input(scores) for all source sentences, where K is Attr(beam_size) and Input(ids), Input(scores) are predicted results @@ -265,10 +86,9 @@ class BeamSearchOp : public framework::OperatorWithKernel { public: using framework::OperatorWithKernel::OperatorWithKernel; - protected: void InferShape(framework::InferShapeContext *ctx) const override { for (const std::string &arg : - std::vector({"pre_ids", "ids", "scores"})) { + std::vector({"pre_ids", "scores"})) { PADDLE_ENFORCE(ctx->HasInput(arg), "BeamSearch need input argument '%s'", arg); } @@ -279,27 +99,33 @@ class BeamSearchOp : public framework::OperatorWithKernel { } } + protected: framework::OpKernelType GetExpectedKernelType( const framework::ExecutionContext &ctx) const override { - framework::OpKernelType kt = framework::OpKernelType( - framework::ToDataType( - ctx.Input("pre_ids")->type()), - platform::CPUPlace()); - return kt; + auto *scores = ctx.Input("scores"); + size_t level = ctx.Attr("level"); + size_t batch_size = scores->lod()[level].size() - 1; + // The current CUDA kernel only support cases with batch_size < 4. + // Compute on CPU for cases with batch_size > 4. + if (batch_size <= 4) { + return framework::OpKernelType( + ctx.Input("pre_ids")->type(), ctx.GetPlace()); + } else { + return framework::OpKernelType( + ctx.Input("pre_ids")->type(), + platform::CPUPlace()); + } } }; class BeamSearchInferVarType : public framework::VarTypeInference { public: - void operator()(const framework::OpDesc &op_desc, - framework::BlockDesc *block) const override { - for (auto &o : op_desc.Output("selected_ids")) { - auto &selected_ids = block->FindRecursiveOrCreateVar(o); - selected_ids.SetType(framework::proto::VarType::LOD_TENSOR); + void operator()(framework::InferVarTypeContext *ctx) const override { + for (auto &o : ctx->Output("selected_ids")) { + ctx->SetType(o, framework::proto::VarType::LOD_TENSOR); } - for (auto &o : op_desc.Output("selected_scores")) { - auto &selected_scores = block->FindRecursiveOrCreateVar(o); - selected_scores.SetType(framework::proto::VarType::LOD_TENSOR); + for (auto &o : ctx->Output("selected_scores")) { + ctx->SetType(o, framework::proto::VarType::LOD_TENSOR); } } }; diff --git a/paddle/fluid/operators/beam_search_op.cu.cc b/paddle/fluid/operators/beam_search_op.cu.cc new file mode 100644 index 0000000000000000000000000000000000000000..4ef9476eee5d3fac4decd7273da824b2f2349199 --- /dev/null +++ b/paddle/fluid/operators/beam_search_op.cu.cc @@ -0,0 +1,24 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/beam_search_op.h" +#include "paddle/fluid/framework/op_registry.h" + +namespace ops = paddle::operators; +REGISTER_OP_CUDA_KERNEL( + beam_search, + ops::BeamSearchOpKernel, + ops::BeamSearchOpKernel, + ops::BeamSearchOpKernel, + ops::BeamSearchOpKernel); diff --git a/paddle/fluid/operators/beam_search_op.h b/paddle/fluid/operators/beam_search_op.h index b5e2ed05924cc8b7bc06058b9b1103ba10be486e..3d32ea0cc9686a709b185087d76d12f266663d03 100644 --- a/paddle/fluid/operators/beam_search_op.h +++ b/paddle/fluid/operators/beam_search_op.h @@ -4,7 +4,7 @@ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at -http://www.apache.org/licenses/LICENSE-2.0 + http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, @@ -14,187 +14,12 @@ limitations under the License. */ #pragma once -#include -#include -#include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/framework/operator.h" +#include "paddle/fluid/operators/math/beam_search.h" namespace paddle { namespace operators { -/* - * This is an implementation of beam search. - * - * To explain the details, lets take machine translation task for example, in - * this task, one source sentence is translated to multiple target sentences, - * during this period, one sentence will be translated to multiple translation - * prefixes(target sentence that have not ended), in each time step a prefix - * will have some candidates, input the candidate ids and their corresponding - * scores (probabilities), it will sort and select the top beam_size candidates - * for each source sentence, and store the selected candidates's score and their - * corresponding ids to LoDTensors. - * - * A detailed example: - * - * Input - * - * ids: - * LoD (should have 2 levels) - * first level: [0, 1, 4] - * second level: [0, 1, 2, 3, 4] - * - * tensor's data - * [ - * [4, 2, 5] - * [2, 1, 3] - * [3, 5, 2] - * [8, 2, 1] - * ] - * - * scores: - * LoD same as `ids` - * tensor's data - * [ - * [0.5, 0.3, 0.2] - * [0.6, 0.3, 0.1] - * [0.9, 0.5, 0.1] - * [0.7, 0.5, 0.1] - * ] - * - * the inputs means that there are 2 source sentences to translate, and the - * first source has 1 prefix, the second source has 2 prefix. - * - * lets assume beam size is 2, and the beam search's output should be - * LoD - * first level: - * [0, 1, 2] - * second level: - * [0, 2, 4] - * - * id tensor's data - * [[ - * 4, - * 1, - * 3, - * 8, - * ]] - * - * score tensor's data - * [[ - * 0.5, - * 0.3, - * 0.9, - * 0.7 - * ]] - * - * TODO all the prune operations should be in the beam search, so it is better - * to split the beam search algorithm into a sequence of smaller operators, and - * the prune operators can be inserted in this sequence. - */ -class BeamSearch { - public: - // TODO(superjom) make type customizable - using id_t = size_t; - using score_t = float; - /* - * Input the arguments that needed by this class. - */ - BeamSearch(const framework::LoDTensor& ids, - const framework::LoDTensor& scores, size_t level, size_t beam_size, - int end_id) - : beam_size_(beam_size), - ids_(&ids), - scores_(&scores), - lod_level_(level), - end_id_(end_id) {} - - /* - * The main function of beam search. - * - * @selected_ids: a [None, 1]-shaped tensor with LoD. - * In a machine translation model, it might be the candidate term id sets, - * each set stored as a varience-length sequence. - * The format might be described with a two-level LoD - * - [[0 1] - * - [0 1 2]] - * - [[] - * - [0 1]] - * the first level of LoD tells that there are two source sentences. The - * second level describes the details of the candidate id set's offsets in - * the - * source sentences. - * - * @selected_scores: a LoD tensor with the same shape and LoD with - * selected_ids. - * It stores the corresponding scores of candidate ids in selected_ids. - * - * Return false if all the input tensor is empty, in machine translation task - * that means no candidates is provided, and the task will stop running. - */ - void operator()(const framework::LoDTensor& pre_ids, - const framework::LoDTensor& pre_scores, - framework::LoDTensor* selected_ids, - framework::LoDTensor* selected_scores); - /* - * The basic items help to sort. - */ - struct Item { - Item() {} - Item(size_t offset, size_t id, float score) - : offset(offset), id(id), score(score) {} - // offset in the higher lod level. - size_t offset; - // // prefix id in the lower lod level. - // size_t prefix; - // the candidate id - id_t id; - // the corresponding score - score_t score; - }; - - protected: - /* - * Prune the source sentences all branchs finished, and it is optional. - * Pruning must one step later than finishing (thus pre_ids is needed here), - * since the end tokens must be writed out. - */ - void PruneEndBeams(const framework::LoDTensor& pre_ids, - std::vector>* items); - - /* - * Transform the items into a map whose key is offset, value is the items. - * NOTE low performance. - */ - std::vector> ToMap( - const std::vector>& inputs, size_t element_num); - - /* - * For each source, select top beam_size records. - */ - std::vector> SelectTopBeamSizeItems( - const framework::LoDTensor& pre_ids, - const framework::LoDTensor& pre_scores); - - /* - * Get the items of next source sequence, return false if no remaining items. - */ - bool NextItemSet(const framework::LoDTensor& pre_ids, - const framework::LoDTensor& pre_scores, - std::vector* items); - - private: - size_t beam_size_; - const framework::LoDTensor* ids_; - const framework::LoDTensor* scores_; - size_t lod_level_{0}; - size_t sent_offset_{0}; - int end_id_{0}; -}; - -std::ostream& operator<<(std::ostream& os, const BeamSearch::Item& item); - -std::string ItemToString(const BeamSearch::Item& item); - template class BeamSearchOpKernel : public framework::OpKernel { public: @@ -203,7 +28,7 @@ class BeamSearchOpKernel : public framework::OpKernel { auto* scores = context.Input("scores"); auto* pre_ids = context.Input("pre_ids"); auto* pre_scores = context.Input("pre_scores"); - PADDLE_ENFORCE_NOT_NULL(ids); + PADDLE_ENFORCE_NOT_NULL(scores); PADDLE_ENFORCE_NOT_NULL(pre_ids); PADDLE_ENFORCE_NOT_NULL(pre_scores); @@ -211,14 +36,21 @@ class BeamSearchOpKernel : public framework::OpKernel { size_t level = context.Attr("level"); size_t beam_size = context.Attr("beam_size"); int end_id = context.Attr("end_id"); - BeamSearch alg(*ids, *scores, level, beam_size, end_id); + bool is_accumulated = context.Attr("is_accumulated"); + auto selected_ids = context.Output("selected_ids"); auto selected_scores = context.Output("selected_scores"); + auto* parent_idx = context.Output("parent_idx"); PADDLE_ENFORCE_NOT_NULL(selected_ids); PADDLE_ENFORCE_NOT_NULL(selected_scores); - alg(*pre_ids, *pre_scores, selected_ids, selected_scores); + + math::BeamSearchFunctor alg; + alg(context.template device_context(), pre_ids, pre_scores, + ids, scores, selected_ids, selected_scores, parent_idx, level, + beam_size, end_id, is_accumulated); } }; + } // namespace operators } // namespace paddle diff --git a/paddle/fluid/operators/beam_search_op_test.cc b/paddle/fluid/operators/beam_search_op_test.cc deleted file mode 100644 index c4f4b478fbfc87e4178155132781214575c1e6b0..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/beam_search_op_test.cc +++ /dev/null @@ -1,91 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. */ - -#include "paddle/fluid/operators/beam_search_op.h" - -#include -#include - -namespace paddle { -namespace test { - -using std::vector; -using framework::LoDTensor; -using framework::LoD; -using operators::BeamSearch; -using paddle::platform::CPUPlace; -using std::cout; -using std::endl; - -void CreateInput(LoDTensor* ids, LoDTensor* scores) { - LoD lod; - vector level0({0, 2, 4}); - vector level1({0, 1, 2, 3, 4}); - lod.push_back(level0); - lod.push_back(level1); - ids->set_lod(lod); - scores->set_lod(lod); - - auto dims = framework::make_ddim(vector({4, 3})); - ids->Resize(dims); - scores->Resize(dims); - CPUPlace place; - - auto* ids_data = ids->mutable_data(place); - auto* scores_data = scores->mutable_data(place); - vector _ids({4, 2, 5, 2, 1, 3, 3, 5, 2, 8, 2, 1}); - vector _scores( - {0.5, 0.3, 0.2, 0.6, 0.3, 0.1, 0.9, 0.5, 0.1, 0.7, 0.5, 0.1}); - - for (int i = 0; i < 12; i++) { - ids_data[i] = _ids[i]; - scores_data[i] = _scores[i]; - } -} - -TEST(beam_search_op, run) { - CPUPlace place; - LoDTensor ids, scores; - CreateInput(&ids, &scores); - - LoDTensor pre_ids; - pre_ids.Resize(framework::make_ddim(vector(4, 1))); - for (int i = 0; i < 4; i++) { - pre_ids.mutable_data(place)[i] = i + 1; - } - LoDTensor pre_scores; - pre_scores.Resize(framework::make_ddim(vector(4, 1))); - for (int i = 0; i < 4; i++) { - pre_scores.mutable_data(place)[i] = 0.1 * (i + 1); - } - - BeamSearch beamsearch(ids, scores, (size_t)0, (size_t)2, 0); - LoDTensor sids, sscores; - beamsearch(pre_ids, pre_scores, &sids, &sscores); - - LOG(INFO) << "score: " << sscores << endl; - - ASSERT_EQ(sids.lod(), sscores.lod()); - - vector tids({4, 2, 3, 8}); - vector tscores({0.5, 0.6, 0.9, 0.7}); - - for (int i = 0; i < 4; i++) { - ASSERT_EQ(tids[i], sids.data()[i]); - ASSERT_EQ(tscores[i], sscores.data()[i]); - } -} - -} // namespace test -} // namespace paddle diff --git a/paddle/fluid/operators/benchmark/CMakeLists.txt b/paddle/fluid/operators/benchmark/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..54008336a9f67f0123ba1cfa6fcea35b79b7ac4c --- /dev/null +++ b/paddle/fluid/operators/benchmark/CMakeLists.txt @@ -0,0 +1,3 @@ +cc_test(op_tester SRCS op_tester.cc op_tester_config.cc + DEPS memory timer framework_proto proto_desc lod_tensor op_registry + device_context scope ${GLOB_OP_LIB} ${GLOB_OPERATOR_DEPS}) diff --git a/paddle/fluid/operators/benchmark/op_tester.cc b/paddle/fluid/operators/benchmark/op_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..fec091255f6391b77cd2858905f3aa2e5dd8baff --- /dev/null +++ b/paddle/fluid/operators/benchmark/op_tester.cc @@ -0,0 +1,509 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/benchmark/op_tester.h" +#include +#include "gflags/gflags.h" +#include "gtest/gtest.h" +#include "paddle/fluid/framework/op_info.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/framework/variable_helper.h" +#include "paddle/fluid/platform/init.h" +#include "paddle/fluid/platform/profiler.h" +#include "paddle/fluid/platform/timer.h" +#include "paddle/fluid/pybind/pybind.h" + +namespace paddle { +namespace operators { +namespace benchmark { + +DEFINE_string(op_config_list, "", "Path of op config file."); +DEFINE_int32(specified_config_id, -1, "Test the specified op config."); + +void OpTester::Init(const std::string &filename) { + Init(OpTesterConfig(filename)); +} + +void OpTester::Init(const OpTesterConfig &config) { + config_ = config; + + auto &op_desc_info = framework::OpInfoMap::Instance(); + // Initialize the OpDesc + if (op_desc_info.Has(config_.op_type)) { + type_ = config_.op_type; + + CreateOpDesc(); + CreateInputVarDesc(); + CreateOutputVarDesc(); + } else { + LOG(FATAL) << "Op \"" << config_.op_type << "\" is not registered."; + } + + if (config_.device_id >= 0) { + place_ = paddle::platform::CUDAPlace(config_.device_id); + } else { + place_ = paddle::platform::CPUPlace(); + } + + framework::InitDevices(false); + scope_.reset(new paddle::framework::Scope()); + + op_ = framework::OpRegistry::CreateOp(op_desc_); + CreateVariables(scope_.get()); +} + +void OpTester::Run() { + if (config_.print_debug_string) { + LOG(INFO) << DebugString(); + } + + // Warm up + RunImpl(); + + platform::Timer timer; + if (config_.profile) { + if (platform::is_cpu_place(place_)) { + platform::EnableProfiler(platform::ProfilerState::kCPU); + } else { +#ifdef PADDLE_WITH_CUDA + platform::EnableProfiler(platform::ProfilerState::kAll); + platform::SetDeviceId(config_.device_id); +#else + PADDLE_THROW("'CUDAPlace' is not supported in CPU only device."); +#endif + } + + timer.Start(); + for (int i = config_.repeat; i > 0; --i) { + RunImpl(); + } + timer.Pause(); + platform::DisableProfiler(platform::EventSortingKey::kDefault, + "op_tester_profiler"); + } else { + timer.Start(); + for (int i = config_.repeat; i > 0; --i) { + RunImpl(); + } + timer.Pause(); + } + config_.runtime = timer.ElapsedMS() / config_.repeat; + LOG(INFO) << "=== Run " << config_.repeat + << " times, latency: " << config_.runtime << " ms ==="; +} + +void OpTester::RunImpl() { + op_->Run(*scope_, place_); + platform::DeviceContextPool::Instance().Get(place_)->Wait(); + scope_->DropKids(); +} + +std::vector OpTester::GetOpProtoInputNames() { + std::vector input_names; + const framework::proto::OpProto &proto = + framework::OpInfoMap::Instance().Get(type_).Proto(); + for (int i = 0; i != proto.inputs_size(); ++i) { + const auto &input = proto.inputs(i); + input_names.push_back(input.name()); + } + return input_names; +} + +std::vector OpTester::GetOpProtoOutputNames() { + std::vector output_names; + const framework::proto::OpProto &proto = + framework::OpInfoMap::Instance().Get(type_).Proto(); + for (int i = 0; i != proto.outputs_size(); ++i) { + const auto &output = proto.outputs(i); + output_names.push_back(output.name()); + } + return output_names; +} + +std::unordered_map +OpTester::GetOpProtoAttrNames() { + std::unordered_map attr_types; + const framework::proto::OpProto &proto = + framework::OpInfoMap::Instance().Get(type_).Proto(); + const std::vector skipped_attrs = { + framework::OpProtoAndCheckerMaker::OpRoleAttrName(), + framework::OpProtoAndCheckerMaker::OpRoleVarAttrName(), + framework::OpProtoAndCheckerMaker::OpNamescopeAttrName(), + framework::OpProtoAndCheckerMaker::OpCreationCallstackAttrName()}; + for (int i = 0; i != proto.attrs_size(); ++i) { + const auto &attr = proto.attrs(i); + if (!Has(skipped_attrs, attr.name())) { + VLOG(4) << "attr: " << attr.name() << ", type: " << attr.type(); + attr_types[attr.name()] = attr.type(); + } + } + return attr_types; +} + +framework::proto::VarType::Type OpTester::TransToVarType(std::string str) { + if (str == "int32") { + return framework::proto::VarType::INT32; + } else if (str == "int64") { + return framework::proto::VarType::INT64; + } else if (str == "fp32") { + return framework::proto::VarType::FP32; + } else if (str == "fp64") { + return framework::proto::VarType::FP64; + } else { + PADDLE_THROW("Unsupported dtype %s.", str.c_str()); + } +} + +void OpTester::CreateInputVarDesc() { + std::vector input_names = GetOpProtoInputNames(); + for (auto &name : input_names) { + const OpInputConfig *input = config_.GetInput(name); + if (input == nullptr) { + LOG(FATAL) << "The input " << name << " of op " << config_.op_type + << " is not correctlly provided."; + } + + std::string var_name = config_.op_type + "." + name; + framework::VarDesc *var = Var(var_name); + // Need to support more type + var->SetType(framework::proto::VarType::LOD_TENSOR); + var->SetPersistable(false); + var->SetDataType(TransToVarType(input->dtype)); + var->SetShape(input->dims); + + op_desc_.SetInput(name, {var_name}); + inputs_[var_name] = *input; + } +} + +void OpTester::CreateOutputVarDesc() { + std::vector output_names = GetOpProtoOutputNames(); + for (auto &name : output_names) { + std::string var_name = config_.op_type + "." + name; + framework::VarDesc *var = Var(var_name); + // Need to support more type + var->SetType(framework::proto::VarType::LOD_TENSOR); + var->SetPersistable(false); + var->SetDataType(framework::proto::VarType::FP32); + + op_desc_.SetOutput(name, {var_name}); + } +} + +void OpTester::CreateOpDesc() { + op_desc_.SetType(config_.op_type); + std::unordered_map attr_types = + GetOpProtoAttrNames(); + for (auto item : config_.attrs) { + const std::string &name = item.first; + if (attr_types.find(name) == attr_types.end()) { + LOG(FATAL) << "Operator " << type_ << " do not have attr " << name; + } + + const std::string &value_str = item.second; + const framework::proto::AttrType &type = attr_types[name]; + switch (type) { + case framework::proto::AttrType::BOOLEAN: + break; + case framework::proto::AttrType::INT: { + int value = StringTo(value_str); + op_desc_.SetAttr(name, {value}); + } break; + case framework::proto::AttrType::FLOAT: { + float value = StringTo(value_str); + op_desc_.SetAttr(name, {value}); + } break; + case framework::proto::AttrType::STRING: { + op_desc_.SetAttr(name, {value_str}); + } break; + case framework::proto::AttrType::BOOLEANS: + case framework::proto::AttrType::INTS: + case framework::proto::AttrType::FLOATS: + case framework::proto::AttrType::STRINGS: + LOG(FATAL) << "Not supported yet."; + break; + case framework::proto::AttrType::LONG: { + int64_t value = StringTo(value_str); + op_desc_.SetAttr(name, value); + } break; + case framework::proto::AttrType::LONGS: + default: + PADDLE_THROW("Unsupport attr type %d", type); + } + } +} + +framework::VarDesc *OpTester::Var(const std::string &name) { + auto it = vars_.find(name); + if (it != vars_.end()) { + return it->second.get(); + } + auto *var = new framework::VarDesc(name); + vars_[name].reset(var); + return var; +} + +template +void OpTester::SetupTensor(framework::LoDTensor *tensor, + const std::vector &shape, T lower, T upper, + const std::string &initializer) { + static unsigned int seed = 100; + std::mt19937 rng(seed++); + std::uniform_real_distribution uniform_dist(0, 1); + + T *ptr = tensor->mutable_data(framework::make_ddim(shape), place_); + + framework::LoDTensor cpu_tensor; + T *cpu_ptr = nullptr; + + if (!platform::is_cpu_place(place_)) { + cpu_ptr = cpu_tensor.mutable_data(framework::make_ddim(shape), + platform::CPUPlace()); + } else { + cpu_ptr = ptr; + } + + if (initializer == "random") { + for (int i = 0; i < cpu_tensor.numel(); ++i) { + cpu_ptr[i] = static_cast(uniform_dist(rng) * (upper - lower) + lower); + } + } else if (initializer == "natural") { + for (int i = 0; i < cpu_tensor.numel(); ++i) { + cpu_ptr[i] = lower + i; + } + } else if (initializer == "zeros") { + for (int i = 0; i < cpu_tensor.numel(); ++i) { + cpu_ptr[i] = 0; + } + } else { + PADDLE_THROW("Unsupported initializer %s.", initializer.c_str()); + } + + if (!platform::is_cpu_place(place_)) { + TensorCopySync(cpu_tensor, place_, tensor); + } +} + +void OpTester::CreateVariables(framework::Scope *scope) { + for (auto &item : vars_) { + auto &var = item.second; + if (var->Name() == framework::kEmptyVarName) { + continue; + } + + auto *ptr = scope->Var(var->Name()); + framework::InitializeVariable(ptr, var->GetType()); + if (var->Persistable()) { + VLOG(3) << "Create Variable " << var->Name() + << " global, which pointer is " << ptr; + } else { + VLOG(3) << "Create Variable " << var->Name() + << " locally, which pointer is " << ptr; + } + } + + for (auto &item : inputs_) { + // Allocate memory for input tensor + auto &var_name = item.first; + VLOG(3) << "Allocate memory for tensor " << var_name; + + auto &var_desc = vars_[var_name]; + std::vector shape = var_desc->GetShape(); + + auto *var = scope->Var(var_name); + auto *tensor = var->GetMutable(); + const auto &data_type = var_desc->GetDataType(); + if (data_type == framework::proto::VarType::INT32) { + SetupTensor(tensor, shape, 0, 1, item.second.initializer); + } else if (data_type == framework::proto::VarType::INT64) { + SetupTensor(tensor, shape, 0, 1, item.second.initializer); + } else if (data_type == framework::proto::VarType::FP32) { + SetupTensor(tensor, shape, static_cast(0.0), + static_cast(1.0), item.second.initializer); + } else if (data_type == framework::proto::VarType::FP64) { + SetupTensor(tensor, shape, static_cast(0.0), + static_cast(1.0), item.second.initializer); + } else { + PADDLE_THROW("Unsupported dtype %d.", data_type); + } + + VLOG(3) << "Set lod for tensor " << var_name; + std::vector> &lod_vec = item.second.lod; + framework::LoD lod; + for (size_t i = 0; i < lod_vec.size(); ++i) { + lod.push_back(lod_vec[i]); + } + tensor->set_lod(lod); + } +} + +static std::string GenSpaces(int count) { + std::stringstream ss; + for (int i = 0; i < count; ++i) { + ss << " "; + } + return ss.str(); +} + +std::string OpTester::DebugString() { + std::stringstream ss; + int count = 0; + for (auto &item : vars_) { + auto &var = item.second; + ss << GenSpaces(count++) << "vars {\n"; + ss << GenSpaces(count) << "name: \"" << var->Name() << "\"\n"; + ss << GenSpaces(count++) << "type: {\n"; + ss << GenSpaces(count) << "type: LOD_TENSOR\n"; + ss << GenSpaces(count++) << "lod_tensor {\n"; + ss << GenSpaces(count++) << "tensor {\n"; + const auto &data_type = var->GetDataType(); + if (data_type == framework::proto::VarType::INT32) { + ss << GenSpaces(count) << "data_type: INT32\n"; + } else if (data_type == framework::proto::VarType::INT64) { + ss << GenSpaces(count) << "data_type: INT64\n"; + } else if (data_type == framework::proto::VarType::FP32) { + ss << GenSpaces(count) << "data_type: FP32\n"; + } else if (data_type == framework::proto::VarType::FP64) { + ss << GenSpaces(count) << "data_type: FP64\n"; + } + std::vector shape = var->GetShape(); + for (auto d : shape) { + ss << GenSpaces(count) << "dims: " << d << "\n"; + } + ss << GenSpaces(--count) << "}\n"; + ss << GenSpaces(--count) << "}\n"; + ss << GenSpaces(--count) << "}\n"; + ss << GenSpaces(count) << "persistable: " << var->Persistable() << "\n"; + ss << GenSpaces(--count) << "}\n"; + } + ss << GenSpaces(count++) << "ops {\n"; + for (auto &name : op_desc_.InputNames()) { + ss << GenSpaces(count++) << "inputs {\n"; + ss << GenSpaces(count) << "parameters: \"" << name << "\"\n"; + ss << GenSpaces(count) << "arguments: \"" << op_desc_.Input(name)[0] + << "\"\n"; + ss << GenSpaces(--count) << "}\n"; + } + for (auto &name : op_desc_.OutputNames()) { + ss << GenSpaces(count++) << "outputs {\n"; + ss << GenSpaces(count) << "parameters: \"" << name << "\"\n"; + ss << GenSpaces(count) << "arguments: \"" << op_desc_.Output(name)[0] + << "\"\n"; + ss << GenSpaces(--count) << "}\n"; + } + ss << GenSpaces(count) << "type: " << op_desc_.Type() << "\n"; + for (auto &name : op_desc_.AttrNames()) { + ss << GenSpaces(count++) << "attrs {\n"; + const auto &attr_type = op_desc_.GetAttrType(name); + const auto &attr = op_desc_.GetAttr(name); + ss << GenSpaces(count) << "name: \"" << name << "\"\n"; + switch (attr_type) { + case framework::proto::AttrType::BOOLEAN: { + ss << GenSpaces(count) << "type: BOOLEAN\n"; + ss << GenSpaces(count) << "b: " << boost::get(attr) << "\n"; + } break; + case framework::proto::AttrType::INT: { + ss << GenSpaces(count) << "type: INT\n"; + ss << GenSpaces(count) << "i: " << boost::get(attr) << "\n"; + } break; + case framework::proto::AttrType::FLOAT: { + ss << GenSpaces(count) << "type: FLOAT\n"; + ss << GenSpaces(count) << "f: " << boost::get(attr) << "\n"; + } break; + case framework::proto::AttrType::STRING: { + ss << GenSpaces(count) << "type: STRING\n"; + ss << GenSpaces(count) << "s: \"" << boost::get(attr) + << "\"\n"; + } break; + case framework::proto::AttrType::BOOLEANS: { + ss << GenSpaces(count) << "type: BOOLEANS\n"; + ss << GenSpaces(count) << "bools: " + << "\n"; + } break; + case framework::proto::AttrType::INTS: { + ss << GenSpaces(count) << "type: INTS\n"; + ss << GenSpaces(count) << "ints: " + << "\n"; + } break; + case framework::proto::AttrType::FLOATS: { + ss << GenSpaces(count) << "type: FLOATS\n"; + ss << GenSpaces(count) << "floats: " + << "\n"; + } break; + case framework::proto::AttrType::STRINGS: { + ss << GenSpaces(count) << "type: STRINGS\n"; + ss << GenSpaces(count) << "strings: " + << "\n"; + } break; + case framework::proto::AttrType::LONG: { + ss << GenSpaces(count) << "type: LONG\n"; + ss << GenSpaces(count) << "l: " << boost::get(attr) << "\n"; + } break; + case framework::proto::AttrType::LONGS: { + ss << GenSpaces(count) << "type: LONGS\n"; + ss << GenSpaces(count) << "longs: " + << "\n"; + } break; + default: + PADDLE_THROW("Unsupport attr type %d", attr_type); + } + ss << GenSpaces(--count) << "}\n"; + } + ss << GenSpaces(--count) << "}\n"; + return ss.str(); +} + +TEST(op_tester, base) { + if (!FLAGS_op_config_list.empty()) { + std::ifstream fin(FLAGS_op_config_list, std::ios::in | std::ios::binary); + PADDLE_ENFORCE(static_cast(fin), "Cannot open file %s", + FLAGS_op_config_list.c_str()); + std::vector op_configs; + while (!fin.eof()) { + VLOG(4) << "Reading config " << op_configs.size() << "..."; + OpTesterConfig config; + bool result = config.Init(fin); + if (result) { + op_configs.push_back(config); + } + } + if (FLAGS_specified_config_id >= 0 && + FLAGS_specified_config_id < static_cast(op_configs.size())) { + OpTester tester; + tester.Init(op_configs[FLAGS_specified_config_id]); + tester.Run(); + } else { + for (size_t i = 0; i < op_configs.size(); ++i) { + OpTester tester; + tester.Init(op_configs[i]); + tester.Run(); + } + } + } else { + OpTester tester; + OpTesterConfig config; + config.op_type = "elementwise_add"; + config.inputs.resize(2); + config.inputs[0].name = "X"; + config.inputs[0].dims = {64, 64}; + config.inputs[1].name = "Y"; + config.inputs[1].dims = {64, 1}; + tester.Init(config); + tester.Run(); + } +} + +} // namespace benchmark +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/benchmark/op_tester.h b/paddle/fluid/operators/benchmark/op_tester.h new file mode 100644 index 0000000000000000000000000000000000000000..328389293c4b71a2f1fefbc3bf26fd46b79ec6e2 --- /dev/null +++ b/paddle/fluid/operators/benchmark/op_tester.h @@ -0,0 +1,75 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include +#include +#include "paddle/fluid/framework/ddim.h" +#include "paddle/fluid/framework/op_desc.h" +#include "paddle/fluid/framework/operator.h" +#include "paddle/fluid/operators/benchmark/op_tester_config.h" + +namespace paddle { +namespace operators { +namespace benchmark { + +class OpTester { + public: + OpTester() {} + + void Init(const std::string &filename); + void Init(const OpTesterConfig &config); + + void Run(); + + std::string DebugString(); + + private: + std::vector GetOpProtoInputNames(); + std::vector GetOpProtoOutputNames(); + std::unordered_map + GetOpProtoAttrNames(); + + framework::proto::VarType::Type TransToVarType(std::string str); + void CreateInputVarDesc(); + void CreateOutputVarDesc(); + void CreateOpDesc(); + + framework::VarDesc *Var(const std::string &name); + void CreateVariables(framework::Scope *scope); + + template + void SetupTensor(framework::LoDTensor *input, + const std::vector &shape, T lower, T upper, + const std::string &initializer); + + void RunImpl(); + + private: + OpTesterConfig config_; + std::string type_; + framework::OpDesc op_desc_; + std::unordered_map> vars_; + std::unordered_map inputs_; + std::unique_ptr op_; + platform::Place place_; + std::unique_ptr scope_; +}; + +} // namespace benchmark +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/benchmark/op_tester_config.cc b/paddle/fluid/operators/benchmark/op_tester_config.cc new file mode 100644 index 0000000000000000000000000000000000000000..b4878ab04244cf6b54d323943fc1fbf4e3882660 --- /dev/null +++ b/paddle/fluid/operators/benchmark/op_tester_config.cc @@ -0,0 +1,226 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/benchmark/op_tester_config.h" +#include +#include "paddle/fluid/platform/enforce.h" + +namespace paddle { +namespace operators { +namespace benchmark { + +static const char kStartSeparator[] = "{"; +static const char kEndSeparator[] = "}"; +static const char kSepBetweenItems[] = ";"; + +static bool StartWith(const std::string& str, const std::string& substr) { + return str.find(substr) == 0; +} + +static bool EndWith(const std::string& str, const std::string& substr) { + return str.rfind(substr) == (str.length() - substr.length()); +} + +static void EraseEndSep(std::string* str, + std::string substr = kSepBetweenItems) { + if (EndWith(*str, substr)) { + str->erase(str->length() - substr.length(), str->length()); + } +} + +OpInputConfig::OpInputConfig(std::istream& is) { + std::string sep; + is >> sep; + if (sep == kStartSeparator) { + while (sep != kEndSeparator) { + is >> sep; + if (sep == "name" || sep == "name:") { + is >> name; + EraseEndSep(&name); + } else if (sep == "dtype" || sep == "dtype:") { + ParseDType(is); + } else if (sep == "initializer" || sep == "initializer:") { + ParseInitializer(is); + } else if (sep == "dims" || sep == "dims:") { + ParseDims(is); + } else if (sep == "lod" || sep == "lod:") { + ParseLoD(is); + } + } + } +} + +void OpInputConfig::ParseDType(std::istream& is) { + std::string dtype_str; + is >> dtype_str; + EraseEndSep(&dtype_str); + + if (dtype_str == "int32" || dtype_str == "int") { + dtype = "int32"; + } else if (dtype_str == "int64" || dtype_str == "long") { + dtype = "int64"; + } else if (dtype_str == "fp32" || dtype_str == "float") { + dtype = "fp32"; + } else if (dtype_str == "fp64" || dtype_str == "double") { + dtype = "fp64"; + } else { + PADDLE_THROW("Unsupported dtype %s", dtype_str.c_str()); + } + VLOG(4) << "dtype of input " << name << " is: " << dtype; +} + +void OpInputConfig::ParseInitializer(std::istream& is) { + std::string initializer_str; + is >> initializer_str; + EraseEndSep(&initializer_str); + + const std::vector supported_initializers = {"random", "natural", + "zeros"}; + if (!Has(supported_initializers, initializer_str)) { + PADDLE_THROW("Unsupported initializer %s", initializer_str.c_str()); + } + + initializer = initializer_str; + VLOG(4) << "initializer of input " << name << " is: " << initializer; +} + +void OpInputConfig::ParseDims(std::istream& is) { + std::string dims_str; + is >> dims_str; + + dims.clear(); + std::string token; + std::istringstream token_stream(dims_str); + while (std::getline(token_stream, token, 'x')) { + dims.push_back(std::stoi(token)); + } +} + +void OpInputConfig::ParseLoD(std::istream& is) { + std::string lod_str; + std::string start_sep = + std::string(kStartSeparator) + std::string(kStartSeparator); + std::string end_sep = std::string(kEndSeparator) + std::string(kEndSeparator); + + std::string sep; + is >> sep; + if (StartWith(sep, start_sep)) { + lod_str += sep; + while (!EndWith(sep, end_sep)) { + is >> sep; + lod_str += sep; + } + } + EraseEndSep(&lod_str); + PADDLE_ENFORCE_GE(lod_str.length(), 4U); + VLOG(4) << "lod: " << lod_str << ", length: " << lod_str.length(); + + // Parse the lod_str + lod.clear(); + for (size_t i = 1; i < lod_str.length() - 1;) { + if (lod_str[i] == '{') { + std::vector level; + while (lod_str[i] != '}') { + ++i; + + std::string number; + while (lod_str[i] >= '0' && lod_str[i] <= '9') { + number += lod_str[i]; + ++i; + } + level.push_back(StringTo(number)); + } + lod.push_back(level); + } else if (lod_str[i] == '}') { + ++i; + } + } +} + +OpTesterConfig::OpTesterConfig(const std::string& filename) { + std::ifstream fin(filename, std::ios::in | std::ios::binary); + PADDLE_ENFORCE(static_cast(fin), "Cannot open file %s", + filename.c_str()); + + Init(fin); +} + +bool OpTesterConfig::Init(std::istream& is) { + std::string sep; + is >> sep; + if (sep == kStartSeparator) { + while (sep != kEndSeparator) { + is >> sep; + if (sep == "op_type" || sep == "op_type:") { + is >> op_type; + } else if (sep == "device_id" || sep == "device_id:") { + is >> device_id; + } else if (sep == "repeat" || sep == "repeat:") { + is >> repeat; + } else if (sep == "profile" || sep == "profile:") { + is >> profile; + } else if (sep == "print_debug_string" || sep == "print_debug_string:") { + is >> print_debug_string; + } else if (sep == "input" || sep == "input:") { + OpInputConfig input_config(is); + inputs.push_back(input_config); + } else if (sep == "attrs" || sep == "attrs:") { + ParseAttrs(is); + } else { + if (sep != kEndSeparator) { + return false; + } + } + } + } else { + return false; + } + return true; +} + +bool OpTesterConfig::ParseAttrs(std::istream& is) { + std::string sep; + is >> sep; + if (sep == kStartSeparator) { + while (true) { + std::string key; + is >> key; + if (key == kEndSeparator) { + break; + } + + std::string value; + is >> value; + EraseEndSep(&key, ":"); + EraseEndSep(&value); + VLOG(4) << "attrs: " << key << ", " << value; + + attrs[key] = value; + } + } + return true; +} + +const OpInputConfig* OpTesterConfig::GetInput(const std::string& name) { + for (size_t i = 0; i < inputs.size(); ++i) { + if (inputs[i].name == name) { + return &inputs[i]; + } + } + return nullptr; +} + +} // namespace benchmark +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/benchmark/op_tester_config.h b/paddle/fluid/operators/benchmark/op_tester_config.h new file mode 100644 index 0000000000000000000000000000000000000000..5803f82ac28867a481875c2af607290c5d366146 --- /dev/null +++ b/paddle/fluid/operators/benchmark/op_tester_config.h @@ -0,0 +1,82 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include +#include +#include + +namespace paddle { +namespace operators { +namespace benchmark { + +struct OpInputConfig { + OpInputConfig() {} + explicit OpInputConfig(std::istream& is); + + void ParseDType(std::istream& is); + void ParseInitializer(std::istream& is); + void ParseDims(std::istream& is); + void ParseLoD(std::istream& is); + + std::string name; + std::string dtype{"fp32"}; // int32/int, int64/long, fp32/float, fp64/double + std::string initializer{"random"}; // random, natural + std::vector dims; + std::vector> lod; +}; + +struct OpTesterConfig { + OpTesterConfig() {} + explicit OpTesterConfig(const std::string& filename); + + bool Init(std::istream& is); + + bool ParseAttrs(std::istream& is); + + const OpInputConfig* GetInput(const std::string& name); + + std::string op_type; + std::vector inputs; + std::unordered_map attrs; + int device_id{-1}; // CPU: -1 + int repeat{1}; + int profile{0}; + int print_debug_string{0}; + double runtime{0.0}; +}; + +static bool Has(const std::vector& vec, const std::string& item) { + for (size_t i = 0; i < vec.size(); ++i) { + if (vec[i] == item) { + return true; + } + } + return false; +} + +template +T StringTo(const std::string& str) { + std::istringstream is(str); + T value; + is >> value; + return value; +} + +} // namespace benchmark +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/bilinear_interp_op.cc b/paddle/fluid/operators/bilinear_interp_op.cc deleted file mode 100644 index 2dc3399da183fbcf7664066f6f7ce12db3dc6d5e..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/bilinear_interp_op.cc +++ /dev/null @@ -1,116 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve. - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - http://www.apache.org/licenses/LICENSE-2.0 - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. */ - -#include "paddle/fluid/operators/bilinear_interp_op.h" -#include -#include "paddle/fluid/framework/op_registry.h" - -namespace paddle { -namespace operators { - -using framework::Tensor; - -class BilinearInterpOp : public framework::OperatorWithKernel { - public: - using framework::OperatorWithKernel::OperatorWithKernel; - - protected: - void InferShape(framework::InferShapeContext* ctx) const override { - PADDLE_ENFORCE(ctx->HasInput("X"), - "Input(X) of BilinearInterOp should not be null."); - PADDLE_ENFORCE(ctx->HasOutput("Out"), - "Output(Out) of BilinearInterOp should not be null."); - - auto dim_x = ctx->GetInputDim("X"); // NCHW format - int out_h = ctx->Attrs().Get("out_h"); - int out_w = ctx->Attrs().Get("out_w"); - PADDLE_ENFORCE_EQ(dim_x.size(), 4, "X's dimension must be 4"); - - if (ctx->HasInput("OutSize")) { - auto out_size_dim = ctx->GetInputDim("OutSize"); - PADDLE_ENFORCE_EQ(out_size_dim.size(), 1, - "OutSize's dimension size must be 1"); - PADDLE_ENFORCE_EQ(out_size_dim[0], 2, "OutSize's dim[0] must be 2"); - } - std::vector dim_out({dim_x[0], dim_x[1], out_h, out_w}); - ctx->SetOutputDim("Out", framework::make_ddim(dim_out)); - } - - protected: - framework::OpKernelType GetExpectedKernelType( - const framework::ExecutionContext& ctx) const override { - return framework::OpKernelType( - framework::ToDataType(ctx.Input("X")->type()), ctx.GetPlace()); - } -}; - -class BilinearInterpOpMaker : public framework::OpProtoAndCheckerMaker { - public: - void Make() override { - AddInput("X", - "The input tensor of bilinear interpolation, " - "This is a 4-D tensor with shape of (N x C x h x w)"); - AddInput("OutSize", - "This is a 1-D tensor with two number. " - "The first number is height and the second number is width.") - .AsDispensable(); - AddOutput("Out", "The dimension of output is (N x C x out_h x out_w)"); - - AddAttr("out_h", "output height of bilinear interpolation op."); - AddAttr("out_w", "output width of bilinear interpolation op."); - AddComment(R"DOC( - Bilinear interpolation is an extension of linear interpolation for - interpolating functions of two variables (e.g. H-direction and - W-direction in this op) on a rectilinear 2D grid. - - The key idea is to perform linear interpolation first in one - direction, and then again in the other direction. - - For details, please refer to Wikipedia: - https://en.wikipedia.org/wiki/Bilinear_interpolation - )DOC"); - } -}; - -class BilinearInterpOpGrad : public framework::OperatorWithKernel { - public: - using framework::OperatorWithKernel::OperatorWithKernel; - - protected: - void InferShape(framework::InferShapeContext* ctx) const override { - PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null"); - PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")), - "Input(Out@GRAD) should not be null"); - auto dim_x = ctx->GetInputDim("X"); - if (ctx->HasOutput(framework::GradVarName("X"))) { - ctx->SetOutputDim(framework::GradVarName("X"), dim_x); - } - } - - framework::OpKernelType GetExpectedKernelType( - const framework::ExecutionContext& ctx) const override { - return framework::OpKernelType( - framework::ToDataType(ctx.Input("X")->type()), ctx.GetPlace()); - } -}; - -} // namespace operators -} // namespace paddle - -namespace ops = paddle::operators; -REGISTER_OPERATOR(bilinear_interp, ops::BilinearInterpOp, - ops::BilinearInterpOpMaker, - paddle::framework::DefaultGradOpDescMaker); -REGISTER_OPERATOR(bilinear_interp_grad, ops::BilinearInterpOpGrad); -REGISTER_OP_CPU_KERNEL(bilinear_interp, ops::BilinearInterpKernel, - ops::BilinearInterpKernel); -REGISTER_OP_CPU_KERNEL(bilinear_interp_grad, - ops::BilinearInterpGradKernel); diff --git a/paddle/fluid/operators/bilinear_interp_op.cu b/paddle/fluid/operators/bilinear_interp_op.cu deleted file mode 100644 index 4c1971538495c6f111e9db18f4014786f6f0dd58..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/bilinear_interp_op.cu +++ /dev/null @@ -1,207 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve. - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - http://www.apache.org/licenses/LICENSE-2.0 - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. */ - -#include "paddle/fluid/operators/bilinear_interp_op.h" -#include "paddle/fluid/platform/cuda_primitives.h" - -namespace paddle { -namespace operators { - -using framework::Tensor; - -template -__global__ void KeBilinearInterpFw( - const T* in, const size_t in_img_h, const size_t in_img_w, - const size_t input_h, const size_t input_w, T* out, const size_t out_img_h, - const size_t out_img_w, const size_t output_h, const size_t output_w, - const size_t num_channels, const T ratio_h, const T ratioW) { - int nthreads = output_h * output_w; - int tid = blockIdx.x * blockDim.x + threadIdx.x; - if (tid < nthreads) { - int out_id_h = tid / output_w; - int out_id_w = tid % output_w; - int in_img_size = input_w / num_channels; - int out_img_size = output_w / num_channels; - int channel_id = out_id_w / out_img_size; - - int out_img_idy = (out_id_w % out_img_size) / out_img_w; - int in_img_idy = ratio_h * out_img_idy; - int h_id = (in_img_idy < in_img_h - 1) ? 1 : 0; - T h1lambda = ratio_h * out_img_idy - in_img_idy; - T h2lambda = 1.f - h1lambda; - - int out_img_idx = tid % out_img_w; - int in_img_idx = ratioW * out_img_idx; - int w_id = (in_img_idx < in_img_w - 1) ? 1 : 0; - T w1lambda = ratioW * out_img_idx - in_img_idx; - T w2lambda = 1.f - w1lambda; - - const T* in_pos = &in[out_id_h * input_w + channel_id * in_img_size + - in_img_idy * in_img_w + in_img_idx]; - - // bilinear interpolation - out[out_id_h * output_w + out_id_w] = - h2lambda * (w2lambda * in_pos[0] + w1lambda * in_pos[w_id]) + - h1lambda * (w2lambda * in_pos[h_id * in_img_w] + - w1lambda * in_pos[h_id * in_img_w + w_id]); - } -} - -template -__global__ void KeBilinearInterpBw( - T* in, const size_t in_img_h, const size_t in_img_w, const size_t input_h, - const size_t input_w, const T* out, const size_t out_img_h, - const size_t out_img_w, const size_t output_h, const size_t output_w, - const size_t num_channels, const T ratio_h, const T ratioW) { - int nthreads = output_h * output_w; - int tid = blockIdx.x * blockDim.x + threadIdx.x; - if (tid < nthreads) { - int out_id_h = tid / output_w; - int out_id_w = tid % output_w; - int in_img_size = input_w / num_channels; - int out_img_size = output_w / num_channels; - int channel_id = out_id_w / out_img_size; - - int out_img_idy = (out_id_w % out_img_size) / out_img_w; - int in_img_idy = ratio_h * out_img_idy; - int h_id = (in_img_idy < in_img_h - 1) ? 1 : 0; - T h1lambda = ratio_h * out_img_idy - in_img_idy; - T h2lambda = 1.f - h1lambda; - - int out_img_idx = tid % out_img_w; - int in_img_idx = ratioW * out_img_idx; - int w_id = (in_img_idx < in_img_w - 1) ? 1 : 0; - T w1lambda = ratioW * out_img_idx - in_img_idx; - T w2lambda = 1.f - w1lambda; - - T* in_pos = &in[out_id_h * input_w + channel_id * in_img_size + - in_img_idy * in_img_w + in_img_idx]; - const T* out_pos = &out[out_id_h * output_w + out_id_w]; - atomicAdd(&in_pos[0], h2lambda * w2lambda * out_pos[0]); - atomicAdd(&in_pos[w_id], h2lambda * w1lambda * out_pos[0]); - atomicAdd(&in_pos[h_id * in_img_w], h1lambda * w2lambda * out_pos[0]); - atomicAdd(&in_pos[h_id * in_img_w + w_id], - h1lambda * w1lambda * out_pos[0]); - } -} - -template -class BilinearInterpOpCUDAKernel : public framework::OpKernel { - public: - void Compute(const framework::ExecutionContext& ctx) const override { - PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()), - "This kernel only runs on GPU device."); - auto* input_t = ctx.Input("X"); // float tensor - auto* output_t = ctx.Output("Out"); // float tensor - auto* input = input_t->data(); - - int out_h = ctx.Attr("out_h"); - int out_w = ctx.Attr("out_w"); - auto out_dims = output_t->dims(); - auto out_size_t = ctx.Input("OutSize"); - if (out_size_t != nullptr) { - Tensor sizes; - framework::TensorCopy(*out_size_t, platform::CPUPlace(), &sizes); - auto size_data = sizes.data(); - out_h = size_data[0]; - out_w = size_data[1]; - } - auto* output = output_t->mutable_data( - {out_dims[0], out_dims[1], out_h, out_w}, ctx.GetPlace()); - - int batch_size = input_t->dims()[0]; - int channels = input_t->dims()[1]; - int in_h = input_t->dims()[2]; - int in_w = input_t->dims()[3]; - - int in_hw = in_h * in_w; - int out_hw = out_h * out_w; - int in_chw = channels * in_hw; - int out_chw = channels * out_hw; - - T ratio_h = (out_h > 1) ? static_cast(in_h - 1) / (out_h - 1) : 0.f; - T ratio_w = (out_w > 1) ? static_cast(in_w - 1) / (out_w - 1) : 0.f; - - if (in_h == out_h && in_w == out_w) { - memcpy(output, input, input_t->numel() * sizeof(T)); - } else { - int threadNum = batch_size * out_chw; - int blocks = (threadNum + 1024 - 1) / 1024; - - KeBilinearInterpFw< - T><<>>( - input, in_h, in_w, batch_size, in_chw, output, out_h, out_w, - batch_size, out_chw, channels, ratio_h, ratio_w); - } - } -}; - -template -class BilinearInterpGradOpCUDAKernel : public framework::OpKernel { - public: - void Compute(const framework::ExecutionContext& ctx) const override { - auto* d_input_t = ctx.Output(framework::GradVarName("X")); - auto* d_output_t = ctx.Input(framework::GradVarName("Out")); - auto* d_output = d_output_t->data(); - auto* d_input = d_input_t->mutable_data(ctx.GetPlace()); - - auto& device_ctx = - ctx.template device_context(); - math::SetConstant zero; - zero(device_ctx, d_input_t, static_cast(0.0)); - - int out_h = ctx.Attr("out_h"); - int out_w = ctx.Attr("out_w"); - - auto out_size_t = ctx.Input("OutSize"); - if (out_size_t != nullptr) { - Tensor sizes; - framework::TensorCopy(*out_size_t, platform::CPUPlace(), &sizes); - auto size_data = sizes.data(); - out_h = size_data[0]; - out_w = size_data[1]; - } - - int batch_size = d_input_t->dims()[0]; - int channels = d_input_t->dims()[1]; - int in_h = d_input_t->dims()[2]; - int in_w = d_input_t->dims()[3]; - - int in_hw = in_h * in_w; - int out_hw = out_h * out_w; - int in_chw = channels * in_hw; - int out_chw = channels * out_hw; - - T ratio_h = (out_h > 1) ? static_cast(in_h - 1) / (out_h - 1) : 0.f; - T ratio_w = (out_w > 1) ? static_cast(in_w - 1) / (out_w - 1) : 0.f; - - if (in_h == out_h && in_w == out_w) { - memcpy(d_input, d_output, d_input_t->numel() * sizeof(T)); - } else { - int threadNum = batch_size * out_chw; - int blocks = (threadNum + 1024 - 1) / 1024; - - KeBilinearInterpBw< - T><<>>( - d_input, in_h, in_w, batch_size, in_chw, d_output, out_h, out_w, - batch_size, out_chw, channels, ratio_h, ratio_w); - } - } -}; - -} // namespace operators -} // namespace paddle - -namespace ops = paddle::operators; -REGISTER_OP_CUDA_KERNEL(bilinear_interp, - ops::BilinearInterpOpCUDAKernel); -REGISTER_OP_CUDA_KERNEL(bilinear_interp_grad, - ops::BilinearInterpGradOpCUDAKernel); diff --git a/paddle/fluid/operators/bilinear_interp_op.h b/paddle/fluid/operators/bilinear_interp_op.h deleted file mode 100644 index 70847cb8c1abe2e94bc844ab8117d1f23fea533b..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/bilinear_interp_op.h +++ /dev/null @@ -1,163 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve. - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - http://www.apache.org/licenses/LICENSE-2.0 - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. */ - -#pragma once -#include "paddle/fluid/framework/op_registry.h" -#include "paddle/fluid/operators/math/math_function.h" - -namespace paddle { -namespace operators { - -using Tensor = framework::Tensor; - -template -class BilinearInterpKernel : public framework::OpKernel { - public: - void Compute(const framework::ExecutionContext& ctx) const override { - auto* input_t = ctx.Input("X"); // float tensor - auto* output_t = ctx.Output("Out"); // float tensor - auto out_dims = output_t->dims(); - auto* input = input_t->data(); - int out_h = ctx.Attr("out_h"); - int out_w = ctx.Attr("out_w"); - auto out_size_t = ctx.Input("OutSize"); - if (out_size_t != nullptr) { - auto out_size_data = out_size_t->data(); - out_h = out_size_data[0]; - out_w = out_size_data[1]; - } - auto* output = output_t->mutable_data( - {out_dims[0], out_dims[1], out_h, out_w}, ctx.GetPlace()); - int batch_size = input_t->dims()[0]; - int channels = input_t->dims()[1]; - int in_h = input_t->dims()[2]; - int in_w = input_t->dims()[3]; - - int in_hw = in_h * in_w; - int out_hw = out_h * out_w; - int in_chw = channels * in_hw; - int out_chw = channels * out_hw; - - float ratio_h = - (out_h > 1) ? static_cast(in_h - 1) / (out_h - 1) : 0.f; - float ratio_w = - (out_w > 1) ? static_cast(in_w - 1) / (out_w - 1) : 0.f; - - if (in_h == out_h && in_w == out_w) { - memcpy(output, input, input_t->numel() * sizeof(T)); - } else { - for (int k = 0; k < batch_size; ++k) { // loop for batches - for (int i = 0; i < out_h; ++i) { // loop for images - int h = ratio_h * i; - int hid = (h < in_h - 1) ? 1 : 0; - float h1lambda = ratio_h * i - h; - float h2lambda = 1.f - h1lambda; - - for (int j = 0; j < out_w; ++j) { - int w = ratio_w * j; - int wid = (w < in_w - 1) ? 1 : 0; - float w1lambda = ratio_w * j - w; - float w2lambda = 1.f - w1lambda; - // calculate four position for bilinear interpolation - const T* in_pos = &input[k * in_chw + h * in_w + w]; - T* out_pos = &output[k * out_chw + i * out_w + j]; - - for (int c = 0; c < channels; ++c) { // loop for channels - // bilinear interpolation - out_pos[0] = static_cast( - h2lambda * (w2lambda * in_pos[0] + w1lambda * in_pos[wid]) + - h1lambda * (w2lambda * in_pos[hid * in_w] + - w1lambda * in_pos[hid * in_w + wid])); - in_pos += in_hw; - out_pos += out_hw; - } - } - } - } - } - } -}; - -template -class BilinearInterpGradKernel : public framework::OpKernel { - public: - void Compute(const framework::ExecutionContext& ctx) const override { - auto* d_input_t = ctx.Output(framework::GradVarName("X")); - auto* d_output_t = ctx.Input(framework::GradVarName("Out")); - auto* d_output = d_output_t->data(); - auto* d_input = d_input_t->mutable_data(ctx.GetPlace()); - auto& device_ctx = - ctx.template device_context(); - math::SetConstant zero; - zero(device_ctx, d_input_t, static_cast(0.0)); - - int out_h = ctx.Attr("out_h"); - int out_w = ctx.Attr("out_w"); - - auto out_size_t = ctx.Input("OutSize"); - if (out_size_t != nullptr) { - auto out_size_data = out_size_t->data(); - out_h = out_size_data[0]; - out_w = out_size_data[1]; - } - - int batch_size = d_input_t->dims()[0]; - int channels = d_input_t->dims()[1]; - int in_h = d_input_t->dims()[2]; - int in_w = d_input_t->dims()[3]; - - int in_hw = in_h * in_w; - int out_hw = out_h * out_w; - int in_chw = channels * in_hw; - int out_chw = channels * out_hw; - - float ratio_h = - (out_h > 1) ? static_cast(in_h - 1) / (out_h - 1) : 0.f; - float ratio_w = - (out_w > 1) ? static_cast(in_w - 1) / (out_w - 1) : 0.f; - - if (in_h == out_h && in_w == out_w) { - memcpy(d_input, d_output, d_input_t->numel() * sizeof(T)); - } else { - for (int k = 0; k < batch_size; ++k) { // loop for batches - for (int i = 0; i < out_h; ++i) { // loop for images - int h = ratio_h * i; - int hid = (h < in_h - 1) ? 1 : 0; - float h1lambda = ratio_h * i - h; - float h2lambda = 1 - h1lambda; - - for (int j = 0; j < out_w; ++j) { - int w = ratio_w * j; - int wid = (w < in_w - 1) ? 1 : 0; - float w1lambda = ratio_w * j - w; - float w2lambda = 1 - w1lambda; - T* in_pos = &d_input[k * in_chw + h * in_w + w]; - const T* out_pos = &d_output[k * out_chw + i * out_w + j]; - - for (int c = 0; c < channels; ++c) { // loop for channels - in_pos[0] += static_cast(h2lambda * w2lambda * out_pos[0]); - in_pos[wid] += static_cast(h2lambda * w1lambda * out_pos[0]); - in_pos[hid * in_w] += - static_cast(h1lambda * w2lambda * out_pos[0]); - in_pos[hid * in_w + wid] += - static_cast(h1lambda * w1lambda * out_pos[0]); - in_pos += in_hw; - out_pos += out_hw; - } - } - } - } - } - } -}; - -} // namespace operators -} // namespace paddle diff --git a/paddle/fluid/operators/bilinear_tensor_product_op.cc b/paddle/fluid/operators/bilinear_tensor_product_op.cc index 8d261a118a75ee16027faf60341cefd30c3cdbba..bd69f422e5dbd5a5dc95150b10daa302f47ec5ff 100644 --- a/paddle/fluid/operators/bilinear_tensor_product_op.cc +++ b/paddle/fluid/operators/bilinear_tensor_product_op.cc @@ -13,6 +13,9 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/operators/bilinear_tensor_product_op.h" +#include +#include +#include namespace paddle { namespace operators { @@ -121,15 +124,9 @@ class BilinearTensorProductOpGrad : public framework::OperatorWithKernel { "The second dimension of input(Out@GRAD) must be equal to " "the third dimension of the Input(Weight)."); - if (ctx->HasInput("Bias")) { - auto bias_dims = ctx->GetInputDim("Bias"); - PADDLE_ENFORCE_EQ( - bias_dims[1], out_dims[1], - "The second dimension of input(Out@GRAD) must be equal to " - "the second dimension of the Input(Bias)."); - auto bias_grad_name = framework::GradVarName("Bias"); - if (ctx->HasOutput(bias_grad_name)) - ctx->SetOutputDim(bias_grad_name, bias_dims); + auto bias_grad_name = framework::GradVarName("Bias"); + if (ctx->HasOutput(bias_grad_name)) { + ctx->SetOutputDim(bias_grad_name, {1, out_dims[1]}); } auto x_grad_name = framework::GradVarName("X"); @@ -148,13 +145,39 @@ class BilinearTensorProductOpGrad : public framework::OperatorWithKernel { } }; +class BilinearTensorProductGradOpDescMaker + : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + std::unique_ptr op(new framework::OpDesc()); + op->SetType("bilinear_tensor_product_grad"); + op->SetAttrMap(Attrs()); + op->SetInput("X", Input("X")); + op->SetInput("Y", Input("Y")); + op->SetInput("Weight", Input("Weight")); + if (ForwardOp().Inputs().count("Bias") > 0) { + op->SetOutput(framework::GradVarName("Bias"), InputGrad("Bias")); + } + + op->SetOutput(framework::GradVarName("X"), InputGrad("X")); + op->SetOutput(framework::GradVarName("Y"), InputGrad("Y")); + op->SetOutput(framework::GradVarName("Weight"), InputGrad("Weight")); + op->SetInput(framework::GradVarName("Out"), OutputGrad("Out")); + + return op; + } +}; + } // namespace operators } // namespace paddle namespace ops = paddle::operators; REGISTER_OPERATOR(bilinear_tensor_product, ops::BilinearTensorProductOp, ops::BilinearTensorProductOpMaker, - paddle::framework::DefaultGradOpDescMaker); + ops::BilinearTensorProductGradOpDescMaker); REGISTER_OPERATOR(bilinear_tensor_product_grad, ops::BilinearTensorProductOpGrad); REGISTER_OP_CPU_KERNEL( diff --git a/paddle/fluid/operators/bilinear_tensor_product_op.cu b/paddle/fluid/operators/bilinear_tensor_product_op.cu index 9426ffbe174c7daf9f24525f5f7ca12d986042f4..c2b4f69e6854522b91dfd9fb5f738c0e5ffc77b1 100644 --- a/paddle/fluid/operators/bilinear_tensor_product_op.cu +++ b/paddle/fluid/operators/bilinear_tensor_product_op.cu @@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#define EIGEN_USE_GPU #include "paddle/fluid/operators/bilinear_tensor_product_op.h" namespace ops = paddle::operators; diff --git a/paddle/fluid/operators/bilinear_tensor_product_op.h b/paddle/fluid/operators/bilinear_tensor_product_op.h index f23336f7b98d6d71d155373cff3515a8463aecbe..5017c3a457abc8865b9c20bec1c7c1429a4dfef4 100644 --- a/paddle/fluid/operators/bilinear_tensor_product_op.h +++ b/paddle/fluid/operators/bilinear_tensor_product_op.h @@ -70,7 +70,7 @@ class BilinearTensorProductKernel : public framework::OpKernel { if (bias) { auto bias_vec = EigenMatrix::From(*bias); Eigen::DSizes bcast(batch_size, 1); - output_mat.device(place) = bias_vec.broadcast(bcast) + output_mat; + output_mat.device(place) = bias_vec.broadcast(bcast).eval() + output_mat; } } }; @@ -99,13 +99,13 @@ class BilinearTensorProductGradKernel : public framework::OpKernel { auto d_out_mat = EigenMatrix::From(*d_out); auto& place = *ctx.template device_context().eigen_device(); auto& dev_ctx = ctx.template device_context(); - // Create the intermediate variable to caculate the Output(Y@Grad). + // Create the intermediate variable to calculate the Output(Y@Grad). Tensor x_scale; x_scale.mutable_data(framework::make_ddim({batch_size, x_dim}), ctx.GetPlace()); auto x_scale_mat = EigenMatrix::From(x_scale); - // Create the intermediate variable to caculate the Output(X@Grad). + // Create the intermediate variable to calculate the Output(X@Grad). Tensor y_scale; y_scale.mutable_data(framework::make_ddim({batch_size, y_dim}), ctx.GetPlace()); @@ -113,65 +113,64 @@ class BilinearTensorProductGradKernel : public framework::OpKernel { math::SetConstant set_zero; - // Set Output(X@Grad) be zero. if (d_x) { d_x->mutable_data(ctx.GetPlace()); set_zero(dev_ctx, d_x, static_cast(0)); } - // Set Output(Y@Grad) be zero. if (d_y) { d_y->mutable_data(ctx.GetPlace()); set_zero(dev_ctx, d_y, static_cast(0)); } + if (d_weight) { + d_weight->mutable_data(ctx.GetPlace()); + } + auto blas = math::GetBlas(ctx); // Caculate the Output(X@Grad) and Output(Y@Grad). - if (d_x || d_y) { + if (d_x || d_y || d_weight) { Eigen::DSizes bcast_for_x(1, y_dim); Eigen::DSizes bcast_for_y(1, x_dim); + Eigen::DSizes bcast_for_weight(1, x_dim); + for (int i = 0; i < out_dim; ++i) { Tensor weight_i = weight->Slice(i, i + 1).Resize( framework::make_ddim({x_dim, y_dim})); auto output_vec = d_out_mat.chip(i, 1); + if (d_x) { y_scale_mat.device(place) = output_vec.reshape(Eigen::DSizes(batch_size, 1)) - .broadcast(bcast_for_x) * + .broadcast(bcast_for_x) + .eval() * y_mat; blas.GEMM(CblasNoTrans, CblasTrans, batch_size, x_dim, y_dim, 1, y_scale.data(), weight_i.data(), 1, d_x->data()); } - if (d_y) { - x_scale_mat.device(place) = + + if (d_y || d_weight) { + auto output_vec_y = output_vec.reshape(Eigen::DSizes(batch_size, 1)) - .broadcast(bcast_for_y) * - x_mat; - blas.GEMM(CblasNoTrans, CblasNoTrans, batch_size, y_dim, x_dim, 1, - x_scale.data(), weight_i.data(), 1, d_y->data()); + .broadcast(bcast_for_y) + .eval(); + x_scale_mat.device(place) = output_vec_y * x_mat; + if (d_y) { + blas.GEMM(CblasNoTrans, CblasNoTrans, batch_size, y_dim, x_dim, 1, + x_scale.data(), weight_i.data(), 1, d_y->data()); + } + if (d_weight) { + Tensor d_weight_i = d_weight->Slice(i, i + 1).Resize( + framework::make_ddim({x_dim, y_dim})); + blas.GEMM(CblasTrans, CblasNoTrans, x_dim, y_dim, batch_size, 1, + x_scale.data(), y->data(), 0, d_weight_i.data()); + } } } } - // Caculate the gradient of Input(Weight). - if (d_weight) { - d_weight->mutable_data(ctx.GetPlace()); - Eigen::DSizes bcast_for_weight(1, x_dim); - for (int i = 0; i < out_dim; ++i) { - Tensor d_weight_i = d_weight->Slice(i, i + 1).Resize( - framework::make_ddim({x_dim, y_dim})); - auto output_vec = d_out_mat.chip(i, 1); - x_scale_mat.device(place) = - output_vec.reshape(Eigen::DSizes(batch_size, 1)) - .broadcast(bcast_for_weight) * - x_mat; - blas.GEMM(CblasTrans, CblasNoTrans, x_dim, y_dim, batch_size, 1, - x_scale.data(), y->data(), 0, d_weight_i.data()); - } - } - - // Caculate the gradient of Input(Bias). + // calculate the gradient of Input(Bias). if (d_bias) { d_bias->mutable_data(ctx.GetPlace()); auto d_bias_mat = framework::EigenVector::Flatten(*d_bias); diff --git a/paddle/fluid/operators/bpr_loss_op.cc b/paddle/fluid/operators/bpr_loss_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..b2dbaecfcfd67cc679d02e22d4e89cfedeeba80c --- /dev/null +++ b/paddle/fluid/operators/bpr_loss_op.cc @@ -0,0 +1,161 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/bpr_loss_op.h" +#include + +namespace paddle { +namespace operators { + +class BprLossOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should be not null."); + PADDLE_ENFORCE(ctx->HasInput("Label"), "Input(Label) should be not null."); + PADDLE_ENFORCE(ctx->HasOutput("Y"), "Output(Y) should be not null."); + + auto x_dims = ctx->GetInputDim("X"); + auto label_dims = ctx->GetInputDim("Label"); + int rank = x_dims.size(); + PADDLE_ENFORCE_EQ(rank, label_dims.size(), + "Input(X) and Input(Label) shall have the same rank."); + PADDLE_ENFORCE_EQ(framework::slice_ddim(x_dims, 0, rank - 1), + framework::slice_ddim(label_dims, 0, rank - 1), + "Input(X) and Input(Label) shall have the same shape " + "except the last dimension."); + + auto y_dims = x_dims; + y_dims[rank - 1] = 1; + ctx->SetOutputDim("Y", y_dims); + ctx->ShareLoD("X", /*->*/ "Y"); + } + + protected: + // Explicitly set that the data type of computation kernel of Seq-bpr + // is determined by its input "X". + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext& ctx) const override { + return framework::OpKernelType(ctx.Input("X")->type(), + platform::CPUPlace()); + } +}; + +class BprLossGradientOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should be not null."); + PADDLE_ENFORCE(ctx->HasInput("Label"), "Input(Label) should be not null."); + PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Y")), + "Input(Y@GRAD) shoudl be not null."); + PADDLE_ENFORCE(ctx->HasOutput(framework::GradVarName("X")), + "Output(X@GRAD) should be not null."); + + auto x_dims = ctx->GetInputDim("X"); + auto label_dims = ctx->GetInputDim("Label"); + auto dy_dims = ctx->GetInputDim(framework::GradVarName("Y")); + int rank = x_dims.size(); + PADDLE_ENFORCE_EQ(dy_dims.size(), rank, + "Input(Y@Grad) and Input(X) should have the same rank."); + PADDLE_ENFORCE_EQ(label_dims.size(), rank, + "Input(Label) and Input(X) should have the same rank."); + PADDLE_ENFORCE_EQ(framework::slice_ddim(x_dims, 0, rank - 1), + framework::slice_ddim(label_dims, 0, rank - 1), + "The Input(X) and Input(Label) should have the same " + "shape except the last dimension."); + PADDLE_ENFORCE_EQ(framework::slice_ddim(x_dims, 0, rank - 1), + framework::slice_ddim(dy_dims, 0, rank - 1), + "The Input(X) and Input(Y@Grad) should have the same " + "shape except the last dimension."); + PADDLE_ENFORCE_EQ(dy_dims[rank - 1], 1, + "The last dimension of Input(Y@Grad) should be 1."); + PADDLE_ENFORCE_EQ(label_dims[rank - 1], 1, + " the last dimension of Input(Label) should be 1."); + ctx->SetOutputDim(framework::GradVarName("X"), x_dims); + ctx->ShareLoD("X", framework::GradVarName("X")); + } + + protected: + // Explicitly set that the data type of computation kernel of cross_entropy + // is determined by its input "X". + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext& ctx) const override { + return framework::OpKernelType(ctx.Input("X")->type(), + platform::CPUPlace()); + } +}; + +class BprLossOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", + "(Tensor, default Tensor), a tensor whose last dimension " + "size is equal to the number of classes. This input is a " + "real number."); + AddInput( + "Label", + "(Tensor), the tensor which represents the ground truth. It has the " + "same shape with 'X' except the last dimension. the last dimension " + "size is 1."); + AddOutput("Y", + "(Tensor, default Tensor), a tensor whose shape is same " + "with 'X' except that the last dimension size is 1. It " + "represents the sequence bpr loss."); + AddComment(R"DOC( +Bayesian Personalized Ranking Loss Operator. + +This operator belongs to pairwise ranking loss. Label is the desired item. +The loss at a given point in one session is defined as: +$Y[i] = -\frac{1}{N_{i}} * \sum_{j=0}^{N_{i}}\log(\sigma(X[i, Label[i]]-X[i, j]))$ + +Learn more details by reading paper (https://arxiv.org/abs/1511.06939) + +)DOC"); + } +}; + +class BprLossGradDescMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + std::unique_ptr op(new framework::OpDesc()); + op->SetType("bpr_loss_grad"); + op->SetInput("X", Input("X")); + op->SetInput("Label", Input("Label")); + op->SetInput(framework::GradVarName("Y"), OutputGrad("Y")); + op->SetOutput(framework::GradVarName("X"), InputGrad("X")); + op->SetAttrMap(Attrs()); + return op; + } +}; +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +using CPUCtx = paddle::platform::CPUDeviceContext; + +REGISTER_OPERATOR(bpr_loss, ops::BprLossOp, ops::BprLossOpMaker, + ops::BprLossGradDescMaker); +REGISTER_OPERATOR(bpr_loss_grad, ops::BprLossGradientOp); +REGISTER_OP_CPU_KERNEL(bpr_loss, ops::BprLossOpKernel, + ops::BprLossOpKernel); +REGISTER_OP_CPU_KERNEL(bpr_loss_grad, + ops::BprLossGradientOpKernel, + ops::BprLossGradientOpKernel); diff --git a/paddle/fluid/operators/bpr_loss_op.h b/paddle/fluid/operators/bpr_loss_op.h new file mode 100644 index 0000000000000000000000000000000000000000..f9570e4e2ed0d9ac8739410eb7cd7397ad09fae4 --- /dev/null +++ b/paddle/fluid/operators/bpr_loss_op.h @@ -0,0 +1,118 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include "paddle/fluid/framework/eigen.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/math/math_function.h" +#include "paddle/fluid/platform/for_range.h" + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; +/*Todo: + *Find a way to adapt TolerableValue, using blas or eigen. + */ +template +struct TolerableValue { + HOSTDEVICE T operator()(const T& x) const { + PADDLE_ASSERT(std::is_floating_point::value); + const T kApproInf = 1e20; + if (x == INFINITY) return kApproInf; + if (x == -INFINITY) return -kApproInf; + return x; + } +}; + +template +class BprLossOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* x = ctx.Input("X"); + auto* label = ctx.Input("Label"); + auto* y = ctx.Output("Y"); + y->mutable_data(ctx.GetPlace()); + int rank = x->dims().size(); + + Tensor x_2d = framework::ReshapeToMatrix(*x, rank - 1); + Tensor labels_2d = framework::ReshapeToMatrix(*label, rank - 1); + Tensor y_2d = framework::ReshapeToMatrix(*y, rank - 1); + + const framework::Tensor* logits = &x_2d; + const framework::Tensor* labels = &labels_2d; + framework::Tensor* out = &y_2d; + + const int step_size = logits->dims()[0]; + const int class_num = logits->dims()[1]; + const T* logits_data = logits->data(); + T* loss_data = out->data(); + + const int64_t* label_data = labels->data(); + for (int i = 0; i < step_size; ++i) { + int lbl_pos = label_data[i]; + PADDLE_ENFORCE_GE(lbl_pos, 0); + PADDLE_ENFORCE_LT(lbl_pos, class_num); + int index_pos = i * class_num + lbl_pos; + T sum = static_cast(0); + for (int j = 0; j < class_num; j++) { + if (j == lbl_pos) continue; + int index_neg = i * class_num + j; + sum += TolerableValue()(-std::log( + 1.0f + TolerableValue()(std::exp(logits_data[index_neg] - + logits_data[index_pos])))); + } + loss_data[i] = -sum / (class_num - 1); + } + } +}; + +template +class BprLossGradientOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* x = ctx.Input("X"); + auto* dy = ctx.Input(framework::GradVarName("Y")); + auto* label = ctx.Input("Label"); + auto* dx = ctx.Output(framework::GradVarName("X")); + + const size_t step_size = static_cast(x->dims()[0]); + const size_t num_classes = static_cast(x->dims()[1]); + T* dx_data = dx->mutable_data(ctx.GetPlace()); + const T* dy_data = dy->data(); + const T* x_data = x->data(); + const int64_t* label_data = label->data(); + + for (size_t sample_id = 0; sample_id < step_size; sample_id++) { + for (size_t x_offset = sample_id * num_classes; + x_offset < (sample_id + 1) * num_classes; x_offset++) { + dx_data[x_offset] = static_cast(0); + } + auto p_index = sample_id * num_classes + label_data[sample_id]; + for (size_t ni = 0; ni < num_classes; ni++) { + if (label_data[sample_id] == ni) continue; + auto n_index = sample_id * num_classes + ni; + auto grad_ = -dy_data[sample_id] / + ((num_classes - 1) * + (1.0f + TolerableValue()(std::exp(x_data[p_index] - + x_data[n_index])))); + dx_data[p_index] += grad_; + dx_data[n_index] -= grad_; + } + } + } +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/cast_op.cc b/paddle/fluid/operators/cast_op.cc index 8d6a498dc941e44688ec8a2b49a6e080608f9b85..0c517cc757ca3f6f1ff7f4191ab2d529890b7154 100644 --- a/paddle/fluid/operators/cast_op.cc +++ b/paddle/fluid/operators/cast_op.cc @@ -13,6 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/operators/cast_op.h" +#include #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/platform/float16.h" @@ -30,7 +31,8 @@ class CastOpProtoMaker : public framework::OpProtoAndCheckerMaker { Cast Operator. This Operator casts the input tensor to another data type and -returns tha Output Tensor. +returns the Output Tensor. It's meaningless if the output dtype equals +the input dtype, but it's fine if you do so. )DOC"); } diff --git a/paddle/fluid/operators/checkpoint_notify_op.cc b/paddle/fluid/operators/checkpoint_notify_op.cc deleted file mode 100644 index 3a2527e407bb179c4873fa3ffe2e8f22fb47faf7..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/checkpoint_notify_op.cc +++ /dev/null @@ -1,88 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include // NOLINT -#include - -#include "paddle/fluid/framework/data_type.h" -#include "paddle/fluid/framework/lod_tensor.h" -#include "paddle/fluid/framework/op_registry.h" -#include "paddle/fluid/operators/detail/macros.h" -#include "paddle/fluid/operators/send_recv_util.h" -#include "paddle/fluid/string/printf.h" - -namespace paddle { -namespace operators { - -class CheckpointNotifyOp : public framework::OperatorBase { - public: - CheckpointNotifyOp(const std::string& type, - const framework::VariableNameMap& inputs, - const framework::VariableNameMap& outputs, - const framework::AttributeMap& attrs) - : OperatorBase(type, inputs, outputs, attrs) {} - - void RunImpl(const framework::Scope& scope, - const platform::Place& place) const override { - std::vector epmap = Attr>("epmap"); - std::string dir = Attr("dir"); - std::string lookup_table_name = Attr("lookup_table"); - - distributed::RPCClient* rpc_client = - distributed::RPCClient::GetInstance(); - for (size_t i = 0; i < epmap.size(); i++) { - auto lookup_table_save_dir = - string::Sprintf("%s/%s_%d", dir, lookup_table_name, i); - rpc_client->AsyncCheckpointNotify(epmap[i], lookup_table_save_dir); - VLOG(3) << "checkpoint notify sending lookup table: " << lookup_table_name - << " and dir:" << dir << " to " << epmap[i]; - } - PADDLE_ENFORCE(rpc_client->Wait(), "internal error in RPCClient"); - } -}; - -class CheckpointNotifyOpMaker : public framework::OpProtoAndCheckerMaker { - public: - void Make() { - AddAttr>("epmap", - "(string vector, default 127.0.0.1:6164)" - "Parameter Server endpoints in the order") - .SetDefault({"127.0.0.1:6164"}); - AddAttr( - "dir", "(string, default '') indicate the folder checkpoint will use"); - AddAttr("lookup_table", - "(string, default '') the lookup table name"); - AddComment(R"DOC( -CheckpointNotify operator - -This operator will send lookup table and it's checkpoint direcoty to listen_and_serve op at -the parameter server. -)DOC"); - } -}; - -class CheckpointNotifyOpShapeInference : public framework::InferShapeBase { - public: - void operator()(framework::InferShapeContext* ctx) const override {} -}; - -} // namespace operators -} // namespace paddle - -namespace ops = paddle::operators; - -REGISTER_OPERATOR(checkpoint_notify, ops::CheckpointNotifyOp, - paddle::framework::EmptyGradOpMaker, - ops::CheckpointNotifyOpMaker, - ops::CheckpointNotifyOpShapeInference); diff --git a/paddle/fluid/operators/clip_by_norm_op.cc b/paddle/fluid/operators/clip_by_norm_op.cc index eae86a373be278cbb3ea9425b2ff0169f8faa99e..5720b295ecf8171540803aaadff43dfdcb20553b 100644 --- a/paddle/fluid/operators/clip_by_norm_op.cc +++ b/paddle/fluid/operators/clip_by_norm_op.cc @@ -14,69 +14,10 @@ limitations under the License. */ #include "paddle/fluid/operators/clip_by_norm_op.h" -namespace paddle { -namespace operators { - -class ClipByNormOp : public framework::OperatorWithKernel { - public: - using framework::OperatorWithKernel::OperatorWithKernel; - - protected: - void InferShape(framework::InferShapeContext* ctx) const override { - PADDLE_ENFORCE(ctx->HasInput("X"), - "Input(X) of ClipByNormOp should not be null."); - PADDLE_ENFORCE(ctx->HasOutput("Out"), - "Output(Out) of ClipByNormOp should not be null."); - auto max_norm = ctx->Attrs().Get("max_norm"); - PADDLE_ENFORCE_GT(max_norm, 0, "max_norm should be greater than 0."); - auto x_dims = ctx->GetInputDim("X"); - ctx->SetOutputDim("Out", x_dims); - ctx->ShareLoD("X", /*->*/ "Out"); - } -}; - -class ClipByNormOpMaker : public framework::OpProtoAndCheckerMaker { - public: - void Make() override { - AddInput("X", - "(Tensor) The input of clip_by_norm op." - "The number of dimensions must be between [1, 9]."); - AddOutput("Out", - "(Tensor) The output of clip_by_norm op with shape as input(X)"); - AddAttr("max_norm", "(float) The maximum norm value."); - AddComment(R"DOC( -ClipByNorm Operator. - -This operator limits the L2 norm of the input $X$ within $max\_norm$. -If the L2 norm of $X$ is less than or equal to $max\_norm$, $Out$ will be -the same as $X$. If the L2 norm of $X$ is greater than $max\_norm$, $X$ will -be linearly scaled to make the L2 norm of $Out$ equal to $max\_norm$, as -shown in the following formula: - -$$ -Out = \\frac{max\\_norm * X}{norm(X)}, -$$ - -where $norm(X)$ represents the L2 norm of $X$. - -Examples: - .. code-block:: python - - data = fluid.layer.data( - name='data', shape=[2, 4, 6], dtype='float32') - reshaped = fluid.layers.clip_by_norm( - x=data, max_norm=0.5) - -)DOC"); - } -}; - -} // namespace operators -} // namespace paddle - namespace ops = paddle::operators; REGISTER_OP_WITHOUT_GRADIENT(clip_by_norm, ops::ClipByNormOp, ops::ClipByNormOpMaker); + REGISTER_OP_CPU_KERNEL( clip_by_norm, ops::ClipByNormKernel); diff --git a/paddle/fluid/operators/clip_by_norm_op.h b/paddle/fluid/operators/clip_by_norm_op.h index 855c4d70677395992e2bf685c910cbea2d37b20b..d8baa4b8b235fdea7a3dc51ac7db1c004d49334a 100644 --- a/paddle/fluid/operators/clip_by_norm_op.h +++ b/paddle/fluid/operators/clip_by_norm_op.h @@ -64,7 +64,7 @@ class ClipByNormKernel : public framework::OpKernel { output->mutable_data(context.GetPlace()); } else { PADDLE_THROW("Unexpected branch, input variable type is %s", - in_var->Type().name()); + framework::ToTypeName(in_var->Type())); } PADDLE_ENFORCE_NOT_NULL(input); @@ -83,5 +83,59 @@ class ClipByNormKernel : public framework::OpKernel { } }; +class ClipByNormOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + protected: + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("X"), + "Input(X) of ClipByNormOp should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("Out"), + "Output(Out) of ClipByNormOp should not be null."); + auto max_norm = ctx->Attrs().Get("max_norm"); + PADDLE_ENFORCE_GT(max_norm, 0, "max_norm should be greater than 0."); + auto x_dims = ctx->GetInputDim("X"); + ctx->SetOutputDim("Out", x_dims); + ctx->ShareLoD("X", /*->*/ "Out"); + } +}; + +class ClipByNormOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", + "(Tensor) The input of clip_by_norm op." + "The number of dimensions must be between [1, 9]."); + AddOutput("Out", + "(Tensor) The output of clip_by_norm op with shape as input(X)"); + AddAttr("max_norm", "(float) The maximum norm value."); + AddComment(R"DOC( +ClipByNorm Operator. + +This operator limits the L2 norm of the input $X$ within $max\_norm$. +If the L2 norm of $X$ is less than or equal to $max\_norm$, $Out$ will be +the same as $X$. If the L2 norm of $X$ is greater than $max\_norm$, $X$ will +be linearly scaled to make the L2 norm of $Out$ equal to $max\_norm$, as +shown in the following formula: + +$$ +Out = \\frac{max\\_norm * X}{norm(X)}, +$$ + +where $norm(X)$ represents the L2 norm of $X$. + +Examples: + .. code-block:: python + + data = fluid.layer.data( + name='data', shape=[2, 4, 6], dtype='float32') + reshaped = fluid.layers.clip_by_norm( + x=data, max_norm=0.5) + +)DOC"); + } +}; + } // namespace operators } // namespace paddle diff --git a/paddle/fluid/operators/clip_op.cc b/paddle/fluid/operators/clip_op.cc index a679f7e2536a0a44148193f423f5ffe11b5e35fc..4fc6ae365ec61326670775ab13b854235f19266f 100644 --- a/paddle/fluid/operators/clip_op.cc +++ b/paddle/fluid/operators/clip_op.cc @@ -13,6 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/operators/clip_op.h" +#include namespace paddle { namespace operators { @@ -76,12 +77,28 @@ class ClipOpGrad : public framework::OperatorWithKernel { } }; +class ClipGradOpDescMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + std::unique_ptr op(new framework::OpDesc()); + op->SetType("clip_grad"); + op->SetInput("X", Input("X")); + op->SetInput(framework::GradVarName("Out"), OutputGrad("Out")); + op->SetOutput(framework::GradVarName("X"), InputGrad("X")); + op->SetAttrMap(Attrs()); + return op; + } +}; + } // namespace operators } // namespace paddle namespace ops = paddle::operators; REGISTER_OPERATOR(clip, ops::ClipOp, ops::ClipOpMaker, - paddle::framework::DefaultGradOpDescMaker); + ops::ClipGradOpDescMaker); REGISTER_OPERATOR(clip_grad, ops::ClipOpGrad); REGISTER_OP_CPU_KERNEL( clip, ops::ClipKernel); diff --git a/paddle/fluid/operators/compare_op.cc b/paddle/fluid/operators/compare_op.cc deleted file mode 100644 index f40b1ba338d429c248103eeb930ac7e1bb690218..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/compare_op.cc +++ /dev/null @@ -1,115 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/operators/compare_op.h" -#include -#include "paddle/fluid/framework/op_registry.h" - -namespace paddle { -namespace operators { -template -class CompareOpProtoMaker : public framework::OpProtoAndCheckerMaker { - public: - void Make() override { - OpComment comment; - AddInput("X", string::Sprintf("the left hand operand of %s operator", - comment.type)); - AddInput("Y", string::Sprintf("the right hand operand of %s operator", - comment.type)); - AddAttr("force_cpu", - "Force fill output variable to cpu " - "memory. Otherwise, fill output variable to the running " - "device [default true].") - .SetDefault(true); - AddOutput("Out", string::Sprintf("n-dim bool tensor. Each element is %s", - comment.equation)); - AddComment(string::Sprintf(R"DOC( -It operates element-wise on X and Y, and returns the Out. Each of them is a -N-dim tensor. X and Y could be any type. The each element of the Out tensor is -calculated by $%s$ -)DOC", - comment.equation)); - AddAttr( - "axis", - "The start dimension index for broadcasting Y onto X. [default -1]") - .SetDefault(-1) - .EqualGreaterThan(-1); - } -}; - -template -class CompareOpInferShape : public framework::InferShapeBase { - public: - void operator()(framework::InferShapeContext *context) const override { - OpComment comment; - PADDLE_ENFORCE(context->HasInput("X"), "%s operator must has input X", - comment.type); - PADDLE_ENFORCE(context->HasInput("Y"), "%s operator must has input Y", - comment.type); - auto dim_x = context->GetInputDim("X"); - auto dim_y = context->GetInputDim("Y"); - PADDLE_ENFORCE_GE(dim_x.size(), dim_y.size(), - "The size of dim_y should not be greater than dim_x's."); - - context->SetOutputDim("Out", context->GetInputDim("X")); - context->ShareLoD("X", "Out"); - } -}; - -class CompareOp : public framework::OperatorWithKernel { - public: - using framework::OperatorWithKernel::OperatorWithKernel; - - protected: - framework::OpKernelType GetExpectedKernelType( - const framework::ExecutionContext &ctx) const override { - framework::OpKernelType kt = OperatorWithKernel::GetExpectedKernelType(ctx); - // CompareOp kernel's device type is decided by input tensor place - bool force_cpu = ctx.Attr("force_cpu"); - kt.place_ = force_cpu ? platform::CPUPlace() - : ctx.Input("X")->place(); - return kt; - } -}; - -} // namespace operators -} // namespace paddle - -#define REGISTER_COMPARE_OP(op_type, _equation) \ - struct _##op_type##Comment { \ - static char type[]; \ - static char equation[]; \ - }; \ - char _##op_type##Comment::type[]{#op_type}; \ - char _##op_type##Comment::equation[]{_equation}; \ - REGISTER_OPERATOR( \ - op_type, ::paddle::operators::CompareOp, \ - ::paddle::operators::CompareOpProtoMaker<_##op_type##Comment>, \ - ::paddle::operators::CompareOpInferShape<_##op_type##Comment>, \ - ::paddle::framework::EmptyGradOpMaker); - -REGISTER_COMPARE_OP(less_than, "Out = X < Y"); -REGISTER_COMPARE_KERNEL(less_than, CPU, paddle::operators::LessThanFunctor); -REGISTER_COMPARE_OP(less_equal, "Out = X <= Y"); -REGISTER_COMPARE_KERNEL(less_equal, CPU, paddle::operators::LessEqualFunctor); -REGISTER_COMPARE_OP(greater_than, "Out = X > Y"); -REGISTER_COMPARE_KERNEL(greater_than, CPU, - paddle::operators::GreaterThanFunctor); -REGISTER_COMPARE_OP(greater_equal, "Out = X >= Y"); -REGISTER_COMPARE_KERNEL(greater_equal, CPU, - paddle::operators::GreaterEqualFunctor); -REGISTER_COMPARE_OP(equal, "Out = X == Y"); -REGISTER_COMPARE_KERNEL(equal, CPU, paddle::operators::EqualFunctor); -REGISTER_COMPARE_OP(not_equal, "Out = X != Y"); -REGISTER_COMPARE_KERNEL(not_equal, CPU, paddle::operators::NotEqualFunctor); diff --git a/paddle/fluid/operators/compare_op.cu b/paddle/fluid/operators/compare_op.cu deleted file mode 100644 index 1bf85c64fb5b4d79c62118959fd72b13ed1c63ed..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/compare_op.cu +++ /dev/null @@ -1,24 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/operators/compare_op.h" - -REGISTER_COMPARE_KERNEL(less_than, CUDA, paddle::operators::LessThanFunctor); -REGISTER_COMPARE_KERNEL(less_equal, CUDA, paddle::operators::LessEqualFunctor); -REGISTER_COMPARE_KERNEL(greater_than, CUDA, - paddle::operators::GreaterThanFunctor); -REGISTER_COMPARE_KERNEL(greater_equal, CUDA, - paddle::operators::GreaterEqualFunctor); -REGISTER_COMPARE_KERNEL(equal, CUDA, paddle::operators::EqualFunctor); -REGISTER_COMPARE_KERNEL(not_equal, CUDA, paddle::operators::NotEqualFunctor); diff --git a/paddle/fluid/operators/compare_op.h b/paddle/fluid/operators/compare_op.h deleted file mode 100644 index 1cbabdaf6767815c1fedba0eabec9b5de678e047..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/compare_op.h +++ /dev/null @@ -1,100 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#pragma once -#include -#include -#include "paddle/fluid/framework/op_registry.h" -#include "paddle/fluid/operators/elementwise_op_function.h" -#include "paddle/fluid/platform/transform.h" - -namespace paddle { -namespace operators { - -template -struct LessThanFunctor { - using ELEM_TYPE = T; - HOSTDEVICE bool operator()(const T& a, const T& b) const { return a < b; } -}; - -template -struct LessEqualFunctor { - using ELEM_TYPE = T; - HOSTDEVICE bool operator()(const T& a, const T& b) const { return a <= b; } -}; - -template -struct GreaterThanFunctor { - using ELEM_TYPE = T; - HOSTDEVICE bool operator()(const T& a, const T& b) const { return a > b; } -}; - -template -struct GreaterEqualFunctor { - using ELEM_TYPE = T; - HOSTDEVICE bool operator()(const T& a, const T& b) const { return a >= b; } -}; - -template -struct EqualFunctor { - using ELEM_TYPE = T; - HOSTDEVICE bool operator()(const T& a, const T& b) const { - if (std::is_floating_point::value) { - // This branch will be optimized while compiling if T is integer. It is - // safe to cast a and b to double. - return fabs(static_cast(a - b)) < 1e-8; - } else { - return (a == b); - } - } -}; - -template -struct NotEqualFunctor { - using ELEM_TYPE = T; - HOSTDEVICE bool operator()(const T& a, const T& b) const { - return !EqualFunctor()(a, b); - } -}; - -template -class CompareOpKernel - : public framework::OpKernel { - public: - void Compute(const framework::ExecutionContext& context) const override { - using T = typename Functor::ELEM_TYPE; - using Tensor = framework::Tensor; - - auto* x = context.Input("X"); - auto* y = context.Input("Y"); - auto* z = context.Output("Out"); - int axis = context.Attr("axis"); - ElementwiseComputeEx(context, x, y, axis, - Functor(), z); - } -}; - -} // namespace operators -} // namespace paddle - -#define REGISTER_COMPARE_KERNEL(op_type, dev, functor) \ - REGISTER_OP_##dev##_KERNEL( \ - op_type, ::paddle::operators::CompareOpKernel< \ - ::paddle::platform::dev##DeviceContext, functor>, \ - ::paddle::operators::CompareOpKernel< \ - ::paddle::platform::dev##DeviceContext, functor>, \ - ::paddle::operators::CompareOpKernel< \ - ::paddle::platform::dev##DeviceContext, functor>, \ - ::paddle::operators::CompareOpKernel< \ - ::paddle::platform::dev##DeviceContext, functor>); diff --git a/paddle/fluid/operators/concat_op.cc b/paddle/fluid/operators/concat_op.cc index 57817da71adfd80faad29a48b05ba2f326de6c07..1f71555180361a1522b7a1c8383fe128bc4edcd0 100644 --- a/paddle/fluid/operators/concat_op.cc +++ b/paddle/fluid/operators/concat_op.cc @@ -13,10 +13,14 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/operators/concat_op.h" - +#include #include #include +#ifdef PADDLE_WITH_MKLDNN +#include +#endif + namespace paddle { namespace operators { using framework::Tensor; @@ -47,9 +51,19 @@ class ConcatOp : public framework::OperatorWithKernel { if (j == axis) { out_dims[axis] += ins[i][j]; } else { - PADDLE_ENFORCE_EQ(out_dims[j], ins[i][j], - "Input tensors should have the same " - "elements except the specify axis."); + if (ctx->IsRuntime()) { + // check all shape in run time + PADDLE_ENFORCE_EQ(out_dims[j], ins[i][j], + "Input tensors should have the same " + "elements except the specify axis."); + } else { + // not check -1 with other in compile time + if (out_dims[j] > 0 && ins[i][j] > 0) { + PADDLE_ENFORCE_EQ(out_dims[j], ins[i][j], + "Input tensors should have the same " + "elements except the specify axis."); + } + } } } } @@ -59,6 +73,22 @@ class ConcatOp : public framework::OperatorWithKernel { ctx->SetOutputDim("Out", out_dims); ctx->ShareLoD("X", /*->*/ "Out"); } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext &ctx) const override { + auto input_data_type = + framework::GetDataTypeOfVar(ctx.MultiInputVar("X")[0]); + +#ifdef PADDLE_WITH_MKLDNN + if (platform::CanMKLDNNBeUsed(ctx)) { + return framework::OpKernelType(input_data_type, ctx.GetPlace(), + framework::DataLayout::kMKLDNN, + framework::LibraryType::kMKLDNN); + } +#endif + return framework::OpKernelType(input_data_type, ctx.GetPlace()); + } }; class ConcatOpMaker : public framework::OpProtoAndCheckerMaker { @@ -66,6 +96,10 @@ class ConcatOpMaker : public framework::OpProtoAndCheckerMaker { void Make() override { AddInput("X", "Input tensors of concat operator.").AsDuplicable(); AddOutput("Out", "Output tensor of concat operator."); + AddAttr( + "use_mkldnn", + "(bool, default false) Indicates if MKL-DNN kernel will be used") + .SetDefault(false); AddAttr("axis", "The axis along which the input tensors will be concatenated.") .SetDefault(0); @@ -87,11 +121,7 @@ Examples: class ConcatOpGrad : public framework::OperatorWithKernel { public: - ConcatOpGrad(const std::string &type, - const framework::VariableNameMap &inputs, - const framework::VariableNameMap &outputs, - const framework::AttributeMap &attrs) - : OperatorWithKernel(type, inputs, outputs, attrs) {} + using framework::OperatorWithKernel::OperatorWithKernel; void InferShape(framework::InferShapeContext *ctx) const override { auto in_x = "X"; @@ -109,6 +139,33 @@ class ConcatOpGrad : public framework::OperatorWithKernel { } } } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext &ctx) const override { + return framework::OpKernelType( + ctx.Input(framework::GradVarName("Out"))->type(), + ctx.GetPlace()); + } +}; + +DECLARE_NO_NEED_BUFFER_VARS_INFERENCE(ConcatOpGradNoNeedBufferVarInference, + "X"); + +class ConcatGradOpDescMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + std::unique_ptr op(new framework::OpDesc()); + op->SetType("concat_grad"); + op->SetInput("X", Input("X")); + op->SetInput(framework::GradVarName("Out"), OutputGrad("Out")); + op->SetOutput(framework::GradVarName("X"), InputGrad("X", false)); + op->SetAttrMap(Attrs()); + return op; + } }; } // namespace operators @@ -116,9 +173,9 @@ class ConcatOpGrad : public framework::OperatorWithKernel { namespace ops = paddle::operators; REGISTER_OPERATOR(concat, ops::ConcatOp, ops::ConcatOpMaker, - paddle::framework::DefaultGradOpDescMaker< - false> /* set false to disable empty grad */); -REGISTER_OPERATOR(concat_grad, ops::ConcatOpGrad); + ops::ConcatGradOpDescMaker); +REGISTER_OPERATOR(concat_grad, ops::ConcatOpGrad, + ops::ConcatOpGradNoNeedBufferVarInference); REGISTER_OP_CPU_KERNEL( concat, ops::ConcatKernel, ops::ConcatKernel, diff --git a/paddle/fluid/operators/concat_op.h b/paddle/fluid/operators/concat_op.h index b2c6495c442cd02679825425becc2160c303dcc6..bd474be0facb349c53a8766412311296383a86c5 100644 --- a/paddle/fluid/operators/concat_op.h +++ b/paddle/fluid/operators/concat_op.h @@ -17,7 +17,7 @@ limitations under the License. */ #include #include #include "paddle/fluid/framework/op_registry.h" -#include "paddle/fluid/operators/math/concat.h" +#include "paddle/fluid/operators/math/concat_and_split.h" #include "paddle/fluid/operators/strided_memcpy.h" namespace paddle { @@ -89,29 +89,17 @@ class ConcatGradKernel : public framework::OpKernel { outputs.push_back(nullptr); } } + auto& dev_ctx = ctx.template device_context(); // Sometimes direct copies will be faster, this maybe need deeply analysis. if (axis == 0 && outs.size() < 10) { - size_t input_offset = 0; - const auto in_stride = framework::stride_numel(out_grad->dims()); - - for (size_t i = 0; i < outs.size(); ++i) { - auto out_stride = framework::stride_numel(ins[i]->dims()); - auto* out = outputs[i]; - if (out != nullptr) { - StridedNumelCopyWithAxis( - ctx.device_context(), axis, out->data(), out_stride, - out_grad->data() + input_offset, in_stride, out_stride[axis]); - } - input_offset += out_stride[axis]; - } + std::vector ref_shape; + ref_shape.insert(ref_shape.begin(), ins.begin(), ins.end()); + StridedMemcpyWithAxis0(dev_ctx, *out_grad, ref_shape, &outputs); } else { - auto& dev_ctx = ctx.template device_context(); - paddle::operators::math::ConcatGradFunctor - concat_grad_functor; - concat_grad_functor(dev_ctx, *out_grad, - ctx.MultiInput("X"), - static_cast(axis), &outputs); + math::SplitFunctor split_functor; + split_functor(dev_ctx, *out_grad, ctx.MultiInput("X"), + static_cast(axis), &outputs); } } }; diff --git a/paddle/fluid/operators/conditional_block_op.cc b/paddle/fluid/operators/conditional_block_op.cc deleted file mode 100644 index 135254ce6b6bf9add7bb1f0c3f645ed47081fba4..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/conditional_block_op.cc +++ /dev/null @@ -1,257 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ -#include -#include "paddle/fluid/framework/executor.h" -#include "paddle/fluid/framework/op_registry.h" -#include "paddle/fluid/framework/var_type.h" - -namespace paddle { -namespace operators { - -class ConditionalOp : public framework::OperatorBase { - public: - ConditionalOp(const std::string &type, - const framework::VariableNameMap &inputs, - const framework::VariableNameMap &outputs, - const framework::AttributeMap &attrs) - : OperatorBase(type, inputs, outputs, attrs) {} - - protected: - std::vector InputTensors( - const framework::Scope &scope, const std::string &in_name) const { - std::vector retv; - auto xs = Inputs(in_name); - retv.resize(xs.size(), nullptr); - std::transform( - xs.begin(), xs.end(), retv.begin(), - [&scope](const std::string &var_name) -> const framework::LoDTensor * { - auto *var = scope.FindVar(var_name); - PADDLE_ENFORCE(var != nullptr, "Cannot find variable %s", var_name); - return &var->Get(); - }); - return retv; - } - - bool ScalarCondition( - const std::vector &ips) const { - if (!(ips.size() == 1UL && ips[0]->IsInitialized())) { - PADDLE_THROW("should have one initialized input as condition"); - } - if (!(framework::IsType(ips[0]->type()) && // NOLINT - ips[0]->numel() == 1)) { - PADDLE_THROW( - "condition input's data type should be bool, " - "numel should be 1, actual numel is %d", - ips[0]->numel()); - } - bool res = false; - if (platform::is_gpu_place(ips[0]->place())) { -#ifdef PADDLE_WITH_CUDA - framework::LoDTensor cpu_tensor; - framework::TensorCopy(*ips[0], platform::CPUPlace(), &cpu_tensor); - platform::DeviceContextPool::Instance().Get(ips[0]->place())->Wait(); - res = cpu_tensor.data()[0]; -#endif - } else { - res = ips[0]->data()[0]; - } - return res; - } -}; - -class ConditionalBlockOp : public ConditionalOp { - public: - ConditionalBlockOp(const std::string &type, - const framework::VariableNameMap &inputs, - const framework::VariableNameMap &outputs, - const framework::AttributeMap &attrs) - : ConditionalOp(type, inputs, outputs, attrs) {} - - private: - void RunImpl(const framework::Scope &scope, - const platform::Place &dev_place) const override { - bool need_run; - if (Attr("is_scalar_condition")) { - // When is_scalar_condition is True, the conditional variable is a scalar, - // whether need to execute the operators in sub-block depends on the - // conditional variable (Cond). - auto xs = InputTensors(scope, "Cond"); - need_run = ScalarCondition(xs); - } else { - // When is_scalar_condition is False, the conditional variable maybe a - // vector or tensor, whether need to execute the operators in sub-block - // depends on the input variables (Input). - auto xs = InputTensors(scope, "Input"); - need_run = std::all_of( - xs.begin(), xs.end(), - [](const framework::LoDTensor *t) { return t->numel() != 0; }); - } - - if (need_run) { - auto *scope_var = scope.FindVar(Output("Scope")); - PADDLE_ENFORCE(scope_var != nullptr, "Must set scope"); - auto *scopes = scope_var->GetMutable>(); - scopes->resize(1); - scopes->front() = &scope.NewScope(); - auto &cur_scope = *scopes->front(); - - framework::Executor exec(dev_place); - auto *block = Attr("sub_block"); - exec.Run(*block->Program(), &cur_scope, block->ID(), false); - } - } -}; - -class ConditionalBlockOpProtoMaker : public framework::OpProtoAndCheckerMaker { - public: - void Make() override { - AddInput("Cond", - "The conditional variable of this operator. If Cond is empty, the " - "whole sub-block will not be executed.") - .AsDuplicable(); - AddInput("Input", "The input variables of the sub-block.").AsDuplicable(); - AddOutput("Out", "The output variables of the sub-block.").AsDuplicable(); - AddOutput("Scope", - "(std::vector) The step scope of conditional block. To " - "unify the conditional block, rnn and while op, the type of " - "scope is std::vector"); - AddAttr( - "sub_block", "The step block of conditional block operator"); - AddAttr("is_scalar_condition", - "The conditional variable (Cond) is used as scalar " - "condition.") - .SetDefault(false); - AddComment(R"DOC(Conditional block operator - -If `is_scalar_condition` is True, the conditional variable (Cond) is a scalar, -run the operators in sub-block if Cond is True. - -If `is_scalar_condition` is False, the conditional variable (Cond) is a vector or -tensor, run the operators in sub-block if all of input variables are not empty. - - -)DOC"); - } -}; - -class ConditionalBlockGradOp : public ConditionalOp { - public: - ConditionalBlockGradOp(const std::string &type, - const framework::VariableNameMap &inputs, - const framework::VariableNameMap &outputs, - const framework::AttributeMap &attrs) - : ConditionalOp(type, inputs, outputs, attrs) {} - - private: - void RunImpl(const framework::Scope &scope, - const platform::Place &dev_place) const override { - bool need_run; - if (Attr("is_scalar_condition")) { - auto xs = this->InputTensors(scope, "Cond"); - need_run = ScalarCondition(xs); - } else { - auto xs = this->InputTensors(scope, "Input"); - need_run = std::all_of( - xs.begin(), xs.end(), - [](const framework::LoDTensor *t) { return t->numel() != 0; }); - } - - if (need_run) { - auto *scope_var = scope.FindVar(Input("Scope")); - PADDLE_ENFORCE(scope_var != nullptr, "Must set scope"); - auto &scopes = scope_var->Get>(); - framework::Scope &cur_scope = *scopes[0]; - - framework::Executor exec(dev_place); - auto *block = Attr("sub_block"); - exec.Run(*block->Program(), &cur_scope, block->ID(), false); - - AssignLocalGradientToGlobal(dev_place, cur_scope, Inputs("Input"), - Outputs(framework::GradVarName("Input"))); - - AssignLocalGradientToGlobal(dev_place, cur_scope, Inputs("Cond"), - Outputs(framework::GradVarName("Cond"))); - } - } - - private: - void AssignLocalGradientToGlobal( - const platform::Place &place, const framework::Scope &cur_scope, - const std::vector &p_names, - const std::vector &pg_names) const { - for (size_t i = 0; i < p_names.size(); ++i) { - auto out_grad_name = pg_names[i]; - auto in_grad_name = framework::GradVarName(p_names[i]); - auto *in_var = cur_scope.FindVar(in_grad_name); - if (in_var == nullptr) { - continue; - } - auto new_in_grad_name = cur_scope.Rename(in_grad_name); - auto assign = framework::OpRegistry::CreateOp( - "assign", {{"X", {new_in_grad_name}}}, {{"Out", {out_grad_name}}}, - framework::AttributeMap{}); - assign->Run(cur_scope, place); - cur_scope.Rename(new_in_grad_name, in_grad_name); - } - } -}; - -class ConditionalBlockGradInferShape : public framework::InferShapeBase { - public: - void operator()(framework::InferShapeContext *context) const override { - PADDLE_ENFORCE(context->HasInputs("Cond")); - if (context->HasInputs("Input")) { - PADDLE_ENFORCE(context->HasOutputs(framework::GradVarName("Input"))); - context->SetOutputsDim(framework::GradVarName("Input"), - context->GetInputsDim("Input")); - } - if (context->HasOutputs(framework::GradVarName("Cond"))) { - context->SetOutputsDim(framework::GradVarName("Cond"), - context->GetInputsDim("Cond")); - } - } -}; - -class ConditionalBlockGradMaker : public framework::SingleGradOpDescMaker { - public: - using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; - - protected: - std::unique_ptr Apply() const override { - auto grad_op = new framework::OpDesc(); - grad_op->SetType("conditional_block_grad"); - grad_op->SetInput("Cond", Input("Cond")); - grad_op->SetInput("Input", Input("Input")); - grad_op->SetInput("Out", Output("Out")); - grad_op->SetInput(framework::GradVarName("Out"), OutputGrad("Out")); - grad_op->SetInput("Scope", Output("Scope")); - grad_op->SetOutput(framework::GradVarName("Cond"), - InputGrad("Cond", false)); - grad_op->SetOutput(framework::GradVarName("Input"), - InputGrad("Input", false)); - grad_op->SetBlockAttr("sub_block", this->grad_block_[0]); - grad_op->SetAttr("is_scalar_condition", GetAttr("is_scalar_condition")); - return std::unique_ptr(grad_op); - } -}; - -} // namespace operators -} // namespace paddle - -namespace ops = paddle::operators; -REGISTER_OPERATOR(conditional_block, ops::ConditionalBlockOp, - ops::ConditionalBlockOpProtoMaker, - ops::ConditionalBlockGradMaker); -REGISTER_OPERATOR(conditional_block_grad, ops::ConditionalBlockGradOp, - ops::ConditionalBlockGradInferShape); diff --git a/paddle/fluid/operators/controlflow/CMakeLists.txt b/paddle/fluid/operators/controlflow/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..7aa1c44eaafe53034b19ee52c59cc94d3a1269da --- /dev/null +++ b/paddle/fluid/operators/controlflow/CMakeLists.txt @@ -0,0 +1,5 @@ +include(operators) +register_operators(DEPS naive_executor) +cc_library(while_op_helper SRCS while_op_helper.cc DEPS operator) + +file(APPEND ${pybind_file} "USE_OP(less_than);\nUSE_OP(logical_and);\nUSE_NO_KERNEL_OP(read_from_array);\n") diff --git a/paddle/fluid/operators/controlflow/compare_op.cc b/paddle/fluid/operators/controlflow/compare_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..5d3f9b43f8c08d356319fa0b9ccaf808811d3d39 --- /dev/null +++ b/paddle/fluid/operators/controlflow/compare_op.cc @@ -0,0 +1,139 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/controlflow/compare_op.h" +#include +#include "paddle/fluid/framework/op_registry.h" + +namespace paddle { +namespace operators { + +template +class CompareOpKernel + : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& context) const override { + using T = typename Functor::ELEM_TYPE; + using Tensor = framework::Tensor; + + auto* x = context.Input("X"); + auto* y = context.Input("Y"); + auto* z = context.Output("Out"); + int axis = context.Attr("axis"); + + if (x->numel() == 1 && y->numel() == 1) { + bool* z_data = z->mutable_data(context.GetPlace()); + z_data[0] = Functor()(x->data()[0], y->data()[0]); + } else { + ElementwiseComputeEx( + context, x, y, axis, Functor(), z); + } + } +}; + +template +class CompareOpProtoMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + OpComment comment; + AddInput("X", string::Sprintf("the left hand operand of %s operator", + comment.type)); + AddInput("Y", string::Sprintf("the right hand operand of %s operator", + comment.type)); + AddAttr( + "axis", + "The start dimension index for broadcasting Y onto X. [default -1]") + .SetDefault(-1) + .EqualGreaterThan(-1); + AddAttr("force_cpu", + "Force fill output variable to cpu " + "memory. Otherwise, fill output variable to the running " + "device [default true].") + .SetDefault(true); + AddOutput("Out", string::Sprintf("n-dim bool tensor. Each element is %s", + comment.equation)); + AddComment(string::Sprintf(R"DOC( +It operates element-wise on X and Y, and returns the Out. Each of them is a +N-dim tensor. X and Y could be any type. The each element of the Out tensor is +calculated by $%s$ +)DOC", + comment.equation)); + } +}; + +template +class CompareOpInferShape : public framework::InferShapeBase { + public: + void operator()(framework::InferShapeContext* context) const override { + OpComment comment; + PADDLE_ENFORCE(context->HasInput("X"), "%s operator must has input X", + comment.type); + PADDLE_ENFORCE(context->HasInput("Y"), "%s operator must has input Y", + comment.type); + auto dim_x = context->GetInputDim("X"); + auto dim_y = context->GetInputDim("Y"); + PADDLE_ENFORCE_GE(dim_x.size(), dim_y.size(), + "The size of dim_y should not be greater than dim_x's."); + + context->SetOutputDim("Out", context->GetInputDim("X")); + context->ShareLoD("X", "Out"); + } +}; + +class CompareOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext& ctx) const override { + framework::OpKernelType kt = OperatorWithKernel::GetExpectedKernelType(ctx); + // CompareOp kernel's device type is decided by input tensor place + bool force_cpu = ctx.Attr("force_cpu"); + kt.place_ = force_cpu ? platform::CPUPlace() + : ctx.Input("X")->place(); + return kt; + } +}; + +} // namespace operators +} // namespace paddle + +#define REGISTER_COMPARE_OP(op_type, _equation) \ + struct _##op_type##Comment { \ + static char type[]; \ + static char equation[]; \ + }; \ + char _##op_type##Comment::type[]{#op_type}; \ + char _##op_type##Comment::equation[]{_equation}; \ + REGISTER_OPERATOR( \ + op_type, ::paddle::operators::CompareOp, \ + ::paddle::operators::CompareOpProtoMaker<_##op_type##Comment>, \ + ::paddle::operators::CompareOpInferShape<_##op_type##Comment>, \ + ::paddle::framework::EmptyGradOpMaker); + +REGISTER_COMPARE_OP(less_than, "Out = X < Y"); +REGISTER_COMPARE_KERNEL(less_than, CPU, paddle::operators::LessThanFunctor); +REGISTER_COMPARE_OP(less_equal, "Out = X <= Y"); +REGISTER_COMPARE_KERNEL(less_equal, CPU, paddle::operators::LessEqualFunctor); +REGISTER_COMPARE_OP(greater_than, "Out = X > Y"); +REGISTER_COMPARE_KERNEL(greater_than, CPU, + paddle::operators::GreaterThanFunctor); +REGISTER_COMPARE_OP(greater_equal, "Out = X >= Y"); +REGISTER_COMPARE_KERNEL(greater_equal, CPU, + paddle::operators::GreaterEqualFunctor); +REGISTER_COMPARE_OP(equal, "Out = X == Y"); +REGISTER_COMPARE_KERNEL(equal, CPU, paddle::operators::EqualFunctor); +REGISTER_COMPARE_OP(not_equal, "Out = X != Y"); +REGISTER_COMPARE_KERNEL(not_equal, CPU, paddle::operators::NotEqualFunctor); diff --git a/paddle/fluid/operators/controlflow/compare_op.cu b/paddle/fluid/operators/controlflow/compare_op.cu new file mode 100644 index 0000000000000000000000000000000000000000..b1f306358359764b919f9e570cf44f9733a7d178 --- /dev/null +++ b/paddle/fluid/operators/controlflow/compare_op.cu @@ -0,0 +1,24 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/controlflow/compare_op.h" + +REGISTER_COMPARE_KERNEL(less_than, CUDA, paddle::operators::LessThanFunctor); +REGISTER_COMPARE_KERNEL(less_equal, CUDA, paddle::operators::LessEqualFunctor); +REGISTER_COMPARE_KERNEL(greater_than, CUDA, + paddle::operators::GreaterThanFunctor); +REGISTER_COMPARE_KERNEL(greater_equal, CUDA, + paddle::operators::GreaterEqualFunctor); +REGISTER_COMPARE_KERNEL(equal, CUDA, paddle::operators::EqualFunctor); +REGISTER_COMPARE_KERNEL(not_equal, CUDA, paddle::operators::NotEqualFunctor); diff --git a/paddle/fluid/operators/controlflow/compare_op.h b/paddle/fluid/operators/controlflow/compare_op.h new file mode 100644 index 0000000000000000000000000000000000000000..b7529e4ae632d31524846d9d5aa4b1883f4509a1 --- /dev/null +++ b/paddle/fluid/operators/controlflow/compare_op.h @@ -0,0 +1,100 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include +#include +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/elementwise/elementwise_op_function.h" +#include "paddle/fluid/platform/transform.h" + +namespace paddle { +namespace operators { + +template +struct LessThanFunctor { + using ELEM_TYPE = T; + HOSTDEVICE bool operator()(const T& a, const T& b) const { return a < b; } +}; + +template +struct LessEqualFunctor { + using ELEM_TYPE = T; + HOSTDEVICE bool operator()(const T& a, const T& b) const { return a <= b; } +}; + +template +struct GreaterThanFunctor { + using ELEM_TYPE = T; + HOSTDEVICE bool operator()(const T& a, const T& b) const { return a > b; } +}; + +template +struct GreaterEqualFunctor { + using ELEM_TYPE = T; + HOSTDEVICE bool operator()(const T& a, const T& b) const { return a >= b; } +}; + +template +struct EqualFunctor { + using ELEM_TYPE = T; + HOSTDEVICE bool operator()(const T& a, const T& b) const { + if (std::is_floating_point::value) { + // This branch will be optimized while compiling if T is integer. It is + // safe to cast a and b to double. + return fabs(static_cast(a - b)) < 1e-8; + } else { + return (a == b); + } + } +}; + +template +struct NotEqualFunctor { + using ELEM_TYPE = T; + HOSTDEVICE bool operator()(const T& a, const T& b) const { + return !EqualFunctor()(a, b); + } +}; + +template +class CompareOpKernel + : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& context) const override { + using T = typename Functor::ELEM_TYPE; + using Tensor = framework::Tensor; + + auto* x = context.Input("X"); + auto* y = context.Input("Y"); + auto* z = context.Output("Out"); + int axis = context.Attr("axis"); + ElementwiseComputeEx(context, x, y, axis, + Functor(), z); + } +}; + +} // namespace operators +} // namespace paddle + +#define REGISTER_COMPARE_KERNEL(op_type, dev, functor) \ + REGISTER_OP_##dev##_KERNEL( \ + op_type, ::paddle::operators::CompareOpKernel< \ + ::paddle::platform::dev##DeviceContext, functor>, \ + ::paddle::operators::CompareOpKernel< \ + ::paddle::platform::dev##DeviceContext, functor>, \ + ::paddle::operators::CompareOpKernel< \ + ::paddle::platform::dev##DeviceContext, functor>, \ + ::paddle::operators::CompareOpKernel< \ + ::paddle::platform::dev##DeviceContext, functor>); diff --git a/paddle/fluid/operators/controlflow/conditional_block_op.cc b/paddle/fluid/operators/controlflow/conditional_block_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..f0dc718195506e89bf9fecc0eb5e0d5117275a33 --- /dev/null +++ b/paddle/fluid/operators/controlflow/conditional_block_op.cc @@ -0,0 +1,276 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ +#include +#include +#include +#include +#include "paddle/fluid/framework/executor.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/framework/var_type.h" + +namespace paddle { +namespace operators { + +class ConditionalOp : public framework::OperatorBase { + public: + ConditionalOp(const std::string &type, + const framework::VariableNameMap &inputs, + const framework::VariableNameMap &outputs, + const framework::AttributeMap &attrs) + : OperatorBase(type, inputs, outputs, attrs) {} + + protected: + std::vector InputTensors( + const framework::Scope &scope, const std::string &in_name) const { + std::vector retv; + auto xs = Inputs(in_name); + retv.resize(xs.size(), nullptr); + std::transform( + xs.begin(), xs.end(), retv.begin(), + [&scope](const std::string &var_name) -> const framework::LoDTensor * { + auto *var = scope.FindVar(var_name); + PADDLE_ENFORCE(var != nullptr, "Cannot find variable %s", var_name); + return &var->Get(); + }); + return retv; + } + + bool ScalarCondition( + const std::vector &ips) const { + if (!(ips.size() == 1UL && ips[0]->IsInitialized())) { + PADDLE_THROW("should have one initialized input as condition"); + } + + PADDLE_ENFORCE(ips[0]->type() == framework::proto::VarType::BOOL && + ips[0]->numel() == 1, + "condition input's data type should be bool, " + "numel should be 1, actual numel is %d", + ips[0]->numel()); + bool res = false; + if (platform::is_gpu_place(ips[0]->place())) { +#ifdef PADDLE_WITH_CUDA + framework::LoDTensor cpu_tensor; + framework::TensorCopy(*ips[0], platform::CPUPlace(), &cpu_tensor); + platform::DeviceContextPool::Instance().Get(ips[0]->place())->Wait(); + res = cpu_tensor.data()[0]; +#endif + } else { + res = ips[0]->data()[0]; + } + return res; + } +}; + +class ConditionalBlockOp : public ConditionalOp { + public: + ConditionalBlockOp(const std::string &type, + const framework::VariableNameMap &inputs, + const framework::VariableNameMap &outputs, + const framework::AttributeMap &attrs) + : ConditionalOp(type, inputs, outputs, attrs) {} + + private: + void RunImpl(const framework::Scope &scope, + const platform::Place &dev_place) const override { + bool need_run; + if (Attr("is_scalar_condition")) { + // When is_scalar_condition is True, the conditional variable is a scalar, + // whether need to execute the operators in sub-block depends on the + // conditional variable (Cond). + auto xs = InputTensors(scope, "Cond"); + need_run = ScalarCondition(xs); + } else { + // When is_scalar_condition is False, the conditional variable maybe a + // vector or tensor, whether need to execute the operators in sub-block + // depends on the input variables (Input). + auto xs = InputTensors(scope, "Input"); + need_run = std::all_of( + xs.begin(), xs.end(), + [](const framework::LoDTensor *t) { return t->numel() != 0; }); + } + + if (need_run) { + auto *scope_var = scope.FindVar(Output("Scope")); + PADDLE_ENFORCE(scope_var != nullptr, "Must set scope"); + auto *scopes = scope_var->GetMutable>(); + scopes->resize(1); + scopes->front() = &scope.NewScope(); + auto &cur_scope = *scopes->front(); + + framework::Executor exec(dev_place); + auto *block = Attr("sub_block"); + exec.Run(*block->Program(), &cur_scope, block->ID(), false); + } + } +}; + +class ConditionalBlockOpProtoMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("Cond", + "The conditional variable of this operator. If Cond is empty, the " + "whole sub-block will not be executed.") + .AsDuplicable(); + AddInput("Input", "The input variables of the sub-block.").AsDuplicable(); + AddOutput("Out", "The output variables of the sub-block.").AsDuplicable(); + AddOutput("Scope", + "(std::vector) The step scope of conditional block. To " + "unify the conditional block, rnn and while op, the type of " + "scope is std::vector"); + AddAttr( + "sub_block", "The step block of conditional block operator"); + AddAttr("is_scalar_condition", + "The conditional variable (Cond) is used as scalar " + "condition.") + .SetDefault(false); + AddComment(R"DOC(Conditional block operator + +If `is_scalar_condition` is True, the conditional variable (Cond) is a scalar, +run the operators in sub-block if Cond is True. + +If `is_scalar_condition` is False, the conditional variable (Cond) is a vector or +tensor, run the operators in sub-block if all of input variables are not empty. + + +)DOC"); + } +}; + +class ConditionalBlockGradOp : public ConditionalOp { + public: + ConditionalBlockGradOp(const std::string &type, + const framework::VariableNameMap &inputs, + const framework::VariableNameMap &outputs, + const framework::AttributeMap &attrs) + : ConditionalOp(type, inputs, outputs, attrs) {} + + private: + void RunImpl(const framework::Scope &scope, + const platform::Place &dev_place) const override { + bool need_run; + if (Attr("is_scalar_condition")) { + auto xs = this->InputTensors(scope, "Cond"); + need_run = ScalarCondition(xs); + } else { + auto xs = this->InputTensors(scope, "Input"); + need_run = std::all_of( + xs.begin(), xs.end(), + [](const framework::LoDTensor *t) { return t->numel() != 0; }); + } + + if (need_run) { + auto *scope_var = scope.FindVar(Input("Scope")); + PADDLE_ENFORCE(scope_var != nullptr, "Must set scope"); + auto &scopes = scope_var->Get>(); + framework::Scope &cur_scope = *scopes[0]; + + framework::Executor exec(dev_place); + auto *block = Attr("sub_block"); + + const auto &ins = Inputs("Input"); + const auto &d_ins = Outputs(framework::GradVarName("Input")); + const auto &conds = Inputs("Cond"); + const auto &d_conds = Outputs(framework::GradVarName("Cond")); + + std::vector ins_conds_grads; + ins_conds_grads.reserve(ins.size() + conds.size()); + for (auto &in : ins) { + ins_conds_grads.emplace_back(framework::GradVarName(in)); + } + for (auto &cond : conds) { + ins_conds_grads.emplace_back(framework::GradVarName(cond)); + } + + exec.Run(*block->Program(), &cur_scope, block->ID(), false, true, + ins_conds_grads); + + AssignLocalGradientToGlobal(dev_place, cur_scope, ins_conds_grads.data(), + ins.size(), d_ins); + + AssignLocalGradientToGlobal(dev_place, cur_scope, + ins_conds_grads.data() + ins.size(), + conds.size(), d_conds); + } + } + + private: + void AssignLocalGradientToGlobal( + const platform::Place &place, const framework::Scope &cur_scope, + const std::string *p_grad_names, size_t p_grad_names_num, + const std::vector &pg_names) const { + for (size_t i = 0; i < p_grad_names_num; ++i) { + auto out_grad_name = pg_names[i]; + const auto &in_grad_name = p_grad_names[i]; + auto *in_var = cur_scope.FindVar(in_grad_name); + if (in_var == nullptr) { + continue; + } + auto new_in_grad_name = cur_scope.Rename(in_grad_name); + auto assign = framework::OpRegistry::CreateOp( + "assign", {{"X", {new_in_grad_name}}}, {{"Out", {out_grad_name}}}, + framework::AttributeMap{}); + assign->Run(cur_scope, place); + cur_scope.Rename(new_in_grad_name, in_grad_name); + } + } +}; + +class ConditionalBlockGradInferShape : public framework::InferShapeBase { + public: + void operator()(framework::InferShapeContext *context) const override { + PADDLE_ENFORCE(context->HasInputs("Cond")); + if (context->HasInputs("Input")) { + PADDLE_ENFORCE(context->HasOutputs(framework::GradVarName("Input"))); + context->SetOutputsDim(framework::GradVarName("Input"), + context->GetInputsDim("Input")); + } + if (context->HasOutputs(framework::GradVarName("Cond"))) { + context->SetOutputsDim(framework::GradVarName("Cond"), + context->GetInputsDim("Cond")); + } + } +}; + +class ConditionalBlockGradMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + auto grad_op = new framework::OpDesc(); + grad_op->SetType("conditional_block_grad"); + grad_op->SetInput("Cond", Input("Cond")); + grad_op->SetInput("Input", Input("Input")); + grad_op->SetInput("Out", Output("Out")); + grad_op->SetInput(framework::GradVarName("Out"), OutputGrad("Out")); + grad_op->SetInput("Scope", Output("Scope")); + grad_op->SetOutput(framework::GradVarName("Cond"), + InputGrad("Cond", false)); + grad_op->SetOutput(framework::GradVarName("Input"), + InputGrad("Input", false)); + grad_op->SetBlockAttr("sub_block", this->grad_block_[0]); + grad_op->SetAttr("is_scalar_condition", GetAttr("is_scalar_condition")); + return std::unique_ptr(grad_op); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(conditional_block, ops::ConditionalBlockOp, + ops::ConditionalBlockOpProtoMaker, + ops::ConditionalBlockGradMaker); +REGISTER_OPERATOR(conditional_block_grad, ops::ConditionalBlockGradOp, + ops::ConditionalBlockGradInferShape); diff --git a/paddle/fluid/operators/controlflow/feed_op.cc b/paddle/fluid/operators/controlflow/feed_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..0dfed7f5cc1e929c1fb566df1a7dfb4b2450323b --- /dev/null +++ b/paddle/fluid/operators/controlflow/feed_op.cc @@ -0,0 +1,86 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/feed_fetch_type.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/framework/operator.h" + +namespace paddle { +namespace operators { +class FeedOp : public framework::OperatorBase { + public: + FeedOp(const std::string &type, const framework::VariableNameMap &inputs, + const framework::VariableNameMap &outputs, + const framework::AttributeMap &attrs) + : OperatorBase(type, inputs, outputs, attrs) {} + + private: + void RunImpl(const framework::Scope &scope, + const platform::Place &place) const override { + // get device context from pool + auto *dev_ctx = platform::DeviceContextPool::Instance().Get(place); + + auto feed_var_name = Input("X"); + auto *feed_var = scope.FindVar(feed_var_name); + + PADDLE_ENFORCE(feed_var != nullptr, + "Cannot find feed_var in scope, feed_var_name is %s", + feed_var_name); + + auto out_name = this->Output("Out"); + auto *out_var = scope.FindVar(out_name); + PADDLE_ENFORCE(out_var != nullptr, + "Cannot find out_var in scope, out_var_name is %s", + out_name); + + auto col = Attr("col"); + + VLOG(3) << "Feed Var " << feed_var_name << "'s " << col << " column to var " + << out_name; + + auto &feed_list = feed_var->Get(); + PADDLE_ENFORCE_LT(static_cast(col), feed_list.size()); + auto &feed_item = feed_list.at(static_cast(col)); + auto *out_item = out_var->GetMutable(); + + if (platform::is_same_place(feed_item.place(), place)) { + out_item->ShareDataWith(feed_item); + } else { + framework::TensorCopy(feed_item, place, *dev_ctx, out_item); + } + out_item->set_lod(feed_item.lod()); + } +}; + +class FeedOpInfoMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", "The input of feed op"); + AddOutput("Out", "The output of feed op"); + AddAttr("col", "(int) The column of feed"); + AddComment(R"DOC( +Feed Operator. + +It should not be configured by users directly. + +)DOC"); + } +}; + +} // namespace operators +} // namespace paddle + +REGISTER_OPERATOR(feed, paddle::operators::FeedOp, + paddle::framework::EmptyGradOpMaker, + paddle::operators::FeedOpInfoMaker); diff --git a/paddle/fluid/operators/fetch_op.cc b/paddle/fluid/operators/controlflow/fetch_op.cc similarity index 100% rename from paddle/fluid/operators/fetch_op.cc rename to paddle/fluid/operators/controlflow/fetch_op.cc diff --git a/paddle/fluid/operators/controlflow/get_places_op.cc b/paddle/fluid/operators/controlflow/get_places_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..fa77f97419b6d605e478709e13413606ff124572 --- /dev/null +++ b/paddle/fluid/operators/controlflow/get_places_op.cc @@ -0,0 +1,116 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include // NOLINT +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/detail/safe_ref.h" +#include "paddle/fluid/platform/place.h" +#ifdef PADDLE_WITH_CUDA +#include "paddle/fluid/platform/gpu_info.h" +#endif + +namespace paddle { +namespace operators { + +static size_t CUDADevCount() { +#ifdef PADDLE_WITH_CUDA + return platform::GetCUDADeviceCount(); +#else + return 0UL; +#endif +} + +class GetPlacesOp : public framework::OperatorBase { + public: + GetPlacesOp(const std::string &type, const framework::VariableNameMap &inputs, + const framework::VariableNameMap &outputs, + const framework::AttributeMap &attrs) + : OperatorBase(type, inputs, outputs, attrs) {} + + private: + void RunImpl(const framework::Scope &scope, + const platform::Place &place) const override { + bool is_gpu; + if (Attr("device_type") == "AUTO") { + is_gpu = platform::is_gpu_place(place); + } else { + is_gpu = Attr("device_type") == "CUDA"; + } + auto device_count = static_cast(Attr("device_count")); + if (device_count == 0) { + device_count = + is_gpu ? CUDADevCount() : std::thread::hardware_concurrency(); + } + PADDLE_ENFORCE_NE(device_count, 0UL, "Cannot indicate %s device count", + is_gpu ? "GPU" : "CPU"); + + auto out_var_name = Output("Out"); + auto &places = + *(detail::Ref(scope.FindVar(out_var_name), + "Output variable %s cannot be found", out_var_name) + .GetMutable()); + places.reserve(device_count); + if (is_gpu) { + PADDLE_ENFORCE_LE(device_count, CUDADevCount(), + "Only %d CUDA devices found, cannot set to %d", + CUDADevCount(), device_count); + for (size_t i = 0; i < device_count; ++i) { + places.emplace_back(platform::CUDAPlace(static_cast(i))); + } + } else { + for (size_t i = 0; i < device_count; ++i) { + places.emplace_back(platform::CPUPlace()); + } + } + } +}; + +class GetPlacesOpProtoMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddOutput("Out", "vector of Place"); + AddAttr("device_count", "device count").SetDefault(0); + AddAttr("device_type", "device type") + .InEnum({"CUDA", "CPU", "AUTO"}) + .SetDefault("AUTO"); + AddComment(R"DOC( +Returns a list of places based on arguments. The list will be used for parallel +execution. +)DOC"); + } +}; + +class GetPlacesInferVarType : public framework::VarTypeInference { + public: + void operator()(framework::InferVarTypeContext *ctx) const override { + for (auto &o_name : ctx->Output("Out")) { + ctx->SetType(o_name, framework::proto::VarType::PLACE_LIST); + } + } +}; + +class GetPlacesInferShape : public framework::InferShapeBase { + public: + void operator()(framework::InferShapeContext *context) const override { + // Do nothing + } +}; + +} // namespace operators +} // namespace paddle +namespace ops = paddle::operators; + +REGISTER_OPERATOR(get_places, ops::GetPlacesOp, ops::GetPlacesOpProtoMaker, + ops::GetPlacesInferVarType, ops::GetPlacesInferShape, + paddle::framework::EmptyGradOpMaker); diff --git a/paddle/fluid/operators/controlflow/logical_op.cc b/paddle/fluid/operators/controlflow/logical_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..2e7f3edd55c3353bacddec3dd4ffaba9e0208136 --- /dev/null +++ b/paddle/fluid/operators/controlflow/logical_op.cc @@ -0,0 +1,149 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/controlflow/logical_op.h" +#include +#include "paddle/fluid/framework/op_registry.h" + +namespace paddle { +namespace operators { +template +class BinaryLogicalOpProtoMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + OpComment comment; + AddInput("X", + string::Sprintf("(LoDTensor) Left hand operand of %s operator", + comment.type)); + AddInput("Y", + string::Sprintf("(LoDTensor) Right hand operand of %s operator", + comment.type)); + AddOutput("Out", string::Sprintf( + "(LoDTensor) n-dim bool tensor. Each element is %s", + comment.equation)); + AddComment(string::Sprintf(R"DOC(%s Operator + +It operates element-wise on X and Y, and returns the Out. X, Y and Out are N-dim boolean tensors. +Each element of Out is calculated by %s +)DOC", + comment.type, comment.equation)); + } +}; + +template +class UnaryLogicalOpProtoMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + OpComment comment; + AddInput("X", string::Sprintf("(LoDTensor) Operand of %s operator", + comment.type)); + AddOutput("Out", string::Sprintf( + "(LoDTensor) n-dim bool tensor. Each element is %s", + comment.equation)); + AddComment(string::Sprintf(R"DOC(%s Operator + +It operates element-wise on X, and returns the Out. X and Out are N-dim boolean tensors. +Each element of Out is calculated by %s +)DOC", + comment.type, comment.equation)); + } +}; + +template +class BinaryLogicalOpInferShape : public framework::InferShapeBase { + public: + void operator()(framework::InferShapeContext *context) const override { + OpComment comment; + PADDLE_ENFORCE(context->HasInput("X"), + "Input(X) of %s operator must not be null", comment.type); + PADDLE_ENFORCE(context->HasInput("Y"), + "Input(Y) of %s operator must not be null", comment.type); + auto dim_x = context->GetInputDim("X"); + auto dim_y = context->GetInputDim("Y"); + PADDLE_ENFORCE_EQ(framework::product(dim_x), framework::product(dim_y), + "The number of elements in X and Y should be same"); + + context->SetOutputDim("Out", context->GetInputDim("X")); + context->ShareLoD("X", "Out"); + } +}; + +template +class UnaryLogicalOpInferShape : public framework::InferShapeBase { + public: + void operator()(framework::InferShapeContext *context) const override { + OpComment comment; + PADDLE_ENFORCE(context->HasInput("X"), + "Input(X) of %s operator must not be null", comment.type); + context->SetOutputDim("Out", context->GetInputDim("X")); + context->ShareLoD("X", "Out"); + } +}; + +class LogicalOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext &ctx) const override { + framework::OpKernelType kt = OperatorWithKernel::GetExpectedKernelType(ctx); + // LogicalOp kernel's device type is decided by input tensor place + kt.place_ = ctx.Input("X")->place(); + return kt; + } +}; + +} // namespace operators +} // namespace paddle + +#define REGISTER_BINARY_LOGICAL_OP(op_type, _equation) \ + struct _##op_type##Comment { \ + static char type[]; \ + static char equation[]; \ + }; \ + char _##op_type##Comment::type[]{#op_type}; \ + char _##op_type##Comment::equation[]{_equation}; \ + REGISTER_OPERATOR( \ + op_type, ::paddle::operators::LogicalOp, \ + ::paddle::operators::BinaryLogicalOpProtoMaker<_##op_type##Comment>, \ + ::paddle::operators::BinaryLogicalOpInferShape<_##op_type##Comment>, \ + ::paddle::framework::EmptyGradOpMaker); + +#define REGISTER_UNARY_LOGICAL_OP(op_type, _equation) \ + struct _##op_type##Comment { \ + static char type[]; \ + static char equation[]; \ + }; \ + char _##op_type##Comment::type[]{#op_type}; \ + char _##op_type##Comment::equation[]{_equation}; \ + REGISTER_OPERATOR( \ + op_type, ::paddle::operators::LogicalOp, \ + ::paddle::operators::UnaryLogicalOpProtoMaker<_##op_type##Comment>, \ + ::paddle::operators::UnaryLogicalOpInferShape<_##op_type##Comment>, \ + ::paddle::framework::EmptyGradOpMaker); + +REGISTER_BINARY_LOGICAL_OP(logical_and, "$$Out = X \\&\\& Y$$"); +REGISTER_BINARY_LOGICAL_KERNEL(logical_and, CPU, + paddle::operators::LogicalAndFunctor); +REGISTER_BINARY_LOGICAL_OP(logical_or, "$$Out = X || Y$$"); +REGISTER_BINARY_LOGICAL_KERNEL(logical_or, CPU, + paddle::operators::LogicalOrFunctor); +REGISTER_UNARY_LOGICAL_OP(logical_not, "$$Out = !X$$"); +REGISTER_UNARY_LOGICAL_KERNEL(logical_not, CPU, + paddle::operators::LogicalNotFunctor); +REGISTER_BINARY_LOGICAL_OP(logical_xor, + "$$Out = (X || Y) \\&\\& !(X \\&\\& Y)$$"); +REGISTER_BINARY_LOGICAL_KERNEL(logical_xor, CPU, + paddle::operators::LogicalXorFunctor); diff --git a/paddle/fluid/operators/controlflow/logical_op.cu b/paddle/fluid/operators/controlflow/logical_op.cu new file mode 100644 index 0000000000000000000000000000000000000000..7ca54b488bfbb260c422941b82145f092a150be7 --- /dev/null +++ b/paddle/fluid/operators/controlflow/logical_op.cu @@ -0,0 +1,24 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/controlflow/logical_op.h" + +REGISTER_BINARY_LOGICAL_KERNEL(logical_and, CUDA, + paddle::operators::LogicalAndFunctor); +REGISTER_BINARY_LOGICAL_KERNEL(logical_or, CUDA, + paddle::operators::LogicalOrFunctor); +REGISTER_UNARY_LOGICAL_KERNEL(logical_not, CUDA, + paddle::operators::LogicalNotFunctor); +REGISTER_BINARY_LOGICAL_KERNEL(logical_xor, CUDA, + paddle::operators::LogicalXorFunctor); diff --git a/paddle/fluid/operators/logical_op.h b/paddle/fluid/operators/controlflow/logical_op.h similarity index 100% rename from paddle/fluid/operators/logical_op.h rename to paddle/fluid/operators/controlflow/logical_op.h diff --git a/paddle/fluid/operators/controlflow/tensor_array_read_write_op.cc b/paddle/fluid/operators/controlflow/tensor_array_read_write_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..45f18ac9255bdd75d8cbb5e1dd30ebba52260850 --- /dev/null +++ b/paddle/fluid/operators/controlflow/tensor_array_read_write_op.cc @@ -0,0 +1,230 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ +#include "paddle/fluid/operators/array_operator.h" +#include "paddle/fluid/operators/detail/safe_ref.h" +namespace paddle { +namespace operators { + +class WriteToArrayOp : public ArrayOp { + public: + WriteToArrayOp(const std::string &type, + const framework::VariableNameMap &inputs, + const framework::VariableNameMap &outputs, + const framework::AttributeMap &attrs) + : ArrayOp(type, inputs, outputs, attrs) {} + + private: + void RunImpl(const framework::Scope &scope, + const platform::Place &place) const override { + auto *x = scope.FindVar(Input("X")); + if (x == nullptr) return; + auto &x_tensor = x->Get(); + size_t offset = GetOffset(scope, place); + auto *out = + scope.FindVar(Output("Out"))->GetMutable(); + if (offset >= out->size()) { + VLOG(10) << "Resize " << Output("Out") << " from " << out->size() + << " to " << offset + 1; + out->resize(offset + 1); + } + auto *out_tensor = &out->at(offset); + out_tensor->set_lod(x_tensor.lod()); + if (x_tensor.memory_size() > 0) { + platform::DeviceContextPool &pool = + platform::DeviceContextPool::Instance(); + auto &dev_ctx = *pool.Get(place); + + TensorCopy(x_tensor, place, dev_ctx, out_tensor); + } else { + VLOG(10) << "WARNING: The input tensor 'x_tensor' holds no memory, so " + "nothing has been written to output array[" + << offset << "]."; + } + } +}; + +class WriteToArrayOpProtoMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", "(LoDTensor) the tensor will be written to tensor array"); + AddInput( + "I", + "(Tensor) the subscript index in tensor array. The number of element " + "should be 1"); + AddOutput("Out", "(TensorArray) the tensor array will be written"); + AddComment(R"DOC( +WriteToArray Operator. + +This operator writes a LoDTensor to a LoDTensor array. + +Assume $T$ is LoDTensor, $i$ is the subscript of the array, and $A$ is the array. The +equation is + +$$A[i] = T$$ + +)DOC"); + } +}; + +class WriteToArrayInferShape : public framework::InferShapeBase { + public: + void operator()(framework::InferShapeContext *context) const override { + PADDLE_ENFORCE(context->HasInput("I"), "Must set the subscript index"); + PADDLE_ENFORCE_EQ(framework::product(context->GetInputDim("I")), 1, + "The number of element of subscript index must be 1"); + if (!context->HasInput("X")) { + return; + } + PADDLE_ENFORCE(context->HasOutput("Out"), NotHasOutError()); + context->SetOutputDim("Out", context->GetInputDim("X")); + } + + protected: + virtual const char *NotHasXError() const { return "Must set the lod tensor"; } + + virtual const char *NotHasOutError() const { + return "Must set the lod tensor array"; + } +}; + +class WriteToArrayInferVarType : public framework::VarTypeInference { + public: + void operator()(framework::InferVarTypeContext *ctx) const override { + auto x_name = ctx->Input("X")[0]; + auto out_name = ctx->Output("Out")[0]; + VLOG(10) << "Set Variable " << out_name << " as LOD_TENSOR_ARRAY"; + ctx->SetType(out_name, framework::proto::VarType::LOD_TENSOR_ARRAY); + if (ctx->HasVar(x_name)) { + ctx->SetDataType(out_name, ctx->GetDataType(x_name)); + } + } +}; + +class ReadFromArrayOp : public ArrayOp { + public: + ReadFromArrayOp(const std::string &type, + const framework::VariableNameMap &inputs, + const framework::VariableNameMap &outputs, + const framework::AttributeMap &attrs) + : ArrayOp(type, inputs, outputs, attrs) {} + + private: + void RunImpl(const framework::Scope &scope, + const platform::Place &place) const override { + auto *x = scope.FindVar(Input("X")); + PADDLE_ENFORCE(x != nullptr, "X must be set"); + auto &x_array = x->Get(); + auto *out = scope.FindVar(Output("Out")); + PADDLE_ENFORCE(out != nullptr, "Out must be set"); + size_t offset = GetOffset(scope, place); + if (offset < x_array.size()) { + auto *out_tensor = out->GetMutable(); + platform::DeviceContextPool &pool = + platform::DeviceContextPool::Instance(); + auto &dev_ctx = *pool.Get(place); + framework::TensorCopy(x_array[offset], place, dev_ctx, out_tensor); + out_tensor->set_lod(x_array[offset].lod()); + } else { + VLOG(10) << "offset " << offset << " >= " << x_array.size(); + } + } +}; + +class ReadFromArrayProtoMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", "(TensorArray) the array will be read from."); + AddInput("I", + "(Tensor) the subscript index in tensor array. The number of " + "element should be 1"); + AddOutput("Out", "(LoDTensor) the tensor will be read from."); + AddComment(R"DOC( +ReadFromArray Operator. + +Read a LoDTensor from a LoDTensor Array. + +Assume $T$ is LoDTensor, $i$ is the subscript of the array, and $A$ is the array. The +equation is + +$$T = A[i]$$ + +)DOC"); + } +}; + +class ReadFromArrayInferShape : public WriteToArrayInferShape { + public: + void operator()(framework::InferShapeContext *context) const override { + WriteToArrayInferShape::operator()(context); + if (!context->HasInput("X")) { + return; + } + + // FIXME: just for compile time. + if (!context->IsRuntime()) { + context->ShareLoD("X", /*->*/ "Out"); + } + } + + protected: + const char *NotHasXError() const override { + return "The input array X must be set"; + } + const char *NotHasOutError() const override { + return "The output tensor out must be set"; + } +}; + +class WriteToArrayGradMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + auto *grad_op = new framework::OpDesc(); + grad_op->SetType("read_from_array"); + grad_op->SetInput("I", Input("I")); + grad_op->SetInput("X", OutputGrad("Out")); + grad_op->SetOutput("Out", InputGrad("X")); + grad_op->SetAttrMap(Attrs()); + return std::unique_ptr(grad_op); + } +}; + +class ReadFromArrayGradMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + auto *grad_op = new framework::OpDesc(); + grad_op->SetType("write_to_array"); + grad_op->SetInput("I", Input("I")); + grad_op->SetInput("X", OutputGrad("Out")); + grad_op->SetOutput("Out", InputGrad("X")); + grad_op->SetAttrMap(Attrs()); + return std::unique_ptr(grad_op); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(write_to_array, ops::WriteToArrayOp, + ops::WriteToArrayInferShape, ops::WriteToArrayOpProtoMaker, + ops::WriteToArrayGradMaker, ops::WriteToArrayInferVarType); +REGISTER_OPERATOR(read_from_array, ops::ReadFromArrayOp, + ops::ReadFromArrayInferShape, ops::ReadFromArrayProtoMaker, + ops::ReadFromArrayGradMaker); diff --git a/paddle/fluid/operators/controlflow/while_op.cc b/paddle/fluid/operators/controlflow/while_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..b3219208825cd1aea4c869064ff8f5fa8d3300fd --- /dev/null +++ b/paddle/fluid/operators/controlflow/while_op.cc @@ -0,0 +1,459 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include "paddle/fluid/framework/executor.h" +#include "paddle/fluid/framework/lod_tensor_array.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/framework/operator.h" +#include "paddle/fluid/framework/var_type.h" +#include "paddle/fluid/operators/controlflow/while_op_helper.h" +#include "paddle/fluid/operators/detail/safe_ref.h" + +namespace paddle { +namespace operators { + +using StepScopeVar = std::vector; +using LoDTensor = framework::LoDTensor; + +namespace { // NOLINT +static std::string GetSkipEagerDeletionVarsDebugString( + const std::vector &vars) { + std::string str = "Skip " + std::to_string(vars.size()) + + " var(s) in eager deletion mode: "; + for (auto &var : vars) { + str.append(var); + str.push_back(' '); + } + return str; +} +} // NOLINT + +class WhileOp : public framework::OperatorBase { + public: + WhileOp(const std::string &type, const framework::VariableNameMap &inputs, + const framework::VariableNameMap &outputs, + const framework::AttributeMap &attrs) + : framework::OperatorBase(type, inputs, outputs, attrs) {} + + private: + void RunImpl(const framework::Scope &scope, + const platform::Place &dev_place) const override { + PADDLE_ENFORCE_NOT_NULL(scope.FindVar(Input(kCondition))); + + auto &cond = scope.FindVar(Input(kCondition))->Get(); + PADDLE_ENFORCE_EQ(cond.dims(), paddle::framework::make_ddim({1})); + + framework::Executor executor(dev_place); + auto *block = Attr(kStepBlock); + + auto *program = block->Program(); + + auto step_scopes = + scope.FindVar(Output(kStepScopes))->GetMutable(); + + PADDLE_ENFORCE(platform::is_cpu_place(cond.place()), + "Condition of while op must in CPU memory."); + + bool is_test = Attr("is_test"); + auto &skip_vars = Attr>(kSkipEagerDeletionVars); + VLOG(2) << GetSkipEagerDeletionVarsDebugString(skip_vars); + + auto ctx = executor.Prepare(*program, block->ID(), skip_vars); + if (!is_test) { + while (cond.data()[0]) { + auto ¤t_scope = scope.NewScope(); + step_scopes->push_back(¤t_scope); + executor.RunPreparedContext(ctx.get(), ¤t_scope, false, true, + true); + } + } else { + auto ¤t_scope = scope.NewScope(); + executor.CreateVariables(*program, ¤t_scope, block->ID()); + while (cond.data()[0]) { + for (auto &name : current_scope.LocalVarNames()) { + auto *var = current_scope.Var(name); + if (var->IsType()) { + // Clear all lod information for all lod_tensors. + auto *t = var->GetMutable(); + framework::LoD empty_lod; + t->set_lod(empty_lod); + } else if (var->IsType()) { + // Clear elements of all tensor arrays. + auto *t = var->GetMutable(); + t->clear(); + } + } + executor.RunPreparedContext(ctx.get(), ¤t_scope, false, false, + false); + } + scope.DeleteScope(¤t_scope); + } + } +}; + +class WhileOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput(kX, + "A set of variables, which are required by operators inside the " + "block of While Op.") + .AsDuplicable(); + AddInput( + kCondition, + "(Bool) An scalar. When it's False, the While Op will be terminated.") + .AsDuplicable(); + AddOutput(kOutputs, + "A set of variables, which will be assigned with values " + "generated by the operators inside the block of While Op.") + .AsDuplicable(); + AddOutput(kStepScopes, + "(StepScopeVar) A vector of local scope, which size equals the " + "step number of While Op. The i'th scope storages temporary " + "variables generated in the i'th step."); + AddAttr(kStepBlock, + "The step block inside WhileOp"); + AddAttr("is_test", + "(bool, default false) Set to true for inference only, false " + "for training. Some layers may run faster when this is true.") + .SetDefault(false); + AddAttr>(kSkipEagerDeletionVars, + "Vars that would skip eager deletion." + "Users should not set this manually.") + .SetDefault(std::vector()); + AddComment(R"DOC( +)DOC"); + } +}; + +class WhileGradOp : public framework::OperatorBase { + public: + WhileGradOp(const std::string &type, const framework::VariableNameMap &inputs, + const framework::VariableNameMap &outputs, + const framework::AttributeMap &attrs) + : framework::OperatorBase(type, inputs, outputs, attrs) {} + + private: + void RunImpl(const framework::Scope &scope, + const platform::Place &dev_place) const override { + PADDLE_ENFORCE(!Attr("is_test"), + "GradOp is only callable when is_test is false"); + // get device context from pool + platform::DeviceContextPool &pool = platform::DeviceContextPool::Instance(); + auto &dev_ctx = *pool.Get(dev_place); + framework::Executor executor(dev_place); + auto *block = Attr(kStepBlock); + auto *program = block->Program(); + + auto &skip_vars = Attr>(kSkipEagerDeletionVars); + VLOG(2) << GetSkipEagerDeletionVarsDebugString(skip_vars); + auto ctx = executor.Prepare(*program, block->ID(), skip_vars); + + auto *step_scopes = + scope.FindVar(Input(kStepScopes))->GetMutable(); + + auto outside_og_names = Inputs(framework::GradVarName(kOutputs)); + auto inside_og_names = + Attr>("original_output_grad"); + + PADDLE_ENFORCE_EQ(outside_og_names.size(), inside_og_names.size()); + + for (auto cur_scope_iter = step_scopes->rbegin(); + cur_scope_iter != step_scopes->rend(); ++cur_scope_iter) { + VLOG(3) << "Start backward at time_step " + << cur_scope_iter - step_scopes->rbegin(); + framework::Scope &cur_scope = **cur_scope_iter; + // Link OG from outside to inside + for (size_t i = 0; i < outside_og_names.size(); ++i) { + auto outside_og_name = outside_og_names[i]; + auto inside_og_name = inside_og_names[i]; + VLOG(8) << "Linking outside " << outside_og_name << " --> inside " + << inside_og_name; + if (scope.FindVar(outside_og_name) == nullptr) { + continue; + } + + auto &og_outside = + detail::Ref(scope.FindVar(outside_og_name), + "Cannot find Outside Gradient %s", outside_og_name); + auto &og_inside = + detail::Ref(cur_scope.Var(inside_og_name), + "Cannot find inside gradient %s", inside_og_name); + if (og_outside.IsType()) { + auto &outside_tensor = og_outside.Get(); + auto &inside_tensor = + detail::Ref(og_inside.GetMutable()); + inside_tensor.set_lod(outside_tensor.lod()); + inside_tensor.ShareDataWith(outside_tensor); + } else if (og_outside.IsType()) { + auto &outside_array = og_outside.Get(); + auto &inside_array = + detail::Ref(og_inside.GetMutable()); + VLOG(8) << outside_og_name << " size = " << outside_array.size(); + inside_array.resize(outside_array.size()); + + for (size_t j = 0; j < inside_array.size(); ++j) { + VLOG(8) << j << " " << outside_array[j].numel(); + if (outside_array[j].numel() != 0) { + inside_array[j].set_lod(outside_array[j].lod()); + inside_array[j].ShareDataWith(outside_array[j]); + } else { + PADDLE_ENFORCE_EQ(inside_array[j].numel(), 0); + } + } + } else { + PADDLE_THROW("Currently only support LoDTensor and LoDTensorArray."); + } + } + executor.RunPreparedContext(ctx.get(), *cur_scope_iter, false, true, + true); + + // The Outputs(kXGRAD) contains the names of the gradient of parameters + // and inputs. + auto &pg_ig_names = Outputs(kXGRAD); + auto &p_names = Inputs(kX); + PADDLE_ENFORCE_EQ(pg_ig_names.size(), p_names.size()); + for (size_t param_id = 0; param_id < pg_ig_names.size(); ++param_id) { + if (pg_ig_names[param_id] == framework::kEmptyVarName) { + continue; // parameter doesn't have gradient + } + auto inside_grad_name = framework::GradVarName(p_names[param_id]); + + // for some grad_op, their input doesn't have gradient, + // for example lookup_table_grad_op, the input(Idx) doesn't have + // gradient. + auto pg_ig_var = cur_scope.FindVar(inside_grad_name); + PADDLE_ENFORCE(pg_ig_var != nullptr); + if (pg_ig_var->IsType()) { + auto pg_ig_lod_t_arr = + pg_ig_var->GetMutable(); + bool empty = true; + for (auto &each : *pg_ig_lod_t_arr) { + if (each.numel() != 0) { + empty = false; + break; + } + } + if (empty) { + LOG(WARNING) << pg_ig_names[param_id] + << " is not found in cur_scope."; + continue; + } + } + + // // TODO(tonyyang-svail): Not sure we need the following + // // If does not compute gradient of that variable inside rnn, + // just + // // continue + // if (local_var_names.find(inside_grad_name) == + // local_var_names.end()) { + // continue; + // } + + // zero gradient variable in step 0 + if (cur_scope_iter == step_scopes->rbegin()) { + auto *var = (*cur_scope_iter)->FindVar(inside_grad_name); + PADDLE_ENFORCE_NOT_NULL(var, "Can not find var %s", inside_grad_name); + PADDLE_ENFORCE( + var->IsType() || + var->IsType(), + "Currently the type of var only can be LoDTensorArray, " + "or LoDTensor, but the received var[%s] is %s.", + inside_grad_name, framework::ToTypeName(var->Type())); + + if (var->IsType()) { + auto &inside_tensor = var->Get(); + framework::AttributeMap attrs; + attrs["dtype"] = inside_tensor.type(); + attrs["shape"] = framework::vectorize2int(inside_tensor.dims()); + attrs["value"] = 0.0f; + + auto var_name = pg_ig_names[param_id]; + auto zero_op = framework::OpRegistry::CreateOp( + "fill_constant", framework::VariableNameMap{}, + {{"Out", {var_name}}}, attrs); + zero_op->Run(scope, dev_place); + scope.FindVar(var_name) + ->GetMutable() + ->set_lod(inside_tensor.lod()); + } + } + auto new_inside_name = cur_scope.Rename(inside_grad_name); + auto sum_op = framework::OpRegistry::CreateOp( + "sum", {{"X", {pg_ig_names[param_id], new_inside_name}}}, + {{"Out", {pg_ig_names[param_id]}}}, + framework::AttributeMap{{"use_mkldnn", {false}}}); + sum_op->Run(cur_scope, dev_place); + cur_scope.Rename(new_inside_name, inside_grad_name); + } + dev_ctx.Wait(); + const_cast(scope).DeleteScope(&cur_scope); + } + } +}; + +class WhileGradOpDescMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + auto *while_grad = new framework::OpDesc(); + while_grad->SetType("while_grad"); + while_grad->SetInput(kX, Input(kX)); + while_grad->SetInput(kOutputs, Output(kOutputs)); + while_grad->SetInput(kStepScopes, Output(kStepScopes)); + + auto *grad_block = this->grad_block_[0]; + auto *fwd_block = grad_block->ForwardBlock(); + auto *parent_block = grad_block->ParentBlock(); + + // Not all of IGs will be generated by inner gradient operators of while op. + // Ignore IGs that is not generated by the inside block. + std::unordered_set inner_op_outputs; + for (const auto *op : grad_block->AllOps()) { + for (auto &oname : op->OutputArgumentNames()) { + inner_op_outputs.insert(oname); + } + } + auto igs = InputGrad(kX, /*do not drop empty gradient*/ false); + for (auto &each_ig : igs) { + if (inner_op_outputs.find(each_ig) == inner_op_outputs.end()) { + VLOG(8) << "Ignore " << each_ig; + each_ig = framework::kEmptyVarName; + } + } + while_grad->SetOutput(framework::GradVarName(kX), igs); + + // OG should be re-calculated by step blocks, since many outputs of while op + // do not need to calculate gradients. + std::unordered_set block_ins; + block_ins.reserve(Input(kX).size() + Output(kOutputs).size()); + for (auto &p : Input(kX)) { + block_ins.insert(p); + } + for (auto &o : Output(kOutputs)) { + block_ins.insert(o); + } + std::unordered_set output_grads; + for (const auto *op : grad_block->AllOps()) { + for (auto &input_name : op->InputArgumentNames()) { + // If the input of Op has been recorded or is generated by the forward + // block, do not make it as input again. + + // The input is located in I/O or other op's outputs or the variable is + // located in grad_block's parents + if (block_ins.find(input_name) != block_ins.end() || + (fwd_block->FindVarRecursive(input_name) != nullptr || + parent_block->FindVarRecursive(input_name) != nullptr)) { + continue; + } + + output_grads.insert(input_name); + } + for (auto &output_name : op->OutputArgumentNames()) { + block_ins.insert(output_name); + } + } + + std::vector output_grads_list; + output_grads_list.resize(output_grads.size()); + std::copy(output_grads.begin(), output_grads.end(), + output_grads_list.begin()); + while_grad->SetInput(framework::GradVarName(kOutputs), output_grads_list); + + while_grad->SetAttrMap(this->Attrs()); + while_grad->SetBlockAttr(kStepBlock, grad_block); + // record the original output gradient names, since the gradient name of + // while operator could be renamed. + while_grad->SetAttr("original_output_grad", output_grads_list); + + while_grad->SetAttr(kSkipEagerDeletionVars, std::vector()); + + return std::unique_ptr(while_grad); + } +}; + +class WhileGradOpVarTypeInference : public framework::VarTypeInference { + public: + void operator()(framework::InferVarTypeContext *ctx) const override { + auto p_names = ctx->Input(kX); + auto pg_ig_names = ctx->Output(framework::GradVarName(kX)); + + for (size_t i = 0; i < p_names.size(); ++i) { + if (ctx->HasVar(pg_ig_names[i])) { + VLOG(5) << "Setting " << pg_ig_names[i] << " following " << p_names[i] + << " type: " << ctx->GetType(p_names[i]); + ctx->SetType(pg_ig_names[i], ctx->GetType(p_names[i])); + ctx->SetDataType(pg_ig_names[i], ctx->GetDataType(p_names[i])); + } + } + } +}; + +class WhileGradOpShapeInference : public framework::InferShapeBase { + public: + void operator()(framework::InferShapeContext *ctx) const override { + ctx->HasInputs(kX); + ctx->HasOutputs(framework::GradVarName(kX)); + ctx->HasInputs(kOutputs); + ctx->HasInputs(framework::GradVarName(kOutputs)); + + auto pg_ig_names = ctx->Outputs(kXGRAD); + std::vector in_var_ptrs = + ctx->GetInputVarPtrs(kX); + std::vector out_var_ptrs = + ctx->GetOutputVarPtrs(kXGRAD); + PADDLE_ENFORCE(in_var_ptrs.size() == out_var_ptrs.size()); + + for (size_t i = 0; i < in_var_ptrs.size(); ++i) { + if (pg_ig_names[i] == framework::kEmptyVarName) { + continue; + } + if (ctx->IsRuntime()) { + framework::Variable *in_var = + boost::get(in_var_ptrs[i]); + framework::Variable *out_var = + boost::get(out_var_ptrs[i]); + + auto type = framework::ToVarType(in_var->Type()); + if (type == framework::proto::VarType::LOD_TENSOR) { + out_var->GetMutable()->Resize( + in_var->Get().dims()); + } else if (type == framework::proto::VarType::SELECTED_ROWS) { + out_var->GetMutable()->set_height( + in_var->Get().GetCompleteDims()[0]); + } else if (type == framework::proto::VarType::LOD_TENSOR_ARRAY) { + PADDLE_THROW("WhileGradOp doesn't support type %d", + static_cast(type)); + } + } else { + framework::VarDesc *in_var = + boost::get(in_var_ptrs[i]); + boost::get(out_var_ptrs[i]) + ->SetShape(in_var->GetShape()); + } + } + } +}; + +} // namespace operators +} // namespace paddle + +REGISTER_OPERATOR(while, paddle::operators::WhileOp, + paddle::operators::WhileOpMaker, + paddle::operators::WhileGradOpDescMaker); +REGISTER_OPERATOR(while_grad, paddle::operators::WhileGradOp, + paddle::operators::WhileGradOpShapeInference, + paddle::operators::WhileGradOpVarTypeInference); diff --git a/paddle/fluid/operators/controlflow/while_op_helper.cc b/paddle/fluid/operators/controlflow/while_op_helper.cc new file mode 100644 index 0000000000000000000000000000000000000000..2cbd94a061b5b369d67b6e0995d6b8fd45801828 --- /dev/null +++ b/paddle/fluid/operators/controlflow/while_op_helper.cc @@ -0,0 +1,291 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/operators/controlflow/while_op_helper.h" +#include +#include +#include +#include "paddle/fluid/framework/program_desc.h" + +namespace paddle { +namespace operators { + +// OpVariant is a wrapper class of OpDesc and OperatorBase +// So that API would be the same. +class OpVariant { + struct InputsVisitor + : public boost::static_visitor { + template + const framework::VariableNameMap *operator()(const OpType *op) const { + return &(op->Inputs()); + } + }; + + struct OutputsVisitor + : public boost::static_visitor { + template + const framework::VariableNameMap *operator()(const OpType *op) const { + return &(op->Outputs()); + } + }; + + struct AttributeMapVisitor + : public boost::static_visitor { + const framework::AttributeMap *operator()( + const framework::OpDesc *op) const { + return &(op->GetAttrMap()); + } + + const framework::AttributeMap *operator()( + const framework::OperatorBase *op) const { + return &(op->Attrs()); + } + }; + + struct RawPointerVisitor : public boost::static_visitor { + template + const void *operator()(const OpType *op) const { + return op; + } + }; + + public: + OpVariant(const framework::OperatorBase *op) : op_(op) {} // NOLINT + + OpVariant(const framework::OpDesc *op) : op_(op) {} // NOLINT + + const framework::VariableNameMap &Inputs() const { + return *boost::apply_visitor(InputsVisitor(), op_); + } + + const framework::VariableNameMap &Outputs() const { + return *boost::apply_visitor(OutputsVisitor(), op_); + } + + const framework::AttributeMap &Attrs() const { + return *boost::apply_visitor(AttributeMapVisitor(), op_); + } + + template + const AttrType &Attr(const std::string &name) const { + auto &attrs = Attrs(); + auto it = attrs.find(name); + PADDLE_ENFORCE(it != attrs.end(), "Cannot find attribute %s", name); + return boost::get(it->second); + } + + bool operator==(const OpVariant &other) const { + return RawPointer() == other.RawPointer(); + } + + const void *RawPointer() const { + return boost::apply_visitor(RawPointerVisitor(), op_); + } + + int which() const { return static_cast(op_.which()); } + + struct Hasher { + size_t operator()(const OpVariant &op) const { + return reinterpret_cast(op.RawPointer()); + } + }; + + private: + const boost::variant + op_; +}; + +static std::string GetDebugString(const std::vector &names) { + if (names.empty()) return ""; + std::string ret = names[0]; + for (size_t i = 1; i < names.size(); ++i) { + ret += (" " + names[i]); + } + return ret; +} + +// Set skip variables of while_op and while_grad_op +// These variables should be skipped when eager deletion enables. +// It is because: +// 1. while_grad_op needs some variables defined in while_op. +// 2. while_grad_op needs variables from the previous time step. +static void SetSkipVars(const OpVariant &op, std::vector attr) { + auto &attrs = const_cast(op.Attrs()); + VLOG(2) << "Prepare to skip " << attr.size() + << " var(s): " << GetDebugString(attr); + attrs[kSkipEagerDeletionVars] = std::move(attr); +} + +// Check whether the forward while_op and while_grad_op match +// The program may have many while_ops. +static bool IsMatchedWhileOpAndWhileGradOp(const OpVariant &fwd_op, + const OpVariant &grad_op) { + return fwd_op.Inputs().at(kX) == grad_op.Inputs().at(kX) && + fwd_op.Outputs().at(kOutputs) == grad_op.Inputs().at(kOutputs); +} + +// Test whether the variable is skippable in forward while_op +// The variable is skippable in while_op when the variable used in while_grad +// is not from grad_block. +static bool IsSkippableVar(const std::string &name, + framework::BlockDesc *grad_block) { + return name != framework::kEmptyVarName && !grad_block->HasVar(name); +} + +static void ModifyWhileOpAndWhileGradOpAttr(const OpVariant &fwd_op, + const OpVariant &bwd_op) { + auto *grad_block = bwd_op.Attr(kStepBlock); + + // Find all skippable variables in forward while_op + std::unordered_set forward_skip_vars; + for (auto *op_desc : grad_block->AllOps()) { + for (auto &in_arg_name : op_desc->InputArgumentNames()) { + if (IsSkippableVar(in_arg_name, grad_block)) { + forward_skip_vars.insert(in_arg_name); + } + } + + for (auto &out_arg_name : op_desc->OutputArgumentNames()) { + if (IsSkippableVar(out_arg_name, grad_block)) { + forward_skip_vars.insert(out_arg_name); + } + } + } + + SetSkipVars(fwd_op, std::vector(forward_skip_vars.begin(), + forward_skip_vars.end())); + + // Find all skippable variables in while_grad_op + // The skipped variables are those which would be used across time steps. + auto &fwd_input = fwd_op.Inputs().at(kX); + auto &in_grads = bwd_op.Outputs().at(framework::GradVarName(kX)); + PADDLE_ENFORCE_EQ( + fwd_input.size(), in_grads.size(), + "Backward input gradient number does not match forward input number."); + + std::unordered_set backward_skip_vars; + for (size_t i = 0; i < in_grads.size(); ++i) { + if (in_grads[i] == framework::kEmptyVarName) { + continue; + } + backward_skip_vars.insert(in_grads[i]); + backward_skip_vars.insert(framework::GradVarName(fwd_input[i])); + } + + SetSkipVars(bwd_op, std::vector(backward_skip_vars.begin(), + backward_skip_vars.end())); +} + +// Find all while_ops and while_grad_ops in the graph or program +// The while_grad_op and while_op may located in different blocks +// So we should traverse all blocks in the program and find them out. +static void FindAllWhileAndWhileGradOp(std::vector *while_ops, + std::vector *while_grad_ops) { + PADDLE_ENFORCE_GE(while_ops->size(), while_grad_ops->size()); + + if (while_ops->empty()) return; + + const auto *program = + while_ops->front().Attr(kStepBlock)->Program(); + for (size_t i = 1; i < program->Size(); ++i) { + auto &block = program->Block(i); + for (size_t j = 0; j < block.OpSize(); ++j) { + auto *op = block.Op(j); + if (op->Type() == "while") { + while_ops->emplace_back(op); + } else if (op->Type() == "while_grad") { + while_grad_ops->emplace_back(op); + } + } + } + + PADDLE_ENFORCE_GE(while_ops->size(), while_grad_ops->size(), + "There are extra while_grad ops in the graph or program"); +} + +static void PrepareSafeEagerDeletionOnWhileOpAndWhileGradOpImpl( + std::vector *while_ops, std::vector *while_grad_ops) { + FindAllWhileAndWhileGradOp(while_ops, while_grad_ops); + + VLOG(2) << "Found while op num: " << while_ops->size() + << ", while grad op num: " << while_grad_ops->size(); + + if (while_grad_ops->empty()) { + return; + } + + std::unordered_set while_op_set( + while_ops->begin(), while_ops->end()); + + for (auto &bwd_op : *while_grad_ops) { + const OpVariant *matched_fwd_op = nullptr; + for (auto &fwd_op : while_op_set) { + if (IsMatchedWhileOpAndWhileGradOp(fwd_op, bwd_op)) { + PADDLE_ENFORCE(matched_fwd_op == nullptr, + "Found multiple matched while ops"); + matched_fwd_op = &fwd_op; + } + } + PADDLE_ENFORCE_NOT_NULL(matched_fwd_op, + "Cannot find matched forward while op."); + ModifyWhileOpAndWhileGradOpAttr(*matched_fwd_op, bwd_op); + while_op_set.erase(*matched_fwd_op); + } +} + +void PrepareSafeEagerDeletionOnWhileOpAndWhileGradOp( + int block_id, + const std::vector> &all_ops) { + // If block_id is not 0, returns + // This is because all while_ops and while_grad_ops in the whole program + // would be processed when block_id is 0 (i.e. when Executor::Run() or + // ParallelExecutor constructs). + + // What's more, all while_ops and while_grad_ops must be processed when + // block_id is zero. If not, while_op may run first and erase variables + // used in while_grad_op, and in this moment, while_grad_ops may be not + // constructed yet. + if (block_id != 0) return; + + std::vector fwd_ops, bwd_ops; + for (auto &op : all_ops) { + if (op->Type() == "while") { + fwd_ops.emplace_back(op.get()); + } else if (op->Type() == "while_grad") { + bwd_ops.emplace_back(op.get()); + } + } + PrepareSafeEagerDeletionOnWhileOpAndWhileGradOpImpl(&fwd_ops, &bwd_ops); +} + +void PrepareSafeEagerDeletionOnWhileOpAndWhileGradOp( + const std::vector &while_ops, + const std::vector &while_grad_ops) { + std::vector fwd_ops, bwd_ops; + fwd_ops.reserve(while_ops.size()); + for (auto *op : while_ops) { + fwd_ops.emplace_back(op); + } + + bwd_ops.reserve(while_grad_ops.size()); + for (auto *op : while_grad_ops) { + bwd_ops.emplace_back(op); + } + + PrepareSafeEagerDeletionOnWhileOpAndWhileGradOpImpl(&fwd_ops, &bwd_ops); +} + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/controlflow/while_op_helper.h b/paddle/fluid/operators/controlflow/while_op_helper.h new file mode 100644 index 0000000000000000000000000000000000000000..456ba8642b9bd32a1236d112cc8b387ae6a279d3 --- /dev/null +++ b/paddle/fluid/operators/controlflow/while_op_helper.h @@ -0,0 +1,43 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include "paddle/fluid/framework/operator.h" +#include "paddle/fluid/platform/variant.h" + +namespace paddle { +namespace operators { + +static constexpr char kStepBlock[] = "sub_block"; +static constexpr char kCondition[] = "Condition"; +static constexpr char kStepScopes[] = "StepScopes"; +static constexpr char kX[] = "X"; +static constexpr char kXGRAD[] = "X@GRAD"; +static constexpr char kOutputs[] = "Out"; +static constexpr char kSkipEagerDeletionVars[] = "skip_eager_deletion_vars"; + +void PrepareSafeEagerDeletionOnWhileOpAndWhileGradOp( + int block_id, + const std::vector> &all_ops); + +void PrepareSafeEagerDeletionOnWhileOpAndWhileGradOp( + const std::vector &while_ops, + const std::vector &while_grad_ops); + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/conv_cudnn_op.cu.cc b/paddle/fluid/operators/conv_cudnn_op.cu.cc index 4a7a6bcf7154d5680de751e3c933be46fb09fd74..9e5ccd928e9d6012c1da3baa17521dcac0c8ff2f 100644 --- a/paddle/fluid/operators/conv_cudnn_op.cu.cc +++ b/paddle/fluid/operators/conv_cudnn_op.cu.cc @@ -15,15 +15,22 @@ limitations under the License. */ #include "paddle/fluid/framework/eigen.h" #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/memory/memory.h" +#include "paddle/fluid/operators/conv_cudnn_op_cache.h" #include "paddle/fluid/operators/conv_op.h" #include "paddle/fluid/platform/assert.h" #include "paddle/fluid/platform/cudnn_helper.h" #include "paddle/fluid/platform/float16.h" +#include "paddle/fluid/platform/profiler.h" DEFINE_bool(cudnn_deterministic, false, "Whether allow using an autotuning algorithm for convolution " "operator. The autotuning algorithm may be non-deterministic. If " "true, the algorithm is deterministic."); +DEFINE_uint64(conv_workspace_size_limit, 4096, + "cuDNN convolution workspace limit in MB unit."); +DEFINE_bool(cudnn_exhaustive_search, false, + "Whether enable exhaustive search for cuDNN convolution or " + "not, defalut is False."); namespace paddle { namespace operators { @@ -35,14 +42,13 @@ using ScopedConvolutionDescriptor = platform::ScopedConvolutionDescriptor; using DataLayout = platform::DataLayout; template using ScalingParamType = typename platform::CudnnDataType::ScalingParamType; - -static constexpr size_t kCONV_CUDNN_WORKSPACE_LIMIT_BYTES = - static_cast(1024) * 1024 * 1024; +using framework::AlgorithmsCache; template class CUDNNConvOpKernel : public framework::OpKernel { public: void Compute(const framework::ExecutionContext& ctx) const override { + auto& dev_ctx = ctx.template device_context(); PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()), "It must use CUDAPlace."); auto* input = ctx.Input("Input"); @@ -55,6 +61,8 @@ class CUDNNConvOpKernel : public framework::OpKernel { int groups = ctx.Attr("groups"); int64_t user_workspace_size = static_cast(ctx.Attr("workspace_size_MB")); + bool exhaustive_search = + FLAGS_cudnn_exhaustive_search || ctx.Attr("exhaustive_search"); const T* input_data = input->data(); const T* filter_data = filter->data(); @@ -120,19 +128,18 @@ class CUDNNConvOpKernel : public framework::OpKernel { // ------------------- cudnn conv workspace --------------------- size_t workspace_size_in_bytes; // final workspace to allocate. size_t workspace_size_limit = kCONV_CUDNN_WORKSPACE_LIMIT_BYTES; - if (user_workspace_size > 0) { - workspace_size_limit = user_workspace_size * 1024 * 1024; + if (FLAGS_conv_workspace_size_limit > 0 || user_workspace_size > 0) { + int64_t max_user_size = + std::max(static_cast(FLAGS_conv_workspace_size_limit), + user_workspace_size); + workspace_size_limit = max_user_size * 1024 * 1024; } + // ------------------- cudnn conv algorithm --------------------- cudnnConvolutionFwdAlgo_t algo; - auto& dev_ctx = ctx.template device_context(); auto handle = dev_ctx.cudnn_handle(); - CUDNN_ENFORCE(platform::dynload::cudnnGetConvolutionForwardAlgorithm( - handle, cudnn_input_desc, cudnn_filter_desc, cudnn_conv_desc, - cudnn_output_desc, CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT, - workspace_size_limit, &algo)); - + bool half_float = false; #if CUDA_VERSION >= 9000 && CUDNN_VERSION_MIN(7, 0, 1) // Tensor core is supported since the volta GPU and // is only enabled when input and filter data are float16 @@ -143,11 +150,62 @@ class CUDNNConvOpKernel : public framework::OpKernel { cudnn_conv_desc, CUDNN_TENSOR_OP_MATH)); // Currently tensor core is only enabled using this algo algo = CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM; + half_float = true; + VLOG(5) << "use cudnn_tensor_op_math"; } else { CUDNN_ENFORCE(platform::dynload::cudnnSetConvolutionMathType( cudnn_conv_desc, CUDNN_DEFAULT_MATH)); + VLOG(5) << "NOT use cudnn_tensor_op_math"; } #endif + Tensor cudnn_workspace; + void* cudnn_workspace_ptr = nullptr; + + auto x_dims = framework::vectorize(input->dims()); + auto f_dims = framework::vectorize(filter->dims()); + if ((!exhaustive_search) && (!half_float)) { + CUDNN_ENFORCE(platform::dynload::cudnnGetConvolutionForwardAlgorithm( + handle, cudnn_input_desc, cudnn_filter_desc, cudnn_conv_desc, + cudnn_output_desc, CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT, + workspace_size_limit, &algo)); + VLOG(3) << "cuDNN forward algo " << algo; + } else if (exhaustive_search && (!half_float)) { + AlgorithmsCache& algo_cache = + ctx.GetKernelConfig>(0); + cudnn_workspace = + ctx.AllocateTmpTensor( + framework::make_ddim( + {static_cast(workspace_size_limit)}), + dev_ctx); + cudnn_workspace_ptr = static_cast(cudnn_workspace.data()); + + algo = algo_cache.GetAlgorithm( + x_dims, f_dims, strides, paddings, dilations, 0, [&]() { + int returned_algo_count; + std::array + fwd_perf_stat; + + CUDNN_ENFORCE( + platform::dynload::cudnnFindConvolutionForwardAlgorithmEx( + handle, cudnn_input_desc, input_data, cudnn_filter_desc, + filter_data, cudnn_conv_desc, cudnn_output_desc, + output_data, kNUM_CUDNN_FWD_ALGS, &returned_algo_count, + fwd_perf_stat.data(), cudnn_workspace_ptr, + workspace_size_limit)); + + VLOG(3) << "Perf result: (algo: stat, time, memory)"; + for (int i = 0; i < returned_algo_count; ++i) { + const auto& stat = fwd_perf_stat[i]; + VLOG(3) << stat.algo << ": " << stat.status << " " << stat.time + << " " << stat.memory; + } + return fwd_perf_stat[0].algo; + }); + VLOG(3) << "choose algo " << algo; + } else { + PADDLE_ENFORCE(half_float, + "cuDNN exhaustive search doesn't support half float."); + } // get workspace size able to allocate CUDNN_ENFORCE(platform::dynload::cudnnGetConvolutionForwardWorkspaceSize( @@ -158,17 +216,23 @@ class CUDNNConvOpKernel : public framework::OpKernel { PADDLE_ENFORCE_LE(workspace_size_in_bytes, workspace_size_limit, "workspace_size to be allocated exceeds the limit"); + // Allocate on GPU memory + if (!cudnn_workspace_ptr) { + cudnn_workspace = + ctx.AllocateTmpTensor( + framework::make_ddim( + {static_cast(workspace_size_in_bytes)}), + dev_ctx); + cudnn_workspace_ptr = static_cast(cudnn_workspace.data()); + } // ------------------- cudnn conv forward --------------------- ScalingParamType alpha = 1.0f, beta = 0.0f; for (int i = 0; i < groups; i++) { - auto cudnn_func = [&](void* cudnn_workspace) { - CUDNN_ENFORCE(platform::dynload::cudnnConvolutionForward( - handle, &alpha, cudnn_input_desc, input_data + i * group_offset_in, - cudnn_filter_desc, filter_data + i * group_offset_filter, - cudnn_conv_desc, algo, cudnn_workspace, workspace_size_in_bytes, - &beta, cudnn_output_desc, output_data + i * group_offset_out)); - }; - dev_ctx.RunCudnnFuncWithWorkspace(cudnn_func, workspace_size_in_bytes); + CUDNN_ENFORCE(platform::dynload::cudnnConvolutionForward( + handle, &alpha, cudnn_input_desc, input_data + i * group_offset_in, + cudnn_filter_desc, filter_data + i * group_offset_filter, + cudnn_conv_desc, algo, cudnn_workspace_ptr, workspace_size_in_bytes, + &beta, cudnn_output_desc, output_data + i * group_offset_out)); } } }; @@ -177,6 +241,7 @@ template class CUDNNConvGradOpKernel : public framework::OpKernel { public: void Compute(const framework::ExecutionContext& ctx) const override { + auto& dev_ctx = ctx.template device_context(); PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()), "It must use CUDAPlace."); auto input = ctx.Input("Input"); @@ -195,6 +260,13 @@ class CUDNNConvGradOpKernel : public framework::OpKernel { int groups = ctx.Attr("groups"); int64_t user_workspace_size = static_cast(ctx.Attr("workspace_size_MB")); + bool exhaustive_search = + FLAGS_cudnn_exhaustive_search || ctx.Attr("exhaustive_search"); + if (exhaustive_search && FLAGS_cudnn_deterministic) { + PADDLE_THROW( + "Cann't set exhaustive_search True and " + "FLAGS_cudnn_deterministic True at same time."); + } // ------------------- cudnn descriptors --------------------- ScopedTensorDescriptor input_desc; @@ -228,6 +300,21 @@ class CUDNNConvGradOpKernel : public framework::OpKernel { cudnnFilterDescriptor_t cudnn_filter_desc = filter_desc.descriptor( layout, framework::vectorize2int(filter->dims()), groups); +#if CUDA_VERSION >= 9000 && CUDNN_VERSION_MIN(7, 0, 1) + // Enable Tensor Core for cudnn backward + if (dev_ctx.GetComputeCapability() >= 70 && + std::type_index(typeid(T)) == + std::type_index(typeid(platform::float16))) { + CUDNN_ENFORCE(platform::dynload::cudnnSetConvolutionMathType( + cudnn_conv_desc, CUDNN_TENSOR_OP_MATH)); + VLOG(5) << "use cudnn_tensor_op_math for backward"; + } else { + CUDNN_ENFORCE(platform::dynload::cudnnSetConvolutionMathType( + cudnn_conv_desc, CUDNN_DEFAULT_MATH)); + VLOG(5) << "NOT use cudnn_tensor_op_math for backward"; + } +#endif + int input_channels = input->dims()[1]; int input_height, input_width, input_depth; if (input->dims().size() == 5) { @@ -262,14 +349,62 @@ class CUDNNConvGradOpKernel : public framework::OpKernel { cudnnConvolutionBwdFilterAlgo_t filter_algo; size_t workspace_size_in_bytes = 0, tmp_size = 0; size_t workspace_size_limit = kCONV_CUDNN_WORKSPACE_LIMIT_BYTES; - if (user_workspace_size > 0) { - workspace_size_limit = user_workspace_size * 1024 * 1024; + if (FLAGS_conv_workspace_size_limit > 0 || user_workspace_size > 0) { + int64_t max_user_size = + std::max(static_cast(FLAGS_conv_workspace_size_limit), + user_workspace_size); + workspace_size_limit = max_user_size * 1024 * 1024; } - auto& dev_ctx = ctx.template device_context(); + Tensor cudnn_workspace; + void* cudnn_workspace_ptr = nullptr; + if ((input_data || filter_data) && exhaustive_search) { + cudnn_workspace = + ctx.AllocateTmpTensor( + framework::make_ddim( + {static_cast(workspace_size_limit)}), + dev_ctx); + cudnn_workspace_ptr = static_cast(cudnn_workspace.data()); + } + + auto x_dims = framework::vectorize(input->dims()); + auto f_dims = framework::vectorize(filter->dims()); auto handle = dev_ctx.cudnn_handle(); if (input_grad) { - if (!FLAGS_cudnn_deterministic) { + T* input_grad_data = input_grad->mutable_data(ctx.GetPlace()); + if (exhaustive_search) { + AlgorithmsCache& data_algo_cache = + ctx.GetKernelConfig>( + 0); + + data_algo = data_algo_cache.GetAlgorithm( + x_dims, f_dims, strides, paddings, dilations, 0, [&]() { + int returned_algo_count; + std::array + data_perf_stat; + + CUDNN_ENFORCE(platform::dynload:: + cudnnFindConvolutionBackwardDataAlgorithmEx( + handle, cudnn_filter_desc, filter_data, + cudnn_output_grad_desc, output_grad_data, + cudnn_conv_desc, cudnn_input_desc, + input_grad_data, kNUM_CUDNN_BWD_DATA_ALGS, + &returned_algo_count, data_perf_stat.data(), + cudnn_workspace_ptr, workspace_size_limit)); + + VLOG(3) << "Perf result: (algo: stat, time, memory)"; + for (int i = 0; i < returned_algo_count; ++i) { + const auto& stat = data_perf_stat[i]; + VLOG(3) << stat.algo << ": " << stat.status << " " << stat.time + << " " << stat.memory; + } + return data_perf_stat[0].algo; + }); + VLOG(3) << "cuDNN backward data algo " << data_algo; + } else if (FLAGS_cudnn_deterministic) { + data_algo = CUDNN_CONVOLUTION_BWD_DATA_ALGO_1; + } else { CUDNN_ENFORCE( platform::dynload::cudnnGetConvolutionBackwardDataAlgorithm( handle, cudnn_filter_desc, @@ -282,10 +417,7 @@ class CUDNNConvGradOpKernel : public framework::OpKernel { cudnn_input_desc, CUDNN_CONVOLUTION_BWD_DATA_SPECIFY_WORKSPACE_LIMIT, workspace_size_limit, &data_algo)); - } else { - data_algo = CUDNN_CONVOLUTION_BWD_DATA_ALGO_1; } - CUDNN_ENFORCE( platform::dynload::cudnnGetConvolutionBackwardDataWorkspaceSize( handle, cudnn_filter_desc, cudnn_output_grad_desc, @@ -294,17 +426,41 @@ class CUDNNConvGradOpKernel : public framework::OpKernel { } if (filter_grad) { - if (!FLAGS_cudnn_deterministic) { + T* filter_grad_data = filter_grad->mutable_data(ctx.GetPlace()); + if (exhaustive_search) { + AlgorithmsCache& f_algo_cache = + ctx.GetKernelConfig< + AlgorithmsCache>(1); + + filter_algo = f_algo_cache.GetAlgorithm( + x_dims, f_dims, strides, paddings, dilations, 0, [&]() { + int returned_algo_count; + std::array + filter_perf_stat; + + CUDNN_ENFORCE( + platform::dynload:: + cudnnFindConvolutionBackwardFilterAlgorithmEx( + handle, cudnn_input_desc, input_data, + cudnn_output_grad_desc, output_grad_data, + cudnn_conv_desc, cudnn_filter_desc, filter_grad_data, + kNUM_CUDNN_BWD_FILTER_ALGS, &returned_algo_count, + filter_perf_stat.data(), cudnn_workspace_ptr, + workspace_size_limit)); + return filter_perf_stat[0].algo; + }); + VLOG(3) << "cuDNN backward filter algo " << filter_algo; + } else if (FLAGS_cudnn_deterministic) { + filter_algo = CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1; + } else { CUDNN_ENFORCE( platform::dynload::cudnnGetConvolutionBackwardFilterAlgorithm( handle, cudnn_input_desc, cudnn_output_grad_desc, cudnn_conv_desc, cudnn_filter_desc, CUDNN_CONVOLUTION_BWD_FILTER_SPECIFY_WORKSPACE_LIMIT, workspace_size_limit, &filter_algo)); - } else { - filter_algo = CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1; } - CUDNN_ENFORCE( platform::dynload::cudnnGetConvolutionBackwardFilterWorkspaceSize( handle, cudnn_input_desc, cudnn_output_grad_desc, cudnn_conv_desc, @@ -312,6 +468,16 @@ class CUDNNConvGradOpKernel : public framework::OpKernel { workspace_size_in_bytes = std::max(workspace_size_in_bytes, tmp_size); } + // ------------------- cudnn conv workspace --------------------- + if (!cudnn_workspace_ptr) { + cudnn_workspace = + ctx.AllocateTmpTensor( + framework::make_ddim( + {static_cast(workspace_size_in_bytes)}), + dev_ctx); + cudnn_workspace_ptr = static_cast(cudnn_workspace.data()); + } + // ------------------- cudnn conv backward data --------------------- ScalingParamType alpha = 1.0f, beta = 0.0f; if (input_grad) { @@ -319,15 +485,12 @@ class CUDNNConvGradOpKernel : public framework::OpKernel { // Because beta is zero, it is unnecessary to reset input_grad. for (int i = 0; i < groups; i++) { - auto cudnn_func = [&](void* cudnn_workspace) { - CUDNN_ENFORCE(platform::dynload::cudnnConvolutionBackwardData( - handle, &alpha, cudnn_filter_desc, - filter_data + i * group_offset_filter, cudnn_output_grad_desc, - output_grad_data + i * group_offset_out, cudnn_conv_desc, - data_algo, cudnn_workspace, workspace_size_in_bytes, &beta, - cudnn_input_desc, input_grad_data + i * group_offset_in)); - }; - dev_ctx.RunCudnnFuncWithWorkspace(cudnn_func, workspace_size_in_bytes); + CUDNN_ENFORCE(platform::dynload::cudnnConvolutionBackwardData( + handle, &alpha, cudnn_filter_desc, + filter_data + i * group_offset_filter, cudnn_output_grad_desc, + output_grad_data + i * group_offset_out, cudnn_conv_desc, data_algo, + cudnn_workspace_ptr, workspace_size_in_bytes, &beta, + cudnn_input_desc, input_grad_data + i * group_offset_in)); } } // ------------------- cudnn conv backward filter --------------------- @@ -335,15 +498,12 @@ class CUDNNConvGradOpKernel : public framework::OpKernel { T* filter_grad_data = filter_grad->mutable_data(ctx.GetPlace()); // Because beta is zero, it is unnecessary to reset filter_grad. for (int i = 0; i < groups; i++) { - auto cudnn_func = [&](void* cudnn_workspace) { - CUDNN_ENFORCE(platform::dynload::cudnnConvolutionBackwardFilter( - handle, &alpha, cudnn_input_desc, - input_data + i * group_offset_in, cudnn_output_grad_desc, - output_grad_data + i * group_offset_out, cudnn_conv_desc, - filter_algo, cudnn_workspace, workspace_size_in_bytes, &beta, - cudnn_filter_desc, filter_grad_data + i * group_offset_filter)); - }; - dev_ctx.RunCudnnFuncWithWorkspace(cudnn_func, workspace_size_in_bytes); + CUDNN_ENFORCE(platform::dynload::cudnnConvolutionBackwardFilter( + handle, &alpha, cudnn_input_desc, input_data + i * group_offset_in, + cudnn_output_grad_desc, output_grad_data + i * group_offset_out, + cudnn_conv_desc, filter_algo, cudnn_workspace_ptr, + workspace_size_in_bytes, &beta, cudnn_filter_desc, + filter_grad_data + i * group_offset_filter)); } } } @@ -359,7 +519,8 @@ REGISTER_OP_KERNEL(conv2d, CUDNN, plat::CUDAPlace, paddle::operators::CUDNNConvOpKernel); REGISTER_OP_KERNEL(conv2d_grad, CUDNN, plat::CUDAPlace, paddle::operators::CUDNNConvGradOpKernel, - paddle::operators::CUDNNConvGradOpKernel); + paddle::operators::CUDNNConvGradOpKernel, + paddle::operators::CUDNNConvGradOpKernel); REGISTER_OP_KERNEL(conv3d, CUDNN, plat::CUDAPlace, paddle::operators::CUDNNConvOpKernel, diff --git a/paddle/fluid/operators/conv_cudnn_op_cache.h b/paddle/fluid/operators/conv_cudnn_op_cache.h new file mode 100644 index 0000000000000000000000000000000000000000..de92b75a501dfc300bb8b52ebfa7903995847218 --- /dev/null +++ b/paddle/fluid/operators/conv_cudnn_op_cache.h @@ -0,0 +1,51 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include +#include "paddle/fluid/framework/operator.h" +#include "paddle/fluid/platform/cudnn_helper.h" + +DECLARE_uint64(conv_workspace_size_limit); +DECLARE_bool(cudnn_exhaustive_search); +DECLARE_int64(cudnn_exhaustive_search_times); + +namespace paddle { +namespace operators { + +static constexpr char kCUDNNFwdAlgoCache[] = "kCUDNNFwdAlgoCache"; +static constexpr char kCUDNNBwdDataAlgoCache[] = "kCUDNNBwdDataAlgoCache"; +static constexpr char kCUDNNBwdFilterAlgoCache[] = "kCUDNNBwdFilterAlgoCache"; + +static constexpr size_t kCONV_CUDNN_WORKSPACE_LIMIT_BYTES = + static_cast(1024) * 1024 * 1024; + +#if CUDNN_VERSION_MIN(6, 0, 5) +static constexpr size_t kNUM_CUDNN_FWD_ALGS = CUDNN_CONVOLUTION_FWD_ALGO_COUNT; +static constexpr size_t kNUM_CUDNN_BWD_FILTER_ALGS = + CUDNN_CONVOLUTION_BWD_FILTER_ALGO_COUNT; +static constexpr size_t kNUM_CUDNN_BWD_DATA_ALGS = + CUDNN_CONVOLUTION_BWD_DATA_ALGO_COUNT; +#else +// cuDNN v5 has no CUDNN_CONVOLUTION_FWD_ALGO_COUNT etc. +static constexpr size_t kNUM_CUDNN_FWD_ALGS = 7; +static constexpr size_t kNUM_CUDNN_BWD_FILTER_ALGS = 4; +static constexpr size_t kNUM_CUDNN_BWD_DATA_ALGS = 5; +#endif + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/conv_fusion_op.cc b/paddle/fluid/operators/conv_fusion_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..23b8087e781da30ed7b66ba651f8071ecb7aaf50 --- /dev/null +++ b/paddle/fluid/operators/conv_fusion_op.cc @@ -0,0 +1,108 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include "paddle/fluid/operators/conv_op.h" +#ifdef PADDLE_WITH_CUDA +#include "paddle/fluid/platform/cudnn_helper.h" +#endif + +namespace paddle { +namespace operators { + +// This fused conv follows the equation: +// y = act ( alpha1 * conv(x) + alpha2 * z + bias ). +// here, y is Output, +// x is Input, +// z is ResidualData, +// bias is Bias +// When `split_channels` is set, y will be splitted into multiple outputs, +// each output has split_channels[i] number of channels. +class Conv2DFusionOpMaker : public Conv2DOpMaker { + protected: + void Apply() override { + AddAttr( + "activation", + "The activation type can be 'identity', 'sigmoid', 'relu', 'relu6' " + "'relux' , 'tanh', 'band_pass'") + .SetDefault("relu"); + AddAttr>( + "split_channels", + "When `split_channels` are set, there will be multiple outputs, the " + "output size is equal to the number of `split_channels`.") + .SetDefault({}); + AddOutput("Outputs", + "This Outputs is used when setting `split_channels`." + "Usually used to fuse conv with same input and same filter size, " + "padding, stride, dilation size.") + .AsDuplicable() + .AsDispensable(); + AddInput("AlgoCache", + "The cache of convolution algorithm, a RAW type variable.") + .AsDispensable(); + AddAttr( + "search_times", + "The number of exhaustive search times for convolution algorithm.") + .SetDefault(-1); + } +}; + +class Conv2DFusionOpInferShape : public framework::InferShapeBase { + public: + void operator()(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("Input"), + "Input(Input) of ConvOp should not be null."); + PADDLE_ENFORCE(ctx->HasInput("Filter"), + "Input(Filter) of ConvOp should not be null."); + auto in_dims = ctx->GetInputDim("Input"); + auto filter_dims = ctx->GetInputDim("Filter"); + + std::vector strides = ctx->Attrs().Get>("strides"); + std::vector paddings = ctx->Attrs().Get>("paddings"); + std::vector dilations = + ctx->Attrs().Get>("dilations"); + + std::vector oshape({in_dims[0], filter_dims[0]}); + for (size_t i = 0; i < strides.size(); ++i) { + oshape.push_back(ConvOutputSize(in_dims[i + 2], filter_dims[i + 2], + dilations[i], paddings[i], strides[i])); + } + PADDLE_ENFORCE(ctx->HasOutput("Output"), + "Output(Output) of ConvOp should not be null."); + ctx->SetOutputDim("Output", framework::make_ddim(oshape)); + std::vector channels = + ctx->Attrs().Get>("split_channels"); + if (channels.size()) { + PADDLE_ENFORCE(ctx->HasOutputs("Outputs"), + "Output(Outputs) of ConvOp should not be null."); + std::vector oshapes; + oshapes.reserve(channels.size()); + for (size_t i = 0; i < channels.size(); ++i) { + oshapes.push_back({oshape[0], channels[i], oshape[2], oshape[3]}); + } + ctx->SetOutputsDim("Outputs", oshapes); + } + } +}; + +// TODO(qingqing): add gradient operator for conv2d_fusion + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(conv2d_fusion, ops::ConvOp, ops::Conv2DFusionOpMaker, + ops::Conv2DFusionOpInferShape, ops::ConvOpInferVarType, + paddle::framework::EmptyGradOpMaker); diff --git a/paddle/fluid/operators/conv_fusion_op.cu.cc b/paddle/fluid/operators/conv_fusion_op.cu.cc new file mode 100644 index 0000000000000000000000000000000000000000..64152829b4f000e545054e528edca33dfe96ec56 --- /dev/null +++ b/paddle/fluid/operators/conv_fusion_op.cu.cc @@ -0,0 +1,235 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/conv_cudnn_op_cache.h" +#include "paddle/fluid/platform/cudnn_helper.h" + +DEFINE_int64(cudnn_exhaustive_search_times, -1, + "Exhaustive search times for cuDNN convolution, " + "defalut is 1, only search once."); + +namespace paddle { +namespace operators { + +#if CUDNN_VERSION >= 7100 +using Tensor = framework::Tensor; +using ScopedTensorDescriptor = platform::ScopedTensorDescriptor; +using ScopedFilterDescriptor = platform::ScopedFilterDescriptor; +using ScopedConvolutionDescriptor = platform::ScopedConvolutionDescriptor; +using ScopedActivationDescriptor = platform::ScopedActivationDescriptor; +using DataLayout = platform::DataLayout; +using framework::AlgorithmsCache; + +template +using ScalingParamType = typename platform::CudnnDataType::ScalingParamType; + +template +class CUDNNConvFusionOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto& dev_ctx = ctx.template device_context(); + auto* input = ctx.Input("Input"); + auto* filter = ctx.Input("Filter"); + auto* bias = ctx.Input("Bias"); + PADDLE_ENFORCE(bias, "The bias should not be null."); + auto* residual = ctx.Input("ResidualData"); + auto* output = ctx.Output("Output"); + + std::vector strides = ctx.Attr>("strides"); + std::vector paddings = ctx.Attr>("paddings"); + std::vector dilations = ctx.Attr>("dilations"); + const std::string activation = ctx.Attr("activation"); + int groups = ctx.Attr("groups"); + int64_t user_workspace_size = + static_cast(ctx.Attr("workspace_size_MB")); + bool exhaustive_search = + FLAGS_cudnn_exhaustive_search || ctx.Attr("exhaustive_search"); + + const T* input_data = input->data(); + const T* filter_data = filter->data(); + const T* bias_data = bias->data(); + T* output_data = output->mutable_data(ctx.GetPlace()); + const T* residual_data = residual ? residual->data() : output_data; + + // ------------------- cudnn descriptors --------------------- + ScopedTensorDescriptor input_desc; + ScopedTensorDescriptor output_desc; + ScopedFilterDescriptor filter_desc; + ScopedTensorDescriptor bias_desc; + ScopedConvolutionDescriptor conv_desc; + ScopedActivationDescriptor act_desc; + DataLayout layout = DataLayout::kNCHW; + if (input->dims().size() == 5) { + layout = DataLayout::kNCDHW; + } + + cudnnConvolutionDescriptor_t cudnn_conv_desc = + conv_desc.descriptor(paddings, strides, dilations); + CUDNN_ENFORCE(platform::dynload::cudnnSetConvolutionGroupCount( + cudnn_conv_desc, groups)); + + cudnnTensorDescriptor_t cudnn_input_desc = input_desc.descriptor( + layout, framework::vectorize2int(input->dims())); + cudnnTensorDescriptor_t cudnn_output_desc = output_desc.descriptor( + layout, framework::vectorize2int(output->dims())); + cudnnFilterDescriptor_t cudnn_filter_desc = filter_desc.descriptor( + layout, framework::vectorize2int(filter->dims())); + // Now only support NCHW + std::vector bias_dim = {1, static_cast(output->dims()[1]), 1, 1}; + cudnnTensorDescriptor_t cudnn_bias_desc = + bias_desc.descriptor(layout, bias_dim); + cudnnActivationDescriptor_t cudnn_act_desc = + act_desc.descriptor(activation); + + // ------------------- cudnn conv workspace --------------------- + size_t workspace_size_in_bytes; // final workspace to allocate. + size_t workspace_size_limit = kCONV_CUDNN_WORKSPACE_LIMIT_BYTES; + if (FLAGS_conv_workspace_size_limit > 0 || user_workspace_size > 0) { + int64_t max_user_size = + std::max(static_cast(FLAGS_conv_workspace_size_limit), + user_workspace_size); + workspace_size_limit = max_user_size * 1024 * 1024; + } + + // ------------------- cudnn conv algorithm --------------------- + cudnnConvolutionFwdAlgo_t algo; + auto handle = dev_ctx.cudnn_handle(); + auto workspace_handle = dev_ctx.cudnn_workspace_handle(); + + CUDNN_ENFORCE(platform::dynload::cudnnSetConvolutionMathType( + cudnn_conv_desc, CUDNN_DEFAULT_MATH)); + + auto x_dims = framework::vectorize(input->dims()); + auto f_dims = framework::vectorize(filter->dims()); + if (!exhaustive_search) { + CUDNN_ENFORCE(platform::dynload::cudnnGetConvolutionForwardAlgorithm( + handle, cudnn_input_desc, cudnn_filter_desc, cudnn_conv_desc, + cudnn_output_desc, CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT, + workspace_size_limit, &algo)); + VLOG(3) << "cuDNN forward algo " << algo; + } else { + auto search_func = [&]() { + int returned_algo_count; + std::array + fwd_perf_stat; + auto cudnn_find_func = [&](void* cudnn_workspace) { + CUDNN_ENFORCE( + platform::dynload::cudnnFindConvolutionForwardAlgorithmEx( + handle, cudnn_input_desc, input_data, cudnn_filter_desc, + filter_data, cudnn_conv_desc, cudnn_output_desc, output_data, + kNUM_CUDNN_FWD_ALGS, &returned_algo_count, + fwd_perf_stat.data(), cudnn_workspace, workspace_size_limit)); + }; + workspace_handle.RunFunc(cudnn_find_func, workspace_size_limit); + VLOG(3) << "Perf result: (algo: stat, time, memory)"; + for (int i = 0; i < returned_algo_count; ++i) { + const auto& stat = fwd_perf_stat[i]; + VLOG(3) << stat.algo << ": " << stat.status << " " << stat.time << " " + << stat.memory; + } + return fwd_perf_stat[0].algo; + }; + AlgorithmsCache& algo_cache = + ctx.GetKernelConfig>(0); + int search_times = ctx.Attr("search_times"); + search_times = std::max( + static_cast(FLAGS_cudnn_exhaustive_search_times), search_times); + // TODO(dangqingqing): Unify this if-else. + if (search_times > 0) { + // The searched algo will be cached by `search_times` times for + // different input dimension. For other dimensions, select the algo + // of closest area. + algo = algo_cache.GetAlgorithm(x_dims[2] * x_dims[3], search_times, 0, + search_func); + } else { + algo = algo_cache.GetAlgorithm(x_dims, f_dims, strides, paddings, + dilations, 0, search_func); + } + VLOG(3) << "choose algo " << algo; + } + + CUDNN_ENFORCE(platform::dynload::cudnnGetConvolutionForwardWorkspaceSize( + handle, cudnn_input_desc, cudnn_filter_desc, cudnn_conv_desc, + cudnn_output_desc, algo, &workspace_size_in_bytes)); + PADDLE_ENFORCE_LE(workspace_size_in_bytes, workspace_size_limit, + "workspace_size to be allocated exceeds the limit"); + + if ((activation == "identity") && (!residual)) { + // Only the CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algo is + // enabled with CUDNN_ACTIVATION_IDENTITY in cuDNN lib. + // But test in some case, the speed is slower, change to use + // cudnnConvolutionForward and cudnnAddTensor + // ------------- cudnn conv forward and bias add --------------------- + ScalingParamType alpha = 1.0f, beta = 0.0f; + auto cudnn_func = [&](void* cudnn_workspace) { + CUDNN_ENFORCE(platform::dynload::cudnnConvolutionForward( + handle, &alpha, cudnn_input_desc, input_data, cudnn_filter_desc, + filter_data, cudnn_conv_desc, algo, cudnn_workspace, + workspace_size_in_bytes, &beta, cudnn_output_desc, output_data)); + }; + workspace_handle.RunFunc(cudnn_func, workspace_size_in_bytes); + CUDNN_ENFORCE(platform::dynload::cudnnAddTensor( + handle, &alpha, cudnn_bias_desc, bias_data, &alpha, cudnn_output_desc, + output_data)); + } else { + if (activation == "identity") { + algo = CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM; + } + // ------------------- cudnn conv+bias+act forward -------------------- + ScalingParamType alpha1 = 1.0f; + ScalingParamType alpha2 = residual ? 1.0f : 0.0f; + auto cudnn_func = [&](void* cudnn_workspace) { + CUDNN_ENFORCE(platform::dynload::cudnnConvolutionBiasActivationForward( + handle, &alpha1, cudnn_input_desc, input_data, cudnn_filter_desc, + filter_data, cudnn_conv_desc, algo, cudnn_workspace, + workspace_size_in_bytes, &alpha2, cudnn_output_desc, residual_data, + cudnn_bias_desc, bias_data, cudnn_act_desc, cudnn_output_desc, + output_data)); + }; + workspace_handle.RunFunc(cudnn_func, workspace_size_in_bytes); + } + std::vector channels = ctx.Attr>("split_channels"); + if (channels.size()) { + auto outs = ctx.MultiOutput("Outputs"); + if (x_dims[0] == 1) { + // share data with Output + framework::Tensor t; + t.ShareDataWith(*output); + auto y_dims = output->dims(); + t.Resize({y_dims[1], y_dims[2], y_dims[3]}); + int s = 0; + for (size_t i = 0; i < channels.size(); ++i) { + int e = s + channels[i]; + outs[i]->ShareDataWith(t.Slice(s, e)); + outs[i]->Resize({x_dims[0], channels[i], y_dims[2], y_dims[3]}); + s = e; + } + } else { + // TODO(qingiqng): do copy when batch size large than 1 + PADDLE_THROW("Batch size greater than 1 is Unsupported"); + } + } + } +}; +#endif + +} // namespace operators +} // namespace paddle + +#if CUDNN_VERSION >= 7100 +namespace ops = paddle::operators; +REGISTER_OP_CUDA_KERNEL(conv2d_fusion, ops::CUDNNConvFusionOpKernel, + ops::CUDNNConvFusionOpKernel); +#endif diff --git a/paddle/fluid/operators/conv_mkldnn_op.cc b/paddle/fluid/operators/conv_mkldnn_op.cc deleted file mode 100644 index eae65968285703f5882d910e29bc5d8e1511cba6..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/conv_mkldnn_op.cc +++ /dev/null @@ -1,680 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. */ - -#include "paddle/fluid/operators/conv_op.h" -#include "paddle/fluid/platform/mkldnn_helper.h" - -namespace paddle { -namespace operators { - -using framework::DataLayout; -using mkldnn::memory; -using mkldnn::primitive; -using mkldnn::reorder; -using mkldnn::stream; -using platform::to_void_cast; -using platform::GetMKLDNNFormat; - -class ConvMKLDNNHandler : public platform::MKLDNNHandler { - public: - ConvMKLDNNHandler( - std::shared_ptr conv_pd, - const platform::MKLDNNDeviceContext& dev_ctx, mkldnn::engine engine, - const std::string& base_key) - : platform::MKLDNNHandler(dev_ctx, engine, base_key) { - conv_pd_ = conv_pd; - } - - ConvMKLDNNHandler( - std::shared_ptr conv_pd, - std::shared_ptr - conv_bwd_data_pd, - std::shared_ptr - conv_bwd_weights_pd, - const platform::MKLDNNDeviceContext& dev_ctx, mkldnn::engine engine, - const std::string& base_key) - : platform::MKLDNNHandler(dev_ctx, engine, base_key), - conv_pd_(conv_pd), - conv_bwd_weights_pd_(conv_bwd_weights_pd), - conv_bwd_data_pd_(conv_bwd_data_pd) { - // If we are in Grad operatgor then update a key with BWD suffix to - // distinguish from FWD memory primitives - key_ += "-BWD"; - } - - size_t GetDstMemorySize() const { - return conv_pd_->dst_primitive_desc().get_size(); - } - - size_t GetDiffWeightsMemorySize() const { - return conv_bwd_weights_pd_->diff_weights_primitive_desc().get_size(); - } - - size_t GetDiffSourceMemorySize() const { - return conv_bwd_data_pd_->diff_src_primitive_desc().get_size(); - } - - std::shared_ptr AcquireSrcMemoryFromWeightsPrimitive( - const std::shared_ptr user_memory_p, - std::vector& pipeline) { // NOLINT - auto src_pd = conv_bwd_weights_pd_->src_primitive_desc(); - auto user_pd = user_memory_p->get_primitive_desc(); - return this->AcquireMemory(src_pd, user_pd, user_memory_p, - "@weights-src_mem_p", pipeline); - } - - std::shared_ptr AcquireDiffDstMemoryFromWeightsPrimitive( - const std::shared_ptr user_memory_p, - std::vector& pipeline) { // NOLINT - auto diff_dst_pd = conv_bwd_weights_pd_->diff_dst_primitive_desc(); - auto user_pd = user_memory_p->get_primitive_desc(); - return this->AcquireMemory(diff_dst_pd, user_pd, user_memory_p, - "@weights-diff_dst_mem_p", pipeline); - } - - std::shared_ptr AcquireDiffWeightsMemoryFromWeightsPrimitive( - void* ptr) { - return this->AcquireMemoryFromPrimitive( - conv_bwd_weights_pd_->diff_weights_primitive_desc(), ptr, - "@diff_weights_mem_p"); - } - - std::shared_ptr AcquireDiffDstMemoryFromDataPrimitive( - const std::shared_ptr user_memory_p, - std::vector& pipeline) { // NOLINT - auto diff_dst_pd = conv_bwd_data_pd_->diff_dst_primitive_desc(); - auto user_pd = user_memory_p->get_primitive_desc(); - return this->AcquireMemory(diff_dst_pd, user_pd, user_memory_p, - "@data-diff_dst_mem_p", pipeline); - } - - std::shared_ptr AcquireWeightsMemoryFromDataPrimitive( - const std::shared_ptr user_weights_memory_p, - std::vector& pipeline) { // NOLINT - auto weights_pd = conv_bwd_data_pd_->weights_primitive_desc(); - auto user_pd = user_weights_memory_p->get_primitive_desc(); - return this->AcquireMemory(weights_pd, user_pd, user_weights_memory_p, - "@data-weights_mem_p", pipeline); - } - - std::shared_ptr AcquireDiffSrcMemoryFromDataPrimitive( - void* ptr) { - return this->AcquireMemoryFromPrimitive( - conv_bwd_data_pd_->diff_src_primitive_desc(), ptr, "@diff_src_mem_p"); - } - - std::shared_ptr AcquireDstMemoryFromPrimitive(void* ptr) { - return this->AcquireMemoryFromPrimitive(conv_pd_->dst_primitive_desc(), ptr, - "@dst_mem_p"); - } - - std::shared_ptr AcquireSrcMemoryFromPrimitive( - const std::shared_ptr user_memory_p, - std::vector& pipeline) { // NOLINT - auto src_pd = conv_pd_->src_primitive_desc(); - auto user_pd = user_memory_p->get_primitive_desc(); - return this->AcquireMemory(src_pd, user_pd, user_memory_p, "@src_mem_p", - pipeline); - } - - std::shared_ptr AcquireWeightsMemoryFromPrimitive( - const std::shared_ptr user_weights_memory_p, - std::vector& pipeline, // NOLINT - bool is_persistent = false) { - auto user_weights_pd = user_weights_memory_p->get_primitive_desc(); - auto weights_pd = conv_pd_->weights_primitive_desc(); - return this->AcquireMemory(weights_pd, user_weights_pd, - user_weights_memory_p, "@weights_mem_p", - pipeline, is_persistent); - } - - std::shared_ptr AcquireBiasMemoryFromPrimitive( - const std::shared_ptr user_bias_memory_p, - std::vector& pipeline) { // NOLINT - auto user_bias_pd = user_bias_memory_p->get_primitive_desc(); - auto bias_pd = conv_pd_->bias_primitive_desc(); - return this->AcquireMemory(bias_pd, user_bias_pd, user_bias_memory_p, - "@bias_mem_p", pipeline); - } - - std::shared_ptr AcquireConvolution( - std::shared_ptr src_memory_p, - std::shared_ptr weights_memory_p, - std::shared_ptr dst_memory_p) { - auto prim_key = key_ + "@conv_p"; - auto conv_p = std::static_pointer_cast( - dev_ctx_.GetBlob(prim_key)); - PADDLE_ENFORCE((conv_p != nullptr) || (is_reusing_ == false), - "Fail to find convolution primitive in device context"); - if (conv_p == nullptr) { - conv_p = std::make_shared( - *conv_pd_, *(src_memory_p), *(weights_memory_p.get()), - *(dst_memory_p.get())); - - dev_ctx_.SetBlob(prim_key, conv_p); - } else { - is_reusing_ = true; - } - return conv_p; - } - - std::shared_ptr AcquireConvolution( - std::shared_ptr src_memory_p, - std::shared_ptr weights_memory_p, - std::shared_ptr bias_memory_p, - std::shared_ptr dst_memory_p) { - auto prim_key = key_ + "@conv_p"; - auto conv_p = std::static_pointer_cast( - dev_ctx_.GetBlob(prim_key)); - PADDLE_ENFORCE((conv_p != nullptr) || (is_reusing_ == false), - "Fail to find convolution primitive in device context"); - if (conv_p == nullptr) { - conv_p = std::make_shared( - *conv_pd_, *(src_memory_p), *(weights_memory_p.get()), - *(bias_memory_p.get()), *(dst_memory_p.get())); - - dev_ctx_.SetBlob(prim_key, conv_p); - } else { - is_reusing_ = true; - } - return conv_p; - } - - std::shared_ptr - AcquireConvolutionBackwardWeights( - std::shared_ptr src_memory_p, - std::shared_ptr diff_dst_memory_p, - std::shared_ptr diff_weights_memory_p) { - auto prim_key = key_ + "@conv_bwd_weights_p"; - auto conv_bwd_weights_p = - std::static_pointer_cast( - dev_ctx_.GetBlob(prim_key)); - PADDLE_ENFORCE( - (conv_bwd_weights_p != nullptr) || (is_reusing_ == false), - "Fail to find convolution bwd weights primitive in device context"); - if (conv_bwd_weights_p == nullptr) { - // create backward conv primitive for weights - conv_bwd_weights_p = - std::make_shared( - *conv_bwd_weights_pd_, *src_memory_p, *diff_dst_memory_p, - *diff_weights_memory_p); - dev_ctx_.SetBlob(prim_key, conv_bwd_weights_p); - } else { - is_reusing_ = true; - } - return conv_bwd_weights_p; - } - - std::shared_ptr - AcquireConvolutionBackwardData( - std::shared_ptr diff_dst_memory_p, - std::shared_ptr weights_memory_p, - std::shared_ptr diff_src_memory_p) { - auto prim_key = key_ + "@conv_bwd_data_p"; - auto conv_bwd_data_p = - std::static_pointer_cast( - dev_ctx_.GetBlob(prim_key)); - PADDLE_ENFORCE( - (conv_bwd_data_p != nullptr) || (is_reusing_ == false), - "Fail to find convolution bwd data primitive in device context"); - if (conv_bwd_data_p == nullptr) { - conv_bwd_data_p = std::make_shared( - *conv_bwd_data_pd_, *diff_dst_memory_p, *weights_memory_p, - *diff_src_memory_p); - dev_ctx_.SetBlob(prim_key, conv_bwd_data_p); - } else { - is_reusing_ = true; - } - return conv_bwd_data_p; - } - - // Generate keys for storing/retriving primitives for this operator - // TODO(jczaja): Make hashing function more optimial - static std::string GetHash(memory::dims& input_dims, // NOLINT - memory::dims& weights_dims, // NOLINT - std::vector& strides, // NOLINT - std::vector& paddings, // NOLINT - std::vector& dilations, // NOLINT - int groups, const std::string& suffix) { - return dims2str(input_dims) + dims2str(weights_dims) + dims2str(strides) + - dims2str(paddings) + dims2str(dilations) + std::to_string(groups) + - suffix; - } - - private: - std::shared_ptr conv_pd_; - std::shared_ptr - conv_bwd_weights_pd_; - std::shared_ptr - conv_bwd_data_pd_; -}; - -template -class ConvMKLDNNOpKernel : public paddle::framework::OpKernel { - public: - void Compute(const paddle::framework::ExecutionContext& ctx) const override { - PADDLE_ENFORCE(paddle::platform::is_cpu_place(ctx.GetPlace()), - "It must use CPUPlace."); - - const bool is_test = ctx.Attr("is_test"); - - auto& dev_ctx = - ctx.template device_context(); - const auto& mkldnn_engine = dev_ctx.GetEngine(); - - auto* input = ctx.Input("Input"); - auto* filter = ctx.Input("Filter"); - auto* bias = ctx.HasInput("Bias") ? ctx.Input("Bias") : nullptr; - auto* output = ctx.Output("Output"); - - PADDLE_ENFORCE(input->layout() == DataLayout::kMKLDNN && - input->format() != memory::format::format_undef, - "Wrong layout/format set for Input tensor"); - PADDLE_ENFORCE(filter->layout() == DataLayout::kMKLDNN && - filter->format() != memory::format::format_undef, - "Wrong layout/format set for Filter tensor"); - PADDLE_ENFORCE(input->dims().size() == 4, - "Input must be with 4 dimensions, i.e. NCHW"); - PADDLE_ENFORCE(filter->dims().size() == 4, - "Filter must be with 4 dimensions, i.e. OIHW"); - if (bias) { - PADDLE_ENFORCE(bias->layout() == DataLayout::kMKLDNN && - bias->format() != memory::format::format_undef, - "Wrong layout/format set for Bias tensor"); - PADDLE_ENFORCE(bias->dims().size() == 1, - "Bias must only have 1 dimension, i.e. X"); - } - - std::vector strides = ctx.Attr>("strides"); - std::vector paddings = ctx.Attr>("paddings"); - std::vector dilations = ctx.Attr>("dilations"); - bool fuse_relu = ctx.Attr("fuse_relu"); - bool fuse_eltwise = ctx.Attr("fuse_eltwise"); - int groups = ctx.Attr("groups"); - - // TODO: add support for dilation - PADDLE_ENFORCE( - dilations.size() == 2 && dilations[0] == 1 && dilations[1] == 1, - "dilation in convolution is not implemented yet"); - - const T* input_data = input->data(); - const T* filter_data = filter->data(); - - std::vector src_tz = paddle::framework::vectorize2int(input->dims()); - std::vector weights_tz = - paddle::framework::vectorize2int(filter->dims()); - int g = std::max(groups, 1); - if (g > 1) { - int o = weights_tz[0]; - int i = weights_tz[1]; - int h = weights_tz[2]; - int w = weights_tz[3]; - weights_tz.resize(5); - weights_tz[0] = g; - weights_tz[1] = o / g; - weights_tz[2] = i; - weights_tz[3] = h; - weights_tz[4] = w; - } - std::vector dst_tz = paddle::framework::vectorize2int(output->dims()); - - // Get unique name for storing MKLDNN primitives - const std::string key = ConvMKLDNNHandler::GetHash( - src_tz, weights_tz, strides, paddings, dilations, groups, - ctx.op().Output("Output")); - const std::string key_conv_pd = key + "@conv_pd"; - - std::vector pipeline; - - auto user_src_md = platform::MKLDNNMemDesc( - {src_tz}, platform::MKLDNNGetDataType(), input->format()); - auto user_weights_md = platform::MKLDNNMemDesc( - {weights_tz}, platform::MKLDNNGetDataType(), - (g == 1) ? filter->format() : mkldnn::memory::format::goihw); - - /* create memory descriptor for convolution without specified format - * ('any') which lets a primitive (convolution in this case) choose - * the memory format preferred for best performance - */ - std::string data_format = ctx.Attr("data_format"); - auto chosen_memory_format = - platform::data_format_to_memory_format(data_format); - - auto src_md = platform::MKLDNNMemDesc( - src_tz, platform::MKLDNNGetDataType(), chosen_memory_format); - auto weights_md = platform::MKLDNNMemDesc( - weights_tz, platform::MKLDNNGetDataType(), - (g == 1) ? chosen_memory_format : mkldnn::memory::format::goihw); - std::vector bias_tz; // TODO(mgallus): avoid empty vector creation. - // Currently used whenever bias is != nullptr. - auto dst_md = platform::MKLDNNMemDesc( - dst_tz, platform::MKLDNNGetDataType(), chosen_memory_format); - - // create a conv primitive descriptor and save it for usage in backward - std::shared_ptr conv_pd; - if (bias) { - bias_tz = paddle::framework::vectorize2int(bias->dims()); - auto bias_md = platform::MKLDNNMemDesc( - bias_tz, platform::MKLDNNGetDataType(), memory::format::x); - conv_pd = ConvFwdPrimitiveDesc(src_md, weights_md, bias_md, dst_md, - strides, paddings, mkldnn_engine, - fuse_relu, fuse_eltwise); - } else { - conv_pd = - ConvFwdPrimitiveDesc(src_md, weights_md, dst_md, strides, paddings, - mkldnn_engine, fuse_relu, fuse_eltwise); - } - // Save conv_pd/src_memory/weights_memory for backward pass - dev_ctx.SetBlob(key_conv_pd, conv_pd); - - ConvMKLDNNHandler handler(conv_pd, dev_ctx, mkldnn_engine, key); - - // create mkldnn memory from input tensors (data/weights) - auto user_src_memory_p = - handler.AcquireSrcMemory(user_src_md, to_void_cast(input_data)); - auto user_weights_memory_p = handler.AcquireWeightsMemory( - user_weights_md, to_void_cast(filter_data)); - - T* output_data = - output->mutable_data(ctx.GetPlace(), handler.GetDstMemorySize()); - // create reorder primitive if the input format is not the preferred one - auto src_memory_p = - handler.AcquireSrcMemoryFromPrimitive(user_src_memory_p, pipeline); - auto weights_memory_p = handler.AcquireWeightsMemoryFromPrimitive( - user_weights_memory_p, pipeline, is_test); - auto dst_memory_p = - handler.AcquireDstMemoryFromPrimitive(to_void_cast(output_data)); - - // create convolution op primitive - std::shared_ptr conv_p; - if (bias) { - const T* bias_data = bias->data(); - auto user_bias_md = platform::MKLDNNMemDesc( - {bias_tz}, platform::MKLDNNGetDataType(), memory::format::x); - auto user_bias_memory_p = - handler.AcquireBiasMemory(user_bias_md, to_void_cast(bias_data)); - - auto bias_memory_p = - handler.AcquireBiasMemoryFromPrimitive(user_bias_memory_p, pipeline); - conv_p = handler.AcquireConvolution(src_memory_p, weights_memory_p, - bias_memory_p, dst_memory_p); - } else { - conv_p = handler.AcquireConvolution(src_memory_p, weights_memory_p, - dst_memory_p); - } - - // push primitive to stream and wait until it's executed - pipeline.push_back(*conv_p); - stream(stream::kind::eager).submit(pipeline).wait(); - - output->set_layout(DataLayout::kMKLDNN); - output->set_format(GetMKLDNNFormat(*dst_memory_p)); - } - - private: - mkldnn::primitive_attr CreatePostOps(bool fuse_relu, - bool fuse_eltwise) const { - mkldnn::primitive_attr conv_attr; - mkldnn::post_ops post_operations; - // Fusion with Elementwise layer relies on adding a sum post-operation with - // the scale parameter. It is assumed that when fuse_eltwise is true, the - // Output tensor contains the data coming from residual connection. The - // result of this post_op is: Output = scale * Output + Conv_Out. - if (fuse_eltwise) { - post_operations.append_sum(1.0f); - } - // Fusion with ReLU layer is executed through the PostOps feature. Create a - // PostOps object and configure it to execute an eltwise relu operation. - if (fuse_relu) { - constexpr float scale = 1.0f; - constexpr float negative_slope = 0.0f; - constexpr float placeholder = 0.0f; - post_operations.append_eltwise(scale, mkldnn::algorithm::eltwise_relu, - negative_slope, placeholder); - } - conv_attr.set_post_ops(post_operations); - return conv_attr; - } - - std::unique_ptr - ConvFwdPrimitiveDesc(const memory::desc& src, const memory::desc& weights, - const memory::desc& dst, const std::vector& strides, - const std::vector& paddings, - const mkldnn::engine& engine, const bool fuse_relu, - const bool fuse_eltwise) const { - memory::dims stride_dims = {strides[0], strides[1]}; - memory::dims padding_dims = {paddings[0], paddings[1]}; - - auto conv_desc = mkldnn::convolution_forward::desc( - mkldnn::prop_kind::forward, mkldnn::convolution_direct, src, weights, - dst, stride_dims, padding_dims, padding_dims, - mkldnn::padding_kind::zero); - - mkldnn::primitive_attr conv_attr = CreatePostOps(fuse_relu, fuse_eltwise); - - auto p_conv_pd = new mkldnn::convolution_forward::primitive_desc( - conv_desc, conv_attr, engine); - - return std::unique_ptr( - p_conv_pd); - } - - std::unique_ptr - ConvFwdPrimitiveDesc(const memory::desc& src, const memory::desc& weights, - const memory::desc& bias, const memory::desc& dst, - const std::vector& strides, - const std::vector& paddings, - const mkldnn::engine& engine, const bool fuse_relu, - const bool fuse_eltwise) const { - memory::dims stride_dims = {strides[0], strides[1]}; - memory::dims padding_dims = {paddings[0], paddings[1]}; - - auto conv_desc = mkldnn::convolution_forward::desc( - mkldnn::prop_kind::forward, mkldnn::convolution_direct, src, weights, - bias, dst, stride_dims, padding_dims, padding_dims, - mkldnn::padding_kind::zero); - - mkldnn::primitive_attr conv_attr = CreatePostOps(fuse_relu, fuse_eltwise); - - auto p_conv_pd = new mkldnn::convolution_forward::primitive_desc( - conv_desc, conv_attr, engine); - - return std::unique_ptr( - p_conv_pd); - } -}; - -template -class ConvMKLDNNGradOpKernel : public paddle::framework::OpKernel { - public: - void Compute(const paddle::framework::ExecutionContext& ctx) const override { - PADDLE_ENFORCE(paddle::platform::is_cpu_place(ctx.GetPlace()), - "It must use CPUPlace."); - - auto& dev_ctx = - ctx.template device_context(); - const auto& mkldnn_engine = dev_ctx.GetEngine(); - - const Tensor* input = ctx.Input("Input"); - const Tensor* filter = ctx.Input("Filter"); - const Tensor* output = ctx.Input("Output"); - const Tensor* output_grad = - ctx.Input(framework::GradVarName("Output")); - Tensor* input_grad = ctx.Output(framework::GradVarName("Input")); - Tensor* filter_grad = ctx.Output(framework::GradVarName("Filter")); - - PADDLE_ENFORCE(input->layout() == DataLayout::kMKLDNN && - input->format() != memory::format::format_undef, - "Wrong layout/format set for Input tensor"); - PADDLE_ENFORCE(filter->layout() == DataLayout::kMKLDNN && - filter->format() != memory::format::format_undef, - "Wrong layout/format set for Filter tensor"); - PADDLE_ENFORCE(output->layout() == DataLayout::kMKLDNN && - output->format() != memory::format::format_undef, - "Wrong layout/format set for Output tensor"); - PADDLE_ENFORCE(output_grad->layout() == DataLayout::kMKLDNN && - output_grad->format() != memory::format::format_undef, - "Wrong layout/format set for output_grad tensor"); - - if (!input_grad && !filter_grad) return; - - std::vector strides = ctx.Attr>("strides"); - std::vector paddings = ctx.Attr>("paddings"); - std::vector dilations = ctx.Attr>("dilations"); - int groups = ctx.Attr("groups"); - - const T* input_data = input->data(); - const T* filter_data = filter->data(); - const T* output_grad_data = output_grad->data(); - T* input_grad_data = nullptr; - T* filter_grad_data = nullptr; - - std::vector src_tz = paddle::framework::vectorize2int(input->dims()); - std::vector weights_tz = - paddle::framework::vectorize2int(filter->dims()); - std::vector dst_tz = paddle::framework::vectorize2int(output->dims()); - - // Get an unique name from "argument" name of "Output" variable - // as well as attributes of primitive to be created - // This name will be used as key when saving info into device context - const std::string key = - ConvMKLDNNHandler::GetHash(src_tz, weights_tz, strides, paddings, - dilations, groups, ctx.op().Input("Output")); - - const std::string key_conv_pd = key + "@conv_pd"; - std::vector pipeline; - - // Create user memory descriptors - auto user_src_md = platform::MKLDNNMemDesc( - {src_tz}, platform::MKLDNNGetDataType(), input->format()); - auto user_weights_md = platform::MKLDNNMemDesc( - {weights_tz}, platform::MKLDNNGetDataType(), filter->format()); - auto user_diff_dst_md = platform::MKLDNNMemDesc( - {dst_tz}, platform::MKLDNNGetDataType(), output_grad->format()); - - /* create memory descriptor for conv backward without specified format - * ('any') which lets a primitive (conv backward in this case) choose - * the memory format preferred for best performance - */ - std::string data_format = ctx.Attr("data_format"); - auto chosen_memory_format = - platform::data_format_to_memory_format(data_format); - - auto src_md = platform::MKLDNNMemDesc( - src_tz, platform::MKLDNNGetDataType(), chosen_memory_format); - auto diff_src_md = platform::MKLDNNMemDesc( - src_tz, platform::MKLDNNGetDataType(), chosen_memory_format); - auto weights_md = platform::MKLDNNMemDesc( - weights_tz, platform::MKLDNNGetDataType(), chosen_memory_format); - auto diff_weights_md = platform::MKLDNNMemDesc( - weights_tz, platform::MKLDNNGetDataType(), chosen_memory_format); - auto diff_dst_md = platform::MKLDNNMemDesc( - dst_tz, platform::MKLDNNGetDataType(), chosen_memory_format); - - // Retrieve conv_pd from device context - auto conv_pd = - std::static_pointer_cast( - dev_ctx.GetBlob(key_conv_pd)); - PADDLE_ENFORCE(conv_pd != nullptr, - "Fail to find conv_pd in device context"); - - // create backward convolution weights primitive descriptor - auto conv_bwd_weights_desc = mkldnn::convolution_backward_weights::desc( - mkldnn::convolution_direct, src_md, diff_weights_md, diff_dst_md, - strides, paddings, paddings, mkldnn::padding_kind::zero); - auto conv_bwd_weights_pd = - std::make_shared( - conv_bwd_weights_desc, mkldnn_engine, *conv_pd); - - // create backward convolution data primitive descriptor - auto conv_bwd_data_desc = mkldnn::convolution_backward_data::desc( - mkldnn::convolution_direct, diff_src_md, weights_md, diff_dst_md, - strides, paddings, paddings, mkldnn::padding_kind::zero); - auto conv_bwd_data_pd = - std::make_shared( - conv_bwd_data_desc, mkldnn_engine, *conv_pd); - - ConvMKLDNNHandler handler(conv_pd, conv_bwd_data_pd, conv_bwd_weights_pd, - dev_ctx, mkldnn_engine, key); - - // create mkldnn memory from input tensors (data/weights) - auto user_src_memory_p = - handler.AcquireSrcMemory(user_src_md, to_void_cast(input_data)); - auto user_weights_memory_p = handler.AcquireWeightsMemory( - user_weights_md, to_void_cast(filter_data)); - auto user_diff_dst_memory_p = handler.AcquireDiffDstMemory( - user_diff_dst_md, to_void_cast(output_grad_data)); - - // create backward conv primitive for weights - if (filter_grad) { - auto src_memory_p = handler.AcquireSrcMemoryFromWeightsPrimitive( - user_src_memory_p, pipeline); - - auto diff_dst_memory_4filter_p = - handler.AcquireDiffDstMemoryFromWeightsPrimitive( - user_diff_dst_memory_p, pipeline); - - const size_t size = handler.GetDiffWeightsMemorySize(); - filter_grad_data = filter_grad->mutable_data(ctx.GetPlace(), size); - - auto diff_weights_memory_p = - handler.AcquireDiffWeightsMemoryFromWeightsPrimitive( - reinterpret_cast(filter_grad_data)); - - auto conv_bwd_weights_p = handler.AcquireConvolutionBackwardWeights( - src_memory_p, diff_dst_memory_4filter_p, diff_weights_memory_p); - - // push primitive to stream and wait until it's executed - pipeline.push_back(*conv_bwd_weights_p); - - filter_grad->set_layout(DataLayout::kMKLDNN); - filter_grad->set_format(GetMKLDNNFormat(*diff_weights_memory_p)); - } - - if (input_grad) { - auto weights_memory_p = handler.AcquireWeightsMemoryFromDataPrimitive( - user_weights_memory_p, pipeline); - - auto diff_dst_memory_4data_p = - handler.AcquireDiffDstMemoryFromDataPrimitive(user_diff_dst_memory_p, - pipeline); - - const size_t size = handler.GetDiffSourceMemorySize(); - input_grad_data = input_grad->mutable_data(ctx.GetPlace(), size); - - auto diff_src_memory_p = handler.AcquireDiffSrcMemoryFromDataPrimitive( - reinterpret_cast(input_grad_data)); - - auto conv_bwd_data_p = handler.AcquireConvolutionBackwardData( - diff_dst_memory_4data_p, weights_memory_p, diff_src_memory_p); - - pipeline.push_back(*conv_bwd_data_p); - - input_grad->set_layout(DataLayout::kMKLDNN); - input_grad->set_format(GetMKLDNNFormat(*diff_src_memory_p)); - } - stream(stream::kind::eager).submit(pipeline).wait(); - } // Compute() -}; - -} // namespace operators -} // namespace paddle - -namespace ops = paddle::operators; - -REGISTER_OP_KERNEL(conv2d, MKLDNN, ::paddle::platform::CPUPlace, - ops::ConvMKLDNNOpKernel); - -REGISTER_OP_KERNEL(conv2d_grad, MKLDNN, ::paddle::platform::CPUPlace, - ops::ConvMKLDNNGradOpKernel); diff --git a/paddle/fluid/operators/conv_op.cc b/paddle/fluid/operators/conv_op.cc index 8f84bf71a7f77606bed6672f0830e3fc80165a42..619e12e6ba7c73e46beafadd50770aedfb52c964 100644 --- a/paddle/fluid/operators/conv_op.cc +++ b/paddle/fluid/operators/conv_op.cc @@ -14,10 +14,12 @@ limitations under the License. */ #include "paddle/fluid/operators/conv_op.h" +#include #include #include #ifdef PADDLE_WITH_CUDA +#include "paddle/fluid/operators/conv_cudnn_op_cache.h" #include "paddle/fluid/platform/cudnn_helper.h" #endif #ifdef PADDLE_WITH_MKLDNN @@ -44,7 +46,9 @@ void ConvOp::InferShape(framework::InferShapeContext* ctx) const { std::vector dilations = ctx->Attrs().Get>("dilations"); PADDLE_ENFORCE(in_dims.size() == 4 || in_dims.size() == 5, - "Conv intput should be 4-D or 5-D tensor."); + "Conv intput should be 4-D or 5-D tensor, get %u", + in_dims.size()); + PADDLE_ENFORCE_EQ( in_dims.size(), filter_dims.size(), "Conv input dimension and filter dimension should be the same."); @@ -74,8 +78,11 @@ void ConvOp::InferShape(framework::InferShapeContext* ctx) const { framework::OpKernelType ConvOp::GetExpectedKernelType( const framework::ExecutionContext& ctx) const { + int customized_type_value = + framework::OpKernelType::kDefaultCustomizedTypeValue; framework::LibraryType library{framework::LibraryType::kPlain}; // TODO(pzelazko-intel): enable MKLDNN layout when it's ready + auto input_data_type = ctx.Input("Input")->type(); std::string data_format = ctx.Attr("data_format"); framework::DataLayout layout = framework::StringToDataLayout(data_format); @@ -89,27 +96,46 @@ framework::OpKernelType ConvOp::GetExpectedKernelType( platform::CanMKLDNNBeUsed(ctx)) { library = framework::LibraryType::kMKLDNN; layout = framework::DataLayout::kMKLDNN; + customized_type_value = + (input_data_type == framework::DataTypeTrait::DataType || + input_data_type == framework::DataTypeTrait::DataType) + ? kConvMKLDNNINT8 + : kConvMKLDNNFP32; } #endif - auto input_data_type = - framework::ToDataType(ctx.Input("Input")->type()); - auto filter_data_type = - framework::ToDataType(ctx.Input("Filter")->type()); - PADDLE_ENFORCE_EQ(input_data_type, filter_data_type, - "input and filter data type should be consistent"); - + if (input_data_type != framework::proto::VarType::INT8 && + input_data_type != framework::proto::VarType::UINT8) { + auto filter_data_type = ctx.Input("Filter")->type(); + PADDLE_ENFORCE_EQ(input_data_type, filter_data_type, + "input and filter data type should be consistent"); + } if (input_data_type == framework::proto::VarType::FP16) { PADDLE_ENFORCE_EQ(library, framework::LibraryType::kCUDNN, "float16 can only be used when CUDNN is used"); } - return framework::OpKernelType(input_data_type, ctx.GetPlace(), layout, - library); + auto type = framework::OpKernelType(input_data_type, ctx.GetPlace(), layout, + library, customized_type_value); +#ifdef PADDLE_WITH_CUDA + std::vector& configs = kernel_configs_map_[type]; + // TODO(dangqingqing): Currently conv_fusion_op use cudnn but sets use_cudnn + // to false. It should be fixed and then here should only create if library + // is kCUDNN. + if (configs.empty()) { + std::shared_ptr> p( + new framework::AlgorithmsCache()); + configs.push_back(p); + } +#endif + return type; } void Conv2DOpMaker::Make() { - AddAttr("is_test", "").SetDefault(false); + AddAttr("is_test", + "(bool, default false) Set to true for inference only, false " + "for training. Some layers may run faster when this is true.") + .SetDefault(false); AddInput( "Input", "(Tensor) The input tensor of convolution operator. " @@ -128,10 +154,14 @@ void Conv2DOpMaker::Make() { "The format of output tensor is X (one-dimensional) of size equal" "to the number of output channels. Only used with MKL-DNN.") .AsDispensable(); + AddInput("ResidualData", + "(Tensor) Tensor with residual data " + "to which convolution output will be added." + "Used with fuse_residual_connection fusion.") + .AsDispensable(); AddOutput("Output", "(Tensor) The output tensor of convolution operator. " - "The format of output tensor is also NCHW.") - .Reuse("Input"); + "The format of output tensor is also NCHW."); AddAttr>("strides", "(vector default:{1, 1}), the " "strides(h_stride, w_stride) of " @@ -159,15 +189,44 @@ void Conv2DOpMaker::Make() { "use_cudnn", "(bool, default false) Only used in cudnn kernel, need install cudnn") .SetDefault(false); + AddAttr("fuse_relu_before_depthwise_conv", + "(bool, default false) Only used in cuda depthwise kernel") + .SetDefault(false); AddAttr("use_mkldnn", "(bool, default false) Only used in mkldnn kernel") .SetDefault(false); + AddAttr("use_quantizer", + "(bool, default false) " + "Set to true for operators that should be quantized and use " + "int8 kernel. " + "Only used on CPU.") + .SetDefault(false); AddAttr("fuse_relu", "(bool, default false) Only used in mkldnn kernel") .SetDefault(false); - AddAttr("fuse_eltwise", + AddAttr("fuse_residual_connection", "(bool, default false) Only used in mkldnn kernel. Used " - "whenever convolution output is connected via skip connection " - "to a previous layer.") + "whenever convolution output is as an input to residual " + "connection.") + .SetDefault(false); + AddAttr("Scale_in", + "Scale_in to be used for int8 input data." + "Only used with MKL-DNN INT8.") + .SetDefault(1.0f); + AddAttr("Scale_out", + "Scale_out to be used for int8 output data." + "Only used with MKL-DNN INT8.") + .SetDefault(1.0f); + AddAttr("Scale_in_eltwise", + "Scale_in_eltwise to be used for int8 eltwise input data." + "Only used with MKL-DNN INT8.") + .SetDefault(1.0f); + AddAttr>("Scale_weights", + "Scale_weights to be used for int8 weights data." + "Only used with MKL-DNN INT8.") + .SetDefault({1.0f}); + AddAttr("force_fp32_output", + "(bool, default false) Force INT8 kernel output FP32, only " + "used in MKL-DNN INT8") .SetDefault(false); AddAttr( "data_format", @@ -185,6 +244,11 @@ void Conv2DOpMaker::Make() { "workspace size can increase performance but also requires " "better hardware. This size should be chosen carefully.") .SetDefault(4096); + AddAttr("exhaustive_search", + "(bool, default false) cuDNN has many algorithm to calculation " + "convolution, whether enable exhaustive search " + "for cuDNN convolution or not, defalut is False.") + .SetDefault(false); AddComment(R"DOC( Convolution Operator. @@ -213,9 +277,14 @@ $$ W_{out}= \frac{(W_{in} + 2 * paddings[1] - (dilations[1] * (W_f - 1) + 1))}{strides[1]}+ 1 $$ )DOC"); + Apply(); } void Conv3DOpMaker::Make() { + AddAttr("is_test", + "(bool, default false) Set to true for inference only, false " + "for training. Some layers may run faster when this is true.") + .SetDefault(false); AddInput( "Input", "(Tensor) The input tensor of convolution operator. " @@ -231,10 +300,14 @@ void Conv3DOpMaker::Make() { "is the width of the filter." "If the groups attribute is greater than 1, C equals the number of " "input image channels divided by the groups."); + AddInput("ResidualData", + "(Tensor) Tensor with residual data " + "to which convolution output will be added." + "Used with fuse_residual_connection fusion.") + .AsDispensable(); AddOutput("Output", "(Tensor) The output tensor of convolution operator." - "The format of output tensor is also NCDHW.") - .Reuse("Input"); + "The format of output tensor is also NCDHW."); AddAttr>("strides", "(vector, default:{1, 1, 1}), the " "strides(d_stride, h_stride, w_stride) of " @@ -265,6 +338,13 @@ void Conv3DOpMaker::Make() { AddAttr("use_mkldnn", "(bool, default false) Only used in mkldnn kernel") .SetDefault(false); + AddAttr("fuse_relu", "(bool, default false) Only used in mkldnn kernel") + .SetDefault(false); + AddAttr("fuse_residual_connection", + "(bool, default false) Only used in mkldnn kernel. Used " + "whenever convolution output is as an input to residual " + "connection.") + .SetDefault(false); AddAttr( "data_format", "(string, default NCHW) Only used in " @@ -272,6 +352,9 @@ void Conv3DOpMaker::Make() { "Defaults to \"NHWC\". Specify the data format of the output data, " "the input will be transformed automatically. ") .SetDefault("AnyLayout"); + AddAttr("force_fp32_output", + "(bool, default false) Only used in mkldnn INT8 kernel") + .SetDefault(false); // TODO(dzhwinter): need to registered layout transform function AddAttr("workspace_size_MB", "Only used in cudnn kernel. workspace size for cudnn, in MB, " @@ -280,7 +363,11 @@ void Conv3DOpMaker::Make() { "workspace size can increase performance but also requires " "better hardware. This size should be chosen carefully.") .SetDefault(4096); - + AddAttr("exhaustive_search", + "(bool, default false) cuDNN has many algorithm to calculation " + "convolution, whether enable exhaustive search " + "for cuDNN convolution or not, defalut is False.") + .SetDefault(false); AddComment(R"DOC( Convolution3D Operator. @@ -310,6 +397,7 @@ Example: W_{out}= \frac{(W_{in} + 2 * paddings[2] - (dilations[2] * (W_f - 1) + 1))}{ strides[2]}+ 1 $$ )DOC"); + Apply(); } void ConvOpGrad::InferShape(framework::InferShapeContext* ctx) const { @@ -325,6 +413,8 @@ void ConvOpGrad::InferShape(framework::InferShapeContext* ctx) const { framework::OpKernelType ConvOpGrad::GetExpectedKernelType( const framework::ExecutionContext& ctx) const { + int customized_type_value = + framework::OpKernelType::kDefaultCustomizedTypeValue; framework::LibraryType library_{framework::LibraryType::kPlain}; // TODO(pzelazko-intel): enable MKLDNN layout when it's ready std::string data_format = ctx.Attr("data_format"); @@ -340,28 +430,91 @@ framework::OpKernelType ConvOpGrad::GetExpectedKernelType( platform::CanMKLDNNBeUsed(ctx)) { library_ = framework::LibraryType::kMKLDNN; layout_ = framework::DataLayout::kMKLDNN; + customized_type_value = kConvMKLDNNFP32; } #endif - return framework::OpKernelType( - framework::ToDataType(ctx.Input("Input")->type()), ctx.GetPlace(), - layout_, library_); + auto type = framework::OpKernelType(ctx.Input("Input")->type(), + ctx.GetPlace(), layout_, library_, + customized_type_value); +#ifdef PADDLE_WITH_CUDA + if (library_ == framework::LibraryType::kCUDNN) { + std::vector& configs = kernel_configs_map_[type]; + if (configs.empty()) { + std::shared_ptr> + p(new framework::AlgorithmsCache()); + configs.push_back(p); + + std::shared_ptr< + framework::AlgorithmsCache> + p2(new framework::AlgorithmsCache()); + configs.push_back(p2); + } + } +#endif + return type; } +class Conv2DGradMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + std::unique_ptr Apply() const override { + auto* op = new framework::OpDesc(); + op->SetType(this->ForwardOpType() + "_grad"); + op->SetInput("Input", Input("Input")); + op->SetInput("Filter", Input("Filter")); + op->SetInput("Bias", Input("Bias")); + op->SetInput(framework::GradVarName("Output"), OutputGrad("Output")); + + op->SetOutput(framework::GradVarName("Input"), InputGrad("Input")); + op->SetOutput(framework::GradVarName("Filter"), InputGrad("Filter")); + op->SetOutput(framework::GradVarName("Bias"), InputGrad("Bias")); + op->SetAttrMap(Attrs()); + + return std::unique_ptr(op); + } +}; + +class Conv3DGradMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + std::unique_ptr Apply() const override { + auto* op = new framework::OpDesc(); + op->SetType(this->ForwardOpType() + "_grad"); + op->SetInput("Input", Input("Input")); + op->SetInput("Filter", Input("Filter")); + op->SetInput(framework::GradVarName("Output"), OutputGrad("Output")); + + op->SetOutput(framework::GradVarName("Input"), InputGrad("Input")); + op->SetOutput(framework::GradVarName("Filter"), InputGrad("Filter")); + + if (ForwardOp().Inputs().count("ResidualData") != 0) { + op->SetInput("ResidualData", Input("ResidualData")); + } + + op->SetAttrMap(Attrs()); + + return std::unique_ptr(op); + } +}; + } // namespace operators } // namespace paddle namespace ops = paddle::operators; REGISTER_OPERATOR(conv2d, ops::ConvOp, ops::Conv2DOpMaker, - paddle::framework::DefaultGradOpDescMaker); + ops::ConvOpInferVarType, ops::Conv2DGradMaker); REGISTER_OPERATOR(conv2d_grad, ops::ConvOpGrad); // depthwise convolution op REGISTER_OPERATOR(depthwise_conv2d, ops::ConvOp, ops::Conv2DOpMaker, - paddle::framework::DefaultGradOpDescMaker); + ops::ConvOpInferVarType, ops::Conv2DGradMaker); REGISTER_OPERATOR(depthwise_conv2d_grad, ops::ConvOpGrad); + REGISTER_OPERATOR(conv3d, ops::ConvOp, ops::Conv3DOpMaker, - paddle::framework::DefaultGradOpDescMaker); + ops::ConvOpInferVarType, ops::Conv3DGradMaker); REGISTER_OPERATOR(conv3d_grad, ops::ConvOpGrad); // depthwise conv kernel diff --git a/paddle/fluid/operators/conv_op.h b/paddle/fluid/operators/conv_op.h index ef76106f17218a03d24ebc0eca43dbb0ae935093..797c665165975b4230c0edda460e3eebba8e400c 100644 --- a/paddle/fluid/operators/conv_op.h +++ b/paddle/fluid/operators/conv_op.h @@ -14,6 +14,7 @@ limitations under the License. */ #pragma once +#include #include #include "paddle/fluid/framework/eigen.h" #include "paddle/fluid/framework/op_registry.h" @@ -26,6 +27,9 @@ namespace paddle { namespace operators { using Tensor = framework::Tensor; +constexpr int kConvMKLDNNFP32 = 1; +constexpr int kConvMKLDNNINT8 = 2; +constexpr int MaxKeyLength = 256; // Base convolution operator definations for other conv // like operators to reuse the implementation. @@ -60,12 +64,27 @@ inline bool IsExpand(const std::vector& filter_dim, // operator implementations can reuse the code. class Conv2DOpMaker : public framework::OpProtoAndCheckerMaker { public: - void Make() override; + void Make() final; + + protected: + virtual void Apply() {} }; class Conv3DOpMaker : public framework::OpProtoAndCheckerMaker { public: - void Make() override; + void Make() final; + + protected: + virtual void Apply() {} +}; + +class ConvOpInferVarType : public framework::PassInDtypeAndVarTypeToOutput { + protected: + std::unordered_map GetInputOutputWithSameType() + const override { + return std::unordered_map{ + {"Input", /*->*/ "Output"}}; + } }; class ConvOp : public framework::OperatorWithKernel { @@ -105,6 +124,8 @@ class GemmConvKernel : public framework::OpKernel { std::vector paddings = context.Attr>("paddings"); std::vector dilations = context.Attr>("dilations"); + auto& dev_ctx = context.template device_context(); + const int batch_size = static_cast(input->dims()[0]); // filter_shape_vec: {k_o, k_i, k_h, k_w} or {k_o, k_i, k_d, k_h, k_w} @@ -137,13 +158,13 @@ class GemmConvKernel : public framework::OpKernel { // to call the matrix multiplication interface. Tensor col_matrix; if (is_expand) { - col.mutable_data(col_shape, context.GetPlace()); + col = context.AllocateTmpTensor(col_shape, dev_ctx); col_matrix.ShareDataWith(col); col_matrix.Resize(col_matrix_shape); } - framework::DDim input_shape = framework::slice_ddim( - input->dims(), 1, static_cast(input->dims().size())); + framework::DDim input_shape = + framework::slice_ddim(input->dims(), 1, input->dims().size()); framework::DDim filter_matrix_shape = {filter.dims()[0], filter.numel() / filter.dims()[0]}; @@ -160,7 +181,6 @@ class GemmConvKernel : public framework::OpKernel { math::Vol2ColFunctor vol2col; math::Im2ColFunctor im2col; - auto& dev_ctx = context.template device_context(); auto blas = math::GetBlas(dev_ctx); for (int i = 0; i < batch_size; i++) { Tensor in_batch = input->Slice(i, i + 1).Resize(input_shape); @@ -219,6 +239,8 @@ class GemmConvGradKernel : public framework::OpKernel { const int batch_size = static_cast(input->dims()[0]); + auto& dev_ctx = context.template device_context(); + // filter_shape_vec: {k_o, k_i, k_h, k_w} or {k_o, k_i, k_d, k_h, k_w} std::vector filter_shape_vec(framework::vectorize(filter.dims())); // output_shape_vec: {o_n, o_c, o_h, o_w} or {o_n, o_c, o_d, o_h, o_w} @@ -244,8 +266,8 @@ class GemmConvGradKernel : public framework::OpKernel { framework::DDim col_matrix_shape = framework::flatten_to_2d(col_shape, data_dim + 1); - framework::DDim input_shape = framework::slice_ddim( - input->dims(), 1, static_cast(input->dims().size())); + framework::DDim input_shape = + framework::slice_ddim(input->dims(), 1, input->dims().size()); framework::DDim filter_matrix_shape = {filter.dims()[0], filter.numel() / filter.dims()[0]}; @@ -268,13 +290,12 @@ class GemmConvGradKernel : public framework::OpKernel { // to call the matrix multiplication interface. Tensor col_matrix; if (is_expand) { - col.mutable_data(col_shape, context.GetPlace()); + col = context.AllocateTmpTensor(col_shape, dev_ctx); col_matrix.ShareDataWith(col); col_matrix.Resize(col_matrix_shape); } math::SetConstant set_zero; - auto& dev_ctx = context.template device_context(); auto blas = math::GetBlas(dev_ctx); if (input_grad) { @@ -376,12 +397,18 @@ class DepthwiseConvKernel : public framework::OpKernel { std::vector strides = context.Attr>("strides"); std::vector paddings = context.Attr>("paddings"); std::vector dilations = context.Attr>("dilations"); - - math::DepthwiseConvFunctor depthwiseConv; - + bool fuse_relu = context.Attr("fuse_relu_before_depthwise_conv"); auto& dev_ctx = context.template device_context(); - depthwiseConv(dev_ctx, *input, filter, strides, paddings, dilations, - output); + + if (fuse_relu) { + math::DepthwiseConvFunctor depthwiseConv; + depthwiseConv(dev_ctx, *input, filter, strides, paddings, dilations, + output); + } else { + math::DepthwiseConvFunctor depthwiseConv; + depthwiseConv(dev_ctx, *input, filter, strides, paddings, dilations, + output); + } } }; @@ -403,27 +430,42 @@ class DepthwiseConvGradKernel : public framework::OpKernel { std::vector strides = context.Attr>("strides"); std::vector paddings = context.Attr>("paddings"); std::vector dilations = context.Attr>("dilations"); + bool fuse_relu = context.Attr("fuse_relu_before_depthwise_conv"); math::SetConstant set_zero; auto& dev_ctx = context.template device_context(); - math::DepthwiseConvInputGradFunctor - depthwiseConvInputGrad; - math::DepthwiseConvFilterGradFunctor - depthwiseConvFilterGrad; - if (input_grad) { input_grad->mutable_data(context.GetPlace()); set_zero(dev_ctx, input_grad, static_cast(0)); - depthwiseConvInputGrad(dev_ctx, *input, filter, *output_grad, strides, - paddings, dilations, input_grad); + + if (fuse_relu) { + math::DepthwiseConvInputGradFunctor + depthwiseConvInputGrad; + depthwiseConvInputGrad(dev_ctx, *input, filter, *output_grad, strides, + paddings, dilations, input_grad); + } else { + math::DepthwiseConvInputGradFunctor + depthwiseConvInputGrad; + depthwiseConvInputGrad(dev_ctx, *input, filter, *output_grad, strides, + paddings, dilations, input_grad); + } } if (filter_grad) { filter_grad->mutable_data(context.GetPlace()); set_zero(dev_ctx, filter_grad, static_cast(0)); - depthwiseConvFilterGrad(dev_ctx, *input, *output_grad, strides, paddings, - dilations, filter_grad); + if (fuse_relu) { + math::DepthwiseConvFilterGradFunctor + depthwiseConvFilterGrad; + depthwiseConvFilterGrad(dev_ctx, *input, *output_grad, strides, + paddings, dilations, filter_grad); + } else { + math::DepthwiseConvFilterGradFunctor + depthwiseConvFilterGrad; + depthwiseConvFilterGrad(dev_ctx, *input, *output_grad, strides, + paddings, dilations, filter_grad); + } } } }; diff --git a/paddle/fluid/operators/conv_transpose_cudnn_op.cu.cc b/paddle/fluid/operators/conv_transpose_cudnn_op.cu.cc index 73831611d01b8c5b8d2d9f7f15634a0094e4a608..f44094ca6b7b7f23f2e7593ad79e4e2a6f0d3070 100644 --- a/paddle/fluid/operators/conv_transpose_cudnn_op.cu.cc +++ b/paddle/fluid/operators/conv_transpose_cudnn_op.cu.cc @@ -104,6 +104,7 @@ class CUDNNConvTransposeOpKernel : public framework::OpKernel { int output_offset = output->numel() / output->dims()[0] / groups; int filter_offset = filter->numel() / groups; T alpha = 1.0f, beta = 0.0f; + auto workspace_handle = dev_ctx.cudnn_workspace_handle(); for (int g = 0; g < groups; g++) { auto cudnn_func = [&](void* cudnn_workspace) { CUDNN_ENFORCE(platform::dynload::cudnnConvolutionBackwardData( @@ -112,7 +113,7 @@ class CUDNNConvTransposeOpKernel : public framework::OpKernel { algo, cudnn_workspace, workspace_size_in_bytes, &beta, cudnn_output_desc, output_data + output_offset * g)); }; - dev_ctx.RunCudnnFuncWithWorkspace(cudnn_func, workspace_size_in_bytes); + workspace_handle.RunFunc(cudnn_func, workspace_size_in_bytes); } } }; @@ -208,6 +209,7 @@ class CUDNNConvTransposeGradOpKernel : public framework::OpKernel { output_grad->numel() / output_grad->dims()[0] / groups; int filter_offset = filter->numel() / groups; T alpha = 1.0f, beta = 0.0f; + auto workspace_handle = dev_ctx.cudnn_workspace_handle(); if (input_grad) { T* input_grad_data = input_grad->mutable_data(ctx.GetPlace()); // Because beta is zero, it is unnecessary to reset input_grad. @@ -220,7 +222,7 @@ class CUDNNConvTransposeGradOpKernel : public framework::OpKernel { cudnn_workspace, workspace_size_in_bytes, &beta, cudnn_input_desc, input_grad_data + input_offset * g)); }; - dev_ctx.RunCudnnFuncWithWorkspace(cudnn_func, workspace_size_in_bytes); + workspace_handle.RunFunc(cudnn_func, workspace_size_in_bytes); } } @@ -238,7 +240,7 @@ class CUDNNConvTransposeGradOpKernel : public framework::OpKernel { cudnn_workspace, workspace_size_in_bytes, &beta, cudnn_filter_desc, filter_grad_data + filter_offset * g)); }; - dev_ctx.RunCudnnFuncWithWorkspace(cudnn_func, workspace_size_in_bytes); + workspace_handle.RunFunc(cudnn_func, workspace_size_in_bytes); } } } diff --git a/paddle/fluid/operators/conv_transpose_op.cc b/paddle/fluid/operators/conv_transpose_op.cc index a916dd3496ffaffa138529a8a2f7e20ef26fcc96..baa39c0f9926efc233f9a228e055e2eb2116dbcc 100644 --- a/paddle/fluid/operators/conv_transpose_op.cc +++ b/paddle/fluid/operators/conv_transpose_op.cc @@ -13,9 +13,14 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/operators/conv_transpose_op.h" +#include #include #include +#ifdef PADDLE_WITH_MKLDNN +#include "paddle/fluid/platform/mkldnn_helper.h" +#endif + namespace paddle { namespace operators { @@ -78,29 +83,37 @@ void ConvTransposeOp::InferShape(framework::InferShapeContext* ctx) const { framework::OpKernelType ConvTransposeOp::GetExpectedKernelType( const framework::ExecutionContext& ctx) const { + framework::LibraryType library_{framework::LibraryType::kPlain}; + std::string data_format = ctx.Attr("data_format"); + framework::DataLayout layout_ = framework::StringToDataLayout(data_format); bool use_cudnn = ctx.Attr("use_cudnn"); use_cudnn &= platform::is_gpu_place(ctx.GetPlace()); #ifdef PADDLE_WITH_CUDA if (platform::is_gpu_place(ctx.GetPlace())) { auto& dev_ctx = ctx.template device_context(); use_cudnn &= dev_ctx.cudnn_handle() != nullptr; + if (use_cudnn) { + library_ = framework::LibraryType::kCUDNN; + } } #endif - framework::LibraryType library_; - if (use_cudnn) { - library_ = framework::LibraryType::kCUDNN; - } else { - library_ = framework::LibraryType::kPlain; +#ifdef PADDLE_WITH_MKLDNN + if (library_ == framework::LibraryType::kPlain && + platform::CanMKLDNNBeUsed(ctx)) { + library_ = framework::LibraryType::kMKLDNN; + layout_ = framework::DataLayout::kMKLDNN; } +#endif - std::string data_format = ctx.Attr("data_format"); - framework::DataLayout layout_ = framework::StringToDataLayout(data_format); - return framework::OpKernelType( - framework::ToDataType(ctx.Input("Input")->type()), ctx.GetPlace(), - layout_, library_); + return framework::OpKernelType(ctx.Input("Input")->type(), + ctx.GetPlace(), layout_, library_); } void Conv2DTransposeOpMaker::Make() { + AddAttr("is_test", + "(bool, default false) Set to true for inference only, false " + "for training. Some layers may run faster when this is true.") + .SetDefault(false); AddInput( "Input", "(Tensor) The input tensor of convolution transpose operator. " @@ -115,6 +128,12 @@ void Conv2DTransposeOpMaker::Make() { "output feature channels," "H is the height of the filter, and W is the width of the filter. " "We enforce groups number == 1 in the convolution transpose scenario."); + AddInput("Bias", + "(Tensor) Bias to be added to each output of filter application." + "The format of output tensor is X (one-dimensional) of size equal" + "to the number of output channels. Only used with MKL-DNN.") + .AsDispensable(); + AddOutput("Output", "(Tensor) The output tensor of convolution transpose operator. " "The format of output tensor is also NCHW."); @@ -145,6 +164,11 @@ void Conv2DTransposeOpMaker::Make() { "use_cudnn", "(bool, default false) Only used in cudnn kernel, need install cudnn") .SetDefault(false); + AddAttr("use_mkldnn", + "(bool, default false) Only used in mkldnn kernel") + .SetDefault(false); + AddAttr("fuse_relu", "(bool, default false) Only used in mkldnn kernel") + .SetDefault(false); AddAttr( "data_format", "(string, default NCHW) Only used in " @@ -238,6 +262,9 @@ void Conv3DTransposeOpMaker::Make() { "use_cudnn", "(bool, default false) Only used in cudnn kernel, need install cudnn") .SetDefault(false); + AddAttr("use_mkldnn", + "(bool, default false) Only used in mkldnn kernel") + .SetDefault(false); AddAttr( "data_format", "(string, default NCHW) Only used in " @@ -314,11 +341,32 @@ framework::OpKernelType ConvTransposeOpGrad::GetExpectedKernelType( std::string data_format = ctx.Attr("data_format"); framework::DataLayout layout_ = framework::StringToDataLayout(data_format); - return framework::OpKernelType( - framework::ToDataType(ctx.Input("Input")->type()), ctx.GetPlace(), - layout_, library_); + return framework::OpKernelType(ctx.Input("Input")->type(), + ctx.GetPlace(), layout_, library_); } +class ConvTransposeGradOpDescMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + std::unique_ptr op(new framework::OpDesc()); + op->SetType(ForwardOp().Type() + "_grad"); + op->SetInput("Input", Input("Input")); + op->SetInput("Filter", Input("Filter")); + op->SetOutput(framework::GradVarName("Input"), InputGrad("Input")); + op->SetOutput(framework::GradVarName("Filter"), InputGrad("Filter")); + if (ForwardOp().Inputs().count("Bias") > 0) { + op->SetInput("Bias", Input("Bias")); + op->SetOutput(framework::GradVarName("Bias"), InputGrad("Bias")); + } + op->SetInput(framework::GradVarName("Output"), OutputGrad("Output")); + op->SetAttrMap(Attrs()); + return op; + } +}; + } // namespace operators } // namespace paddle @@ -327,7 +375,7 @@ namespace ops = paddle::operators; // conv2d_transpose REGISTER_OPERATOR(conv2d_transpose, ops::ConvTransposeOp, ops::Conv2DTransposeOpMaker, - paddle::framework::DefaultGradOpDescMaker); + ops::ConvTransposeGradOpDescMaker); REGISTER_OPERATOR(conv2d_transpose_grad, ops::ConvTransposeOpGrad); REGISTER_OP_CPU_KERNEL( @@ -343,7 +391,7 @@ REGISTER_OP_CPU_KERNEL( // conv3d_transpose REGISTER_OPERATOR(conv3d_transpose, ops::ConvTransposeOp, ops::Conv3DTransposeOpMaker, - paddle::framework::DefaultGradOpDescMaker); + ops::ConvTransposeGradOpDescMaker); REGISTER_OPERATOR(conv3d_transpose_grad, ops::ConvTransposeOpGrad); REGISTER_OP_CPU_KERNEL( @@ -359,7 +407,7 @@ REGISTER_OP_CPU_KERNEL( // depthwise conv2d_transpose REGISTER_OPERATOR(depthwise_conv2d_transpose, ops::ConvTransposeOp, ops::Conv2DTransposeOpMaker, - paddle::framework::DefaultGradOpDescMaker); + ops::ConvTransposeGradOpDescMaker); REGISTER_OPERATOR(depthwise_conv2d_transpose_grad, ops::ConvTransposeOpGrad); REGISTER_OP_CPU_KERNEL( diff --git a/paddle/fluid/operators/cos_sim_op.cc b/paddle/fluid/operators/cos_sim_op.cc index 8f3644039f9950a8a70e2fd66c20837a5f52bd7f..30ec74d8442d2f42510220b825988b340f79d0a2 100644 --- a/paddle/fluid/operators/cos_sim_op.cc +++ b/paddle/fluid/operators/cos_sim_op.cc @@ -74,6 +74,9 @@ class CosSimOpMaker : public framework::OpProtoAndCheckerMaker { "Norm of the second input, reduced along the 1st " "dimension.") .AsIntermediate(); + AddAttr(framework::kAllKernelsMustComputeRuntimeShape, + "Skip calling InferShape() function in the runtime.") + .SetDefault(true); AddComment(R"DOC( **Cosine Similarity Operator** diff --git a/paddle/fluid/operators/cos_sim_op.cu b/paddle/fluid/operators/cos_sim_op.cu index 82205e9c75402e368a2d1e161d471e35ff7356ea..3d144ca29d9989ad2cbb438a950860eaac873d07 100644 --- a/paddle/fluid/operators/cos_sim_op.cu +++ b/paddle/fluid/operators/cos_sim_op.cu @@ -11,8 +11,6 @@ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ - -#define EIGEN_USE_GPU #include "paddle/fluid/operators/cos_sim_op.h" namespace ops = paddle::operators; diff --git a/paddle/fluid/operators/cos_sim_op.h b/paddle/fluid/operators/cos_sim_op.h index 76cfc680518a3caaa68abc48cedf82ce7d21c8b8..0b4e3f774674112ddc268ba911e1df317d5edcca 100644 --- a/paddle/fluid/operators/cos_sim_op.h +++ b/paddle/fluid/operators/cos_sim_op.h @@ -28,17 +28,21 @@ class CosSimKernel : public framework::OpKernel { public: void Compute(const framework::ExecutionContext& context) const override { // get Tensor - auto* in_x = context.Input("X"); + auto* in_x = context.Input("X"); auto* in_y = context.Input("Y"); - auto* out_z = context.Output("Out"); + auto* out_z = context.Output("Out"); auto* out_x_norm = context.Output("XNorm"); auto* out_y_norm = context.Output("YNorm"); - out_z->mutable_data(context.GetPlace()); - out_x_norm->mutable_data(context.GetPlace()); - out_y_norm->mutable_data(context.GetPlace()); int rows_x = in_x->dims()[0]; int rows_y = in_y->dims()[0]; + out_z->Resize({rows_x, 1}); + out_x_norm->Resize({rows_x, 1}); + out_y_norm->Resize({rows_y, 1}); + out_z->mutable_data(context.GetPlace()); + out_x_norm->mutable_data(context.GetPlace()); + out_y_norm->mutable_data(context.GetPlace()); + out_z->set_lod(in_x->lod()); int cols = framework::product(in_x->dims()) / rows_x; @@ -81,6 +85,7 @@ class CosSimGradKernel : public framework::OpKernel { if (rows_x == rows_y) { if (out_grad_x) { + out_grad_x->Resize(in_x->dims()); math::CosSimGradFunctor functor( in_x_norm->data(), in_y_norm->data(), in_x->data(), in_y->data(), in_z->data(), in_grad_z->data(), @@ -91,6 +96,7 @@ class CosSimGradKernel : public framework::OpKernel { for_range(functor); } if (out_grad_y) { + out_grad_y->Resize(in_y->dims()); math::CosSimGradFunctor functor( in_y_norm->data(), in_x_norm->data(), in_y->data(), in_x->data(), in_z->data(), in_grad_z->data(), @@ -102,6 +108,7 @@ class CosSimGradKernel : public framework::OpKernel { } } else { if (out_grad_x) { + out_grad_x->Resize(in_x->dims()); math::CosSimDxFunctor functor( in_x_norm->data(), in_y_norm->data(), in_x->data(), in_y->data(), in_z->data(), in_grad_z->data(), @@ -112,6 +119,7 @@ class CosSimGradKernel : public framework::OpKernel { for_range(functor); } if (out_grad_y) { + out_grad_y->Resize(in_y->dims()); out_grad_y->mutable_data(context.GetPlace()); math::SetConstant set_zero; auto& dev_ctx = context.template device_context(); diff --git a/paddle/fluid/operators/crf_decoding_op.cc b/paddle/fluid/operators/crf_decoding_op.cc index c27befe1143baa68add4b56f3572eab75272c3a5..e053ae57739d3d96209e9ca180cc041f8b55396e 100644 --- a/paddle/fluid/operators/crf_decoding_op.cc +++ b/paddle/fluid/operators/crf_decoding_op.cc @@ -84,12 +84,12 @@ class CRFDecodingOp : public framework::OperatorWithKernel { "Output(ViterbiPath) should be not null."); auto emission_dims = ctx->GetInputDim("Emission"); - PADDLE_ENFORCE_EQ(emission_dims.size(), 2UL, + PADDLE_ENFORCE_EQ(emission_dims.size(), 2, "The Input(Emission) should be a 2-D tensor."); PADDLE_ENFORCE(emission_dims[0], "An empty mini-batch is not allowed."); auto transition_dims = ctx->GetInputDim("Transition"); - PADDLE_ENFORCE_EQ(transition_dims.size(), 2UL, + PADDLE_ENFORCE_EQ(transition_dims.size(), 2, "The Input(Transition) should be a 2-D tensor."); PADDLE_ENFORCE_EQ( transition_dims[0] - 2, transition_dims[1], @@ -118,9 +118,8 @@ class CRFDecodingOp : public framework::OperatorWithKernel { protected: framework::OpKernelType GetExpectedKernelType( const framework::ExecutionContext& ctx) const override { - return framework::OpKernelType( - framework::ToDataType(ctx.Input("Emission")->type()), - platform::CPUPlace()); + return framework::OpKernelType(ctx.Input("Emission")->type(), + platform::CPUPlace()); } }; } // namespace operators diff --git a/paddle/fluid/operators/crf_decoding_op.h b/paddle/fluid/operators/crf_decoding_op.h index 8181897c3d3844bda5574e85a08b2af038fcd664..d6b54038ec5648c72d606a6c7b9c8356cb74521b 100644 --- a/paddle/fluid/operators/crf_decoding_op.h +++ b/paddle/fluid/operators/crf_decoding_op.h @@ -16,6 +16,7 @@ limitations under the License. */ #include #include "paddle/fluid/framework/eigen.h" #include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/jit/kernels.h" #include "paddle/fluid/operators/math/math_function.h" namespace paddle { @@ -69,9 +70,6 @@ class CRFDecodingOpKernel : public framework::OpKernel { auto emission_dims = emission_weights.dims(); const size_t seq_len = emission_dims[0]; const size_t tag_num = emission_dims[1]; - - const size_t state_trans_base_idx = 2; - const T* x = emission_weights.data(); const T* w = transition_weights.data(); int64_t* path = decoded_path->data(); @@ -84,221 +82,10 @@ class CRFDecodingOpKernel : public framework::OpKernel { Tensor track; int* track_value = track.mutable_data(emission_dims, platform::CPUPlace()); - -#ifdef __AVX__ -// It use the AVX or AVX512 instruction to deal the data as the vector of 8 or -// 16 elements per iteration. Then it can implement the parallel processing. -// Only optimize for float type. -#ifdef __AVX512F__ - size_t step_size = 16; -#else - size_t step_size = 8; -#endif - if (std::is_same::value && (tag_num >= step_size)) { - size_t steps = tag_num / step_size; - size_t remain = tag_num % step_size; - int last_offset = static_cast(remain) - static_cast(step_size); - - // Setup the alpha initial value. - size_t i_offset = 0; - for (size_t i = 0; i <= steps; ++i) { -#ifdef __AVX512F__ - // Declare the variable for the content of weights, input and alpha - // values. - __m512 w_content, x_content, alpha_content; - - // Load the relevant data into the variables from un-aligned address. - w_content = _mm512_loadu_ps((const float*)(w + i_offset)); - x_content = _mm512_loadu_ps((const float*)(x + i_offset)); - alpha_content = _mm512_add_ps(w_content, x_content); - - // Save the alpha value. - _mm512_storeu_ps(reinterpret_cast(alpha_value + i_offset), - alpha_content); -#else - // Declare the variable for the content of weights, input and alpha - // values. - __m256 w_content, x_content, alpha_content; - - // Load the relevant data into the variables from un-aligned address. - w_content = _mm256_loadu_ps((const float*)(w + i_offset)); - x_content = _mm256_loadu_ps((const float*)(x + i_offset)); - alpha_content = _mm256_add_ps(w_content, x_content); - - // Save the alpha value. - _mm256_storeu_ps(reinterpret_cast(alpha_value + i_offset), - alpha_content); -#endif - i_offset += step_size; - if (i == steps - 1) { - if (remain > 0) { - i_offset += last_offset; - } else { - break; - } - } - } - - // Use the column-major strategy to get the location of maximum score. - size_t seq_offset = 0; - for (size_t k = 1; k < seq_len; ++k) { - size_t j_offset = 0; - for (size_t j = 0; j <= steps; ++j) { -#ifdef __AVX512F__ - // Initialize the variables of maximum score and location. - __m512 max_score = _mm512_set1_ps(-std::numeric_limits::max()); - __m512i max_j = _mm512_setzero_si512(); -#else - // Initialize the variables of maximum score and location. - __m256 max_score = _mm256_set1_ps(-std::numeric_limits::max()); - __m256i max_j = _mm256_set1_epi32(0); -#endif - // Calculate the offset of transition_weights. - size_t trans_offset = state_trans_base_idx * tag_num + j_offset; - for (size_t i = 0; i < tag_num; ++i) { -#ifdef __AVX512F__ - // Initalize the content of alpha variable with related offset. - __m512 alpha_content = - _mm512_set1_ps(*(const float*)(alpha_value + seq_offset + i)); - // Obtain the content of weights from un-aligned address. - __m512 w_content = - _mm512_loadu_ps((const float*)(w + trans_offset)); - - __m512 score_v = _mm512_add_ps(alpha_content, w_content); - - __mmask16 mask = _mm512_cmp_ps_mask(score_v, max_score, _CMP_GT_OS); - - // According to the mask value, it update the index of the max_score - // location. - max_j = _mm512_mask_set1_epi32(max_j, mask, i); - - // Update the max_score value. - max_score = _mm512_max_ps(max_score, score_v); -#else - // Initalize the content of alpha variable with related offset. - __m256 alpha_content = _mm256_broadcast_ss( - (const float*)(alpha_value + seq_offset + i)); - // Obtain the content of weights from un-aligned address. - __m256 w_content = - _mm256_loadu_ps((const float*)(w + trans_offset)); - __m256 score_v = _mm256_add_ps(alpha_content, w_content); - - __m256 mask = _mm256_cmp_ps(score_v, max_score, _CMP_GT_OS); - -#ifdef __AVX2__ - // According to the mask value, it update the index of the max_score - // location. - max_j = _mm256_or_si256( - _mm256_andnot_si256((__m256i)mask, max_j), - _mm256_and_si256((__m256i)mask, _mm256_set1_epi32(i))); -#else - __m128i lo_max_j = _mm256_extractf128_si256(max_j, 0); - __m128i hi_max_j = _mm256_extractf128_si256(max_j, 1); - __m128i lo_mask = _mm256_extractf128_si256((__m256i)mask, 0); - __m128i hi_mask = _mm256_extractf128_si256((__m256i)mask, 1); - - lo_max_j = _mm_andnot_si128(lo_mask, lo_max_j); - hi_max_j = _mm_andnot_si128(hi_mask, hi_max_j); - lo_mask = _mm_and_si128(lo_mask, _mm_set1_epi32(i)); - hi_mask = _mm_and_si128(hi_mask, _mm_set1_epi32(i)); - - lo_max_j = _mm_or_si128(lo_mask, lo_max_j); - hi_max_j = _mm_or_si128(hi_mask, hi_max_j); - - // According to the mask value, it update the index of the max_score - // location. - max_j = _mm256_insertf128_si256(max_j, lo_max_j, 0); - max_j = _mm256_insertf128_si256(max_j, hi_max_j, 1); -#endif - - // Update the max_score value. - max_score = _mm256_max_ps(max_score, score_v); -#endif - trans_offset += tag_num; - } - -#ifdef __AVX512F__ - // Update the alpha and track values. - __m512 x_content = _mm512_loadu_ps( - (const float*)(x + seq_offset + tag_num + j_offset)); - max_score = _mm512_add_ps(max_score, x_content); - _mm512_storeu_ps(reinterpret_cast(alpha_value + seq_offset + - tag_num + j_offset), - max_score); - _mm512_storeu_si512( - reinterpret_cast<__m512i*>(track_value + seq_offset + tag_num + - j_offset), - max_j); -#else - // Update the alpha and track values. - __m256 x_content = _mm256_loadu_ps( - (const float*)(x + seq_offset + tag_num + j_offset)); - max_score = _mm256_add_ps(max_score, x_content); - _mm256_storeu_ps(reinterpret_cast(alpha_value + seq_offset + - tag_num + j_offset), - max_score); - _mm256_storeu_si256( - reinterpret_cast<__m256i*>(track_value + seq_offset + tag_num + - j_offset), - max_j); -#endif - - // Calculate the offset of next step - j_offset += step_size; - if (j == steps - 1) { - if (remain > 0) { - j_offset += last_offset; - } else { - break; - } - } - } - - seq_offset += tag_num; - } - } else { - for (size_t i = 0; i < tag_num; ++i) alpha_value[i] = w[i] + x[i]; - - for (size_t k = 1; k < seq_len; ++k) { - for (size_t i = 0; i < tag_num; ++i) { - T max_score = -std::numeric_limits::max(); - int max_j = 0; - for (size_t j = 0; j < tag_num; ++j) { - T score = alpha_value[(k - 1) * tag_num + j] + - w[(j + state_trans_base_idx) * tag_num + i]; - if (score > max_score) { - max_score = score; - max_j = j; - } - } - - alpha_value[k * tag_num + i] = max_score + x[k * tag_num + i]; - track_value[k * tag_num + i] = max_j; - } - } - } -#else - for (size_t i = 0; i < tag_num; ++i) alpha_value[i] = w[i] + x[i]; - - for (size_t k = 1; k < seq_len; ++k) { - for (size_t i = 0; i < tag_num; ++i) { - T max_score = -std::numeric_limits::max(); - int max_j = 0; - for (size_t j = 0; j < tag_num; ++j) { - T score = alpha_value[(k - 1) * tag_num + j] + - w[(j + state_trans_base_idx) * tag_num + i]; - if (score > max_score) { - max_score = score; - max_j = j; - } - } - - alpha_value[k * tag_num + i] = max_score + x[k * tag_num + i]; - track_value[k * tag_num + i] = max_j; - } - } - -#endif + auto ker = + jit::KernelFuncs, platform::CPUPlace>::Cache() + .At(tag_num); + ker(static_cast(seq_len), x, w, alpha_value, track_value, tag_num); T max_score = -std::numeric_limits::max(); int max_i = 0; for (size_t i = 0; i < tag_num; ++i) { diff --git a/paddle/fluid/operators/crop_op.cc b/paddle/fluid/operators/crop_op.cc index a2a871efa850df5101be7c27ebd81456acace7e1..78fcd07e1df8d590ad2a4508bbc82477d928c6e9 100644 --- a/paddle/fluid/operators/crop_op.cc +++ b/paddle/fluid/operators/crop_op.cc @@ -13,7 +13,9 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/operators/crop_op.h" -#include +#include +#include +#include namespace paddle { namespace operators { @@ -51,9 +53,8 @@ class CropOp : public framework::OperatorWithKernel { framework::OpKernelType GetExpectedKernelType( const framework::ExecutionContext& ctx) const override { - return framework::OpKernelType( - framework::ToDataType(ctx.Input("X")->type()), - ctx.device_context()); + return framework::OpKernelType(ctx.Input("X")->type(), + ctx.device_context()); } }; @@ -174,19 +175,36 @@ class CropOpGrad : public framework::OperatorWithKernel { framework::OpKernelType GetExpectedKernelType( const framework::ExecutionContext& ctx) const override { return framework::OpKernelType( - framework::ToDataType( - ctx.Input(framework::GradVarName("Out")) - ->type()), + ctx.Input(framework::GradVarName("Out"))->type(), ctx.device_context()); } }; +class CropGradOpDescMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + std::unique_ptr op(new framework::OpDesc()); + op->SetType("crop_grad"); + op->SetInput(framework::GradVarName("Out"), OutputGrad("Out")); + op->SetInput("X", Input("X")); + if (ForwardOp().Inputs().count("Offsets") > 0) { + op->SetInput("Offsets", Input("Offsets")); + } + op->SetOutput(framework::GradVarName("X"), InputGrad("X")); + op->SetAttrMap(Attrs()); + return op; + } +}; + } // namespace operators } // namespace paddle namespace ops = paddle::operators; REGISTER_OPERATOR(crop, ops::CropOp, ops::CropOpMaker, - paddle::framework::DefaultGradOpDescMaker); + ops::CropGradOpDescMaker); REGISTER_OPERATOR(crop_grad, ops::CropOpGrad); REGISTER_OP_CPU_KERNEL( crop, ops::CropKernel); diff --git a/paddle/fluid/operators/crop_op.cu b/paddle/fluid/operators/crop_op.cu index b75678217e36aa2297c68a7f8e2a9dfafadaca72..66cb5c452de4b2107693127ce414daf9fb7cd7d8 100644 --- a/paddle/fluid/operators/crop_op.cu +++ b/paddle/fluid/operators/crop_op.cu @@ -11,8 +11,6 @@ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ - -#define EIGEN_USE_GPU #include "paddle/fluid/operators/crop_op.h" namespace ops = paddle::operators; diff --git a/paddle/fluid/operators/crop_op.h b/paddle/fluid/operators/crop_op.h index 2d7d33bd4f9b42b644444912570375bad92ba6c2..cfc2cac7beb8a13526cefc94c127ffc2aea533df 100644 --- a/paddle/fluid/operators/crop_op.h +++ b/paddle/fluid/operators/crop_op.h @@ -68,7 +68,6 @@ void CropFunction(const framework::ExecutionContext& context) { } out->mutable_data(out_dims, context.GetPlace()); auto x_stride = framework::stride(x->dims()); - auto out_stride = framework::stride(out->dims()); auto offsets = GetOffsets(context); int64_t offset = 0; for (size_t i = 0; i < offsets.size(); ++i) { diff --git a/paddle/fluid/operators/cross_entropy_op.cc b/paddle/fluid/operators/cross_entropy_op.cc index 66f19fe7ecfa51b2ce917f0c5fcb6d486f1a7307..ad32de53e7019b438b7106ddd031a8f00bd79b5d 100644 --- a/paddle/fluid/operators/cross_entropy_op.cc +++ b/paddle/fluid/operators/cross_entropy_op.cc @@ -13,17 +13,21 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/operators/cross_entropy_op.h" +#include +#include +#include namespace paddle { namespace operators { -class CrossEntropyOp : public framework::OperatorWithKernel { +class CrossEntropyOpBase : public framework::OperatorWithKernel { public: using framework::OperatorWithKernel::OperatorWithKernel; void InferShape(framework::InferShapeContext* ctx) const override { PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should be not null."); PADDLE_ENFORCE(ctx->HasInput("Label"), "Input(Label) should be not null."); + PADDLE_ENFORCE(ctx->HasOutput("Y"), "Output(Y) should be not null."); auto x_dims = ctx->GetInputDim("X"); @@ -31,14 +35,24 @@ class CrossEntropyOp : public framework::OperatorWithKernel { int rank = x_dims.size(); PADDLE_ENFORCE_EQ(rank, label_dims.size(), "Input(X) and Input(Label) shall have the same rank."); - PADDLE_ENFORCE_EQ(framework::slice_ddim(x_dims, 0, rank - 1), - framework::slice_ddim(label_dims, 0, rank - 1), - "Input(X) and Input(Label) shall have the same shape " - "except the last dimension."); - if (ctx->Attrs().Get("soft_label")) { - PADDLE_ENFORCE_EQ(x_dims[rank - 1], label_dims[rank - 1], - "If Attr(soft_label) == true, the last dimension of " - "Input(X) and Input(Label) should be equal."); + bool check = true; + if ((!ctx->IsRuntime()) && (framework::product(x_dims) <= 0 || + framework::product(label_dims) <= 0)) { + check = false; + } + if (check) { + PADDLE_ENFORCE_EQ(framework::slice_ddim(x_dims, 0, rank - 1), + framework::slice_ddim(label_dims, 0, rank - 1), + "Input(X) and Input(Label) shall have the same shape " + "except the last dimension."); + } + + if (IsSoftLabel(ctx)) { + if (check) { + PADDLE_ENFORCE_EQ(x_dims[rank - 1], label_dims[rank - 1], + "If Attr(soft_label) == true, the last dimension of " + "Input(X) and Input(Label) should be equal."); + } } else { PADDLE_ENFORCE_EQ(label_dims[rank - 1], 1UL, "If Attr(softLabel) == false, the last dimension of " @@ -56,25 +70,27 @@ class CrossEntropyOp : public framework::OperatorWithKernel { // is determined by its input "X". framework::OpKernelType GetExpectedKernelType( const framework::ExecutionContext& ctx) const override { - return framework::OpKernelType( - framework::ToDataType(ctx.Input("X")->type()), - ctx.device_context()); + return framework::OpKernelType(ctx.Input("X")->type(), + ctx.device_context()); + } + + virtual bool IsSoftLabel(framework::InferShapeContext* ctx) const { + return ctx->Attrs().Get("soft_label"); } }; -class CrossEntropyGradientOp : public framework::OperatorWithKernel { +class CrossEntropyGradientOpBase : public framework::OperatorWithKernel { public: using framework::OperatorWithKernel::OperatorWithKernel; - void InferShape(framework::InferShapeContext* ctx) const override { - PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should be not null."); + void InferShape(framework::InferShapeContext* ctx) const { PADDLE_ENFORCE(ctx->HasInput("Label"), "Input(Label) should be not null."); PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Y")), "Input(Y@GRAD) shoudl be not null."); PADDLE_ENFORCE(ctx->HasOutput(framework::GradVarName("X")), "Output(X@GRAD) should be not null."); - auto x_dims = ctx->GetInputDim("X"); + auto x_dims = GetXDim(ctx); auto label_dims = ctx->GetInputDim("Label"); auto dy_dims = ctx->GetInputDim(framework::GradVarName("Y")); int rank = x_dims.size(); @@ -82,27 +98,40 @@ class CrossEntropyGradientOp : public framework::OperatorWithKernel { "Input(Y@Grad) and Input(X) should have the same rank."); PADDLE_ENFORCE_EQ(label_dims.size(), rank, "Input(Label) and Input(X) should have the same rank."); - PADDLE_ENFORCE_EQ(framework::slice_ddim(x_dims, 0, rank - 1), - framework::slice_ddim(label_dims, 0, rank - 1), - "The Input(X) and Input(Label) should have the same " - "shape except the last dimension."); - PADDLE_ENFORCE_EQ(framework::slice_ddim(x_dims, 0, rank - 1), - framework::slice_ddim(dy_dims, 0, rank - 1), - "The Input(X) and Input(Y@Grad) should have the same " - "shape except the last dimension."); - PADDLE_ENFORCE_EQ(dy_dims[rank - 1], 1, - "The last dimension of Input(Y@Grad) should be 1."); - if (ctx->Attrs().Get("soft_label")) { - PADDLE_ENFORCE_EQ(x_dims[rank - 1], label_dims[rank - 1], - "When Attr(soft_label) == true, the last dimension of " - "Input(X) and Input(Label) should be equal."); + + bool check = true; + if ((!ctx->IsRuntime()) && (framework::product(x_dims) <= 0 || + framework::product(label_dims) <= 0)) { + check = false; + } + + if (check) { + PADDLE_ENFORCE_EQ(framework::slice_ddim(x_dims, 0, rank - 1), + framework::slice_ddim(label_dims, 0, rank - 1), + "The Input(X) and Input(Label) should have the same " + "shape except the last dimension."); + PADDLE_ENFORCE_EQ(framework::slice_ddim(x_dims, 0, rank - 1), + framework::slice_ddim(dy_dims, 0, rank - 1), + "The Input(X) and Input(Y@Grad) should have the same " + "shape except the last dimension."); + } + if (IsSoftLabel(ctx)) { + if (check) { + PADDLE_ENFORCE_EQ( + x_dims[rank - 1], label_dims[rank - 1], + "When Attr(soft_label) == true, the last dimension of " + "Input(X) and Input(Label) should be equal."); + } } else { PADDLE_ENFORCE_EQ(label_dims[rank - 1], 1, "When Attr(soft_label) == false, the last dimension of " "Input(Label) should be 1."); } ctx->SetOutputDim(framework::GradVarName("X"), x_dims); - ctx->ShareLoD("X", framework::GradVarName("X")); + PADDLE_ENFORCE_EQ(dy_dims[rank - 1], 1, + "The last dimension of Input(Y@Grad) should be 1."); + ctx->SetOutputDim(framework::GradVarName("X"), x_dims); + ctx->ShareLoD(VarNameWithXLoD(), framework::GradVarName("X")); } protected: @@ -111,9 +140,28 @@ class CrossEntropyGradientOp : public framework::OperatorWithKernel { framework::OpKernelType GetExpectedKernelType( const framework::ExecutionContext& ctx) const override { return framework::OpKernelType( - framework::ToDataType(ctx.Input("X")->type()), + ctx.Input(framework::GradVarName("Y"))->type(), ctx.device_context()); } + + virtual framework::DDim GetXDim(framework::InferShapeContext* ctx) const { + return ctx->GetInputDim("X"); + } + + virtual const char* VarNameWithXLoD() const { return "X"; } + + virtual bool IsSoftLabel(framework::InferShapeContext* ctx) const { + return ctx->Attrs().Get("soft_label"); + } +}; + +class CrossEntropyOpInferVarType + : public framework::PassInDtypeAndVarTypeToOutput { + protected: + std::unordered_map GetInputOutputWithSameType() + const override { + return std::unordered_map{{"X", /*->*/ "Y"}}; + } }; class CrossEntropyOpMaker : public framework::OpProtoAndCheckerMaker { @@ -179,17 +227,165 @@ or not. But the output only shares the LoD information with input X. )DOC"); } }; + +class CrossEntropyGradientOp : public CrossEntropyGradientOpBase { + public: + using CrossEntropyGradientOpBase::CrossEntropyGradientOpBase; + + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should be not null."); + CrossEntropyGradientOpBase::InferShape(ctx); + } +}; + +class CrossEntropyGradOpDescMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + std::unique_ptr op(new framework::OpDesc()); + op->SetType("cross_entropy_grad"); + op->SetInput("X", Input("X")); + op->SetInput("Label", Input("Label")); + op->SetInput(framework::GradVarName("Y"), OutputGrad("Y")); + op->SetOutput(framework::GradVarName("X"), InputGrad("X")); + op->SetAttrMap(Attrs()); + return op; + } +}; + +class CrossEntropyOp2 : public CrossEntropyOpBase { + public: + using CrossEntropyOpBase::CrossEntropyOpBase; + + void InferShape(framework::InferShapeContext* ctx) const override { + CrossEntropyOpBase::InferShape(ctx); + + PADDLE_ENFORCE(ctx->HasOutput("XShape"), + "Output(XShape) should be not null."); + + PADDLE_ENFORCE(ctx->HasOutput("MatchX"), + "Output(MatchX) should be not null."); + auto x_dims = ctx->GetInputDim("X"); + auto x_dims_vec = framework::vectorize(x_dims); + x_dims_vec.push_back(0); + ctx->SetOutputDim("XShape", framework::make_ddim(x_dims_vec)); + x_dims[x_dims.size() - 1] = 1; + ctx->SetOutputDim("MatchX", x_dims); + ctx->ShareLoD("X", /*->*/ "XShape"); + } + + protected: + bool IsSoftLabel(framework::InferShapeContext* ctx) const override { + return false; + } +}; + +class CrossEntropyGradientOp2 : public CrossEntropyGradientOpBase { + public: + using CrossEntropyGradientOpBase::CrossEntropyGradientOpBase; + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("MatchX"), "Input(MatchX) must exist"); + CrossEntropyGradientOpBase::InferShape(ctx); + } + + protected: + virtual framework::DDim GetXDim(framework::InferShapeContext* ctx) const { + auto x_shape = ctx->GetInputDim("XShape"); + return framework::DDim(x_shape.Get(), x_shape.size() - 1); + } + + virtual const char* VarNameWithXLoD() const { return "XShape"; } + + virtual bool IsSoftLabel(framework::InferShapeContext* ctx) const { + return false; + } +}; + +class CrossEntropyOpMaker2 : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", + "(Tensor, default Tensor), a tensor whose last dimension " + "size is equal to the number of classes. This input is a " + "probability computed by the previous operator, which is almost " + "always the result of a softmax operator."); + AddInput( + "Label", + "(Tensor), the tensor which represents the ground truth. It has the " + "same shape with 'X' except the last dimension. One hot Tensor."); + AddOutput("Y", + "(Tensor, default Tensor), a tensor whose shape is same " + "with 'X' except that the last dimension size is 1. It " + "represents the cross entropy loss."); + AddOutput("XShape", "Temporaily variable to save shape and LoD of X."); + AddOutput("MatchX", + "X value that matches label, used for gradient computation."); + AddAttr("ignore_index", + "(int, default -100), Specifies a target value that is" + "ignored and does not contribute to the input gradient." + "Only valid if soft_label is set to False") + .SetDefault(-100); + AddComment(R"DOC( +Hard-label CrossEntropy Operator. + +The input 'X' and 'Label' will first be logically flattened to 2-D matrixs. +The matrix's second dimension(row length) is as same as the original last +dimension, and the first dimension(column length) is the product of all other +original dimensions. Then the softmax computation will take palce on each raw +of flattened matrixs. + +Only support hard label. + +Both the input X and Label can carry the LoD (Level of Details) information, +or not. But the output only shares the LoD information with input X. + +)DOC"); + } +}; + +class CrossEntropyGradOpDescMaker2 : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + std::unique_ptr op(new framework::OpDesc()); + op->SetType("cross_entropy_grad2"); + op->SetInput("Label", Input("Label")); + op->SetInput("MatchX", Output("MatchX")); + op->SetInput("XShape", Output("XShape")); + op->SetInput(framework::GradVarName("Y"), OutputGrad("Y")); + op->SetOutput(framework::GradVarName("X"), InputGrad("X")); + op->SetAttrMap(Attrs()); + return op; + } +}; + } // namespace operators } // namespace paddle namespace ops = paddle::operators; using CPUCtx = paddle::platform::CPUDeviceContext; -REGISTER_OPERATOR(cross_entropy, ops::CrossEntropyOp, ops::CrossEntropyOpMaker, - paddle::framework::DefaultGradOpDescMaker); +REGISTER_OPERATOR(cross_entropy, ops::CrossEntropyOpBase, + ops::CrossEntropyOpMaker, ops::CrossEntropyOpInferVarType, + ops::CrossEntropyGradOpDescMaker); REGISTER_OPERATOR(cross_entropy_grad, ops::CrossEntropyGradientOp); REGISTER_OP_CPU_KERNEL(cross_entropy, ops::CrossEntropyOpKernel, ops::CrossEntropyOpKernel); REGISTER_OP_CPU_KERNEL(cross_entropy_grad, ops::CrossEntropyGradientOpKernel, ops::CrossEntropyGradientOpKernel); + +REGISTER_OPERATOR(cross_entropy2, ops::CrossEntropyOp2, + ops::CrossEntropyOpMaker2, ops::CrossEntropyOpInferVarType, + ops::CrossEntropyGradOpDescMaker2); +REGISTER_OPERATOR(cross_entropy_grad2, ops::CrossEntropyGradientOp2); +REGISTER_OP_CPU_KERNEL(cross_entropy2, + ops::CrossEntropyOpKernel2, + ops::CrossEntropyOpKernel2); +REGISTER_OP_CPU_KERNEL(cross_entropy_grad2, + ops::CrossEntropyGradientOpKernel2, + ops::CrossEntropyGradientOpKernel2); diff --git a/paddle/fluid/operators/cross_entropy_op.cu b/paddle/fluid/operators/cross_entropy_op.cu index 30dbd5bd3d39dd2992c3dd91364003bb7715a2eb..243e7f52c1e3c4c210e91f708ae5d6de97e4afbc 100644 --- a/paddle/fluid/operators/cross_entropy_op.cu +++ b/paddle/fluid/operators/cross_entropy_op.cu @@ -13,12 +13,27 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/operators/cross_entropy_op.h" +#include "paddle/fluid/platform/float16.h" +namespace plat = paddle::platform; namespace ops = paddle::operators; using CUDACtx = paddle::platform::CUDADeviceContext; REGISTER_OP_CUDA_KERNEL(cross_entropy, ops::CrossEntropyOpKernel, - ops::CrossEntropyOpKernel); -REGISTER_OP_CUDA_KERNEL(cross_entropy_grad, - ops::CrossEntropyGradientOpKernel, - ops::CrossEntropyGradientOpKernel); + ops::CrossEntropyOpKernel, + ops::CrossEntropyOpKernel); + +REGISTER_OP_CUDA_KERNEL( + cross_entropy_grad, ops::CrossEntropyGradientOpKernel, + ops::CrossEntropyGradientOpKernel, + ops::CrossEntropyGradientOpKernel); + +REGISTER_OP_CUDA_KERNEL(cross_entropy2, + ops::CrossEntropyOpKernel2, + ops::CrossEntropyOpKernel2, + ops::CrossEntropyOpKernel2); + +REGISTER_OP_CUDA_KERNEL( + cross_entropy_grad2, ops::CrossEntropyGradientOpKernel2, + ops::CrossEntropyGradientOpKernel2, + ops::CrossEntropyGradientOpKernel2); diff --git a/paddle/fluid/operators/cross_entropy_op.h b/paddle/fluid/operators/cross_entropy_op.h index f123e11542d85c904a81fe2a87f59ab52511cc15..7eb663773ed072760c47a2914377b5306ceeb7af 100644 --- a/paddle/fluid/operators/cross_entropy_op.h +++ b/paddle/fluid/operators/cross_entropy_op.h @@ -15,6 +15,7 @@ limitations under the License. */ #pragma once #include "paddle/fluid/framework/eigen.h" #include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/math.h" #include "paddle/fluid/operators/math/cross_entropy.h" #include "paddle/fluid/operators/math/math_function.h" #include "paddle/fluid/platform/for_range.h" @@ -137,5 +138,124 @@ class CrossEntropyGradientOpKernel : public framework::OpKernel { } }; +template +struct HardLabelCrossEntropyForwardFunctor { + HardLabelCrossEntropyForwardFunctor(const T* x, T* y, T* match_x, + const int64_t* label, + int64_t ignore_index, + int64_t feature_size) + : x_(x), + y_(y), + match_x_(match_x), + label_(label), + ignore_index_(ignore_index), + feature_size_(feature_size) {} + + HOSTDEVICE void operator()(int64_t idx) const { + auto label = label_[idx]; + if (label != ignore_index_) { + auto match_x = x_[idx * feature_size_ + label]; + y_[idx] = -math::TolerableValue()(real_log(match_x)); + match_x_[idx] = match_x; + } else { + y_[idx] = 0; + match_x_[idx] = 0; // any value is ok + } + } + + const T* x_; + T* y_; + T* match_x_; + const int64_t* label_; + int64_t ignore_index_; + int64_t feature_size_; +}; + +template +struct HardLabelCrossEntropyBackwardFunctor { + HardLabelCrossEntropyBackwardFunctor(T* dx, const T* dy, const T* match_x, + const int64_t* label, + int64_t ignore_index, + int64_t feature_size) + : dx_(dx), + dy_(dy), + match_x_(match_x), + label_(label), + ignore_index_(ignore_index), + feature_size_(feature_size) {} + + HOSTDEVICE void operator()(int64_t idx) const { + auto row_idx = idx / feature_size_; + auto col_idx = idx % feature_size_; + auto label = label_[row_idx]; + if (label == col_idx && label != ignore_index_) { + dx_[idx] = -dy_[row_idx] / match_x_[row_idx]; + } else { + dx_[idx] = 0; + } + } + + T* dx_; + const T* dy_; + const T* match_x_; + const int64_t* label_; + int64_t ignore_index_; + int64_t feature_size_; +}; + +template +class CrossEntropyOpKernel2 : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* x = ctx.Input("X"); + auto* label = ctx.Input("Label"); + auto* y = ctx.Output("Y"); + auto* match_x = ctx.Output("MatchX"); + + auto& x_dims = x->dims(); + auto feature_size = x_dims[x_dims.size() - 1]; + auto batch_size = framework::product(x->dims()) / feature_size; + + auto* p_x = x->data(); + auto* p_label = label->data(); + auto* p_y = y->mutable_data(ctx.GetPlace()); + auto* p_match_x = match_x->mutable_data(ctx.GetPlace()); + + auto ignore_index = ctx.Attr("ignore_index"); + + platform::ForRange for_range( + ctx.template device_context(), batch_size); + for_range(HardLabelCrossEntropyForwardFunctor( + p_x, p_y, p_match_x, p_label, ignore_index, feature_size)); + } +}; + +template +class CrossEntropyGradientOpKernel2 : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* dx = ctx.Output(framework::GradVarName("X")); + auto* dy = ctx.Input(framework::GradVarName("Y")); + auto* match_x = ctx.Input("MatchX"); + auto* label = ctx.Input("Label"); + + auto* p_dx = dx->mutable_data(ctx.GetPlace()); + auto* p_dy = dy->data(); + auto* p_match_x = match_x->data(); + auto* p_label = label->data(); + + int64_t ignore_index = ctx.Attr("ignore_index"); + int rank = dx->dims().size(); + int64_t feature_size = dx->dims()[rank - 1]; + int64_t batch_size = framework::product(dx->dims()) / feature_size; + + platform::ForRange for_range( + ctx.template device_context(), + batch_size * feature_size); + for_range(HardLabelCrossEntropyBackwardFunctor( + p_dx, p_dy, p_match_x, p_label, ignore_index, feature_size)); + } +}; + } // namespace operators } // namespace paddle diff --git a/paddle/fluid/operators/csp/CMakeLists.txt b/paddle/fluid/operators/csp/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..5d468316e8eacb73c4a4ce81c784880bb5e46c2d --- /dev/null +++ b/paddle/fluid/operators/csp/CMakeLists.txt @@ -0,0 +1,2 @@ +include(operators) +register_operators() diff --git a/paddle/fluid/operators/go_op.cc b/paddle/fluid/operators/csp/go_op.cc similarity index 100% rename from paddle/fluid/operators/go_op.cc rename to paddle/fluid/operators/csp/go_op.cc diff --git a/paddle/fluid/operators/ctc_align_op.cc b/paddle/fluid/operators/ctc_align_op.cc index d2b440d9d2e50340af7a7bb4e76e55beea1bcb46..e7c472f8c0ce2cfe70b24be3c6930093922b0e27 100644 --- a/paddle/fluid/operators/ctc_align_op.cc +++ b/paddle/fluid/operators/ctc_align_op.cc @@ -36,9 +36,8 @@ class CTCAlignOp : public framework::OperatorWithKernel { protected: framework::OpKernelType GetExpectedKernelType( const framework::ExecutionContext& ctx) const override { - return framework::OpKernelType( - framework::ToDataType(ctx.Input("Input")->type()), - ctx.device_context()); + return framework::OpKernelType(ctx.Input("Input")->type(), + ctx.device_context()); } }; diff --git a/paddle/fluid/operators/cudnn_lstm_op.cc b/paddle/fluid/operators/cudnn_lstm_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..134f84d59cafa661fce727adc3303444c4ef483e --- /dev/null +++ b/paddle/fluid/operators/cudnn_lstm_op.cc @@ -0,0 +1,243 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include "paddle/fluid/framework/op_registry.h" + +namespace paddle { +namespace operators { + +class CudnnLSTMOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("Input"), + "Input(Input) of LSTM should not be null."); + PADDLE_ENFORCE(ctx->HasInput("W"), + "Input(Weight) of LSTM should not be null."); + + PADDLE_ENFORCE(ctx->HasInput("InitH"), + "Input(init_h) of LSTM should not be null."); + PADDLE_ENFORCE(ctx->HasInput("InitC"), + "Input(init_c) of LSTM should not be null."); + PADDLE_ENFORCE(ctx->HasInput("Cache"), + "Input(Cache) of LSTM should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("Out"), + "Output(Out) of LSTM should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("last_h"), + "Output(last_h) of LSTM should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("last_c"), + "Output(last_c) of LSTM should not be null."); + + auto in_dims = ctx->GetInputDim("Input"); + PADDLE_ENFORCE_EQ(in_dims.size(), 3, "Input(X)'s rank must be 3."); + + ctx->SetOutputDim("Out", ctx->GetInputDim("Input")); + ctx->SetOutputDim("last_h", ctx->GetInputDim("InitH")); + ctx->SetOutputDim("last_c", ctx->GetInputDim("InitC")); + } +}; + +class CudnnLSTMOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput( + "Input", + "(Tensor) RNN input tensor, which support variable-time length input " + "sequence." + "The shape of the Tensor MUST be ( seq_len * batch_size * input_size)" + "seq_len is the total time step in this mini-batch (CAN be change in " + "different batch)" + "batch_size is the instance number of this batch" + "input_size is the hidden size of the input." + "input_hidden_size and the hidden_size in the next may not be same"); + AddInput("InitH", + "(Tensor) the initial hidden state of the LSTM" + "input. This is a tensor with shape (num_layers x batch_size x " + "hidden_size)" + "and When is_bidirec is True, the shape will be (num_layers*2 x " + "batch_size x hidden_size)"); + AddInput("InitC", + "(Tensor) the initial cell state of the LSTm " + "input. This is a tensor with shape (num_layers x batch_size x " + "hidden_size)" + "and When is_bidirec is True, the shape will be (num_layers*2 x " + "batch_size x hidden_size)"); + AddInput("W", + "(Tensor) the learnable hidden-hidden weights." + " The shape is (N), where N is total weight size of the LSTM. " + " cudnn concatenate all the weight to one Tensor"); + AddInput("Cache", + "The cache of dropout op, a RAW type variable including random " + "number generator states and some descriptors, which is used in " + "cudnn kernel.") + .AsDispensable(); + AddOutput("Out", + "(Tensor) the hidden state of LSTM operator. " + "The shape is ( seq_len x batch_size x hidden_size) if " + "is_bidirec is False" + "and When is_bidirec is True, the shape will be ( seq_len x " + "batch_size x hidden_size * 2) "); + AddOutput("last_h", + "(Tensor) the hidden state of the last step. " + "The shape is ( num_layers x batch_size x hidden_size) if " + "is_bidirec is False" + "and When is_bidirec is True, the shape will be (num_layers*2 x " + "batch_size x hidden_size)"); + AddOutput("last_c", + "(Tensor) the cell state of the last step" + "The shape is ( num_layers x batch_size x hidden_size) if " + "is_bidirec is False" + "and When is_bidirect is True, the shape will be (num_layers*2 x " + "batch_size x hidden_size*2)"); + AddAttr("max_len", + "max length of the LSTM op" + "the first dim of the Input can NOT be greater than max_len") + .SetDefault(20); + AddAttr( + "dropout_prob", + "dropout prob of the dropout op" + "the dropout ONLY work between lstm layers, not between time steps" + "There is no dropout work on the Out tensor") + .SetDefault(0.0); + AddAttr("is_bidirec", + "is_bidirec" + "if it is bidirection rnn" + "The will affect the shape of the Out, last_h, and last_c") + .SetDefault(false); + AddAttr("input_size", "input size ot the Input Tensor").SetDefault(10); + AddAttr("hidden_size", "hidden size of the LSTM").SetDefault(100); + AddAttr("num_layers", "the total layer number of the LSTM") + .SetDefault(1); + AddAttr("is_test", "True if in test phase.").SetDefault(false); + AddAttr("seed", "seed to used if fix_seed is True").SetDefault(-1); + AddComment(R"DOC( +CUDNN LSTM implementation + +A four-gate Long Short-Term Memory network with no peephole connections. +In the forward pass the output ht and cell output ct for a given iteration can be computed from the recurrent input ht-1, +the cell input ct-1 and the previous layer input xt given matrices W, R and biases bW, bR from the following equations: + +$$ i_t = sigmoid(W_{ix}x_{t} + W_{ih}h_{t-1} + bx_i + bh_i) $$ + +$$ f_t = sigmoid(W_{fx}x_{t} + W_{fh}h_{t-1} + bx_f + bh_f) $$ + +$$ o_t = sigmoid(W_{ox}x_{t} + W_{oh}h_{t-1} + bx_o + bh_o) $$ + +$$ \\tilde{c_t} = tanh(W_{cx}x_t + W_{ch}h_{t-1} + bx_c + bh_c) $$ + +$$ c_t = f_t \\odot c_{t-1} + i_t \\odot \\tilde{c_t} $$ + +$$ h_t = o_t \\odot tanh(c_t) $$ + +- W terms denote weight matrices (e.g. $W_{ix}$ is the matrix + of weights from the input gate to the input) +- The b terms denote bias vectors ($bx_i$ and $bh_i$ are the input gate bias vector). +- sigmoid is the logistic sigmoid function. +- $i, f, o$ and $c$ are the input gate, forget gate, output gate, + and cell activation vectors, respectively, all of which have the same size as + the cell output activation vector $h$. +- The $\odot$ is the element-wise product of the vectors. +- `tanh` is the activation functions. +- $\tilde{c_t}$ is also called candidate hidden state, + which is computed based on the current input and the previous hidden state. + +Where sigmoid is the sigmoid operator: sigmoid(x) = 1 / (1 + e^-x), * represents a point-wise multiplication, +X represensts a matrix multiplication + + +)DOC"); + } +}; + +class CudnnLSTMGradOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("Input"), + "Input(Input) of LSTM should not be null."); + PADDLE_ENFORCE(ctx->HasInput("W"), "Input(W) of LSTM should not be null."); + PADDLE_ENFORCE(ctx->HasInput("Cache"), + "Input(last_c) of LSTM should not be null."); + PADDLE_ENFORCE(ctx->HasInput("InitH"), + "Input(init_h) of LSTM should not be null."); + + PADDLE_ENFORCE(ctx->HasInput("InitC"), + "Input(init_c) of LSTM should not be null."); + + auto SetOutGradDim = [&ctx](const std::string& name) { + auto g_name = framework::GradVarName(name); + if (ctx->HasOutput(g_name)) { + ctx->SetOutputDim(g_name, ctx->GetInputDim(name)); + } + }; + + SetOutGradDim("Input"); + SetOutGradDim("W"); + SetOutGradDim("InitH"); + SetOutGradDim("InitC"); + } +}; + +class CudnnLSTMGradOpDescMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + std::unique_ptr op(new framework::OpDesc()); + op->SetType("cudnn_lstm_grad"); + op->SetInput("Input", Input("Input")); + op->SetInput("InitH", Input("InitH")); + op->SetInput("InitC", Input("InitC")); + op->SetInput("W", Input("W")); + if (ForwardOp().Inputs().count("Cache") > 0) { + op->SetInput("Cache", Input("Cache")); + } + op->SetInput("Out", Output("Out")); + op->SetInput(framework::GradVarName("Out"), OutputGrad("Out")); + op->SetInput(framework::GradVarName("last_c"), OutputGrad("last_c")); + op->SetInput(framework::GradVarName("last_h"), OutputGrad("last_h")); + + op->SetOutput(framework::GradVarName("Input"), InputGrad("Input")); + op->SetOutput(framework::GradVarName("W"), InputGrad("W")); + op->SetOutput(framework::GradVarName("InitH"), InputGrad("InitH")); + op->SetOutput(framework::GradVarName("InitC"), InputGrad("InitC")); + op->SetAttrMap(Attrs()); + return op; + } +}; + +template +class NotImpleKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + PADDLE_THROW( + "CPU is not support for this kernel now. Will be add in the future"); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(cudnn_lstm, ops::CudnnLSTMOp, ops::CudnnLSTMOpMaker, + ops::CudnnLSTMGradOpDescMaker); +REGISTER_OPERATOR(cudnn_lstm_grad, ops::CudnnLSTMGradOp); + +REGISTER_OP_CPU_KERNEL(cudnn_lstm, ops::NotImpleKernel); +REGISTER_OP_CPU_KERNEL(cudnn_lstm_grad, ops::NotImpleKernel); diff --git a/paddle/fluid/operators/cudnn_lstm_op.cu.cc b/paddle/fluid/operators/cudnn_lstm_op.cu.cc new file mode 100644 index 0000000000000000000000000000000000000000..1bf41ed948b5bd4fbd49587f072f5debfa81d77c --- /dev/null +++ b/paddle/fluid/operators/cudnn_lstm_op.cu.cc @@ -0,0 +1,261 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/cudnn_rnn_cache.h" +#include "paddle/fluid/operators/math/math_function.h" + +namespace paddle { +namespace operators { + +using LoDTensor = framework::LoDTensor; +using Tensor = framework::Tensor; + +template +class CudnnLSTMGPUKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext &ctx) const override { + const Tensor *x = ctx.Input("Input"); + const Tensor *init_h = ctx.Input("InitH"); + const Tensor *init_c = ctx.Input("InitC"); + + auto w = ctx.Input("W"); + + Tensor *out = ctx.Output("Out"); + Tensor *last_h = ctx.Output("last_h"); + Tensor *last_c = ctx.Output("last_c"); + + const T *x_data = x->data(); + const T *init_h_data = init_h->data(); + const T *init_c_data = init_c->data(); + + const T *w_data = w->data(); + + T *out_data = out->mutable_data(ctx.GetPlace()); + T *last_h_data = last_h->mutable_data(ctx.GetPlace()); + T *last_c_data = last_c->mutable_data(ctx.GetPlace()); + + size_t max_len = ctx.Attr("max_len"); + float dropout_prob = ctx.Attr("dropout_prob"); + bool is_bidirec = ctx.Attr("is_bidirec"); + int input_size = ctx.Attr("input_size"); + int hidden_size = ctx.Attr("hidden_size"); + int num_layers = ctx.Attr("num_layers"); + bool is_test = ctx.Attr("is_test"); + + auto &dev_ctx = ctx.template device_context(); + auto handle = dev_ctx.cudnn_handle(); + auto *cache_var = ctx.InputVar("Cache"); + if (!cache_var) { + // The RAW type cache variable wouldn't be created and broadcasted on + // multi-devices before the first running. + // use parent scope to make cache persistable + auto *scope = const_cast(ctx.scope().parent()); + auto cache_var_name = ctx.Inputs("Cache")[0]; + cache_var = scope->Var(cache_var_name); + } + CudnnRNNCache *cudnn_rnn_cache = nullptr; + if (cache_var->IsInitialized()) { + // const_cast is usually bad. + cudnn_rnn_cache = const_cast(cache_var) + ->GetMutable(); + } else { + // const_cast is usually bad. + cudnn_rnn_cache = const_cast(cache_var) + ->GetMutable(); + std::random_device rnd; + int seed = ctx.Attr("seed"); + if (seed == -1) { + seed = rnd(); + } + + auto input_w_numel = w->numel(); + auto batch_size = x->dims()[1]; + cudnn_rnn_cache->init(handle, ctx.GetPlace(), max_len, batch_size, + input_size, hidden_size, num_layers, dropout_prob, + is_bidirec, seed, input_w_numel); + } + + auto run_seq_len = x->dims()[0]; + + if (is_test) { + // for inference + CUDNN_ENFORCE(platform::dynload::cudnnRNNForwardInference( + handle, cudnn_rnn_cache->rnn_desc_, run_seq_len, + cudnn_rnn_cache->x_desc_, x_data, cudnn_rnn_cache->hx_desc_, + init_h_data, cudnn_rnn_cache->cx_desc_, init_c_data, + cudnn_rnn_cache->w_desc_, w_data, cudnn_rnn_cache->y_desc_, out_data, + cudnn_rnn_cache->hy_desc_, last_h_data, cudnn_rnn_cache->cy_desc_, + last_c_data, cudnn_rnn_cache->workspace_data_.data(), + cudnn_rnn_cache->workspace_size_)); + } else { + // for train + CUDNN_ENFORCE(platform::dynload::cudnnRNNForwardTraining( + handle, cudnn_rnn_cache->rnn_desc_, run_seq_len, + cudnn_rnn_cache->x_desc_, x_data, cudnn_rnn_cache->hx_desc_, + init_h_data, cudnn_rnn_cache->cx_desc_, init_c_data, + cudnn_rnn_cache->w_desc_, w_data, cudnn_rnn_cache->y_desc_, out_data, + cudnn_rnn_cache->hy_desc_, last_h_data, cudnn_rnn_cache->cy_desc_, + last_c_data, cudnn_rnn_cache->workspace_data_.data(), + cudnn_rnn_cache->workspace_size_, + cudnn_rnn_cache->reserve_data_.data(), + cudnn_rnn_cache->reserve_size_)); + } + } +}; + +template +class CudnnLSTMGPUGradKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext &ctx) const override { + auto *input = ctx.Input("Input"); + auto *weight = ctx.Input("W"); + auto *init_h = ctx.Input("InitH"); + auto *init_c = ctx.Input("InitC"); + // auto * last_h = ctx.Input("last_h"); + // auto * last_c = ctx.Input("last_c"); + auto *out = ctx.Input("Out"); + auto *out_grad = ctx.Input(framework::GradVarName("Out")); + auto *last_h_grad = ctx.Input(framework::GradVarName("last_h")); + auto *last_c_grad = ctx.Input(framework::GradVarName("last_c")); + + // auto* init_h = ctx.Input("init_h"); + // auto* init_c = ctx.Input("init_c"); + + auto *in_grad = ctx.Output(framework::GradVarName("Input")); + auto *weight_grad = ctx.Output(framework::GradVarName("W")); + auto *init_h_grad = ctx.Output(framework::GradVarName("InitH")); + auto *init_c_grad = ctx.Output(framework::GradVarName("InitC")); + + auto &dev_ctx = ctx.template device_context(); + auto handle = dev_ctx.cudnn_handle(); + auto *cache_var = ctx.InputVar("Cache"); + PADDLE_ENFORCE(cache_var->IsInitialized()); + CudnnRNNCache *cudnn_rnn_cache = + const_cast(cache_var) + ->GetMutable(); + + auto input_dims = input->dims(); + auto init_h_dims = init_h->dims(); + auto init_c_dims = init_c->dims(); + in_grad->mutable_data(ctx.GetPlace()); + weight_grad->mutable_data(ctx.GetPlace()); + math::SetConstant zero; + zero(dev_ctx, in_grad, static_cast(0.0)); + zero(dev_ctx, weight_grad, static_cast(0.0)); + + T *init_h_grad_data = NULL; + if (init_h_grad == nullptr) { + Tensor init_h_grad_temp; + init_h_grad_temp.mutable_data(init_h_dims, ctx.GetPlace()); + zero(dev_ctx, &init_h_grad_temp, static_cast(0.0)); + + init_h_grad_data = init_h_grad_temp.data(); + } else { + init_h_grad->mutable_data(init_h_dims, ctx.GetPlace()); + zero(dev_ctx, init_h_grad, static_cast(0.0)); + init_h_grad_data = init_h_grad->data(); + } + + T *init_c_grad_data = NULL; + if (init_c_grad == nullptr) { + Tensor init_c_grad_temp; + init_c_grad_temp.mutable_data(init_c_dims, ctx.GetPlace()); + zero(dev_ctx, &init_c_grad_temp, static_cast(0.0)); + + init_c_grad_data = init_c_grad_temp.data(); + } else { + init_c_grad->mutable_data(init_c_dims, ctx.GetPlace()); + zero(dev_ctx, init_c_grad, static_cast(0.0)); + init_c_grad_data = init_c_grad->data(); + } + + const T *last_h_grad_data = NULL; + if (last_h_grad == nullptr) { + Tensor last_h_grad_temp; + last_h_grad_temp.mutable_data(init_h_dims, ctx.GetPlace()); + zero(dev_ctx, &last_h_grad_temp, static_cast(0.0)); + + last_h_grad_data = (const T *)last_h_grad_temp.data(); + } else { + last_h_grad_data = last_h_grad->data(); + } + + const T *last_c_grad_data = NULL; + if (last_c_grad == nullptr) { + Tensor last_c_grad_temp; + last_c_grad_temp.mutable_data(init_c_dims, ctx.GetPlace()); + zero(dev_ctx, &last_c_grad_temp, static_cast(0.0)); + + last_c_grad_data = (const T *)last_c_grad_temp.data(); + } else { + last_c_grad_data = last_c_grad->data(); + } + + const T *out_grad_data = NULL; + if (out_grad == nullptr) { + Tensor out_grad_temp; + out_grad_temp.mutable_data(out->dims(), ctx.GetPlace()); + zero(dev_ctx, &out_grad_temp, static_cast(0.0)); + + out_grad_data = (const T *)out_grad_temp.data(); + } else { + out_grad_data = out_grad->data(); + } + + // zero( dev_ctx, last_h_grad, static_cast(0.0)); + // zero( dev_ctx, last_c_grad, static_cast(0.0)); + + auto out_data = out->data(); + // auto out_grad_data = out_grad->data(); + auto weight_data = weight->data(); + auto init_h_data = init_h->data(); + auto init_c_data = init_c->data(); + auto in_grad_data = in_grad->data(); + + auto work_data = cudnn_rnn_cache->workspace_data_.data(); + auto reserve_data = cudnn_rnn_cache->reserve_data_.data(); + + auto run_seq_len = input_dims[0]; + PADDLE_ENFORCE_LE((size_t)run_seq_len, cudnn_rnn_cache->max_length_, + "cudnn running seq_len CAN not greater max_lengh"); + CUDNN_ENFORCE(platform::dynload::cudnnRNNBackwardData( + handle, cudnn_rnn_cache->rnn_desc_, run_seq_len, + cudnn_rnn_cache->y_desc_, out_data, cudnn_rnn_cache->dy_desc_, + out_grad_data, cudnn_rnn_cache->dhy_desc_, last_h_grad_data, + cudnn_rnn_cache->dcy_desc_, last_c_grad_data, cudnn_rnn_cache->w_desc_, + weight_data, cudnn_rnn_cache->hx_desc_, init_h_data, + cudnn_rnn_cache->cx_desc_, init_c_data, cudnn_rnn_cache->dx_desc_, + in_grad_data, cudnn_rnn_cache->dhx_desc_, init_h_grad_data, + cudnn_rnn_cache->dcx_desc_, init_c_grad_data, work_data, + cudnn_rnn_cache->workspace_size_, reserve_data, + cudnn_rnn_cache->reserve_size_)); + + CUDNN_ENFORCE(platform::dynload::cudnnRNNBackwardWeights( + handle, cudnn_rnn_cache->rnn_desc_, run_seq_len, + cudnn_rnn_cache->x_desc_, input->data(), cudnn_rnn_cache->hx_desc_, + init_h->data(), cudnn_rnn_cache->y_desc_, out->data(), + cudnn_rnn_cache->workspace_data_.data(), + cudnn_rnn_cache->workspace_size_, cudnn_rnn_cache->dw_desc_, + weight_grad->data(), cudnn_rnn_cache->reserve_data_.data(), + cudnn_rnn_cache->reserve_size_)); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OP_CUDA_KERNEL(cudnn_lstm, ops::CudnnLSTMGPUKernel); +REGISTER_OP_CUDA_KERNEL(cudnn_lstm_grad, ops::CudnnLSTMGPUGradKernel); diff --git a/paddle/fluid/operators/cudnn_rnn_cache.h b/paddle/fluid/operators/cudnn_rnn_cache.h new file mode 100644 index 0000000000000000000000000000000000000000..7f18b839271a29523cc06c999c28cc0394717397 --- /dev/null +++ b/paddle/fluid/operators/cudnn_rnn_cache.h @@ -0,0 +1,255 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include "paddle/fluid/framework/tensor.h" +#include "paddle/fluid/platform/cudnn_helper.h" + +namespace paddle { +namespace operators { + +struct CudnnRNNCache { + CudnnRNNCache() { + x_desc_ = NULL; + y_desc_ = NULL; + dx_desc_ = NULL; + dy_desc_ = NULL; + } + ~CudnnRNNCache() { release(); } + + cudnnRNNDescriptor_t rnn_desc_; + cudnnTensorDescriptor_t *x_desc_; + cudnnTensorDescriptor_t *y_desc_; + cudnnTensorDescriptor_t *dx_desc_; + cudnnTensorDescriptor_t *dy_desc_; + + cudnnTensorDescriptor_t hx_desc_; + cudnnTensorDescriptor_t cx_desc_; + cudnnTensorDescriptor_t hy_desc_; + cudnnTensorDescriptor_t cy_desc_; + + cudnnTensorDescriptor_t dhx_desc_; + cudnnTensorDescriptor_t dcx_desc_; + cudnnTensorDescriptor_t dhy_desc_; + cudnnTensorDescriptor_t dcy_desc_; + + cudnnTensorDescriptor_t output_x_desc_; + cudnnTensorDescriptor_t output_y_desc_; + + cudnnDropoutDescriptor_t dropout_desc_; + + size_t weights_size_; + cudnnFilterDescriptor_t w_desc_; + cudnnFilterDescriptor_t dw_desc_; + + size_t workspace_size_; + size_t reserve_size_; + framework::Tensor reserve_data_; + framework::Tensor workspace_data_; + + framework::Tensor dropout_state_; + + size_t max_length_; + + float dropout_prob_; + bool is_bidirec_; + + int batch_size_; + int input_size_; + int hidden_size_; + int num_layers_; + int seed_; + + void init(cudnnHandle_t handle, const platform::Place &place, size_t max_len, + int batch_size, int input_size, int hidden_size, int num_layers, + float dropout_prob, bool is_bidirec, int seed, int weight_numel) { + max_length_ = max_len; + batch_size_ = batch_size; + input_size_ = input_size; + hidden_size_ = hidden_size; + num_layers_ = num_layers; + dropout_prob_ = dropout_prob; + is_bidirec_ = is_bidirec; + seed_ = seed; + + x_desc_ = new cudnnTensorDescriptor_t[max_length_]; + y_desc_ = new cudnnTensorDescriptor_t[max_length_]; + dx_desc_ = new cudnnTensorDescriptor_t[max_length_]; + dy_desc_ = new cudnnTensorDescriptor_t[max_length_]; + int dim_a[3]; + int stride_a[3]; + + for (size_t i = 0; i < max_length_; ++i) { + CUDNN_ENFORCE( + platform::dynload::cudnnCreateTensorDescriptor(&x_desc_[i])); + CUDNN_ENFORCE( + platform::dynload::cudnnCreateTensorDescriptor(&y_desc_[i])); + CUDNN_ENFORCE( + platform::dynload::cudnnCreateTensorDescriptor(&dx_desc_[i])); + CUDNN_ENFORCE( + platform::dynload::cudnnCreateTensorDescriptor(&dy_desc_[i])); + dim_a[0] = batch_size_; + dim_a[1] = input_size_; + dim_a[2] = 1; + + stride_a[0] = dim_a[2] * dim_a[1]; + stride_a[1] = dim_a[2]; + stride_a[2] = 1; + CUDNN_ENFORCE(platform::dynload::cudnnSetTensorNdDescriptor( + x_desc_[i], CUDNN_DATA_FLOAT, 3, dim_a, stride_a)); + CUDNN_ENFORCE(platform::dynload::cudnnSetTensorNdDescriptor( + dx_desc_[i], CUDNN_DATA_FLOAT, 3, dim_a, stride_a)); + + dim_a[0] = batch_size_; + dim_a[1] = is_bidirec_ ? hidden_size_ * 2 : hidden_size_; + dim_a[2] = 1; + + stride_a[0] = dim_a[2] * dim_a[1]; + stride_a[1] = dim_a[2]; + stride_a[2] = 1; + + CUDNN_ENFORCE(platform::dynload::cudnnSetTensorNdDescriptor( + y_desc_[i], CUDNN_DATA_FLOAT, 3, dim_a, stride_a)); + CUDNN_ENFORCE(platform::dynload::cudnnSetTensorNdDescriptor( + dy_desc_[i], CUDNN_DATA_FLOAT, 3, dim_a, stride_a)); + } + + dim_a[0] = num_layers_ * (is_bidirec_ ? 2 : 1); + dim_a[1] = batch_size_; + dim_a[2] = hidden_size_; + + stride_a[0] = dim_a[2] * dim_a[1]; + stride_a[1] = dim_a[2]; + stride_a[2] = 1; + + CUDNN_ENFORCE(platform::dynload::cudnnCreateTensorDescriptor(&hx_desc_)); + CUDNN_ENFORCE(platform::dynload::cudnnCreateTensorDescriptor(&cx_desc_)); + CUDNN_ENFORCE(platform::dynload::cudnnCreateTensorDescriptor(&hy_desc_)); + CUDNN_ENFORCE(platform::dynload::cudnnCreateTensorDescriptor(&cy_desc_)); + CUDNN_ENFORCE(platform::dynload::cudnnCreateTensorDescriptor(&dhx_desc_)); + CUDNN_ENFORCE(platform::dynload::cudnnCreateTensorDescriptor(&dcx_desc_)); + CUDNN_ENFORCE(platform::dynload::cudnnCreateTensorDescriptor(&dhy_desc_)); + CUDNN_ENFORCE(platform::dynload::cudnnCreateTensorDescriptor(&dcy_desc_)); + + CUDNN_ENFORCE(platform::dynload::cudnnSetTensorNdDescriptor( + hx_desc_, CUDNN_DATA_FLOAT, 3, dim_a, stride_a)); + CUDNN_ENFORCE(platform::dynload::cudnnSetTensorNdDescriptor( + cx_desc_, CUDNN_DATA_FLOAT, 3, dim_a, stride_a)); + CUDNN_ENFORCE(platform::dynload::cudnnSetTensorNdDescriptor( + hy_desc_, CUDNN_DATA_FLOAT, 3, dim_a, stride_a)); + CUDNN_ENFORCE(platform::dynload::cudnnSetTensorNdDescriptor( + cy_desc_, CUDNN_DATA_FLOAT, 3, dim_a, stride_a)); + CUDNN_ENFORCE(platform::dynload::cudnnSetTensorNdDescriptor( + dhx_desc_, CUDNN_DATA_FLOAT, 3, dim_a, stride_a)); + CUDNN_ENFORCE(platform::dynload::cudnnSetTensorNdDescriptor( + dcx_desc_, CUDNN_DATA_FLOAT, 3, dim_a, stride_a)); + CUDNN_ENFORCE(platform::dynload::cudnnSetTensorNdDescriptor( + dhy_desc_, CUDNN_DATA_FLOAT, 3, dim_a, stride_a)); + CUDNN_ENFORCE(platform::dynload::cudnnSetTensorNdDescriptor( + dcy_desc_, CUDNN_DATA_FLOAT, 3, dim_a, stride_a)); + + CUDNN_ENFORCE( + platform::dynload::cudnnCreateDropoutDescriptor(&dropout_desc_)); + + size_t state_size; + CUDNN_ENFORCE( + platform::dynload::cudnnDropoutGetStatesSize(handle, &state_size); + dropout_state_.Resize({static_cast(state_size)})); + auto *dropout_state_data = dropout_state_.mutable_data(place); + CUDNN_ENFORCE(platform::dynload::cudnnSetDropoutDescriptor( + dropout_desc_, handle, dropout_prob_, dropout_state_data, state_size, + seed_)); + + CUDNN_ENFORCE(platform::dynload::cudnnCreateRNNDescriptor(&rnn_desc_)); + +#if CUDNN_VERSION >= 6000 + CUDNN_ENFORCE(platform::dynload::cudnnSetRNNDescriptor_v6( + handle, rnn_desc_, hidden_size_, num_layers_, dropout_desc_, + CUDNN_LINEAR_INPUT, + is_bidirec_ ? CUDNN_BIDIRECTIONAL : CUDNN_UNIDIRECTIONAL, CUDNN_LSTM, + CUDNN_RNN_ALGO_STANDARD, CUDNN_DATA_FLOAT)); +#else + CUDNN_ENFORCE(platform::dynload::cudnnSetRNNDescriptor( + rnn_desc_, hidden_size_, num_layers_, dropout_desc_, CUDNN_LINEAR_INPUT, + is_bidirec_ ? CUDNN_BIDIRECTIONAL : CUDNN_UNIDIRECTIONAL, CUDNN_LSTM, + CUDNN_DATA_FLOAT)); +#endif + + CUDNN_ENFORCE(platform::dynload::cudnnCreateFilterDescriptor(&w_desc_)); + CUDNN_ENFORCE(platform::dynload::cudnnCreateFilterDescriptor(&dw_desc_)); + + CUDNN_ENFORCE(platform::dynload::cudnnGetRNNParamsSize( + handle, rnn_desc_, x_desc_[0], &weights_size_, CUDNN_DATA_FLOAT)); + + PADDLE_ENFORCE_EQ(weights_size_, sizeof(float) * weight_numel, + "cudnn lstm weight size should be SAME"); + int dim_w[3]; + dim_w[0] = weights_size_ / sizeof(float); + dim_w[1] = 1; + dim_w[2] = 1; + CUDNN_ENFORCE(platform::dynload::cudnnSetFilterNdDescriptor( + w_desc_, CUDNN_DATA_FLOAT, CUDNN_TENSOR_NCHW, 3, dim_w)); + CUDNN_ENFORCE(platform::dynload::cudnnSetFilterNdDescriptor( + dw_desc_, CUDNN_DATA_FLOAT, CUDNN_TENSOR_NCHW, 3, dim_w)); + + CUDNN_ENFORCE(platform::dynload::cudnnGetRNNWorkspaceSize( + handle, rnn_desc_, max_length_, x_desc_, &workspace_size_)); + CUDNN_ENFORCE(platform::dynload::cudnnGetRNNTrainingReserveSize( + handle, rnn_desc_, max_length_, x_desc_, &reserve_size_)); + + reserve_data_.Resize({static_cast(reserve_size_)}); + reserve_data_.mutable_data(place); + + workspace_data_.Resize({static_cast(workspace_size_)}); + workspace_data_.mutable_data(place); + } + + void release() { + for (size_t i = 0; i < max_length_; ++i) { + CUDNN_ENFORCE( + platform::dynload::cudnnDestroyTensorDescriptor(x_desc_[i])); + CUDNN_ENFORCE( + platform::dynload::cudnnDestroyTensorDescriptor(y_desc_[i])); + CUDNN_ENFORCE( + platform::dynload::cudnnDestroyTensorDescriptor(dx_desc_[i])); + CUDNN_ENFORCE( + platform::dynload::cudnnDestroyTensorDescriptor(dy_desc_[i])); + } + + delete[] x_desc_; + delete[] y_desc_; + delete[] dx_desc_; + delete[] dy_desc_; + + CUDNN_ENFORCE(platform::dynload::cudnnDestroyTensorDescriptor(hx_desc_)); + CUDNN_ENFORCE(platform::dynload::cudnnDestroyTensorDescriptor(cx_desc_)); + CUDNN_ENFORCE(platform::dynload::cudnnDestroyTensorDescriptor(hy_desc_)); + CUDNN_ENFORCE(platform::dynload::cudnnDestroyTensorDescriptor(cy_desc_)); + CUDNN_ENFORCE(platform::dynload::cudnnDestroyTensorDescriptor(dhx_desc_)); + CUDNN_ENFORCE(platform::dynload::cudnnDestroyTensorDescriptor(dcx_desc_)); + CUDNN_ENFORCE(platform::dynload::cudnnDestroyTensorDescriptor(dhy_desc_)); + CUDNN_ENFORCE(platform::dynload::cudnnDestroyTensorDescriptor(dcy_desc_)); + + CUDNN_ENFORCE( + platform::dynload::cudnnDestroyDropoutDescriptor(dropout_desc_)); + CUDNN_ENFORCE(platform::dynload::cudnnDestroyRNNDescriptor(rnn_desc_)); + + CUDNN_ENFORCE(platform::dynload::cudnnDestroyFilterDescriptor(w_desc_)); + CUDNN_ENFORCE(platform::dynload::cudnnDestroyFilterDescriptor(dw_desc_)); + } +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/cum_op.h b/paddle/fluid/operators/cum_op.h index 999fdcff90784ed089cd620a4f0a908f196bcdda..7c0fda4169b5e1cf663d04b78b6425d73965c292 100644 --- a/paddle/fluid/operators/cum_op.h +++ b/paddle/fluid/operators/cum_op.h @@ -13,6 +13,8 @@ See the License for the specific language governing permissions and limitations under the License. */ #pragma once + +#include #include "paddle/fluid/framework/eigen.h" #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/framework/operator.h" diff --git a/paddle/fluid/operators/data_norm_op.cc b/paddle/fluid/operators/data_norm_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..45bce6e5203f8c1dbb744e0f954f7f0a71c53372 --- /dev/null +++ b/paddle/fluid/operators/data_norm_op.cc @@ -0,0 +1,406 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/data_norm_op.h" +#include +#include "paddle/fluid/framework/data_layout.h" +#ifdef PADDLE_WITH_MKLDNN +#include "paddle/fluid/platform/mkldnn_helper.h" +#endif + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; +using LoDTensor = framework::LoDTensor; +using DataLayout = framework::DataLayout; + +template +using EigenArrayMap = + Eigen::Map>; +template +using ConstEigenArrayMap = + Eigen::Map>; +template +using EigenVectorArrayMap = Eigen::Map>; +template +using ConstEigenVectorArrayMap = + Eigen::Map>; + +class DataNormOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext *ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("X"), ""); + PADDLE_ENFORCE(ctx->HasInput("BatchSize"), ""); + PADDLE_ENFORCE(ctx->HasInput("BatchSum"), ""); + PADDLE_ENFORCE(ctx->HasInput("BatchSquareSum"), ""); + PADDLE_ENFORCE(ctx->HasOutput("Means"), ""); + PADDLE_ENFORCE(ctx->HasOutput("Scales"), ""); + PADDLE_ENFORCE(ctx->HasOutput("Y"), ""); + + const auto x_dims = ctx->GetInputDim("X"); + const DataLayout data_layout = framework::StringToDataLayout( + ctx->Attrs().Get("data_layout")); + + PADDLE_ENFORCE(x_dims.size() >= 2 && x_dims.size() <= 5, + "Input X must have 2 to 5 dimensions."); + + const int64_t C = + (data_layout == DataLayout::kNCHW ? x_dims[1] + : x_dims[x_dims.size() - 1]); + + PADDLE_ENFORCE_EQ(ctx->GetInputDim("BatchSize").size(), 1UL); + PADDLE_ENFORCE_EQ(ctx->GetInputDim("BatchSum").size(), 1UL); + PADDLE_ENFORCE_EQ(ctx->GetInputDim("BatchSquareSum").size(), 1UL); + PADDLE_ENFORCE_EQ(ctx->GetInputDim("BatchSize")[0], C); + PADDLE_ENFORCE_EQ(ctx->GetInputDim("BatchSum")[0], C); + PADDLE_ENFORCE_EQ(ctx->GetInputDim("BatchSquareSum")[0], C); + + ctx->SetOutputDim("Y", x_dims); + ctx->SetOutputDim("Means", {C}); + ctx->SetOutputDim("Scales", {C}); + ctx->ShareLoD("X", "Y"); + } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext &ctx) const override { + auto input_data_type = ctx.Input("X")->type(); + // By default, the type of the scale, bias, mean, + // and var tensors should both be float. (For float or float16 input tensor) + // or double (For double input tensor). + auto dn_param_type = framework::proto::VarType::FP32; + if (input_data_type == framework::proto::VarType::FP64) { + dn_param_type = framework::proto::VarType::FP64; + } + PADDLE_ENFORCE_EQ(dn_param_type, ctx.Input("BatchSize")->type(), + "BatchSize input should be of float type"); + PADDLE_ENFORCE_EQ(dn_param_type, ctx.Input("BatchSum")->type(), + "BatchSum input should be of float type"); + PADDLE_ENFORCE_EQ(dn_param_type, + ctx.Input("BatchSquareSum")->type(), + "BatchSquareSum input should be of float type"); + + // TODO(pzelazko-intel): enable MKLDNN layout when it's ready + framework::LibraryType library = framework::LibraryType::kPlain; + framework::DataLayout layout = framework::DataLayout::kAnyLayout; +#ifdef PADDLE_WITH_MKLDNN + if (library == framework::LibraryType::kPlain && + platform::CanMKLDNNBeUsed(ctx)) { + library = framework::LibraryType::kMKLDNN; + layout = framework::DataLayout::kMKLDNN; + } +#endif + + return framework::OpKernelType(input_data_type, ctx.GetPlace(), layout, + library); + } +}; + +class DataNormOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + // AddAttr("is_test", "").SetDefault(false); + AddAttr("epsilon", "") + .SetDefault(1e-4) + .AddCustomChecker([](const float &epsilon) { + PADDLE_ENFORCE(epsilon >= 0.0f && epsilon <= 0.001f, + "'epsilon' should be between 0.0 and 0.001."); + }); + AddAttr("data_layout", "").SetDefault("NCHW"); + AddInput("X", "The input tensor"); + AddInput("BatchSize", + "BatchSize is a 1-dimensional tensor of size C " + "that is applied to the output"); + AddInput("BatchSum", + "BatchSum is a 1-dimensional tensor of size C " + "that is applied to the output"); + AddInput("BatchSquareSum", + "The global BatchSquareSum (for training) or " + "estimated BatchSquareSum (for testing)"); + AddOutput("Y", "result after normalization"); + AddOutput("Means", + "Mean of the history data batch, " + "will apply to output when training") + .AsIntermediate(); + AddOutput("Scales", + "Scales of the history data batch, " + "will apply to output when training") + .AsIntermediate(); + AddComment(R"DOC( +Data Normalization. + +Can be used as a normalizer function for data +The required data format for this layer is one of the following: +1. NHWC `[batch, in_height, in_width, in_channels]` +2. NCHW `[batch, in_channels, in_height, in_width]` + +)DOC"); + } +}; + +template +class DataNormKernel + : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext &ctx) const override { + // const bool is_test = ctx.Attr("is_test"); + const std::string data_layout_str = ctx.Attr("data_layout"); + const DataLayout data_layout = + framework::StringToDataLayout(data_layout_str); + + const auto *x = ctx.Input("X"); + const auto &x_dims = x->dims(); + PADDLE_ENFORCE(x_dims.size() == 2, "The Input dim size should be 2"); + const int N = x_dims[0]; + const int C = + (data_layout == DataLayout::kNCHW ? x_dims[1] + : x_dims[x_dims.size() - 1]); + auto *y = ctx.Output("Y"); + auto *mean_out = ctx.Output("Means"); + auto *scales = ctx.Output("Scales"); + + // alloc memory + y->mutable_data(ctx.GetPlace()); + + Eigen::Array inv_std(C); + ConstEigenVectorArrayMap b_size_arr( + ctx.Input("BatchSize")->data(), C); + ConstEigenVectorArrayMap b_sum_arr( + ctx.Input("BatchSum")->data(), C); + ConstEigenVectorArrayMap b_square_sum_arr( + ctx.Input("BatchSquareSum")->data(), C); + EigenVectorArrayMap means_arr(mean_out->mutable_data(ctx.GetPlace()), + C); + EigenVectorArrayMap scales_arr(scales->mutable_data(ctx.GetPlace()), + C); + means_arr = b_sum_arr / b_size_arr; + scales_arr = (b_size_arr / b_square_sum_arr).sqrt(); + + switch (data_layout) { + case DataLayout::kNCHW: // because it's two dimensions, so make no + // difference + case DataLayout::kNHWC: { + EigenArrayMap(y->mutable_data(ctx.GetPlace()), C, N) = + (ConstEigenArrayMap(x->data(), C, N).colwise() - means_arr) + .colwise() * + scales_arr; + break; + } + default: + PADDLE_THROW("Unknown storage order: %d", data_layout); + } + } +}; + +class DataNormGradOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext *ctx) const override { + // check input + PADDLE_ENFORCE(ctx->HasInput("X")); + PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Y")), ""); + PADDLE_ENFORCE(ctx->HasInput("BatchSize"), ""); + PADDLE_ENFORCE(ctx->HasInput("BatchSum"), ""); + PADDLE_ENFORCE(ctx->HasInput("BatchSquareSum"), ""); + PADDLE_ENFORCE(ctx->HasInput("Means"), ""); + PADDLE_ENFORCE(ctx->HasInput("Scales"), ""); + + // check output + PADDLE_ENFORCE(ctx->HasOutput(framework::GradVarName("X")), ""); + PADDLE_ENFORCE(ctx->HasOutput(framework::GradVarName("BatchSize")), ""); + PADDLE_ENFORCE(ctx->HasOutput(framework::GradVarName("BatchSum")), ""); + PADDLE_ENFORCE(ctx->HasOutput(framework::GradVarName("BatchSquareSum")), + ""); + + const auto x_dims = ctx->GetInputDim("X"); + const DataLayout data_layout = framework::StringToDataLayout( + ctx->Attrs().Get("data_layout")); + const int C = + (data_layout == DataLayout::kNCHW ? x_dims[1] + : x_dims[x_dims.size() - 1]); + + ctx->SetOutputDim(framework::GradVarName("X"), x_dims); + ctx->SetOutputDim(framework::GradVarName("BatchSize"), {C}); + ctx->SetOutputDim(framework::GradVarName("BatchSum"), {C}); + ctx->SetOutputDim(framework::GradVarName("BatchSquareSum"), {C}); + } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext &ctx) const override { + const auto *var = ctx.InputVar(framework::GradVarName("Y")); + if (var == nullptr) { + PADDLE_THROW("can't find Y@GRAD"); + } + const Tensor *t = nullptr; + if (var->IsType()) { + t = &var->Get(); + } else if (var->IsType()) { + t = &var->Get(); + } + if (t == nullptr) { + PADDLE_THROW("can't find Y@GRAD"); + } + + // TODO(pzelazko-intel): enable MKLDNN layout when it's ready + framework::LibraryType library = framework::LibraryType::kPlain; + framework::DataLayout layout = framework::DataLayout::kAnyLayout; + +#ifdef PADDLE_WITH_MKLDNN + if (library == framework::LibraryType::kPlain && + platform::CanMKLDNNBeUsed(ctx)) { + library = framework::LibraryType::kMKLDNN; + layout = framework::DataLayout::kMKLDNN; + } +#endif + + return framework::OpKernelType(ctx.Input("X")->type(), + ctx.GetPlace(), layout, library); + } +}; + +template +class DataNormGradKernel + : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext &ctx) const override { + const auto *x = ctx.Input("X"); + const auto *d_y = ctx.Input(framework::GradVarName("Y")); + const auto *batch_size = ctx.Input("BatchSize"); + const auto *batch_sum = ctx.Input("BatchSum"); + const auto *batch_square_sum = ctx.Input("BatchSquareSum"); + const auto *scales = ctx.Input("Scales"); + const auto *means = ctx.Input("Means"); + + const std::string data_layout_str = ctx.Attr("data_layout"); + const DataLayout data_layout = + framework::StringToDataLayout(data_layout_str); + + // Get the size for each dimension. + // NCHW [batch_size, in_channels, in_height, in_width] + const auto &x_dims = x->dims(); + PADDLE_ENFORCE(x_dims.size() == 2, "The Input dim size should be 2"); + const int N = x_dims[0]; + const int C = + (data_layout == DataLayout::kNCHW ? x_dims[1] + : x_dims[x_dims.size() - 1]); + + // init output + auto *d_x = ctx.Output(framework::GradVarName("X")); + auto *d_batch_size = + ctx.Output(framework::GradVarName("BatchSize")); + auto *d_batch_sum = ctx.Output(framework::GradVarName("BatchSum")); + auto *d_batch_square_sum = + ctx.Output(framework::GradVarName("BatchSquareSum")); + + EigenVectorArrayMap d_batch_size_arr( + d_batch_size->mutable_data(ctx.GetPlace()), C); + EigenVectorArrayMap d_batch_sum_arr( + d_batch_sum->mutable_data(ctx.GetPlace()), C); + EigenVectorArrayMap d_batch_square_sum_arr( + d_batch_square_sum->mutable_data(ctx.GetPlace()), C); + + d_batch_size_arr.setZero(); + d_batch_sum_arr.setZero(); + d_batch_square_sum_arr.setZero(); + + const float epsilon = ctx.Attr("epsilon"); + switch ( + data_layout) { // because it's two dimensions, so make no difference + case DataLayout::kNCHW: + case DataLayout::kNHWC: { + ConstEigenVectorArrayMap scales_arr(scales->data(), C); + ConstEigenVectorArrayMap means_arr(means->data(), C); + ConstEigenArrayMap x_arr(x->data(), C, N); + ConstEigenArrayMap d_y_arr(d_y->data(), C, N); + EigenArrayMap d_x_arr(d_x->mutable_data(ctx.GetPlace()), C, N); + d_x_arr.setZero(); + for (int nc = 0; nc < N; ++nc) { + d_x_arr.col(nc) = d_y_arr.col(nc) * scales_arr; + } + + // calculate data sum and squre sum + ConstEigenVectorArrayMap batch_size_arr(batch_size->data(), C); + ConstEigenVectorArrayMap batch_sum_arr(batch_sum->data(), C); + ConstEigenVectorArrayMap batch_square_sum_arr( + batch_square_sum->data(), C); + Eigen::Array sample_sum(C); + Eigen::Array sample_square_sum(C); + // calculate data sample sum and square sum + sample_sum.setZero(); + sample_square_sum.setZero(); + for (int nc = 0; nc < N; ++nc) { + sample_sum += x_arr.col(nc); + sample_square_sum += (x_arr.col(nc) - means_arr).square(); + } + // calculate gradient + d_batch_size_arr.setConstant(N); + d_batch_sum_arr = sample_sum; + d_batch_square_sum_arr = sample_square_sum + d_batch_size_arr * epsilon; + break; + } + default: + PADDLE_THROW("Unknown storage order: %s", data_layout_str); + } + } +}; + +class DataNormGradMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + auto *op = new framework::OpDesc(); + op->SetType("data_norm_grad"); + op->SetInput("X", Input("X")); + op->SetInput(framework::GradVarName("Y"), OutputGrad("Y")); + + op->SetInput("BatchSize", Input("BatchSize")); + op->SetInput("BatchSum", Input("BatchSum")); + op->SetInput("BatchSquareSum", Input("BatchSquareSum")); + op->SetInput("Scales", Output("Scales")); + op->SetInput("Means", Output("Means")); + + op->SetAttrMap(Attrs()); + + op->SetOutput(framework::GradVarName("X"), InputGrad("X")); + op->SetOutput(framework::GradVarName("BatchSize"), InputGrad("BatchSize")); + op->SetOutput(framework::GradVarName("BatchSum"), InputGrad("BatchSum")); + op->SetOutput(framework::GradVarName("BatchSquareSum"), + InputGrad("BatchSquareSum")); + + return std::unique_ptr(op); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(data_norm, ops::DataNormOp, ops::DataNormOpMaker, + ops::DataNormGradMaker); +REGISTER_OPERATOR(data_norm_grad, ops::DataNormGradOp); + +REGISTER_OP_CPU_KERNEL( + data_norm, ops::DataNormKernel, + ops::DataNormKernel); +REGISTER_OP_CPU_KERNEL( + data_norm_grad, + ops::DataNormGradKernel, + ops::DataNormGradKernel); diff --git a/paddle/fluid/operators/data_norm_op.h b/paddle/fluid/operators/data_norm_op.h new file mode 100644 index 0000000000000000000000000000000000000000..63451214bcf649d0a7a949f391db9b651d237d22 --- /dev/null +++ b/paddle/fluid/operators/data_norm_op.h @@ -0,0 +1,35 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include "paddle/fluid/framework/eigen.h" +#include "paddle/fluid/framework/op_registry.h" + +namespace paddle { +namespace operators { + +template +class DataNormKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override; +}; + +template +class DataNormGradKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override; +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/decayed_adagrad_op.cc b/paddle/fluid/operators/decayed_adagrad_op.cc deleted file mode 100644 index d73ae9e2721b388212cb6efa354eb4b480df9cad..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/decayed_adagrad_op.cc +++ /dev/null @@ -1,117 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/operators/decayed_adagrad_op.h" - -namespace paddle { -namespace operators { - -using Tensor = framework::Tensor; -class DecayedAdagradOp : public framework::OperatorWithKernel { - public: - using framework::OperatorWithKernel::OperatorWithKernel; - - void InferShape(framework::InferShapeContext *ctx) const override { - PADDLE_ENFORCE(ctx->HasInput("Param"), - "Input(Param) of DecayedAdagradOp should not be null."); - PADDLE_ENFORCE(ctx->HasInput("Grad"), - "Input(Grad) of DecayedAdagradOp should not be null."); - PADDLE_ENFORCE(ctx->HasInput("Moment"), - "Input(Moment) of DecayedAdagradOp should not be null."); - PADDLE_ENFORCE( - ctx->HasInput("LearningRate"), - "Input(LearningRate) of DecayedAdagradOp should not be null."); - PADDLE_ENFORCE( - ctx->GetInputsVarType("Param").front() == - framework::proto::VarType::LOD_TENSOR, - "The input var's type should be LoDTensor, but the received is %s", - ctx->Inputs("Param").front(), ctx->GetInputsVarType("Param").front()); - PADDLE_ENFORCE( - ctx->GetInputsVarType("Grad").front() == - framework::proto::VarType::LOD_TENSOR, - "The input var's type should be LoDTensor, but the received is %s", - ctx->Inputs("Grad").front(), ctx->GetInputsVarType("Grad").front()); - - PADDLE_ENFORCE(ctx->HasOutput("ParamOut"), - "Output(ParamOut) of DecayedAdagradOp should not be null."); - PADDLE_ENFORCE(ctx->HasOutput("MomentOut"), - "Output(MomentOut) of DecayedAdagradOp should not be null."); - - auto lr_dims = ctx->GetInputDim("LearningRate"); - PADDLE_ENFORCE_EQ(framework::product(lr_dims), 1, - "LearningRate should have one element"); - auto param_dims = ctx->GetInputDim("Param"); - PADDLE_ENFORCE_EQ(param_dims, ctx->GetInputDim("Grad"), - "Param and Grad input of DecayedAdagradOp should have " - "the same dimension."); - PADDLE_ENFORCE_EQ(param_dims, ctx->GetInputDim("Moment"), - "Param and Moment input of DecayedAdagradOp should have " - "the same dimension."); - - ctx->SetOutputDim("ParamOut", param_dims); - ctx->SetOutputDim("MomentOut", param_dims); - } - framework::OpKernelType GetExpectedKernelType( - const framework::ExecutionContext &ctx) const override { - auto input_data_type = - framework::ToDataType(ctx.Input("Param")->type()); - return framework::OpKernelType(input_data_type, ctx.GetPlace()); - } -}; - -class DecayedAdagradOpMaker : public framework::OpProtoAndCheckerMaker { - public: - void Make() override { - AddInput("Param", "(Tensor) Input parameter"); - AddInput("Grad", "(Tensor) Input gradient"); - AddInput("Moment", "(Tensor) Second moment"); - AddInput("LearningRate", "(Tensor) Learning rate"); - - AddOutput("ParamOut", "(Tensor) Output parameter"); - AddOutput("MomentOut", "(Tensor) Output second moment"); - - AddAttr("decay", - "(float, default 0.95) " - "Discounting factor for coming gradient") - .SetDefault(0.95); - AddAttr("epsilon", - "(float, default 1.0e-6) " - "Constant for numerical stability") - .SetDefault(1.0e-6f); - AddComment(R"DOC( -Decayed Adagrad Optimizer. - -The update is done as follows: - -$$ -moment\_out = decay * moment + (1 - decay) * grad * grad \\ -param\_out = param - \frac{learning\_rate * grad}{\sqrt{moment\_out} + epsilon} -$$ - -The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) -does not have an epsilon attribute. It is added here for numerical -stability to avoid the division by zero error. - -)DOC"); - } -}; -} // namespace operators -} // namespace paddle - -namespace ops = paddle::operators; -REGISTER_OP_WITHOUT_GRADIENT(decayed_adagrad, ops::DecayedAdagradOp, - ops::DecayedAdagradOpMaker); -REGISTER_OP_CPU_KERNEL( - decayed_adagrad, - ops::DecayedAdagradOpKernel); diff --git a/paddle/fluid/operators/decayed_adagrad_op.cu b/paddle/fluid/operators/decayed_adagrad_op.cu deleted file mode 100644 index 7da16acf05eefc21cbe3dd0540dcbf69022431de..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/decayed_adagrad_op.cu +++ /dev/null @@ -1,21 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#define EIGEN_USE_GPU -#include "paddle/fluid/operators/decayed_adagrad_op.h" - -namespace ops = paddle::operators; -REGISTER_OP_CUDA_KERNEL( - decayed_adagrad, - ops::DecayedAdagradOpKernel); diff --git a/paddle/fluid/operators/decayed_adagrad_op.h b/paddle/fluid/operators/decayed_adagrad_op.h deleted file mode 100644 index 5df43d33ef9f720fd20d57c53ff37cc85440b24e..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/decayed_adagrad_op.h +++ /dev/null @@ -1,67 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#pragma once -#include "paddle/fluid/framework/eigen.h" -#include "paddle/fluid/framework/op_registry.h" - -namespace paddle { -namespace operators { - -template -class DecayedAdagradOpKernel : public framework::OpKernel { - public: - void Compute(const framework::ExecutionContext& ctx) const override { - const auto* param_var = ctx.InputVar("Param"); - PADDLE_ENFORCE(param_var->IsType(), - "The Var(%s)'s type should be LoDTensor, " - "but the received is %s", - ctx.Inputs("Param").front(), param_var->Type().name()); - const auto* grad_var = ctx.InputVar("Grad"); - PADDLE_ENFORCE(grad_var->IsType(), - "The Var(%s)'s type should be LoDTensor, " - "but the received is %s", - ctx.Inputs("Grad").front(), grad_var->Type().name()); - - auto param_out_tensor = ctx.Output("ParamOut"); - auto moment_out_tensor = ctx.Output("MomentOut"); - - param_out_tensor->mutable_data(ctx.GetPlace()); - moment_out_tensor->mutable_data(ctx.GetPlace()); - - float decay = ctx.Attr("decay"); - float epsilon = ctx.Attr("epsilon"); - - auto param = framework::EigenVector::Flatten( - *ctx.Input("Param")); - auto grad = framework::EigenVector::Flatten( - *ctx.Input("Grad")); - auto moment = framework::EigenVector::Flatten( - *ctx.Input("Moment")); - auto lr = framework::EigenVector::Flatten( - *ctx.Input("LearningRate")); - - auto param_out = framework::EigenVector::Flatten(*param_out_tensor); - auto moment_out = framework::EigenVector::Flatten(*moment_out_tensor); - auto& place = *ctx.template device_context().eigen_device(); - - moment_out.device(place) = decay * moment + (1 - decay) * grad * grad; - Eigen::DSizes m_dsize(moment_out_tensor->numel()); - param_out.device(place) = - param - lr.broadcast(m_dsize) * grad / (moment_out.sqrt() + epsilon); - } -}; - -} // namespace operators -} // namespace paddle diff --git a/paddle/fluid/operators/delete_var_op.cc b/paddle/fluid/operators/delete_var_op.cc index d7a9bfbc437dbf4c723b9c87ff62ec6b62c38638..89416f7ab5d07ddac5b540b9bb361f831c1ef360 100644 --- a/paddle/fluid/operators/delete_var_op.cc +++ b/paddle/fluid/operators/delete_var_op.cc @@ -32,6 +32,11 @@ class DeleteVarOp : public framework::OperatorBase { } }; +class DeleteVarOpShapeInference : public framework::InferShapeBase { + public: + void operator()(framework::InferShapeContext *ctx) const override {} +}; + class DeleteVarOpInfoMaker : public framework::OpProtoAndCheckerMaker { public: void Make() override { @@ -48,4 +53,5 @@ It should not be configured by users directly. REGISTER_OPERATOR(delete_var, paddle::operators::DeleteVarOp, paddle::framework::EmptyGradOpMaker, - paddle::operators::DeleteVarOpInfoMaker); + paddle::operators::DeleteVarOpInfoMaker, + paddle::operators::DeleteVarOpShapeInference); diff --git a/paddle/fluid/operators/dequantize_op.cc b/paddle/fluid/operators/dequantize_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..38159f84a0d56f45cfef233a3c70c3c6cef17d9f --- /dev/null +++ b/paddle/fluid/operators/dequantize_op.cc @@ -0,0 +1,45 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/dequantize_op.h" +#ifdef PADDLE_WITH_MKLDNN +#include "paddle/fluid/platform/mkldnn_helper.h" +#endif + +namespace paddle { +namespace operators { + +framework::OpKernelType DeQuantOp::GetExpectedKernelType( + const framework::ExecutionContext& ctx) const { + framework::LibraryType library_ = framework::LibraryType::kMKLDNN; + framework::DataLayout layout_ = framework::DataLayout::kMKLDNN; + + return framework::OpKernelType(ctx.Input("Input")->type(), + ctx.GetPlace(), layout_, library_); +} + +void DeQuantOpMaker::Make() { + AddInput("Input", "input data"); + AddOutput("Output", "output data"); + AddAttr("Scale", "scale data").SetDefault({1.0f}); + AddComment(R"DOC(This op will dequantize data from INT8 to FP32)DOC"); +} + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; + +REGISTER_OPERATOR(dequantize, ops::DeQuantOp, ops::DeQuantOpMaker, + paddle::framework::DefaultGradOpDescMaker); diff --git a/paddle/fluid/operators/dequantize_op.h b/paddle/fluid/operators/dequantize_op.h new file mode 100644 index 0000000000000000000000000000000000000000..75c27a06c210f2d0e4d7cf52aa16f4c123f8ad8e --- /dev/null +++ b/paddle/fluid/operators/dequantize_op.h @@ -0,0 +1,54 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include "paddle/fluid/framework/op_registry.h" + +namespace paddle { +namespace operators { + +using framework::OpKernelType; +using framework::Tensor; + +class DeQuantOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext* ctx) const override { + ctx->SetOutputDim("Output", ctx->GetInputDim("Input")); + ctx->ShareLoD("Input", /*->*/ "Output"); + } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext& ctx) const override; +}; + +class DeQuantOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override; +}; + +class DeQuantGradOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext* ctx) const override {} +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/detail/macros.h b/paddle/fluid/operators/detail/macros.h deleted file mode 100644 index 6f4a15caa5542a45cd8e26a72b055ca8948069d0..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/detail/macros.h +++ /dev/null @@ -1,35 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#pragma once - -#ifdef PADDLE_WITH_DISTRIBUTE - -#ifdef PADDLE_WITH_GRPC - -#include "paddle/fluid/operators/distributed/grpc_client.h" -#include "paddle/fluid/operators/distributed/grpc_server.h" -#define RPCSERVER_T paddle::operators::distributed::AsyncGRPCServer -#define RPCCLIENT_T paddle::operators::distributed::GRPCClient - -#else // PADDLE_WITH_GRPC - -#include "paddle/fluid/operators/distributed/brpc_client.h" -#include "paddle/fluid/operators/distributed/brpc_server.h" -#define RPCSERVER_T paddle::operators::distributed::AsyncBRPCServer -#define RPCCLIENT_T paddle::operators::distributed::BRPCClient - -#endif // PADDLE_WITH_GRPC - -#endif // PADDLE_WITH_DISTRIBUTE diff --git a/paddle/fluid/operators/detail/safe_ref.h b/paddle/fluid/operators/detail/safe_ref.h index a800d5df0a7cbc668a0217350098bce2bfdcfa70..8660bc219c12fe8676e3be646c4b878a39700201 100644 --- a/paddle/fluid/operators/detail/safe_ref.h +++ b/paddle/fluid/operators/detail/safe_ref.h @@ -25,7 +25,7 @@ namespace detail { */ template inline T& Ref(T* ptr, ARGS&&... args) { - PADDLE_ENFORCE(ptr != nullptr, args...); + PADDLE_ENFORCE(ptr != nullptr, ::paddle::string::Sprintf(args...)); return *ptr; } diff --git a/paddle/fluid/operators/detail/strided_memcpy.h b/paddle/fluid/operators/detail/strided_memcpy.h index 0b7c470fe72eb4270b8d5b2d227642d85683c16d..94419d1f9a4ba654952e0aedb46ab94ea8d5c0a8 100644 --- a/paddle/fluid/operators/detail/strided_memcpy.h +++ b/paddle/fluid/operators/detail/strided_memcpy.h @@ -27,8 +27,8 @@ struct StridedMemcpyFunctor; template struct StridedMemcpyFunctor { void operator()(const platform::DeviceContext& dev_ctx, const T* src, - framework::Dim<0> src_stride, framework::Dim<0> dst_dim, - framework::Dim<0> dst_stride, T* dst) const { + const int64_t* src_stride, const int64_t* dst_dim, + const int64_t* dst_stride, T* dst) const { auto place = dev_ctx.GetPlace(); if (platform::is_cpu_place(place)) { auto& cpu_place = boost::get(place); @@ -50,18 +50,18 @@ struct StridedMemcpyFunctor { template struct StridedMemcpyFunctor { void operator()(const platform::DeviceContext& dev_ctx, const T* src, - framework::Dim<1> src_stride, framework::Dim<1> dst_dim, - framework::Dim<1> dst_stride, T* dst) const { + const int64_t* src_stride, const int64_t* dst_dim, + const int64_t* dst_stride, T* dst) const { auto place = dev_ctx.GetPlace(); if (platform::is_cpu_place(place)) { auto& cpu_place = boost::get(place); - memory::Copy(cpu_place, dst, cpu_place, src, sizeof(T) * dst_dim.head); + memory::Copy(cpu_place, dst, cpu_place, src, sizeof(T) * dst_dim[0]); } else { #ifdef PADDLE_WITH_CUDA auto& gpu_place = boost::get(place); auto& cuda_ctx = reinterpret_cast(dev_ctx); - memory::Copy(gpu_place, dst, gpu_place, src, sizeof(T) * dst_dim.head, + memory::Copy(gpu_place, dst, gpu_place, src, sizeof(T) * dst_dim[0], cuda_ctx.stream()); #else PADDLE_THROW("Paddle is not compiled with GPU"); @@ -73,19 +73,19 @@ struct StridedMemcpyFunctor { template struct StridedMemcpyFunctor { void operator()(const platform::DeviceContext& dev_ctx, const T* src, - framework::Dim src_stride, framework::Dim dst_dim, - framework::Dim dst_stride, T* dst) const { - for (int64_t i = 0; i < dst_dim.head; ++i) { + const int64_t* src_stride, const int64_t* dst_dim, + const int64_t* dst_stride, T* dst) const { + for (int64_t i = 0; i < dst_dim[0]; ++i) { StridedMemcpyFunctor func; - func(dev_ctx, src, src_stride.tail, dst_dim.tail, dst_stride.tail, dst); - src += src_stride.head; - dst += dst_stride.head; + func(dev_ctx, src, src_stride + 1, dst_dim + 1, dst_stride + 1, dst); + src += src_stride[0]; + dst += dst_stride[0]; } } }; template -struct StridedCopyDimVisitor : public boost::static_visitor { +struct StridedCopyDimVisitor { StridedCopyDimVisitor(const platform::DeviceContext& dev_ctx, const T* src, const framework::DDim& src_stride, const framework::DDim& dst_stride, T* dst) @@ -95,13 +95,11 @@ struct StridedCopyDimVisitor : public boost::static_visitor { dst_stride_(dst_stride), dst_(dst) {} - template - void operator()(Dim dst_dim) const { - Dim src_stride = boost::get(src_stride_); - Dim dst_stride = boost::get(dst_stride_); - constexpr int dim = Dim::dimensions; - StridedMemcpyFunctor functor; - functor(dev_ctx_, src_, src_stride, dst_dim, dst_stride, dst_); + template + void operator()(const framework::Dim& dst_dim) const { + StridedMemcpyFunctor functor; + functor(dev_ctx_, src_, src_stride_.Get(), dst_dim.Get(), dst_stride_.Get(), + dst_); } const platform::DeviceContext& dev_ctx_; diff --git a/paddle/fluid/operators/detection/CMakeLists.txt b/paddle/fluid/operators/detection/CMakeLists.txt index aa8ed502fc94bd0970dfe5dbf00ef090e799ad30..94a2016aa53212c3ae5af6d86cccb117855cc3b4 100644 --- a/paddle/fluid/operators/detection/CMakeLists.txt +++ b/paddle/fluid/operators/detection/CMakeLists.txt @@ -20,8 +20,9 @@ detection_library(box_coder_op SRCS box_coder_op.cc box_coder_op.cu) detection_library(iou_similarity_op SRCS iou_similarity_op.cc iou_similarity_op.cu) detection_library(mine_hard_examples_op SRCS mine_hard_examples_op.cc) -detection_library(multiclass_nms_op SRCS multiclass_nms_op.cc) +detection_library(multiclass_nms_op SRCS multiclass_nms_op.cc poly_util.cc gpc.cc) detection_library(prior_box_op SRCS prior_box_op.cc prior_box_op.cu) +detection_library(density_prior_box_op SRCS density_prior_box_op.cc density_prior_box_op.cu) detection_library(anchor_generator_op SRCS anchor_generator_op.cc anchor_generator_op.cu) detection_library(target_assign_op SRCS target_assign_op.cc @@ -30,13 +31,27 @@ detection_library(polygon_box_transform_op SRCS polygon_box_transform_op.cc polygon_box_transform_op.cu) detection_library(rpn_target_assign_op SRCS rpn_target_assign_op.cc) detection_library(generate_proposal_labels_op SRCS generate_proposal_labels_op.cc) +detection_library(box_clip_op SRCS box_clip_op.cc box_clip_op.cu) +detection_library(yolov3_loss_op SRCS yolov3_loss_op.cc) +detection_library(yolo_box_op SRCS yolo_box_op.cc yolo_box_op.cu) +detection_library(box_decoder_and_assign_op SRCS box_decoder_and_assign_op.cc box_decoder_and_assign_op.cu) if(WITH_GPU) detection_library(generate_proposals_op SRCS generate_proposals_op.cc generate_proposals_op.cu DEPS memory cub) + detection_library(distribute_fpn_proposals_op SRCS distribute_fpn_proposals_op.cc distribute_fpn_proposals_op.cu DEPS memory cub) else() detection_library(generate_proposals_op SRCS generate_proposals_op.cc) + detection_library(distribute_fpn_proposals_op SRCS distribute_fpn_proposals_op.cc) endif() detection_library(roi_perspective_transform_op SRCS roi_perspective_transform_op.cc roi_perspective_transform_op.cu) #Export local libraries to parent -set(DETECTION_LIBRARY ${LOCAL_DETECTION_LIBS} PARENT_SCOPE) +# set(DETECTION_LIBRARY ${LOCAL_DETECTION_LIBS} PARENT_SCOPE) + +foreach(src ${LOCAL_DETECTION_LIBS}) + set(OP_LIBRARY ${src} ${OP_LIBRARY} CACHE INTERNAL "op libs") +endforeach() + +cc_library(mask_util SRCS mask_util.cc DEPS memory) +cc_test(mask_util_test SRCS mask_util_test.cc DEPS memory mask_util) +detection_library(generate_mask_labels_op SRCS generate_mask_labels_op.cc DEPS mask_util) diff --git a/paddle/fluid/operators/detection/anchor_generator_op.cc b/paddle/fluid/operators/detection/anchor_generator_op.cc index 0c0155a0a977846b1300d93b4c3fef0e71fc1d26..4a333b559f82e6d39d2d4345c8ad58bc8d430c69 100644 --- a/paddle/fluid/operators/detection/anchor_generator_op.cc +++ b/paddle/fluid/operators/detection/anchor_generator_op.cc @@ -53,8 +53,7 @@ class AnchorGeneratorOp : public framework::OperatorWithKernel { framework::OpKernelType GetExpectedKernelType( const framework::ExecutionContext& ctx) const override { return framework::OpKernelType( - framework::ToDataType(ctx.Input("Input")->type()), - ctx.device_context()); + ctx.Input("Input")->type(), ctx.device_context()); } }; @@ -86,7 +85,7 @@ class AnchorGeneratorOpMaker : public framework::OpProtoAndCheckerMaker { " For instance, the anchor size of 64 means the area of this anchor " "equals to 64**2.") .AddCustomChecker([](const std::vector& anchor_sizes) { - PADDLE_ENFORCE_GT(anchor_sizes.size(), 0, + PADDLE_ENFORCE_GT(anchor_sizes.size(), 0UL, "Size of anchor_sizes must be at least 1."); for (size_t i = 0; i < anchor_sizes.size(); ++i) { PADDLE_ENFORCE_GT(anchor_sizes[i], 0.0, @@ -104,7 +103,7 @@ class AnchorGeneratorOpMaker : public framework::OpProtoAndCheckerMaker { "(vector) List of variances to be used " "in box regression deltas") .AddCustomChecker([](const std::vector& variances) { - PADDLE_ENFORCE_EQ(variances.size(), 4, + PADDLE_ENFORCE_EQ(variances.size(), 4UL, "Must and only provide 4 variance."); for (size_t i = 0; i < variances.size(); ++i) { PADDLE_ENFORCE_GT(variances[i], 0.0, @@ -118,7 +117,7 @@ class AnchorGeneratorOpMaker : public framework::OpProtoAndCheckerMaker { .SetDefault(std::vector(2, 16.0)) .AddCustomChecker([](const std::vector& stride) { PADDLE_ENFORCE_EQ( - stride.size(), 2, + stride.size(), 2UL, "Must and only provide 2 stride for width and height."); for (size_t i = 0; i < stride.size(); ++i) { PADDLE_ENFORCE_GT(stride[i], 0.0, diff --git a/paddle/fluid/operators/detection/bbox_util.h b/paddle/fluid/operators/detection/bbox_util.h index 6abeca1da443248d6ad3c1bcc64dd775d77f4ed8..a7bc3e027229884e78721d29428a8ab3f08a6ebc 100644 --- a/paddle/fluid/operators/detection/bbox_util.h +++ b/paddle/fluid/operators/detection/bbox_util.h @@ -1,13 +1,17 @@ /* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 + Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ + #pragma once #include #include "paddle/fluid/framework/eigen.h" @@ -88,7 +92,33 @@ void BboxOverlaps(const framework::Tensor& r_boxes, inter_w = std::max(x_max - x_min + 1, zero); inter_h = std::max(y_max - y_min + 1, zero); inter_area = inter_w * inter_h; - overlaps_et(i, j) = inter_area / (r_box_area + c_box_area - inter_area); + overlaps_et(i, j) = + (inter_area == 0.) ? 0 : inter_area / + (r_box_area + c_box_area - inter_area); + } + } +} + +template +void ClipTiledBoxes(const platform::DeviceContext& ctx, + const framework::Tensor& im_info, + const framework::Tensor& input_boxes, + framework::Tensor* out) { + T* out_data = out->mutable_data(ctx.GetPlace()); + const T* im_info_data = im_info.data(); + const T* input_boxes_data = input_boxes.data(); + T zero(0); + T im_w = round(im_info_data[1] / im_info_data[2]); + T im_h = round(im_info_data[0] / im_info_data[2]); + for (int64_t i = 0; i < input_boxes.numel(); ++i) { + if (i % 4 == 0) { + out_data[i] = std::max(std::min(input_boxes_data[i], im_w - 1), zero); + } else if (i % 4 == 1) { + out_data[i] = std::max(std::min(input_boxes_data[i], im_h - 1), zero); + } else if (i % 4 == 2) { + out_data[i] = std::max(std::min(input_boxes_data[i], im_w - 1), zero); + } else { + out_data[i] = std::max(std::min(input_boxes_data[i], im_h - 1), zero); } } } diff --git a/paddle/fluid/operators/detection/bipartite_match_op.cc b/paddle/fluid/operators/detection/bipartite_match_op.cc index c23b65fe4dead3ca01a447d03877e3359b19e656..b7da1261a8f9780028bf2d36903e54d7e270bec0 100644 --- a/paddle/fluid/operators/detection/bipartite_match_op.cc +++ b/paddle/fluid/operators/detection/bipartite_match_op.cc @@ -45,9 +45,8 @@ class BipartiteMatchOp : public framework::OperatorWithKernel { protected: framework::OpKernelType GetExpectedKernelType( const framework::ExecutionContext& ctx) const override { - return framework::OpKernelType( - framework::ToDataType(ctx.Input("DistMat")->type()), - platform::CPUPlace()); + return framework::OpKernelType(ctx.Input("DistMat")->type(), + platform::CPUPlace()); } }; diff --git a/paddle/fluid/operators/detection/box_clip_op.cc b/paddle/fluid/operators/detection/box_clip_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..3aa766559a530bc31fbb277f2bcd474da776e63b --- /dev/null +++ b/paddle/fluid/operators/detection/box_clip_op.cc @@ -0,0 +1,86 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/detection/box_clip_op.h" +#include "paddle/fluid/framework/op_registry.h" + +namespace paddle { +namespace operators { + +class BoxClipOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + protected: + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("Input"), + "Input(Input) of BoxClipOp should not be null."); + PADDLE_ENFORCE(ctx->HasInput("ImInfo"), + "Input(ImInfo) of BoxClipOp should not be null."); + + auto input_box_dims = ctx->GetInputDim("Input"); + auto im_info_dims = ctx->GetInputDim("ImInfo"); + + if (ctx->IsRuntime()) { + auto input_box_size = input_box_dims.size(); + PADDLE_ENFORCE_EQ(input_box_dims[input_box_size - 1], 4, + "The last dimension of Input must be 4"); + PADDLE_ENFORCE_EQ(im_info_dims.size(), 2, + "The rank of Input(Input) in BoxClipOp must be 2"); + PADDLE_ENFORCE_EQ(im_info_dims[1], 3, + "The last dimension of ImInfo must be 3"); + } + ctx->ShareDim("Input", /*->*/ "Output"); + ctx->ShareLoD("Input", /*->*/ "Output"); + } +}; + +class BoxClipOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("Input", + "(LoDTensor) " + "Input is a LoDTensor with shape [..., 4] holds 4 points" + "in last dimension in format [xmin, ymin, xmax, ymax]"); + AddInput("ImInfo", + "(Tensor) Information for image reshape is in shape (N, 3), " + "in format (height, width, im_scale)"); + AddOutput("Output", + "(LoDTensor) " + "Output is a LoDTensor with the same shape as Input" + "and it is the result after clip"); + AddComment(R"DOC( +This operator clips input boxes to original input images. + +For each input box, The formula is given as follows: + + $$xmin = \max(\min(xmin, im_w - 1), 0)$$ + $$ymin = \max(\min(ymin, im_h - 1), 0)$$ + $$xmax = \max(\min(xmax, im_w - 1), 0)$$ + $$ymax = \max(\min(ymax, im_h - 1), 0)$$ + +where im_w and im_h are computed from ImInfo, the formula is given as follows: + + $$im_w = \round(width / im_scale)$$ + $$im_h = \round(height / im_scale)$$ +)DOC"); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(box_clip, ops::BoxClipOp, ops::BoxClipOpMaker, + paddle::framework::EmptyGradOpMaker); +REGISTER_OP_CPU_KERNEL( + box_clip, ops::BoxClipKernel, + ops::BoxClipKernel); diff --git a/paddle/fluid/operators/detection/box_clip_op.cu b/paddle/fluid/operators/detection/box_clip_op.cu new file mode 100644 index 0000000000000000000000000000000000000000..b727da5f7b736b6f22407d1dfbca708ed0cf04d9 --- /dev/null +++ b/paddle/fluid/operators/detection/box_clip_op.cu @@ -0,0 +1,74 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/detection/box_clip_op.h" +#include "paddle/fluid/operators/math/math_function.h" +#include "paddle/fluid/platform/cuda_primitives.h" +#include "paddle/fluid/platform/hostdevice.h" + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; +using LoDTenso = framework::LoDTensor; + +static constexpr int ImInfoSize = 3; + +template +static __global__ void GPUBoxClip(const T *input, const size_t *lod, + const size_t width, const T *im_info, + T *output) { + T im_w = round(im_info[blockIdx.x * ImInfoSize + 1] / + im_info[blockIdx.x * ImInfoSize + 2]); + T im_h = round(im_info[blockIdx.x * ImInfoSize] / + im_info[blockIdx.x * ImInfoSize + 2]); + for (int i = threadIdx.x; i < (lod[blockIdx.x + 1] - lod[blockIdx.x]) * width; + i += BlockSize) { + int idx = lod[blockIdx.x] * width + i; + T im_size = (idx % 2 == 0) ? im_w : im_h; + output[idx] = max(min(input[idx], im_size - 1), T(0.)); + } +} + +template +class GPUBoxClipKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext &context) const override { + PADDLE_ENFORCE(platform::is_gpu_place(context.GetPlace()), + "This kernel only runs on GPU device."); + auto *input = context.Input("Input"); + auto *im_info = context.Input("ImInfo"); + auto *output = context.Output("Output"); + const int64_t num = input->dims()[0]; + const int64_t bbox_width = input->numel() / num; + auto lod = input->lod(); + framework::LoD abs_offset_lod = framework::ToAbsOffset(lod); + auto &dev_ctx = context.template device_context(); + auto stream = dev_ctx.stream(); + const size_t batch_size = lod.back().size() - 1; + T *output_data = output->mutable_data(dev_ctx.GetPlace()); + GPUBoxClip<<>>( + input->data(), abs_offset_lod[0].CUDAMutableData(dev_ctx.GetPlace()), + bbox_width, im_info->data(), output_data); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OP_CUDA_KERNEL( + box_clip, ops::GPUBoxClipKernel, + ops::GPUBoxClipKernel); diff --git a/paddle/fluid/operators/detection/box_clip_op.h b/paddle/fluid/operators/detection/box_clip_op.h new file mode 100644 index 0000000000000000000000000000000000000000..74e1f88f8d8b28e490d170934760bd9bffc807bc --- /dev/null +++ b/paddle/fluid/operators/detection/box_clip_op.h @@ -0,0 +1,50 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/detection/bbox_util.h" +#include "paddle/fluid/operators/math/math_function.h" + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; +using LoDTensor = framework::LoDTensor; + +template +class BoxClipKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& context) const override { + auto* input_box = context.Input("Input"); + auto* im_info = context.Input("ImInfo"); + auto* output_box = context.Output("Output"); + auto& dev_ctx = + context.template device_context(); + output_box->mutable_data(context.GetPlace()); + if (input_box->lod().size()) { + PADDLE_ENFORCE_EQ(input_box->lod().size(), 1UL, + "Only support 1 level of LoD."); + } + auto box_lod = input_box->lod().back(); + int64_t n = static_cast(box_lod.size() - 1); + for (int i = 0; i < n; ++i) { + Tensor im_info_slice = im_info->Slice(i, i + 1); + Tensor box_slice = input_box->Slice(box_lod[i], box_lod[i + 1]); + Tensor output_slice = output_box->Slice(box_lod[i], box_lod[i + 1]); + ClipTiledBoxes(dev_ctx, im_info_slice, box_slice, &output_slice); + } + } +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/detection/box_coder_op.cc b/paddle/fluid/operators/detection/box_coder_op.cc index d0f95f727fdbc82777147e3e8ada6ad4f7a35e60..de3612677440596387f313e1ff59184cb3fdb7ae 100644 --- a/paddle/fluid/operators/detection/box_coder_op.cc +++ b/paddle/fluid/operators/detection/box_coder_op.cc @@ -10,6 +10,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/operators/detection/box_coder_op.h" +#include namespace paddle { namespace operators { @@ -30,31 +31,52 @@ class BoxCoderOp : public framework::OperatorWithKernel { auto prior_box_dims = ctx->GetInputDim("PriorBox"); auto target_box_dims = ctx->GetInputDim("TargetBox"); - PADDLE_ENFORCE_EQ(prior_box_dims.size(), 2, - "The rank of Input of PriorBoxVar must be 2"); - PADDLE_ENFORCE_EQ(prior_box_dims[1], 4, "The shape of PriorBox is [N, 4]"); - if (ctx->HasInput("PriorBoxVar")) { - auto prior_box_var_dims = ctx->GetInputDim("PriorBoxVar"); - PADDLE_ENFORCE_EQ(prior_box_dims, prior_box_var_dims); + if (ctx->IsRuntime()) { + PADDLE_ENFORCE_EQ(prior_box_dims.size(), 2, + "The rank of Input PriorBox must be 2"); + PADDLE_ENFORCE_EQ(prior_box_dims[1], 4, + "The shape of PriorBox is [N, 4]"); + if (ctx->HasInput("PriorBoxVar")) { + auto prior_box_var_dims = ctx->GetInputDim("PriorBoxVar"); + PADDLE_ENFORCE(prior_box_var_dims.size() == 2, + "Input(PriorBoxVar) of BoxCoderOp should be 2."); + PADDLE_ENFORCE_EQ( + prior_box_dims, prior_box_var_dims, + "The dimension of Input(PriorBoxVar) should be equal to" + "the dimension of Input(PriorBox) when the rank is 2."); + } } auto code_type = GetBoxCodeType(ctx->Attrs().Get("code_type")); + int axis = ctx->Attrs().Get("axis"); if (code_type == BoxCodeType::kEncodeCenterSize) { PADDLE_ENFORCE_EQ(target_box_dims.size(), 2, - "The rank of Input of TargetBox must be 2"); + "The rank of Input TargetBox must be 2"); PADDLE_ENFORCE_EQ(target_box_dims[1], 4, "The shape of TargetBox is [M, 4]"); + ctx->SetOutputDim( + "OutputBox", + framework::make_ddim({target_box_dims[0], prior_box_dims[0], 4})); } else if (code_type == BoxCodeType::kDecodeCenterSize) { PADDLE_ENFORCE_EQ(target_box_dims.size(), 3, - "The rank of Input of TargetBox must be 3"); - PADDLE_ENFORCE_EQ(target_box_dims[1], prior_box_dims[0]); - PADDLE_ENFORCE_EQ(target_box_dims[2], prior_box_dims[1]); + "The rank of Input TargetBox must be 3"); + PADDLE_ENFORCE(axis == 0 || axis == 1, "axis must be 0 or 1"); + if (ctx->IsRuntime()) { + if (axis == 0) { + PADDLE_ENFORCE_EQ(target_box_dims[1], prior_box_dims[0]); + } else if (axis == 1) { + PADDLE_ENFORCE_EQ(target_box_dims[0], prior_box_dims[0]); + } + PADDLE_ENFORCE_EQ(target_box_dims[2], prior_box_dims[1]); + } + ctx->ShareDim("TargetBox", /*->*/ "OutputBox"); } - ctx->SetOutputDim( - "OutputBox", - framework::make_ddim({target_box_dims[0], prior_box_dims[0], 4})); - ctx->ShareLoD("TargetBox", /*->*/ "OutputBox"); + if (code_type == BoxCodeType::kDecodeCenterSize && axis == 1) { + ctx->ShareLoD("PriorBox", /*->*/ "OutputBox"); + } else { + ctx->ShareLoD("TargetBox", /*->*/ "OutputBox"); + } } }; @@ -97,6 +119,21 @@ class BoxCoderOpMaker : public framework::OpProtoAndCheckerMaker { "(bool, default true) " "whether treat the priorbox as a noramlized box") .SetDefault(true); + AddAttr("axis", + "(int, default 0)" + "which axis in PriorBox to broadcast for box decode," + "for example, if axis is 0 and TargetBox has shape" + "[N, M, 4] and PriorBox has shape [M, 4], then PriorBox " + "will broadcast to [N, M, 4] for decoding. It is only valid" + "when code type is decode_center_size") + .SetDefault(0) + .InEnum({0, 1}); + AddAttr>( + "variance", + "(vector, default {})," + "variance of prior box with shape [4]. PriorBoxVar and variance can" + "not be provided at the same time.") + .SetDefault(std::vector{}); AddOutput("OutputBox", "(LoDTensor or Tensor) " "When code_type is 'encode_center_size', the output tensor of " @@ -135,7 +172,11 @@ where `tx`, `ty`, `tw`, `th` denote the target box's center coordinates, width and height respectively. Similarly, `px`, `py`, `pw`, `ph` denote the priorbox's (anchor) center coordinates, width and height. `pxv`, `pyv`, `pwv`, `phv` denote the variance of the priorbox and `ox`, `oy`, `ow`, `oh` denote the -encoded/decoded coordinates, width and height. +encoded/decoded coordinates, width and height. + +During Box Decoding, two modes for broadcast are supported. Say target box has +shape [N, M, 4], and the shape of prior box can be [N, 4] or [M, 4]. Then prior +box will broadcast to target box along the assigned axis. )DOC"); } }; diff --git a/paddle/fluid/operators/detection/box_coder_op.cu b/paddle/fluid/operators/detection/box_coder_op.cu index a7af111f63d654319dd1d90d2032956951dfe49e..19a5bb90fa828899ad6270c051090dd3662aeed8 100644 --- a/paddle/fluid/operators/detection/box_coder_op.cu +++ b/paddle/fluid/operators/detection/box_coder_op.cu @@ -9,6 +9,9 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ +#include +#include +#include "paddle/fluid/memory/memcpy.h" #include "paddle/fluid/operators/detection/box_coder_op.h" #include "paddle/fluid/platform/cuda_primitives.h" @@ -16,11 +19,11 @@ namespace paddle { namespace operators { template -__global__ void EncodeCenterSizeKernel(const T* prior_box_data, - const T* prior_box_var_data, - const T* target_box_data, const int row, - const int col, const int len, - const bool normalized, T* output) { +__global__ void EncodeCenterSizeKernel( + const T* prior_box_data, const T* prior_box_var_data, + const T* target_box_data, const int row, const int col, const int len, + const bool normalized, const T prior_box_var_size, const float* variance, + const int var_size, T* output) { const int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < row * col) { const int row_idx = idx / col; @@ -30,11 +33,9 @@ __global__ void EncodeCenterSizeKernel(const T* prior_box_data, T prior_box_height = prior_box_data[col_idx * len + 3] - prior_box_data[col_idx * len + 1] + (normalized == false); - T prior_box_center_x = - (prior_box_data[col_idx * len + 2] + prior_box_data[col_idx * len]) / 2; - T prior_box_center_y = (prior_box_data[col_idx * len + 3] + - prior_box_data[col_idx * len + 1]) / - 2; + T prior_box_center_x = prior_box_data[col_idx * len] + prior_box_width / 2; + T prior_box_center_y = + prior_box_data[col_idx * len + 1] + prior_box_height / 2; T target_box_center_x = (target_box_data[row_idx * len + 2] + target_box_data[row_idx * len]) / @@ -55,58 +56,67 @@ __global__ void EncodeCenterSizeKernel(const T* prior_box_data, output[idx * len + 2] = log(fabs(target_box_width / prior_box_width)); output[idx * len + 3] = log(fabs(target_box_height / prior_box_height)); if (prior_box_var_data) { - output[idx * len] /= prior_box_var_data[col_idx * len]; - output[idx * len + 1] /= prior_box_var_data[col_idx * len + 1]; - output[idx * len + 2] /= prior_box_var_data[col_idx * len + 2]; - output[idx * len + 3] /= prior_box_var_data[col_idx * len + 3]; + int prior_var_offset = col_idx * len; + output[idx * len] /= prior_box_var_data[prior_var_offset]; + output[idx * len + 1] /= prior_box_var_data[prior_var_offset + 1]; + output[idx * len + 2] /= prior_box_var_data[prior_var_offset + 2]; + output[idx * len + 3] /= prior_box_var_data[prior_var_offset + 3]; + } else if (var_size == 4) { + for (int k = 0; k < 4; ++k) { + output[idx * len + k] /= static_cast(variance[k]); + } } } } template -__global__ void DecodeCenterSizeKernel(const T* prior_box_data, - const T* prior_box_var_data, - const T* target_box_data, const int row, - const int col, const int len, - const bool normalized, T* output) { +__global__ void DecodeCenterSizeKernel( + const T* prior_box_data, const T* prior_box_var_data, + const T* target_box_data, const int row, const int col, const int len, + const bool normalized, const T prior_box_var_size, const float* variance, + const int var_size, const int axis, T* output) { const int idx = threadIdx.x + blockIdx.x * blockDim.x; + int prior_box_offset = 0; if (idx < row * col) { const int col_idx = idx % col; - T prior_box_width = prior_box_data[col_idx * len + 2] - - prior_box_data[col_idx * len] + (normalized == false); - T prior_box_height = prior_box_data[col_idx * len + 3] - - prior_box_data[col_idx * len + 1] + + const int row_idx = idx / col; + prior_box_offset = axis == 0 ? col_idx * len : row_idx * len; + T prior_box_width = prior_box_data[prior_box_offset + 2] - + prior_box_data[prior_box_offset] + + (normalized == false); + T prior_box_height = prior_box_data[prior_box_offset + 3] - + prior_box_data[prior_box_offset + 1] + (normalized == false); T prior_box_center_x = - (prior_box_data[col_idx * len + 2] + prior_box_data[col_idx * len]) / 2; - T prior_box_center_y = (prior_box_data[col_idx * len + 3] + - prior_box_data[col_idx * len + 1]) / - 2; + prior_box_data[prior_box_offset] + prior_box_width / 2; + T prior_box_center_y = + prior_box_data[prior_box_offset + 1] + prior_box_height / 2; T target_box_width, target_box_height; T target_box_center_x, target_box_center_y; + T box_var_x = T(1), box_var_y = T(1); + T box_var_w = T(1), box_var_h = T(1); if (prior_box_var_data) { - target_box_width = exp(prior_box_var_data[col_idx * len + 2] * - target_box_data[idx * len + 2]) * - prior_box_width; - target_box_height = exp(prior_box_var_data[col_idx * len + 3] * - target_box_data[idx * len + 3]) * - prior_box_height; - target_box_center_x = prior_box_var_data[col_idx * len] * - target_box_data[idx * len] * prior_box_width + - prior_box_center_x; - target_box_center_y = prior_box_var_data[col_idx * len + 1] * - target_box_data[idx * len + 1] * - prior_box_height + - prior_box_center_y; - } else { - target_box_width = exp(target_box_data[idx * len + 2]) * prior_box_width; - target_box_height = - exp(target_box_data[idx * len + 3]) * prior_box_height; - target_box_center_x = - target_box_data[idx * len] * prior_box_width + prior_box_center_x; - target_box_center_y = target_box_data[idx * len + 1] * prior_box_height + - prior_box_center_y; + int prior_var_offset = axis == 0 ? col_idx * len : row_idx * len; + box_var_x = prior_box_var_data[prior_var_offset]; + box_var_y = prior_box_var_data[prior_var_offset + 1]; + box_var_w = prior_box_var_data[prior_var_offset + 2]; + box_var_h = prior_box_var_data[prior_var_offset + 3]; + } else if (var_size == 4) { + box_var_x = static_cast(variance[0]); + box_var_y = static_cast(variance[1]); + box_var_w = static_cast(variance[2]); + box_var_h = static_cast(variance[3]); } + target_box_width = + exp(box_var_w * target_box_data[idx * len + 2]) * prior_box_width; + target_box_height = + exp(box_var_h * target_box_data[idx * len + 3]) * prior_box_height; + target_box_center_x = + box_var_x * target_box_data[idx * len] * prior_box_width + + prior_box_center_x; + target_box_center_y = + box_var_y * target_box_data[idx * len + 1] * prior_box_height + + prior_box_center_y; output[idx * len] = target_box_center_x - target_box_width / 2; output[idx * len + 1] = target_box_center_y - target_box_height / 2; @@ -127,36 +137,64 @@ class BoxCoderCUDAKernel : public framework::OpKernel { auto* prior_box_var = context.Input("PriorBoxVar"); auto* target_box = context.Input("TargetBox"); auto* output_box = context.Output("OutputBox"); - + std::vector variance = context.Attr>("variance"); const T* prior_box_data = prior_box->data(); const T* target_box_data = target_box->data(); const T* prior_box_var_data = nullptr; - if (prior_box_var) prior_box_var_data = prior_box_var->data(); + auto prior_box_var_size = 0; + if (prior_box_var) { + PADDLE_ENFORCE(variance.empty(), + "Input 'PriorBoxVar' and attribute 'variance' should not" + "be used at the same time."); + prior_box_var_data = prior_box_var->data(); + prior_box_var_size = prior_box_var->dims().size(); + } + if (!(variance.empty())) { + PADDLE_ENFORCE(static_cast(variance.size()) == 4, + "Size of attribute 'variance' should be 4"); + } if (target_box->lod().size()) { PADDLE_ENFORCE_EQ(target_box->lod().size(), 1, "Only support 1 level of LoD."); } + const int var_size = static_cast(variance.size()); + + auto code_type = GetBoxCodeType(context.Attr("code_type")); + bool normalized = context.Attr("box_normalized"); + int axis = context.Attr("axis"); + auto row = target_box->dims()[0]; auto col = prior_box->dims()[0]; + if (code_type == BoxCodeType::kDecodeCenterSize) { + col = target_box->dims()[1]; + } auto len = prior_box->dims()[1]; int block = 512; int grid = (row * col + block - 1) / block; auto& device_ctx = context.cuda_device_context(); + auto& allocator = + platform::DeviceTemporaryAllocator::Instance().Get(device_ctx); + int bytes = var_size * sizeof(float); + auto dev_var = allocator.Allocate(bytes); + float* dev_var_data = reinterpret_cast(dev_var->ptr()); + auto cplace = platform::CPUPlace(); + const auto gplace = boost::get(context.GetPlace()); + memory::Copy(gplace, dev_var_data, cplace, &variance[0], bytes, + device_ctx.stream()); + output_box->mutable_data({row, col, len}, context.GetPlace()); T* output = output_box->data(); - auto code_type = GetBoxCodeType(context.Attr("code_type")); - bool normalized = context.Attr("box_normalized"); if (code_type == BoxCodeType::kEncodeCenterSize) { EncodeCenterSizeKernel<<>>( prior_box_data, prior_box_var_data, target_box_data, row, col, len, - normalized, output); + normalized, prior_box_var_size, dev_var_data, var_size, output); } else if (code_type == BoxCodeType::kDecodeCenterSize) { DecodeCenterSizeKernel<<>>( prior_box_data, prior_box_var_data, target_box_data, row, col, len, - normalized, output); + normalized, prior_box_var_size, dev_var_data, var_size, axis, output); } } }; diff --git a/paddle/fluid/operators/detection/box_coder_op.h b/paddle/fluid/operators/detection/box_coder_op.h index 5ed8520acddfa8fe2105a7c1615bcb3243cb130f..d4c7e8cf7723bf83d3cd8bf36b9ae6c5f1c35b10 100644 --- a/paddle/fluid/operators/detection/box_coder_op.h +++ b/paddle/fluid/operators/detection/box_coder_op.h @@ -11,6 +11,7 @@ limitations under the License. */ #pragma once #include +#include #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/operators/math/math_function.h" @@ -19,7 +20,7 @@ namespace operators { enum class BoxCodeType { kEncodeCenterSize = 0, kDecodeCenterSize = 1 }; -inline BoxCodeType GetBoxCodeType(const std::string& type) { +inline BoxCodeType GetBoxCodeType(const std::string &type) { if (type == "encode_center_size") { return BoxCodeType::kEncodeCenterSize; } else if (type == "decode_center_size") { @@ -31,29 +32,31 @@ inline BoxCodeType GetBoxCodeType(const std::string& type) { template class BoxCoderKernel : public framework::OpKernel { public: - void EncodeCenterSize(const framework::Tensor* target_box, - const framework::Tensor* prior_box, - const framework::Tensor* prior_box_var, - const bool normalized, T* output) const { + void EncodeCenterSize(const framework::Tensor *target_box, + const framework::Tensor *prior_box, + const framework::Tensor *prior_box_var, + const bool normalized, + const std::vector variance, T *output) const { int64_t row = target_box->dims()[0]; int64_t col = prior_box->dims()[0]; int64_t len = prior_box->dims()[1]; - auto* target_box_data = target_box->data(); - auto* prior_box_data = prior_box->data(); - const T* prior_box_var_data = nullptr; - if (prior_box_var) prior_box_var_data = prior_box_var->data(); +#ifdef PADDLE_WITH_MKLML +#pragma omp parallel for collapse(2) +#endif for (int64_t i = 0; i < row; ++i) { for (int64_t j = 0; j < col; ++j) { + auto *target_box_data = target_box->data(); + auto *prior_box_data = prior_box->data(); + size_t offset = i * col * len + j * len; T prior_box_width = prior_box_data[j * len + 2] - prior_box_data[j * len] + (normalized == false); T prior_box_height = prior_box_data[j * len + 3] - prior_box_data[j * len + 1] + (normalized == false); - T prior_box_center_x = - (prior_box_data[j * len + 2] + prior_box_data[j * len]) / 2; + T prior_box_center_x = prior_box_data[j * len] + prior_box_width / 2; T prior_box_center_y = - (prior_box_data[j * len + 3] + prior_box_data[j * len + 1]) / 2; + prior_box_data[j * len + 1] + prior_box_height / 2; T target_box_center_x = (target_box_data[i * len + 2] + target_box_data[i * len]) / 2; @@ -65,7 +68,6 @@ class BoxCoderKernel : public framework::OpKernel { target_box_data[i * len + 1] + (normalized == false); - size_t offset = i * col * len + j * len; output[offset] = (target_box_center_x - prior_box_center_x) / prior_box_width; output[offset + 1] = @@ -74,67 +76,96 @@ class BoxCoderKernel : public framework::OpKernel { std::log(std::fabs(target_box_width / prior_box_width)); output[offset + 3] = std::log(std::fabs(target_box_height / prior_box_height)); - if (prior_box_var) { - output[offset] /= prior_box_var_data[j * len]; - output[offset + 1] /= prior_box_var_data[j * len + 1]; - output[offset + 2] /= prior_box_var_data[j * len + 2]; - output[offset + 3] /= prior_box_var_data[j * len + 3]; + } + } + + if (prior_box_var) { + const T *prior_box_var_data = prior_box_var->data(); +#ifdef PADDLE_WITH_MKLML +#pragma omp parallel for collapse(3) +#endif + for (int64_t i = 0; i < row; ++i) { + for (int64_t j = 0; j < col; ++j) { + for (int k = 0; k < 4; ++k) { + size_t offset = i * col * len + j * len; + int prior_var_offset = j * len; + output[offset + k] /= prior_box_var_data[prior_var_offset + k]; + } + } + } + } else if (!(variance.empty())) { +#ifdef PADDLE_WITH_MKLML +#pragma omp parallel for collapse(3) +#endif + for (int64_t i = 0; i < row; ++i) { + for (int64_t j = 0; j < col; ++j) { + for (int k = 0; k < 4; ++k) { + size_t offset = i * col * len + j * len; + output[offset + k] /= static_cast(variance[k]); + } } } } } - void DecodeCenterSize(const framework::Tensor* target_box, - const framework::Tensor* prior_box, - const framework::Tensor* prior_box_var, - const bool normalized, T* output) const { - int64_t row = target_box->dims()[0]; - int64_t col = prior_box->dims()[0]; - int64_t len = prior_box->dims()[1]; - auto* target_box_data = target_box->data(); - auto* prior_box_data = prior_box->data(); - const T* prior_box_var_data = nullptr; - if (prior_box_var) prior_box_var_data = prior_box_var->data(); + template + void DecodeCenterSize(const framework::Tensor *target_box, + const framework::Tensor *prior_box, + const framework::Tensor *prior_box_var, + const bool normalized, std::vector variance, + T *output) const { + int64_t row = target_box->dims()[0]; + int64_t col = target_box->dims()[1]; + int64_t len = target_box->dims()[2]; +#ifdef PADDLE_WITH_MKLML +#pragma omp parallel for collapse(2) +#endif for (int64_t i = 0; i < row; ++i) { for (int64_t j = 0; j < col; ++j) { + auto *target_box_data = target_box->data(); + auto *prior_box_data = prior_box->data(); + + T var_data[4] = {1., 1., 1., 1.}; + T *var_ptr = var_data; size_t offset = i * col * len + j * len; - T prior_box_width = prior_box_data[j * len + 2] - - prior_box_data[j * len] + (normalized == false); - T prior_box_height = prior_box_data[j * len + 3] - - prior_box_data[j * len + 1] + + int prior_box_offset = axis == 0 ? j * len : i * len; + + T prior_box_width = prior_box_data[prior_box_offset + 2] - + prior_box_data[prior_box_offset] + + (normalized == false); + T prior_box_height = prior_box_data[prior_box_offset + 3] - + prior_box_data[prior_box_offset + 1] + (normalized == false); T prior_box_center_x = - (prior_box_data[j * len + 2] + prior_box_data[j * len]) / 2; + prior_box_data[prior_box_offset] + prior_box_width / 2; T prior_box_center_y = - (prior_box_data[j * len + 3] + prior_box_data[j * len + 1]) / 2; + prior_box_data[prior_box_offset + 1] + prior_box_height / 2; T target_box_center_x = 0, target_box_center_y = 0; T target_box_width = 0, target_box_height = 0; - if (prior_box_var) { - target_box_center_x = prior_box_var_data[j * len] * - target_box_data[offset] * prior_box_width + - prior_box_center_x; - target_box_center_y = prior_box_var_data[j * len + 1] * - target_box_data[offset + 1] * - prior_box_height + - prior_box_center_y; - target_box_width = std::exp(prior_box_var_data[j * len + 2] * - target_box_data[offset + 2]) * - prior_box_width; - target_box_height = std::exp(prior_box_var_data[j * len + 3] * - target_box_data[offset + 3]) * - prior_box_height; - } else { - target_box_center_x = - target_box_data[offset] * prior_box_width + prior_box_center_x; - target_box_center_y = target_box_data[offset + 1] * prior_box_height + - prior_box_center_y; - target_box_width = - std::exp(target_box_data[offset + 2]) * prior_box_width; - target_box_height = - std::exp(target_box_data[offset + 3]) * prior_box_height; + int prior_var_offset = axis == 0 ? j * len : i * len; + if (var_size == 2) { + std::memcpy(var_ptr, prior_box_var->data() + prior_var_offset, + 4 * sizeof(T)); + } else if (var_size == 1) { + var_ptr = reinterpret_cast(variance.data()); } + T box_var_x = *var_ptr; + T box_var_y = *(var_ptr + 1); + T box_var_w = *(var_ptr + 2); + T box_var_h = *(var_ptr + 3); + + target_box_center_x = + box_var_x * target_box_data[offset] * prior_box_width + + prior_box_center_x; + target_box_center_y = + box_var_y * target_box_data[offset + 1] * prior_box_height + + prior_box_center_y; + target_box_width = + std::exp(box_var_w * target_box_data[offset + 2]) * prior_box_width; + target_box_height = std::exp(box_var_h * target_box_data[offset + 3]) * + prior_box_height; output[offset] = target_box_center_x - target_box_width / 2; output[offset + 1] = target_box_center_y - target_box_height / 2; @@ -146,31 +177,68 @@ class BoxCoderKernel : public framework::OpKernel { } } - void Compute(const framework::ExecutionContext& context) const override { - auto* prior_box = context.Input("PriorBox"); - auto* prior_box_var = context.Input("PriorBoxVar"); - auto* target_box = context.Input("TargetBox"); - auto* output_box = context.Output("OutputBox"); - + void Compute(const framework::ExecutionContext &context) const override { + auto *prior_box = context.Input("PriorBox"); + auto *prior_box_var = context.Input("PriorBoxVar"); + auto *target_box = context.Input("TargetBox"); + auto *output_box = context.Output("OutputBox"); + std::vector variance = context.Attr>("variance"); + const int axis = context.Attr("axis"); if (target_box->lod().size()) { PADDLE_ENFORCE_EQ(target_box->lod().size(), 1UL, "Only support 1 level of LoD."); } + if (prior_box_var) { + PADDLE_ENFORCE(variance.empty(), + "Input 'PriorBoxVar' and attribute 'variance' should not" + "be used at the same time."); + } + if (!(variance.empty())) { + PADDLE_ENFORCE(static_cast(variance.size()) == 4, + "Size of attribute 'variance' should be 4"); + } + auto code_type = GetBoxCodeType(context.Attr("code_type")); + bool normalized = context.Attr("box_normalized"); + auto row = target_box->dims()[0]; auto col = prior_box->dims()[0]; + if (code_type == BoxCodeType::kDecodeCenterSize) { + col = target_box->dims()[1]; + } auto len = prior_box->dims()[1]; output_box->mutable_data({row, col, len}, context.GetPlace()); - auto code_type = GetBoxCodeType(context.Attr("code_type")); - bool normalized = context.Attr("box_normalized"); - T* output = output_box->data(); + T *output = output_box->data(); if (code_type == BoxCodeType::kEncodeCenterSize) { EncodeCenterSize(target_box, prior_box, prior_box_var, normalized, - output); + variance, output); } else if (code_type == BoxCodeType::kDecodeCenterSize) { - DecodeCenterSize(target_box, prior_box, prior_box_var, normalized, - output); + if (prior_box_var) { + if (axis == 0) { + DecodeCenterSize<0, 2>(target_box, prior_box, prior_box_var, + normalized, variance, output); + } else { + DecodeCenterSize<1, 2>(target_box, prior_box, prior_box_var, + normalized, variance, output); + } + } else if (!(variance.empty())) { + if (axis == 0) { + DecodeCenterSize<0, 1>(target_box, prior_box, prior_box_var, + normalized, variance, output); + } else { + DecodeCenterSize<1, 1>(target_box, prior_box, prior_box_var, + normalized, variance, output); + } + } else { + if (axis == 0) { + DecodeCenterSize<0, 0>(target_box, prior_box, prior_box_var, + normalized, variance, output); + } else { + DecodeCenterSize<1, 0>(target_box, prior_box, prior_box_var, + normalized, variance, output); + } + } } } }; diff --git a/paddle/fluid/operators/detection/box_decoder_and_assign_op.cc b/paddle/fluid/operators/detection/box_decoder_and_assign_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..945d575a6446429a0ec34a603356c2c99263a776 --- /dev/null +++ b/paddle/fluid/operators/detection/box_decoder_and_assign_op.cc @@ -0,0 +1,169 @@ +/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/detection/box_decoder_and_assign_op.h" + +namespace paddle { +namespace operators { + +using LoDTensor = framework::LoDTensor; + +class BoxDecoderAndAssignOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + protected: + void InferShape(framework::InferShapeContext *ctx) const override { + PADDLE_ENFORCE( + ctx->HasInput("PriorBox"), + "Input(PriorBox) of BoxDecoderAndAssignOp should not be null."); + PADDLE_ENFORCE( + ctx->HasInput("PriorBoxVar"), + "Input(PriorBoxVar) of BoxDecoderAndAssignOp should not be null."); + PADDLE_ENFORCE( + ctx->HasInput("TargetBox"), + "Input(TargetBox) of BoxDecoderAndAssignOp should not be null."); + PADDLE_ENFORCE( + ctx->HasInput("BoxScore"), + "Input(BoxScore) of BoxDecoderAndAssignOp should not be null."); + PADDLE_ENFORCE( + ctx->HasOutput("DecodeBox"), + "Output(DecodeBox) of BoxDecoderAndAssignOp should not be null."); + PADDLE_ENFORCE( + ctx->HasOutput("OutputAssignBox"), + "Output(OutputAssignBox) of BoxDecoderAndAssignOp should not be null."); + + auto prior_box_dims = ctx->GetInputDim("PriorBox"); + auto prior_box_var_dims = ctx->GetInputDim("PriorBoxVar"); + auto target_box_dims = ctx->GetInputDim("TargetBox"); + auto box_score_dims = ctx->GetInputDim("BoxScore"); + + PADDLE_ENFORCE_EQ(prior_box_dims.size(), 2, + "The rank of Input of PriorBox must be 2"); + PADDLE_ENFORCE_EQ(prior_box_dims[1], 4, "The shape of PriorBox is [N, 4]"); + PADDLE_ENFORCE_EQ(prior_box_var_dims.size(), 1, + "The rank of Input of PriorBoxVar must be 1"); + PADDLE_ENFORCE_EQ(prior_box_var_dims[0], 4, + "The shape of PriorBoxVar is [4]"); + PADDLE_ENFORCE_EQ(target_box_dims.size(), 2, + "The rank of Input of TargetBox must be 2"); + PADDLE_ENFORCE_EQ(box_score_dims.size(), 2, + "The rank of Input of BoxScore must be 2"); + PADDLE_ENFORCE_EQ(prior_box_dims[0], target_box_dims[0], + "The first dim of prior_box and target_box is roi nums " + "and should be same!"); + PADDLE_ENFORCE_EQ(prior_box_dims[0], box_score_dims[0], + "The first dim of prior_box and box_score is roi nums " + "and should be same!"); + PADDLE_ENFORCE_EQ(target_box_dims[1], box_score_dims[1] * prior_box_dims[1], + "The shape of target_box is [N, classnum * 4], The shape " + "of box_score is [N, classnum], The shape of prior_box " + "is [N, 4]"); + + ctx->SetOutputDim("DecodeBox", framework::make_ddim({target_box_dims[0], + target_box_dims[1]})); + ctx->ShareLoD("TargetBox", /*->*/ "DecodeBox"); + ctx->SetOutputDim( + "OutputAssignBox", + framework::make_ddim({prior_box_dims[0], prior_box_dims[1]})); + ctx->ShareLoD("PriorBox", /*->*/ "OutputAssignBox"); + } +}; + +class BoxDecoderAndAssignOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput( + "PriorBox", + "(Tensor, default Tensor) " + "Box list PriorBox is a 2-D Tensor with shape [N, 4] which holds N " + "boxes and each box is represented as [xmin, ymin, xmax, ymax], " + "[xmin, ymin] is the left top coordinate of the anchor box, " + "if the input is image feature map, they are close to the origin " + "of the coordinate system. [xmax, ymax] is the right bottom " + "coordinate of the anchor box."); + AddInput("PriorBoxVar", + "(Tensor, default Tensor, optional) " + "PriorBoxVar is a 2-D Tensor with shape [N, 4] which holds N " + "group of variance. PriorBoxVar will set all elements to 1 by " + "default.") + .AsDispensable(); + AddInput("TargetBox", + "(LoDTensor or Tensor) " + "This input can be a 2-D LoDTensor with shape " + "[N, classnum*4]. It holds N targets for N boxes."); + AddInput("BoxScore", + "(LoDTensor or Tensor) " + "This input can be a 2-D LoDTensor with shape " + "[N, classnum], each box is represented as [classnum] which is " + "the classification probabilities."); + AddAttr("box_clip", + "(float, default 4.135, np.log(1000. / 16.)) " + "clip box to prevent overflowing") + .SetDefault(4.135f); + AddOutput("DecodeBox", + "(LoDTensor or Tensor) " + "the output tensor of op with shape [N, classnum * 4] " + "representing the result of N target boxes decoded with " + "M Prior boxes and variances for each class."); + AddOutput("OutputAssignBox", + "(LoDTensor or Tensor) " + "the output tensor of op with shape [N, 4] " + "representing the result of N target boxes decoded with " + "M Prior boxes and variances with the best non-background class " + "by BoxScore."); + AddComment(R"DOC( + +Bounding Box Coder. + +Decode the target bounding box with the prior_box information. + +The Decoding schema is described below: + + $$ + ox = (pw \\times pxv \\times tx + px) - \\frac{tw}{2} + $$ + $$ + oy = (ph \\times pyv \\times ty + py) - \\frac{th}{2} + $$ + $$ + ow = \\exp (pwv \\times tw) \\times pw + \\frac{tw}{2} + $$ + $$ + oh = \\exp (phv \\times th) \\times ph + \\frac{th}{2} + $$ + +where `tx`, `ty`, `tw`, `th` denote the target box's center coordinates, width +and height respectively. Similarly, `px`, `py`, `pw`, `ph` denote the +prior_box's (anchor) center coordinates, width and height. `pxv`, `pyv`, `pwv`, +`phv` denote the variance of the prior_box and `ox`, `oy`, `ow`, `oh` denote the +decoded coordinates, width and height in decode_box. + +decode_box is obtained after box decode, then assigning schema is described below: + +For each prior_box, use the best non-background class's decoded values to +update the prior_box locations and get output_assign_box. So, the shape of +output_assign_box is the same as PriorBox. +)DOC"); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(box_decoder_and_assign, ops::BoxDecoderAndAssignOp, + ops::BoxDecoderAndAssignOpMaker, + paddle::framework::EmptyGradOpMaker); +REGISTER_OP_CPU_KERNEL( + box_decoder_and_assign, + ops::BoxDecoderAndAssignKernel, + ops::BoxDecoderAndAssignKernel); diff --git a/paddle/fluid/operators/detection/box_decoder_and_assign_op.cu b/paddle/fluid/operators/detection/box_decoder_and_assign_op.cu new file mode 100644 index 0000000000000000000000000000000000000000..25e6545eb59bde5e080dc907f9ecd4281062413f --- /dev/null +++ b/paddle/fluid/operators/detection/box_decoder_and_assign_op.cu @@ -0,0 +1,147 @@ +/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/memory/memcpy.h" +#include "paddle/fluid/operators/detection/box_decoder_and_assign_op.h" +#include "paddle/fluid/platform/cuda_primitives.h" + +namespace paddle { +namespace operators { + +template +__global__ void DecodeBoxKernel(const T* prior_box_data, + const T* prior_box_var_data, + const T* target_box_data, const int roi_num, + const int class_num, const T box_clip, + T* output_box_data) { + const int idx = threadIdx.x + blockIdx.x * blockDim.x; + if (idx < roi_num * class_num) { + int i = idx / class_num; + int j = idx % class_num; + T prior_box_width = prior_box_data[i * 4 + 2] - prior_box_data[i * 4] + 1; + T prior_box_height = + prior_box_data[i * 4 + 3] - prior_box_data[i * 4 + 1] + 1; + T prior_box_center_x = prior_box_data[i * 4] + prior_box_width / 2; + T prior_box_center_y = prior_box_data[i * 4 + 1] + prior_box_height / 2; + + int offset = i * class_num * 4 + j * 4; + T dw = prior_box_var_data[2] * target_box_data[offset + 2]; + T dh = prior_box_var_data[3] * target_box_data[offset + 3]; + if (dw > box_clip) { + dw = box_clip; + } + if (dh > box_clip) { + dh = box_clip; + } + T target_box_center_x = 0, target_box_center_y = 0; + T target_box_width = 0, target_box_height = 0; + target_box_center_x = + prior_box_var_data[0] * target_box_data[offset] * prior_box_width + + prior_box_center_x; + target_box_center_y = + prior_box_var_data[1] * target_box_data[offset + 1] * prior_box_height + + prior_box_center_y; + target_box_width = expf(dw) * prior_box_width; + target_box_height = expf(dh) * prior_box_height; + + output_box_data[offset] = target_box_center_x - target_box_width / 2; + output_box_data[offset + 1] = target_box_center_y - target_box_height / 2; + output_box_data[offset + 2] = + target_box_center_x + target_box_width / 2 - 1; + output_box_data[offset + 3] = + target_box_center_y + target_box_height / 2 - 1; + } +} + +template +__global__ void AssignBoxKernel(const T* prior_box_data, + const T* box_score_data, T* output_box_data, + const int roi_num, const int class_num, + T* output_assign_box_data) { + const int idx = threadIdx.x + blockIdx.x * blockDim.x; + if (idx < roi_num) { + int i = idx; + T max_score = -1; + int max_j = -1; + for (int j = 0; j < class_num; ++j) { + T score = box_score_data[i * class_num + j]; + if (score > max_score && j > 0) { + max_score = score; + max_j = j; + } + } + if (max_j > 0) { + for (int pno = 0; pno < 4; pno++) { + output_assign_box_data[i * 4 + pno] = + output_box_data[i * class_num * 4 + max_j * 4 + pno]; + } + } else { + for (int pno = 0; pno < 4; pno++) { + output_assign_box_data[i * 4 + pno] = prior_box_data[i * 4 + pno]; + } + } + } +} + +template +class BoxDecoderAndAssignCUDAKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& context) const override { + PADDLE_ENFORCE(platform::is_gpu_place(context.GetPlace()), + "This kernel only runs on GPU device."); + auto* prior_box = context.Input("PriorBox"); + auto* prior_box_var = context.Input("PriorBoxVar"); + auto* target_box = context.Input("TargetBox"); + auto* box_score = context.Input("BoxScore"); + auto* output_box = context.Output("DecodeBox"); + auto* output_assign_box = + context.Output("OutputAssignBox"); + + auto roi_num = target_box->dims()[0]; + auto class_num = box_score->dims()[1]; + auto* target_box_data = target_box->data(); + auto* prior_box_data = prior_box->data(); + auto* prior_box_var_data = prior_box_var->data(); + auto* box_score_data = box_score->data(); + output_box->mutable_data({roi_num, class_num * 4}, context.GetPlace()); + output_assign_box->mutable_data({roi_num, 4}, context.GetPlace()); + T* output_box_data = output_box->data(); + T* output_assign_box_data = output_assign_box->data(); + + int block = 512; + int grid = (roi_num * class_num + block - 1) / block; + auto& device_ctx = context.cuda_device_context(); + + const T box_clip = context.Attr("box_clip"); + + DecodeBoxKernel<<>>( + prior_box_data, prior_box_var_data, target_box_data, roi_num, class_num, + box_clip, output_box_data); + + context.device_context().Wait(); + int assign_grid = (roi_num + block - 1) / block; + AssignBoxKernel<<>>( + prior_box_data, box_score_data, output_box_data, roi_num, class_num, + output_assign_box_data); + context.device_context().Wait(); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OP_CUDA_KERNEL( + box_decoder_and_assign, + ops::BoxDecoderAndAssignCUDAKernel, + ops::BoxDecoderAndAssignCUDAKernel); diff --git a/paddle/fluid/operators/detection/box_decoder_and_assign_op.h b/paddle/fluid/operators/detection/box_decoder_and_assign_op.h new file mode 100644 index 0000000000000000000000000000000000000000..e66a8351f4761fc805dbd2e44f237c751642d816 --- /dev/null +++ b/paddle/fluid/operators/detection/box_decoder_and_assign_op.h @@ -0,0 +1,103 @@ +/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include +#include +#include +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/math/math_function.h" + +namespace paddle { +namespace operators { + +template +class BoxDecoderAndAssignKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& context) const override { + auto* prior_box = context.Input("PriorBox"); + auto* prior_box_var = context.Input("PriorBoxVar"); + auto* target_box = context.Input("TargetBox"); + auto* box_score = context.Input("BoxScore"); + auto* output_box = context.Output("DecodeBox"); + auto* output_assign_box = + context.Output("OutputAssignBox"); + int roi_num = target_box->dims()[0]; + int class_num = box_score->dims()[1]; + auto* target_box_data = target_box->data(); + auto* prior_box_data = prior_box->data(); + auto* prior_box_var_data = prior_box_var->data(); + auto* box_score_data = box_score->data(); + output_box->mutable_data({roi_num, class_num * 4}, context.GetPlace()); + output_assign_box->mutable_data({roi_num, 4}, context.GetPlace()); + T* output_box_data = output_box->data(); + T* output_assign_box_data = output_assign_box->data(); + const T bbox_clip = context.Attr("box_clip"); + + for (int i = 0; i < roi_num; ++i) { + T prior_box_width = prior_box_data[i * 4 + 2] - prior_box_data[i * 4] + 1; + T prior_box_height = + prior_box_data[i * 4 + 3] - prior_box_data[i * 4 + 1] + 1; + T prior_box_center_x = prior_box_data[i * 4] + prior_box_width / 2; + T prior_box_center_y = prior_box_data[i * 4 + 1] + prior_box_height / 2; + for (int j = 0; j < class_num; ++j) { + int64_t offset = i * class_num * 4 + j * 4; + T dw = std::min(prior_box_var_data[2] * target_box_data[offset + 2], + bbox_clip); + T dh = std::min(prior_box_var_data[3] * target_box_data[offset + 3], + bbox_clip); + T target_box_center_x = 0, target_box_center_y = 0; + T target_box_width = 0, target_box_height = 0; + target_box_center_x = + prior_box_var_data[0] * target_box_data[offset] * prior_box_width + + prior_box_center_x; + target_box_center_y = prior_box_var_data[1] * + target_box_data[offset + 1] * + prior_box_height + + prior_box_center_y; + target_box_width = std::exp(dw) * prior_box_width; + target_box_height = std::exp(dh) * prior_box_height; + + output_box_data[offset] = target_box_center_x - target_box_width / 2; + output_box_data[offset + 1] = + target_box_center_y - target_box_height / 2; + output_box_data[offset + 2] = + target_box_center_x + target_box_width / 2 - 1; + output_box_data[offset + 3] = + target_box_center_y + target_box_height / 2 - 1; + } + + T max_score = -1; + int max_j = -1; + for (int j = 0; j < class_num; ++j) { + T score = box_score_data[i * class_num + j]; + if (score > max_score && j > 0) { + max_score = score; + max_j = j; + } + } + + if (max_j > 0) { + for (int pno = 0; pno < 4; pno++) { + output_assign_box_data[i * 4 + pno] = + output_box_data[i * class_num * 4 + max_j * 4 + pno]; + } + } else { + for (int pno = 0; pno < 4; pno++) { + output_assign_box_data[i * 4 + pno] = prior_box_data[i * 4 + pno]; + } + } + } + } +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/detection/density_prior_box_op.cc b/paddle/fluid/operators/detection/density_prior_box_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..cacd47ed4a80489c59cdd80747d69c70bd5ea286 --- /dev/null +++ b/paddle/fluid/operators/detection/density_prior_box_op.cc @@ -0,0 +1,180 @@ +/*Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/detection/density_prior_box_op.h" + +namespace paddle { +namespace operators { + +class DensityPriorBoxOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("Input"), + "Input(Input) of DensityPriorBoxOp should not be null."); + PADDLE_ENFORCE(ctx->HasInput("Image"), + "Input(Image) of DensityPriorBoxOp should not be null."); + + auto image_dims = ctx->GetInputDim("Image"); + auto input_dims = ctx->GetInputDim("Input"); + PADDLE_ENFORCE(image_dims.size() == 4, "The layout of image is NCHW."); + PADDLE_ENFORCE(input_dims.size() == 4, "The layout of input is NCHW."); + + PADDLE_ENFORCE_LT(input_dims[2], image_dims[2], + "The height of input must smaller than image."); + + PADDLE_ENFORCE_LT(input_dims[3], image_dims[3], + "The width of input must smaller than image."); + auto variances = ctx->Attrs().Get>("variances"); + + auto fixed_sizes = ctx->Attrs().Get>("fixed_sizes"); + auto fixed_ratios = ctx->Attrs().Get>("fixed_ratios"); + auto densities = ctx->Attrs().Get>("densities"); + bool flatten = ctx->Attrs().Get("flatten_to_2d"); + + PADDLE_ENFORCE_EQ(fixed_sizes.size(), densities.size(), + "The number of fixed_sizes and densities must be equal."); + size_t num_priors = 0; + for (size_t i = 0; i < densities.size(); ++i) { + num_priors += (fixed_ratios.size()) * (pow(densities[i], 2)); + } + if (!flatten) { + std::vector dim_vec(4); + dim_vec[0] = input_dims[2]; + dim_vec[1] = input_dims[3]; + dim_vec[2] = num_priors; + dim_vec[3] = 4; + ctx->SetOutputDim("Boxes", framework::make_ddim(dim_vec)); + ctx->SetOutputDim("Variances", framework::make_ddim(dim_vec)); + } else { + int64_t dim0 = input_dims[2] * input_dims[3] * num_priors; + ctx->SetOutputDim("Boxes", {dim0, 4}); + ctx->SetOutputDim("Variances", {dim0, 4}); + } + } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext& ctx) const override { + return framework::OpKernelType( + ctx.Input("Input")->type(), ctx.GetPlace()); + } +}; + +class DensityPriorBoxOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput( + "Input", + "(Tensor, default Tensor), " + "the input feature data of DensityPriorBoxOp, the layout is NCHW."); + AddInput("Image", + "(Tensor, default Tensor), " + "the input image data of DensityPriorBoxOp, the layout is NCHW."); + AddOutput("Boxes", + "(Tensor, default Tensor), the output prior boxes of " + "DensityPriorBoxOp. The layout is [H, W, num_priors, 4]. " + "H is the height of input, W is the width of input, num_priors " + "is the box count of each position."); + AddOutput("Variances", + "(Tensor, default Tensor), the expanded variances of " + "DensityPriorBoxOp. The layout is [H, W, num_priors, 4]. " + "H is the height of input, W is the width of input, num_priors " + "is the box count of each position."); + AddAttr>("variances", + "(vector) List of variances to be " + "encoded in density prior boxes.") + .AddCustomChecker([](const std::vector& variances) { + PADDLE_ENFORCE_EQ(variances.size(), 4, + "Must and only provide 4 variance."); + for (size_t i = 0; i < variances.size(); ++i) { + PADDLE_ENFORCE_GT(variances[i], 0.0, + "variance[%d] must be greater than 0.", i); + } + }); + AddAttr("clip", "(bool) Whether to clip out-of-boundary boxes.") + .SetDefault(true); + AddAttr("flatten_to_2d", + "(bool) Whether to flatten to 2D and " + "the second dim is 4.") + .SetDefault(false); + AddAttr( + "step_w", + "Density prior boxes step across width, 0.0 for auto calculation.") + .SetDefault(0.0) + .AddCustomChecker([](const float& step_w) { + PADDLE_ENFORCE_GE(step_w, 0.0, "step_w should be larger than 0."); + }); + AddAttr( + "step_h", + "Density prior boxes step across height, 0.0 for auto calculation.") + .SetDefault(0.0) + .AddCustomChecker([](const float& step_h) { + PADDLE_ENFORCE_GE(step_h, 0.0, "step_h should be larger than 0."); + }); + + AddAttr("offset", + "(float) " + "Density prior boxes center offset.") + .SetDefault(0.5); + AddAttr>("fixed_sizes", + "(vector) List of fixed sizes " + "of generated density prior boxes.") + .SetDefault(std::vector{}) + .AddCustomChecker([](const std::vector& fixed_sizes) { + for (size_t i = 0; i < fixed_sizes.size(); ++i) { + PADDLE_ENFORCE_GT(fixed_sizes[i], 0.0, + "fixed_sizes[%d] should be larger than 0.", i); + } + }); + + AddAttr>("fixed_ratios", + "(vector) List of fixed ratios " + "of generated density prior boxes.") + .SetDefault(std::vector{}) + .AddCustomChecker([](const std::vector& fixed_ratios) { + for (size_t i = 0; i < fixed_ratios.size(); ++i) { + PADDLE_ENFORCE_GT(fixed_ratios[i], 0.0, + "fixed_ratios[%d] should be larger than 0.", i); + } + }); + + AddAttr>("densities", + "(vector) List of densities " + "of generated density prior boxes.") + .SetDefault(std::vector{}) + .AddCustomChecker([](const std::vector& densities) { + for (size_t i = 0; i < densities.size(); ++i) { + PADDLE_ENFORCE_GT(densities[i], 0, + "densities[%d] should be larger than 0.", i); + } + }); + AddComment(R"DOC( + Density Prior box operator + Each position of the input produce N density prior boxes, N is determined by + the count of fixed_ratios, densities, the calculation of N is as follows: + for density in densities: + N += size(fixed_ratios)*density^2 + )DOC"); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(density_prior_box, ops::DensityPriorBoxOp, + ops::DensityPriorBoxOpMaker, + paddle::framework::EmptyGradOpMaker); + +REGISTER_OP_CPU_KERNEL(density_prior_box, ops::DensityPriorBoxOpKernel, + ops::DensityPriorBoxOpKernel); diff --git a/paddle/fluid/operators/detection/density_prior_box_op.cu b/paddle/fluid/operators/detection/density_prior_box_op.cu new file mode 100644 index 0000000000000000000000000000000000000000..6337a4837a64cef2ce0e7bae70d8ba5b8994958e --- /dev/null +++ b/paddle/fluid/operators/detection/density_prior_box_op.cu @@ -0,0 +1,172 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/detection/density_prior_box_op.h" + +namespace paddle { +namespace operators { + +template +static __device__ inline T Clip(T in) { + return min(max(in, 0.), 1.); +} + +template +static __global__ void GenDensityPriorBox( + const int height, const int width, const int im_height, const int im_width, + const T offset, const T step_width, const T step_height, + const int num_priors, const T* ratios_shift, bool is_clip, const T var_xmin, + const T var_ymin, const T var_xmax, const T var_ymax, T* out, T* var) { + int gidx = blockIdx.x * blockDim.x + threadIdx.x; + int gidy = blockIdx.y * blockDim.y + threadIdx.y; + int step_x = blockDim.x * gridDim.x; + int step_y = blockDim.y * gridDim.y; + + const T* width_ratio = ratios_shift; + const T* height_ratio = ratios_shift + num_priors; + const T* width_shift = ratios_shift + 2 * num_priors; + const T* height_shift = ratios_shift + 3 * num_priors; + + for (int j = gidy; j < height; j += step_y) { + for (int i = gidx; i < width * num_priors; i += step_x) { + int h = j; + int w = i / num_priors; + int k = i % num_priors; + + T center_x = (w + offset) * step_width; + T center_y = (h + offset) * step_height; + + T center_x_temp = center_x + width_shift[k]; + T center_y_temp = center_y + height_shift[k]; + + T box_width_ratio = width_ratio[k] / 2.; + T box_height_ratio = height_ratio[k] / 2.; + + T xmin = max((center_x_temp - box_width_ratio) / im_width, 0.); + T ymin = max((center_y_temp - box_height_ratio) / im_height, 0.); + T xmax = min((center_x_temp + box_width_ratio) / im_width, 1.); + T ymax = min((center_y_temp + box_height_ratio) / im_height, 1.); + + int out_offset = (j * width * num_priors + i) * 4; + out[out_offset] = is_clip ? Clip(xmin) : xmin; + out[out_offset + 1] = is_clip ? Clip(ymin) : ymin; + out[out_offset + 2] = is_clip ? Clip(xmax) : xmax; + out[out_offset + 3] = is_clip ? Clip(ymax) : ymax; + + var[out_offset] = var_xmin; + var[out_offset + 1] = var_ymin; + var[out_offset + 2] = var_xmax; + var[out_offset + 3] = var_ymax; + } + } +} + +template +class DensityPriorBoxOpCUDAKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* input = ctx.Input("Input"); + auto* image = ctx.Input("Image"); + auto* boxes = ctx.Output("Boxes"); + auto* vars = ctx.Output("Variances"); + + auto variances = ctx.Attr>("variances"); + auto is_clip = ctx.Attr("clip"); + + auto fixed_sizes = ctx.Attr>("fixed_sizes"); + auto fixed_ratios = ctx.Attr>("fixed_ratios"); + auto densities = ctx.Attr>("densities"); + + T step_w = static_cast(ctx.Attr("step_w")); + T step_h = static_cast(ctx.Attr("step_h")); + T offset = static_cast(ctx.Attr("offset")); + + auto img_width = image->dims()[3]; + auto img_height = image->dims()[2]; + + auto feature_width = input->dims()[3]; + auto feature_height = input->dims()[2]; + + T step_width, step_height; + if (step_w == 0 || step_h == 0) { + step_width = static_cast(img_width) / feature_width; + step_height = static_cast(img_height) / feature_height; + } else { + step_width = step_w; + step_height = step_h; + } + + int num_priors = 0; + for (size_t i = 0; i < densities.size(); ++i) { + num_priors += (fixed_ratios.size()) * (pow(densities[i], 2)); + } + int step_average = static_cast((step_width + step_height) * 0.5); + + framework::Tensor h_temp; + T* tdata = h_temp.mutable_data({num_priors * 4}, platform::CPUPlace()); + int idx = 0; + for (size_t s = 0; s < fixed_sizes.size(); ++s) { + auto fixed_size = fixed_sizes[s]; + int density = densities[s]; + for (size_t r = 0; r < fixed_ratios.size(); ++r) { + float ar = fixed_ratios[r]; + int shift = step_average / density; + float box_width_ratio = fixed_size * sqrt(ar); + float box_height_ratio = fixed_size / sqrt(ar); + for (int di = 0; di < density; ++di) { + for (int dj = 0; dj < density; ++dj) { + float center_x_temp = shift / 2. + dj * shift - step_average / 2.; + float center_y_temp = shift / 2. + di * shift - step_average / 2.; + tdata[idx] = box_width_ratio; + tdata[num_priors + idx] = box_height_ratio; + tdata[2 * num_priors + idx] = center_x_temp; + tdata[3 * num_priors + idx] = center_y_temp; + idx++; + } + } + } + } + + boxes->mutable_data(ctx.GetPlace()); + vars->mutable_data(ctx.GetPlace()); + + framework::Tensor d_temp; + framework::TensorCopy(h_temp, ctx.GetPlace(), &d_temp); + + // At least use 32 threads, at most 512 threads. + // blockx is multiple of 32. + int blockx = std::min( + static_cast(((feature_width * num_priors + 31) >> 5) << 5), + static_cast(512L)); + int gridx = (feature_width * num_priors + blockx - 1) / blockx; + dim3 threads(blockx, 1); + dim3 grids(gridx, feature_height); + + auto stream = + ctx.template device_context().stream(); + GenDensityPriorBox<<>>( + feature_height, feature_width, img_height, img_width, offset, + step_width, step_height, num_priors, d_temp.data(), is_clip, + variances[0], variances[1], variances[2], variances[3], + boxes->data(), vars->data()); + } +}; // namespace operators + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OP_CUDA_KERNEL(density_prior_box, + ops::DensityPriorBoxOpCUDAKernel, + ops::DensityPriorBoxOpCUDAKernel); diff --git a/paddle/fluid/operators/detection/density_prior_box_op.h b/paddle/fluid/operators/detection/density_prior_box_op.h new file mode 100644 index 0000000000000000000000000000000000000000..42137215e21af1a529563ecc995a54d610120beb --- /dev/null +++ b/paddle/fluid/operators/detection/density_prior_box_op.h @@ -0,0 +1,154 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include +#include +#include "paddle/fluid/operators/detection/prior_box_op.h" + +namespace paddle { +namespace operators { + +template +class DensityPriorBoxOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* input = ctx.Input("Input"); + auto* image = ctx.Input("Image"); + auto* boxes = ctx.Output("Boxes"); + auto* vars = ctx.Output("Variances"); + + auto variances = ctx.Attr>("variances"); + auto clip = ctx.Attr("clip"); + + auto fixed_sizes = ctx.Attr>("fixed_sizes"); + auto fixed_ratios = ctx.Attr>("fixed_ratios"); + auto densities = ctx.Attr>("densities"); + + T step_w = static_cast(ctx.Attr("step_w")); + T step_h = static_cast(ctx.Attr("step_h")); + T offset = static_cast(ctx.Attr("offset")); + + auto img_width = image->dims()[3]; + auto img_height = image->dims()[2]; + + auto feature_width = input->dims()[3]; + auto feature_height = input->dims()[2]; + + T step_width, step_height; + if (step_w == 0 || step_h == 0) { + step_width = static_cast(img_width) / feature_width; + step_height = static_cast(img_height) / feature_height; + } else { + step_width = step_w; + step_height = step_h; + } + int num_priors = 0; + +#ifdef PADDLE_WITH_MKLML +#pragma omp parallel for reduction(+ : num_priors) +#endif + for (size_t i = 0; i < densities.size(); ++i) { + num_priors += (fixed_ratios.size()) * (pow(densities[i], 2)); + } + + boxes->mutable_data(ctx.GetPlace()); + vars->mutable_data(ctx.GetPlace()); + + auto box_dim = vars->dims(); + boxes->Resize({feature_height, feature_width, num_priors, 4}); + auto e_boxes = framework::EigenTensor::From(*boxes).setConstant(0.0); + int step_average = static_cast((step_width + step_height) * 0.5); + + std::vector sqrt_fixed_ratios; +#ifdef PADDLE_WITH_MKLML +#pragma omp parallel for +#endif + for (size_t i = 0; i < fixed_ratios.size(); i++) { + sqrt_fixed_ratios.push_back(sqrt(fixed_ratios[i])); + } + +#ifdef PADDLE_WITH_MKLML +#pragma omp parallel for collapse(2) +#endif + for (int h = 0; h < feature_height; ++h) { + for (int w = 0; w < feature_width; ++w) { + T center_x = (w + offset) * step_width; + T center_y = (h + offset) * step_height; + int idx = 0; + // Generate density prior boxes with fixed sizes. + for (size_t s = 0; s < fixed_sizes.size(); ++s) { + auto fixed_size = fixed_sizes[s]; + int density = densities[s]; + int shift = step_average / density; + // Generate density prior boxes with fixed ratios. + for (size_t r = 0; r < fixed_ratios.size(); ++r) { + float box_width_ratio = fixed_size * sqrt_fixed_ratios[r]; + float box_height_ratio = fixed_size / sqrt_fixed_ratios[r]; + float density_center_x = center_x - step_average / 2. + shift / 2.; + float density_center_y = center_y - step_average / 2. + shift / 2.; + for (int di = 0; di < density; ++di) { + for (int dj = 0; dj < density; ++dj) { + float center_x_temp = density_center_x + dj * shift; + float center_y_temp = density_center_y + di * shift; + e_boxes(h, w, idx, 0) = std::max( + (center_x_temp - box_width_ratio / 2.) / img_width, 0.); + e_boxes(h, w, idx, 1) = std::max( + (center_y_temp - box_height_ratio / 2.) / img_height, 0.); + e_boxes(h, w, idx, 2) = std::min( + (center_x_temp + box_width_ratio / 2.) / img_width, 1.); + e_boxes(h, w, idx, 3) = std::min( + (center_y_temp + box_height_ratio / 2.) / img_height, 1.); + idx++; + } + } + } + } + } + } + if (clip) { + T* dt = boxes->data(); + std::transform(dt, dt + boxes->numel(), dt, [](T v) -> T { + return std::min(std::max(v, 0.), 1.); + }); + } + framework::Tensor var_t; + var_t.mutable_data( + framework::make_ddim({1, static_cast(variances.size())}), + ctx.GetPlace()); + + auto var_et = framework::EigenTensor::From(var_t); + + for (size_t i = 0; i < variances.size(); ++i) { + var_et(0, i) = variances[i]; + } + + int box_num = feature_height * feature_width * num_priors; + auto var_dim = vars->dims(); + vars->Resize({box_num, static_cast(variances.size())}); + + auto e_vars = framework::EigenMatrix::From(*vars); +#ifdef PADDLE_WITH_MKLML +#pragma omp parallel for collapse(2) +#endif + for (int i = 0; i < box_num; ++i) { + for (size_t j = 0; j < variances.size(); ++j) { + e_vars(i, j) = variances[j]; + } + } + + vars->Resize(var_dim); + boxes->Resize(box_dim); + } +}; // namespace operators + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/detection/distribute_fpn_proposals_op.cc b/paddle/fluid/operators/detection/distribute_fpn_proposals_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..6d36876efd747d9e6f90c0d0200a9e9610a5318c --- /dev/null +++ b/paddle/fluid/operators/detection/distribute_fpn_proposals_op.cc @@ -0,0 +1,93 @@ +/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/detection/distribute_fpn_proposals_op.h" + +namespace paddle { +namespace operators { + +class DistributeFpnProposalsOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("FpnRois"), + "Input(FpnRois) shouldn't be null"); + PADDLE_ENFORCE_GE( + ctx->Outputs("MultiFpnRois").size(), 1UL, + "Outputs(MultiFpnRois) of DistributeOp should not be empty"); + size_t min_level = static_cast(ctx->Attrs().Get("min_level")); + size_t max_level = static_cast(ctx->Attrs().Get("max_level")); + PADDLE_ENFORCE_GE(max_level, min_level, + "max_level must not lower than min_level"); + // Set the output shape + size_t num_out_rois = max_level - min_level + 1; + std::vector outs_dims; + outs_dims.reserve(num_out_rois); + for (size_t i = 0; i < num_out_rois; ++i) { + framework::DDim out_dim = {-1, 4}; + outs_dims.push_back(out_dim); + } + ctx->SetOutputsDim("MultiFpnRois", outs_dims); + ctx->SetOutputDim("RestoreIndex", {1, -1}); + } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext& ctx) const override { + auto data_type = framework::GetDataTypeOfVar(ctx.InputVar("FpnRois")); + return framework::OpKernelType(data_type, platform::CPUPlace()); + } +}; + +class DistributeFpnProposalsOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("FpnRois", "(LoDTensor) The rois at all levels in shape (-1, 4)"); + AddOutput("MultiFpnRois", "(LoDTensor) Output with distribute operator") + .AsDuplicable(); + AddOutput("RestoreIndex", + "(Tensor) An array of positive number which is " + "used to restore the order of FpnRois"); + AddAttr("min_level", + "The lowest level of FPN layer where the" + " proposals come from"); + AddAttr("max_level", + "The highest level of FPN layer where the" + " proposals come from"); + AddAttr("refer_level", + "The referring level of FPN layer with" + " specified scale"); + AddAttr("refer_scale", + "The referring scale of FPN layer with" + " specified level"); + AddComment(R"DOC( +This operator distribute all proposals into different fpn level, + with respect to scale of the proposals, the referring scale and + the referring level. Besides, to restore the order of proposals, +we return an array which indicate the original index of rois in + current proposals. +)DOC"); + } +}; +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(distribute_fpn_proposals, ops::DistributeFpnProposalsOp, + ops::DistributeFpnProposalsOpMaker, + paddle::framework::EmptyGradOpMaker); +REGISTER_OP_CPU_KERNEL(distribute_fpn_proposals, + ops::DistributeFpnProposalsOpKernel, + ops::DistributeFpnProposalsOpKernel); diff --git a/paddle/fluid/operators/detection/distribute_fpn_proposals_op.cu b/paddle/fluid/operators/detection/distribute_fpn_proposals_op.cu new file mode 100644 index 0000000000000000000000000000000000000000..9cbb969158386547485fad54120510595eb92804 --- /dev/null +++ b/paddle/fluid/operators/detection/distribute_fpn_proposals_op.cu @@ -0,0 +1,221 @@ +/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "cub/cub.cuh" +#include "paddle/fluid/memory/memcpy.h" +#include "paddle/fluid/operators/detection/distribute_fpn_proposals_op.h" +#include "paddle/fluid/operators/gather.cu.h" +#include "paddle/fluid/platform/cuda_primitives.h" +#include "paddle/fluid/platform/for_range.h" + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; +using LoDTensor = framework::LoDTensor; + +static constexpr int kNumCUDAThreads = 512; +static constexpr int kNumMaxinumNumBlocks = 4096; + +#define CUDA_1D_KERNEL_LOOP(i, n) \ + for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); \ + i += blockDim.x * gridDim.x) + +int const BBoxSize = 4; + +struct RangeInitFunctor { + int start_; + int delta_; + int* out_; + __device__ void operator()(size_t i) { out_[i] = start_ + i * delta_; } +}; + +static inline int NumBlocks(const int N) { + return std::min((N + kNumCUDAThreads - 1) / kNumCUDAThreads, + kNumMaxinumNumBlocks); +} + +static inline void TransLoD(const int* length_lod, const int lod_size, + int* offset_lod) { + int offset = 0; + for (int i = 0; i < lod_size; ++i) { + offset_lod[i] = offset; + offset += length_lod[i]; + } +} + +template +static __device__ inline T RoIArea(const T* box, bool normalized) { + if (box[2] < box[0] || box[3] < box[1]) { + // If coordinate values are is invalid + // (e.g. xmax < xmin or ymax < ymin), return 0. + return static_cast(0.); + } else { + const T w = box[2] - box[0]; + const T h = box[3] - box[1]; + if (normalized) { + return w * h; + } else { + // If coordinate values are not within range [0, 1]. + return (w + 1) * (h + 1); + } + } +} + +template +static __global__ void GPUDistFpnProposalsHelper( + const int nthreads, const T* rois, const int lod_size, + const int refer_level, const int refer_scale, const int max_level, + const int min_level, int* roi_batch_id_data, int* sub_lod_list, + int* target_lvls) { + CUDA_1D_KERNEL_LOOP(i, nthreads) { + const T* offset_roi = rois + i * BBoxSize; + int roi_batch_ind = roi_batch_id_data[i]; + // get the target level of current rois + T roi_area = RoIArea(offset_roi, false); + T roi_scale = sqrt(roi_area); + int tgt_lvl = floor(log2(roi_scale / refer_scale) + refer_level); + tgt_lvl = min(max_level, max(tgt_lvl, min_level)); + target_lvls[i] = tgt_lvl; + // compute number of rois in the same batch and same target level + platform::CudaAtomicAdd(sub_lod_list + tgt_lvl * lod_size + roi_batch_ind, + 1); + } +} + +template +class GPUDistributeFpnProposalsOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* fpn_rois = ctx.Input("FpnRois"); + + auto multi_fpn_rois = ctx.MultiOutput("MultiFpnRois"); + auto* restore_index = ctx.Output("RestoreIndex"); + + const int min_level = ctx.Attr("min_level"); + const int max_level = ctx.Attr("max_level"); + const int refer_level = ctx.Attr("refer_level"); + const int refer_scale = ctx.Attr("refer_scale"); + int num_level = max_level - min_level + 1; + + // check that the fpn_rois is not empty + PADDLE_ENFORCE_EQ(fpn_rois->lod().size(), 1UL, + "DistributeFpnProposalsOp need 1 level of LoD"); + + auto fpn_rois_lod = fpn_rois->lod().back(); + int lod_size = fpn_rois_lod.size() - 1; + int roi_num = fpn_rois_lod[lod_size]; + + auto& dev_ctx = ctx.template device_context(); + + // get batch id by lod in CPU + Tensor roi_batch_id_list; + roi_batch_id_list.Resize({roi_num}); + int* roi_batch_id_data = + roi_batch_id_list.mutable_data(platform::CPUPlace()); + for (int n = 0; n < lod_size; ++n) { + for (size_t i = fpn_rois_lod[n]; i < fpn_rois_lod[n + 1]; ++i) { + roi_batch_id_data[i] = n; + } + } + // copy batch id list to GPU + Tensor roi_batch_id_list_gpu; + framework::TensorCopySync(roi_batch_id_list, dev_ctx.GetPlace(), + &roi_batch_id_list_gpu); + + Tensor sub_lod_list; + sub_lod_list.Resize({num_level, lod_size}); + int* sub_lod_list_data = sub_lod_list.mutable_data(dev_ctx.GetPlace()); + Tensor target_lvls; + target_lvls.Resize({roi_num}); + int* target_lvls_data = target_lvls.mutable_data(dev_ctx.GetPlace()); + + int blocks = NumBlocks(roi_num); + int threads = kNumCUDAThreads; + + // get target levels and sub_lod list + GPUDistFpnProposalsHelper<<>>( + roi_num, fpn_rois->data(), lod_size, refer_level, refer_scale, + max_level, min_level, roi_batch_id_list_gpu.data(), + sub_lod_list_data, target_lvls_data); + + Tensor index_in_t; + int* idx_in = index_in_t.mutable_data({roi_num}, dev_ctx.GetPlace()); + platform::ForRange for_range(dev_ctx, roi_num); + for_range(RangeInitFunctor{0, 1, idx_in}); + + Tensor keys_out_t; + int* keys_out = keys_out_t.mutable_data({roi_num}, dev_ctx.GetPlace()); + Tensor index_out_t; + int* idx_out = index_out_t.mutable_data({roi_num}, dev_ctx.GetPlace()); + + // Determine temporary device storage requirements + size_t temp_storage_bytes = 0; + cub::DeviceRadixSort::SortPairsDescending( + nullptr, temp_storage_bytes, target_lvls_data, keys_out, idx_in, + idx_out, roi_num); + // Allocate temporary storage + auto place = boost::get(dev_ctx.GetPlace()); + auto d_temp_storage = memory::Alloc(place, temp_storage_bytes, + memory::Allocator::kScratchpad); + + // Run sorting operation + // sort target level to get corresponding index + cub::DeviceRadixSort::SortPairsDescending( + d_temp_storage->ptr(), temp_storage_bytes, target_lvls_data, keys_out, + idx_in, idx_out, roi_num); + + int* restore_idx_data = + restore_index->mutable_data({roi_num, 1}, dev_ctx.GetPlace()); + // sort current index to get restore index + cub::DeviceRadixSort::SortPairsDescending( + d_temp_storage->ptr(), temp_storage_bytes, idx_out, keys_out, idx_in, + restore_idx_data, roi_num); + + Tensor offset_lod; + int* offset_lod_data = + offset_lod.mutable_data({lod_size + 1}, dev_ctx.GetPlace()); + for (int i = 0; i < num_level; ++i) { + Tensor sub_lod = sub_lod_list.Slice(i, i + 1); + int* sub_lod_data = sub_lod.data(); + // transfer length-based lod to offset-based lod + TransLoD(sub_lod_data, lod_size + 1, offset_lod_data); + int sub_rois_num = offset_lod_data[lod_size]; + Tensor sub_idx = index_out_t.Slice(0, sub_rois_num); + + multi_fpn_rois[i]->mutable_data({sub_rois_num, kBoxDim}, + dev_ctx.GetPlace()); + + GPUGather(dev_ctx, *fpn_rois, sub_idx, multi_fpn_rois[i]); + framework::LoD lod; + std::vector offset; + memory::Copy(platform::CPUPlace(), offset.data(), place, offset_lod_data, + sizeof(int) * (lod_size + 1), 0); + lod.emplace_back(offset); + multi_fpn_rois[i]->set_lod(lod); + } + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OP_CUDA_KERNEL( + distribute_fpn_proposals, + ops::GPUDistributeFpnProposalsOpKernel, + ops::GPUDistributeFpnProposalsOpKernel); diff --git a/paddle/fluid/operators/detection/distribute_fpn_proposals_op.h b/paddle/fluid/operators/detection/distribute_fpn_proposals_op.h new file mode 100644 index 0000000000000000000000000000000000000000..f63e856626d64ec13476c3f967a085624a007c3a --- /dev/null +++ b/paddle/fluid/operators/detection/distribute_fpn_proposals_op.h @@ -0,0 +1,147 @@ +/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include +#include +#include +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/detail/safe_ref.h" +#include "paddle/fluid/operators/gather.h" +#include "paddle/fluid/operators/math/math_function.h" + +namespace paddle { +namespace operators { + +const int kBoxDim = 4; + +template +static inline T BBoxArea(const T* box, bool normalized) { + if (box[2] < box[0] || box[3] < box[1]) { + // If coordinate values are is invalid + // (e.g. xmax < xmin or ymax < ymin), return 0. + return static_cast(0.); + } else { + const T w = box[2] - box[0]; + const T h = box[3] - box[1]; + if (normalized) { + return w * h; + } else { + // If coordinate values are not within range [0, 1]. + return (w + 1) * (h + 1); + } + } +} + +template +class DistributeFpnProposalsOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& context) const override { + auto* fpn_rois = context.Input("FpnRois"); + + auto multi_fpn_rois = + context.MultiOutput("MultiFpnRois"); + + auto* restore_index = + context.Output("RestoreIndex"); + + const int min_level = context.Attr("min_level"); + const int max_level = context.Attr("max_level"); + const int refer_level = context.Attr("refer_level"); + const int refer_scale = context.Attr("refer_scale"); + const int num_level = max_level - min_level + 1; + + // check that the fpn_rois is not empty + PADDLE_ENFORCE_EQ(fpn_rois->lod().size(), 1UL, + "DistributeFpnProposalsOp need 1 level of LoD"); + + auto fpn_rois_lod = fpn_rois->lod().back(); + int fpn_rois_num = fpn_rois_lod[fpn_rois_lod.size() - 1]; + std::vector target_level; + // std::vector target_level(fpn_rois_num, -1); + // record the number of rois in each level + std::vector num_rois_level(num_level, 0); + std::vector num_rois_level_integral(num_level + 1, 0); + for (int i = 0; i < fpn_rois_lod.size() - 1; ++i) { + Tensor fpn_rois_slice = + fpn_rois->Slice(fpn_rois_lod[i], fpn_rois_lod[i + 1]); + const T* rois_data = fpn_rois_slice.data(); + for (int j = 0; j < fpn_rois_slice.dims()[0]; ++j) { + // get the target level of current rois + T roi_scale = std::sqrt(BBoxArea(rois_data, false)); + int tgt_lvl = + std::floor(std::log2(roi_scale / refer_scale) + refer_level); + tgt_lvl = std::min(max_level, std::max(tgt_lvl, min_level)); + target_level.push_back(tgt_lvl); + num_rois_level[tgt_lvl - min_level]++; + rois_data += kBoxDim; + } + } + // define the output rois + // pointer which point to each level fpn rois + std::vector multi_fpn_rois_data(num_level); + // lod0 which will record the offset information of each level rois + std::vector> multi_fpn_rois_lod0; + for (int i = 0; i < num_level; ++i) { + // allocate memory for each level rois + multi_fpn_rois[i]->mutable_data({num_rois_level[i], kBoxDim}, + context.GetPlace()); + multi_fpn_rois_data[i] = multi_fpn_rois[i]->data(); + std::vector lod0(1, 0); + multi_fpn_rois_lod0.push_back(lod0); + // statistic start point for each level rois + num_rois_level_integral[i + 1] = + num_rois_level_integral[i] + num_rois_level[i]; + } + restore_index->mutable_data({1, fpn_rois_num}, context.GetPlace()); + int* restore_index_data = restore_index->data(); + std::vector restore_index_inter(fpn_rois_num, -1); + // distribute the rois into different fpn level by target level + for (int i = 0; i < fpn_rois_lod.size() - 1; ++i) { + Tensor fpn_rois_slice = + fpn_rois->Slice(fpn_rois_lod[i], fpn_rois_lod[i + 1]); + const T* rois_data = fpn_rois_slice.data(); + size_t cur_offset = fpn_rois_lod[i]; + // std::vector lod_offset[num_level]; + for (int j = 0; j < num_level; j++) { + multi_fpn_rois_lod0[j].push_back(multi_fpn_rois_lod0[j][i]); + } + for (int j = 0; j < fpn_rois_slice.dims()[0]; ++j) { + int lvl = target_level[cur_offset + j]; + memcpy(multi_fpn_rois_data[lvl - min_level], rois_data, + kBoxDim * sizeof(T)); + multi_fpn_rois_data[lvl - min_level] += kBoxDim; + int index_in_shuffle = num_rois_level_integral[lvl - min_level] + + multi_fpn_rois_lod0[lvl - min_level][i + 1]; + restore_index_inter[index_in_shuffle] = cur_offset + j; + multi_fpn_rois_lod0[lvl - min_level][i + 1]++; + rois_data += kBoxDim; + } + } + for (int i = 0; i < fpn_rois_num; ++i) { + restore_index_data[restore_index_inter[i]] = i; + } + // merge lod information into LoDTensor + for (int i = 0; i < num_level; ++i) { + framework::LoD lod; + lod.emplace_back(multi_fpn_rois_lod0[i]); + multi_fpn_rois[i]->set_lod(lod); + } + } +}; +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/detection/generate_mask_labels_op.cc b/paddle/fluid/operators/detection/generate_mask_labels_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..46727c29de13c1213694540e6614a05f9008d232 --- /dev/null +++ b/paddle/fluid/operators/detection/generate_mask_labels_op.cc @@ -0,0 +1,437 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include +#include +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/detection/bbox_util.h" +#include "paddle/fluid/operators/detection/mask_util.h" +#include "paddle/fluid/operators/gather.h" +#include "paddle/fluid/operators/math/concat_and_split.h" +#include "paddle/fluid/operators/math/math_function.h" + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; +using LoDTensor = framework::LoDTensor; +const int kBoxDim = 4; + +template +void AppendMask(LoDTensor* out, int64_t offset, Tensor* to_add) { + auto* out_data = out->data(); + auto* to_add_data = to_add->data(); + memcpy(out_data + offset, to_add_data, to_add->numel() * sizeof(T)); +} + +class GenerateMaskLabelsOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("ImInfo"), "Input(ImInfo) shouldn't be null."); + PADDLE_ENFORCE(ctx->HasInput("GtClasses"), + "Input(GtClasses) shouldn't be null."); + PADDLE_ENFORCE(ctx->HasInput("IsCrowd"), + "Input(IsCrowd) shouldn't be null."); + PADDLE_ENFORCE(ctx->HasInput("GtSegms"), + "Input(GtSegms) shouldn't be null."); + PADDLE_ENFORCE(ctx->HasInput("Rois"), "Input(Rois) shouldn't be null."); + PADDLE_ENFORCE(ctx->HasInput("LabelsInt32"), + "Input(LabelsInt32) shouldn't be null."); + + PADDLE_ENFORCE( + ctx->HasOutput("MaskRois"), + "Output(MaskRois) of GenerateMaskLabelsOp should not be null"); + PADDLE_ENFORCE( + ctx->HasOutput("RoiHasMaskInt32"), + "Output(RoiHasMaskInt32) of GenerateMaskLabelsOp should not be null"); + PADDLE_ENFORCE( + ctx->HasOutput("MaskInt32"), + "Output(MaskInt32) of GenerateMaskLabelsOp should not be null"); + + auto im_info_dims = ctx->GetInputDim("ImInfo"); + auto gt_segms_dims = ctx->GetInputDim("GtSegms"); + PADDLE_ENFORCE_EQ(im_info_dims.size(), 2, + "The rank of Input(ImInfo) must be 2."); + PADDLE_ENFORCE_EQ(gt_segms_dims.size(), 2, + "The rank of Input(GtSegms) must be 2."); + PADDLE_ENFORCE_EQ(gt_segms_dims[1], 2, + "The second dim of Input(GtSegms) must be 2."); + int num_classes = ctx->Attrs().Get("num_classes"); + int resolution = ctx->Attrs().Get("resolution"); + + ctx->SetOutputDim("MaskRois", {-1, 4}); + ctx->SetOutputDim("RoiHasMaskInt32", {-1, 1}); + ctx->SetOutputDim("MaskInt32", {-1, num_classes * resolution * resolution}); + } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext& ctx) const override { + auto data_type = framework::GetDataTypeOfVar(ctx.InputVar("Rois")); + return framework::OpKernelType(data_type, platform::CPUPlace()); + } +}; + +/* + * Expand masks from shape (#masks, M ** 2) to (#masks, #classes * M ** 2) + * to encode class specific mask targets. + */ +template +static inline void ExpandMaskTarget(const platform::CPUDeviceContext& ctx, + const Tensor& masks, + const Tensor& mask_class_labels, + const int resolution, const int num_classes, + Tensor* mask_targets) { + const uint8_t* masks_data = masks.data(); + int64_t num_mask = masks.dims()[0]; + const int* mask_class_labels_data = mask_class_labels.data(); + const int M = resolution * resolution; + const int mask_dim = M * num_classes; + + int* mask_targets_data = + mask_targets->mutable_data({num_mask, mask_dim}, ctx.GetPlace()); + math::set_constant(ctx, mask_targets, -1); + for (int64_t mask_id = 0; mask_id < num_mask; ++mask_id) { + int cls = mask_class_labels_data[mask_id]; + int start = M * cls; + if (cls > 0) { + for (int i = 0; i < M; ++i) { + mask_targets_data[mask_id * mask_dim + start + i] = + static_cast(masks_data[mask_id * M + i]); + } + } + } +} + +template +std::vector SampleMaskForOneImage( + const platform::CPUDeviceContext& ctx, const Tensor& im_info, + const Tensor& gt_classes, const Tensor& is_crowd, const Tensor& gt_segms, + const Tensor& rois, const Tensor& label_int32, const int num_classes, + const int resolution, const framework::LoD& segm_length) { + // Prepare the mask targets by associating one gt mask to each training roi + // that has a fg (non-bg) class label. + const int64_t gt_size = static_cast(gt_classes.dims()[0]); + const int64_t roi_size = static_cast(rois.dims()[0]); + const int* gt_classes_data = gt_classes.data(); + const int* is_crowd_data = is_crowd.data(); + const int* label_int32_data = label_int32.data(); + PADDLE_ENFORCE_EQ(roi_size, label_int32.dims()[0]); + + std::vector mask_gt_inds, fg_inds; + std::vector>> gt_polys; + + auto polys_num = segm_length[1]; + auto segm_lod_offset = framework::ConvertToOffsetBasedLoD(segm_length); + auto lod1 = segm_lod_offset[1]; + auto lod2 = segm_lod_offset[2]; + const T* polys_data = gt_segms.data(); + for (int64_t i = 0; i < gt_size; ++i) { + if ((gt_classes_data[i] > 0) && (is_crowd_data[i] == 0)) { + mask_gt_inds.emplace_back(i); + + // slice fg segmentation polys + int poly_num = polys_num[i]; + std::vector> polys; + int s_idx = lod1[i]; + for (int j = 0; j < poly_num; ++j) { + int s = lod2[s_idx + j]; + int e = lod2[s_idx + j + 1]; + PADDLE_ENFORCE_NE(s, e); + std::vector plts(polys_data + s * 2, polys_data + e * 2); + polys.push_back(plts); + } + gt_polys.push_back(polys); + } + } + for (int64_t i = 0; i < roi_size; ++i) { + if (label_int32_data[i] > 0) { + fg_inds.emplace_back(i); + } + } + int gt_num = mask_gt_inds.size(); + int fg_num = fg_inds.size(); + + Tensor boxes_from_polys; + boxes_from_polys.mutable_data({gt_num, 4}, platform::CPUPlace()); + Poly2Boxes(gt_polys, boxes_from_polys.data()); + + std::vector roi_has_mask = + std::vector(fg_inds.begin(), fg_inds.end()); + Tensor mask_class_labels; + Tensor masks; + Tensor rois_fg; + + auto im_scale = im_info.data()[2]; + if (fg_num > 0) { + // Class labels for the foreground rois + mask_class_labels.mutable_data({fg_num, 1}, ctx.GetPlace()); + Gather(label_int32_data, 1, fg_inds.data(), fg_inds.size(), + mask_class_labels.data()); + + uint8_t* masks_data = masks.mutable_data( + {fg_num, resolution * resolution}, ctx.GetPlace()); + + // Find overlap between all foreground rois and the bounding boxes + // enclosing each segmentation + T* rois_fg_data = rois_fg.mutable_data({fg_num, 4}, ctx.GetPlace()); + Gather(rois.data(), 4, fg_inds.data(), fg_inds.size(), + rois_fg.data()); + + for (int k = 0; k < rois_fg.numel(); ++k) { + rois_fg_data[k] = rois_fg_data[k] / im_scale; + } + + Tensor overlaps_bbfg_bbpolys; + overlaps_bbfg_bbpolys.mutable_data({fg_num, gt_num}, ctx.GetPlace()); + BboxOverlaps(rois_fg, boxes_from_polys, &overlaps_bbfg_bbpolys); + + // Map from each fg rois to the index of the mask with highest overlap + // (measured by bbox overlap) + T* overlaps_bbfg_bbpolys_data = overlaps_bbfg_bbpolys.data(); + std::vector fg_masks_inds; + for (int64_t i = 0; i < fg_num; ++i) { + const T* v = overlaps_bbfg_bbpolys_data + i * gt_num; + T max_overlap = std::numeric_limits::min(); + int id = 0; + for (int64_t j = 0; j < gt_num; ++j) { + if (v[j] > max_overlap) { + max_overlap = v[j]; + id = j; + } + } + fg_masks_inds.push_back(id); + } + + // add fg targets + for (int64_t i = 0; i < fg_num; ++i) { + int fg_polys_ind = fg_masks_inds[i]; + T* roi_fg = rois_fg_data + i * 4; + uint8_t* mask = masks_data + i * resolution * resolution; + Polys2MaskWrtBox(gt_polys[fg_polys_ind], roi_fg, resolution, mask); + } + } else { + // The network cannot handle empty blobs, so we must provide a mask + // We simply take the first bg roi, given it an all -1's mask (ignore + // label), and label it with class zero (bg). + int bg_num = 1; + T* rois_fg_data = rois_fg.mutable_data({bg_num, 4}, ctx.GetPlace()); + const T* rois_data = rois.data(); + std::vector bg_inds; + for (int64_t i = 0; i < roi_size; ++i) { + if (label_int32_data[i] == 0) { + bg_inds.emplace_back(i); + rois_fg_data[0] = rois_data[0] / im_scale; + rois_fg_data[1] = rois_data[1] / im_scale; + rois_fg_data[2] = rois_data[2] / im_scale; + rois_fg_data[3] = rois_data[3] / im_scale; + break; + } + } + masks.mutable_data({bg_num, resolution * resolution}, + ctx.GetPlace()); + math::set_constant(ctx, &masks, -1); + int* mask_class_labels_data = + mask_class_labels.mutable_data({bg_num, 1}, ctx.GetPlace()); + mask_class_labels_data[0] = 0; + roi_has_mask = std::vector(bg_inds.begin(), bg_inds.end()); + } + + Tensor masks_expand; + ExpandMaskTarget(ctx, masks, mask_class_labels, resolution, num_classes, + &masks_expand); + + T* rois_fg_data = rois_fg.data(); + for (int k = 0; k < rois_fg.numel(); ++k) { + rois_fg_data[k] = rois_fg_data[k] * im_scale; + } + + Tensor roi_has_mask_t; + int roi_has_mask_size = roi_has_mask.size(); + int* roi_has_mask_data = + roi_has_mask_t.mutable_data({roi_has_mask_size, 1}, ctx.GetPlace()); + std::copy(roi_has_mask.begin(), roi_has_mask.end(), roi_has_mask_data); + + std::vector res; + res.emplace_back(rois_fg); + res.emplace_back(roi_has_mask_t); + res.emplace_back(masks_expand); + return res; +} + +template +class GenerateMaskLabelsKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* im_info = ctx.Input("ImInfo"); + auto* gt_classes = ctx.Input("GtClasses"); + auto* is_crowd = ctx.Input("IsCrowd"); + auto* gt_segms = ctx.Input("GtSegms"); + auto* rois = ctx.Input("Rois"); + auto* label_int32 = ctx.Input("LabelsInt32"); + + auto* mask_rois = ctx.Output("MaskRois"); + auto* roi_has_mask_int32 = ctx.Output("RoiHasMaskInt32"); + auto* mask_int32 = ctx.Output("MaskInt32"); + + int num_classes = ctx.Attr("num_classes"); + int resolution = ctx.Attr("resolution"); + + PADDLE_ENFORCE_EQ(gt_classes->lod().size(), 1UL, + "GenerateMaskLabelsOp gt_classes needs 1 level of LoD"); + PADDLE_ENFORCE_EQ(is_crowd->lod().size(), 1UL, + "GenerateMaskLabelsOp is_crowd needs 1 level of LoD"); + PADDLE_ENFORCE_EQ(rois->lod().size(), 1UL, + "GenerateMaskLabelsOp rois needs 1 level of LoD"); + PADDLE_ENFORCE_EQ(label_int32->lod().size(), 1UL, + "GenerateMaskLabelsOp label_int32 needs 1 level of LoD"); + + PADDLE_ENFORCE_EQ(gt_segms->lod().size(), 3UL); + + int64_t n = static_cast(gt_classes->lod().back().size() - 1); + PADDLE_ENFORCE_EQ(gt_segms->lod()[0].size() - 1, n); + + int mask_dim = num_classes * resolution * resolution; + + mask_rois->mutable_data({rois->numel(), kBoxDim}, ctx.GetPlace()); + roi_has_mask_int32->mutable_data({rois->numel(), 1}, ctx.GetPlace()); + mask_int32->mutable_data({rois->numel(), mask_dim}, ctx.GetPlace()); + + framework::LoD lod; + std::vector lod0(1, 0); + + int64_t num_mask = 0; + auto& dev_ctx = ctx.device_context(); + + auto gt_classes_lod = gt_classes->lod().back(); + auto is_crowd_lod = is_crowd->lod().back(); + auto rois_lod = rois->lod().back(); + auto label_int32_lod = label_int32->lod().back(); + auto gt_segms_lod = gt_segms->lod(); + + for (int i = 0; i < n; ++i) { + Tensor im_info_slice = im_info->Slice(i, i + 1); + Tensor gt_classes_slice = + gt_classes->Slice(gt_classes_lod[i], gt_classes_lod[i + 1]); + Tensor is_crowd_slice = + is_crowd->Slice(is_crowd_lod[i], is_crowd_lod[i + 1]); + Tensor label_int32_slice = + label_int32->Slice(label_int32_lod[i], label_int32_lod[i + 1]); + Tensor rois_slice = rois->Slice(rois_lod[i], rois_lod[i + 1]); + + auto sub_lod_and_offset = + framework::GetSubLoDAndAbsoluteOffset(gt_segms_lod, i, i + 1, 0); + auto lod_length = sub_lod_and_offset.first; + size_t s = sub_lod_and_offset.second.first; + size_t e = sub_lod_and_offset.second.second; + Tensor gt_segms_slice = gt_segms->Slice(s, e); + + std::vector tensor_output = SampleMaskForOneImage( + dev_ctx, im_info_slice, gt_classes_slice, is_crowd_slice, + gt_segms_slice, rois_slice, label_int32_slice, num_classes, + resolution, lod_length); + + Tensor sampled_mask_rois = tensor_output[0]; + Tensor sampled_roi_has_mask_int32 = tensor_output[1]; + Tensor sampled_mask_int32 = tensor_output[2]; + + AppendMask(mask_rois, kBoxDim * num_mask, &sampled_mask_rois); + AppendMask(roi_has_mask_int32, num_mask, + &sampled_roi_has_mask_int32); + AppendMask(mask_int32, mask_dim * num_mask, &sampled_mask_int32); + + num_mask += sampled_mask_rois.dims()[0]; + lod0.emplace_back(num_mask); + } + + lod.emplace_back(lod0); + mask_rois->set_lod(lod); + roi_has_mask_int32->set_lod(lod); + mask_int32->set_lod(lod); + mask_rois->Resize({num_mask, kBoxDim}); + roi_has_mask_int32->Resize({num_mask, 1}); + mask_int32->Resize({num_mask, mask_dim}); + } +}; + +class GenerateMaskLabelsOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("ImInfo", + "(Tensor), This input is a 2D Tensor with shape [B, 3]. " + "B is the number of input images, " + "each element consists of im_height, im_width, im_scale."); + AddInput("GtClasses", + "(LoDTensor), This input is a 2D LoDTensor with shape [M, 1]. " + "M is the number of groundtruth, " + "each element is a class label of groundtruth."); + AddInput( + "IsCrowd", + "(LoDTensor), This input is a 2D LoDTensor with shape [M, 1]. " + "M is the number of groundtruth, " + "each element is a flag indicates whether a groundtruth is crowd."); + AddInput( + "GtSegms", + "(LoDTensor), This input is a 2D LoDTensor with shape [S, 2], it's LoD " + "level is 3. The LoD[0] represents the gt objects number of each " + "instance. LoD[1] represents the segmentation counts of each objects. " + "LoD[2] represents the polygons number of each segmentation. S the " + "total number of polygons coordinate points. Each element is (x, y) " + "coordinate points."); + AddInput( + "Rois", + "(LoDTensor), This input is a 2D LoDTensor with shape [R, 4]. " + "R is the number of rois which is the output of " + "generate_proposal_labels, " + "each element is a bounding box with (xmin, ymin, xmax, ymax) format."); + AddInput("LabelsInt32", + "(LoDTensor), This intput is a 2D LoDTensor with shape [R, 1], " + "each element repersents a class label of a roi"); + AddOutput( + "MaskRois", + "(LoDTensor), This output is a 2D LoDTensor with shape [P, 4]. " + "P is the number of mask, " + "each element is a bounding box with [xmin, ymin, xmax, ymax] format."); + AddOutput("RoiHasMaskInt32", + "(LoDTensor), This output is a 2D LoDTensor with shape [P, 1], " + "each element repersents the output mask rois index with regard " + "to input rois"); + AddOutput("MaskInt32", + "(LoDTensor), This output is a 4D LoDTensor with shape [P, Q], " + "Q equal to num_classes * resolution * resolution"); + + AddAttr("num_classes", "Class number."); + AddAttr("resolution", "Resolution of mask."); + + AddComment(R"DOC( +This operator can be, for given the RoIs and corresponding labels, +to sample foreground RoIs. This mask branch also has +a :math: `K \\times M^{2}` dimensional output targets for each foreground +RoI, which encodes K binary masks of resolution M x M, one for each of the +K classes. This mask targets are used to compute loss of mask branch. + )DOC"); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(generate_mask_labels, ops::GenerateMaskLabelsOp, + ops::GenerateMaskLabelsOpMaker, + paddle::framework::EmptyGradOpMaker); +REGISTER_OP_CPU_KERNEL(generate_mask_labels, + ops::GenerateMaskLabelsKernel); diff --git a/paddle/fluid/operators/detection/generate_proposal_labels_op.cc b/paddle/fluid/operators/detection/generate_proposal_labels_op.cc index d7a53f1bef98ecda3ba7b36323678a11a632a15c..5b2e571baf390bfa9b4bdfa6e0f151102de709fc 100644 --- a/paddle/fluid/operators/detection/generate_proposal_labels_op.cc +++ b/paddle/fluid/operators/detection/generate_proposal_labels_op.cc @@ -16,7 +16,7 @@ limitations under the License. */ #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/operators/detection/bbox_util.h" #include "paddle/fluid/operators/gather.h" -#include "paddle/fluid/operators/math/concat.h" +#include "paddle/fluid/operators/math/concat_and_split.h" #include "paddle/fluid/operators/math/math_function.h" namespace paddle { @@ -48,24 +48,23 @@ class GenerateProposalLabelsOp : public framework::OperatorWithKernel { "Input(GtBoxes) shouldn't be null."); PADDLE_ENFORCE(ctx->HasInput("ImInfo"), "Input(ImInfo) shouldn't be null."); - PADDLE_ENFORCE(ctx->HasOutput("Rois"), - "Output(Rois) of RpnTargetAssignOp should not be null"); + PADDLE_ENFORCE( + ctx->HasOutput("Rois"), + "Output(Rois) of GenerateProposalLabelsOp should not be null"); PADDLE_ENFORCE( ctx->HasOutput("LabelsInt32"), - "Output(LabelsInt32) of RpnTargetAssignOp should not be null"); + "Output(LabelsInt32) of GenerateProposalLabelsOp should not be null"); PADDLE_ENFORCE( ctx->HasOutput("BboxTargets"), - "Output(BboxTargets) of RpnTargetAssignOp should not be null"); - PADDLE_ENFORCE( - ctx->HasOutput("BboxInsideWeights"), - "Output(BboxInsideWeights) of RpnTargetAssignOp should not be null"); - PADDLE_ENFORCE( - ctx->HasOutput("BboxOutsideWeights"), - "Output(BboxOutsideWeights) of RpnTargetAssignOp should not be null"); + "Output(BboxTargets) of GenerateProposalLabelsOp should not be null"); + PADDLE_ENFORCE(ctx->HasOutput("BboxInsideWeights"), + "Output(BboxInsideWeights) of GenerateProposalLabelsOp " + "should not be null"); + PADDLE_ENFORCE(ctx->HasOutput("BboxOutsideWeights"), + "Output(BboxOutsideWeights) of GenerateProposalLabelsOp " + "should not be null"); auto rpn_rois_dims = ctx->GetInputDim("RpnRois"); - auto gt_classes_dims = ctx->GetInputDim("GtClasses"); - auto is_crowd_dims = ctx->GetInputDim("IsCrowd"); auto gt_boxes_dims = ctx->GetInputDim("GtBoxes"); auto im_info_dims = ctx->GetInputDim("ImInfo"); @@ -227,30 +226,36 @@ void GatherBoxesLabels(const platform::CPUDeviceContext& context, template std::vector SampleRoisForOneImage( - const platform::CPUDeviceContext& context, Tensor* rpn_rois, - Tensor* gt_classes, Tensor* is_crowd, Tensor* gt_boxes, Tensor* im_info, - const int batch_size_per_im, const float fg_fraction, const float fg_thresh, - const float bg_thresh_hi, const float bg_thresh_lo, + const platform::CPUDeviceContext& context, const Tensor& rpn_rois_in, + const Tensor& gt_classes, const Tensor& is_crowd, const Tensor& gt_boxes, + const Tensor& im_info, const int batch_size_per_im, const float fg_fraction, + const float fg_thresh, const float bg_thresh_hi, const float bg_thresh_lo, const std::vector& bbox_reg_weights, const int class_nums, std::minstd_rand engine, bool use_random) { - auto rpn_rois_et = framework::EigenTensor::From(*rpn_rois); - auto im_scale = im_info->data()[2]; - rpn_rois_et = rpn_rois_et / im_scale; + auto im_scale = im_info.data()[2]; + + Tensor rpn_rois; + rpn_rois.mutable_data(rpn_rois_in.dims(), context.GetPlace()); + T* rpn_rois_dt = rpn_rois.data(); + const T* rpn_rois_in_dt = rpn_rois_in.data(); + for (int i = 0; i < rpn_rois.numel(); ++i) { + rpn_rois_dt[i] = rpn_rois_in_dt[i] / im_scale; + } Tensor boxes; - int proposals_num = gt_boxes->dims()[0] + rpn_rois->dims()[0]; + int proposals_num = gt_boxes.dims()[0] + rpn_rois.dims()[0]; boxes.mutable_data({proposals_num, kBoxDim}, context.GetPlace()); - Concat(context, *gt_boxes, *rpn_rois, &boxes); + Concat(context, gt_boxes, rpn_rois, &boxes); // Overlaps Tensor proposal_to_gt_overlaps; - proposal_to_gt_overlaps.mutable_data({proposals_num, gt_boxes->dims()[0]}, + proposal_to_gt_overlaps.mutable_data({proposals_num, gt_boxes.dims()[0]}, context.GetPlace()); - BboxOverlaps(boxes, *gt_boxes, &proposal_to_gt_overlaps); + BboxOverlaps(boxes, gt_boxes, &proposal_to_gt_overlaps); // Generate proposal index std::vector> fg_bg_gt = SampleFgBgGt( - context, &proposal_to_gt_overlaps, *is_crowd, batch_size_per_im, + context, &proposal_to_gt_overlaps, is_crowd, batch_size_per_im, fg_fraction, fg_thresh, bg_thresh_hi, bg_thresh_lo, engine, use_random); std::vector fg_inds = fg_bg_gt[0]; std::vector bg_inds = fg_bg_gt[1]; @@ -265,7 +270,7 @@ std::vector SampleRoisForOneImage( sampled_boxes.mutable_data(bbox_dim, context.GetPlace()); sampled_labels.mutable_data({boxes_num}, context.GetPlace()); sampled_gts.mutable_data({fg_num, kBoxDim}, context.GetPlace()); - GatherBoxesLabels(context, boxes, *gt_boxes, *gt_classes, fg_inds, bg_inds, + GatherBoxesLabels(context, boxes, gt_boxes, gt_classes, fg_inds, bg_inds, gt_inds, &sampled_boxes, &sampled_labels, &sampled_gts); // Compute targets @@ -399,8 +404,8 @@ class GenerateProposalLabelsKernel : public framework::OpKernel { gt_boxes->Slice(gt_boxes_lod[i], gt_boxes_lod[i + 1]); Tensor im_info_slice = im_info->Slice(i, i + 1); std::vector tensor_output = SampleRoisForOneImage( - dev_ctx, &rpn_rois_slice, >_classes_slice, &is_crowd_slice, - >_boxes_slice, &im_info_slice, batch_size_per_im, fg_fraction, + dev_ctx, rpn_rois_slice, gt_classes_slice, is_crowd_slice, + gt_boxes_slice, im_info_slice, batch_size_per_im, fg_fraction, fg_thresh, bg_thresh_hi, bg_thresh_lo, bbox_reg_weights, class_nums, engine, use_random); Tensor sampled_rois = tensor_output[0]; @@ -439,31 +444,88 @@ class GenerateProposalLabelsKernel : public framework::OpKernel { class GenerateProposalLabelsOpMaker : public framework::OpProtoAndCheckerMaker { public: void Make() override { - // TODO(buxingyuan): Add Document - AddInput("RpnRois", "RpnRois."); - AddInput("GtClasses", "GtClasses."); - AddInput("IsCrowd", "IsCrowd."); - AddInput("GtBoxes", "GtBoxes."); - AddInput("ImInfo", "ImInfo."); - - AddOutput("Rois", "Rois."); - AddOutput("LabelsInt32", "LabelsInt32."); - AddOutput("BboxTargets", "BboxTargets."); - AddOutput("BboxInsideWeights", "BboxInsideWeights."); - AddOutput("BboxOutsideWeights", "BboxOutsideWeights."); - - AddAttr("batch_size_per_im", "batch_size_per_im"); - AddAttr("fg_fraction", "fg_fraction"); - AddAttr("fg_thresh", "fg_thresh"); - AddAttr("bg_thresh_hi", "bg_thresh_hi"); - AddAttr("bg_thresh_lo", "bg_thresh_lo"); - AddAttr>("bbox_reg_weights", "bbox_reg_weights"); - AddAttr("class_nums", "class_nums"); - AddAttr("use_random", "use_random").SetDefault(true); + AddInput( + "RpnRois", + "(LoDTensor), This input is a 2D LoDTensor with shape [N, 4]. " + "N is the number of the GenerateProposalOp's output, " + "each element is a bounding box with [xmin, ymin, xmax, ymax] format."); + AddInput("GtClasses", + "(LoDTensor), This input is a 2D LoDTensor with shape [M, 1]. " + "M is the number of groundtruth, " + "each element is a class label of groundtruth."); + AddInput( + "IsCrowd", + "(LoDTensor), This input is a 2D LoDTensor with shape [M, 1]. " + "M is the number of groundtruth, " + "each element is a flag indicates whether a groundtruth is crowd."); + AddInput( + "GtBoxes", + "(LoDTensor), This input is a 2D LoDTensor with shape [M, 4]. " + "M is the number of groundtruth, " + "each element is a bounding box with [xmin, ymin, xmax, ymax] format."); + AddInput("ImInfo", + "(Tensor), This input is a 2D Tensor with shape [B, 3]. " + "B is the number of input images, " + "each element consists of im_height, im_width, im_scale."); + + AddOutput( + "Rois", + "(LoDTensor), This output is a 2D LoDTensor with shape [P, 4]. " + "P usuall equal to batch_size_per_im * batch_size, " + "each element is a bounding box with [xmin, ymin, xmax, ymax] format."); + AddOutput("LabelsInt32", + "(LoDTensor), This output is a 2D LoDTensor with shape [P, 1], " + "each element repersents a class label of a roi"); + AddOutput("BboxTargets", + "(LoDTensor), This output is a 2D LoDTensor with shape [P, 4 * " + "class_nums], " + "each element repersents a box label of a roi"); + AddOutput( + "BboxInsideWeights", + "(LoDTensor), This output is a 2D LoDTensor with shape [P, 4 * " + "class_nums], " + "each element indicates whether a box should contribute to loss."); + AddOutput( + "BboxOutsideWeights", + "(LoDTensor), This output is a 2D LoDTensor with shape [P, 4 * " + "class_nums], " + "each element indicates whether a box should contribute to loss."); + + AddAttr("batch_size_per_im", "Batch size of rois per images."); + AddAttr("fg_fraction", + "Foreground fraction in total batch_size_per_im."); + AddAttr( + "fg_thresh", + "Overlap threshold which is used to chose foreground sample."); + AddAttr("bg_thresh_hi", + "Overlap threshold upper bound which is used to chose " + "background sample."); + AddAttr("bg_thresh_lo", + "Overlap threshold lower bound which is used to chose " + "background sample."); + AddAttr>("bbox_reg_weights", "Box regression weights."); + AddAttr("class_nums", "Class number."); + AddAttr( + "use_random", + "Use random sampling to choose foreground and background boxes.") + .SetDefault(true); AddComment(R"DOC( -Generate Proposals Labels Operator. -)DOC"); +This operator can be, for given the GenerateProposalOp output bounding boxes and groundtruth, +to sample foreground boxes and background boxes, and compute loss target. + +RpnRois is the output boxes of RPN and was processed by generate_proposal_op, these boxes +were combined with groundtruth boxes and sampled according to batch_size_per_im and fg_fraction, +If an instance with a groundtruth overlap greater than fg_thresh, then it was considered as a foreground sample. +If an instance with a groundtruth overlap greater than bg_thresh_lo and lower than bg_thresh_hi, +then it was considered as a background sample. +After all foreground and background boxes are chosen (so called Rois), +then we apply random sampling to make sure +the number of foreground boxes is no more than batch_size_per_im * fg_fraction. + +For each box in Rois, we assign the classification (class label) and regression targets (box label) to it. +Finally BboxInsideWeights and BboxOutsideWeights are used to specify whether it would contribute to training loss. + )DOC"); } }; diff --git a/paddle/fluid/operators/detection/generate_proposals_op.cc b/paddle/fluid/operators/detection/generate_proposals_op.cc index 818d58ea9ee327fd99182ad2f8cbeed07e6aaea2..06e48f1262a74dfdfd6d38e71cd02116f3e6eca5 100644 --- a/paddle/fluid/operators/detection/generate_proposals_op.cc +++ b/paddle/fluid/operators/detection/generate_proposals_op.cc @@ -12,10 +12,12 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ +#include +#include #include #include #include "paddle/fluid/framework/op_registry.h" -#include "paddle/fluid/framework/var_type.h" +#include "paddle/fluid/operators/detail/safe_ref.h" #include "paddle/fluid/operators/gather.h" #include "paddle/fluid/operators/math/math_function.h" @@ -25,21 +27,17 @@ namespace operators { using Tensor = framework::Tensor; using LoDTensor = framework::LoDTensor; -struct AppendProposalsFunctor { - LoDTensor *out_; - int64_t offset_; - Tensor *to_add_; +static const double kBBoxClipDefault = std::log(1000.0 / 16.0); - AppendProposalsFunctor(LoDTensor *out, int64_t offset, Tensor *to_add) - : out_(out), offset_(offset), to_add_(to_add) {} - - template - void apply() const { - auto *out_data = out_->data(); - auto *to_add_data = to_add_->data(); - memcpy(out_data + offset_, to_add_data, to_add_->numel() * sizeof(T)); - } -}; +static void AppendProposals(Tensor *dst, int64_t offset, const Tensor &src) { + auto *out_data = dst->data(); + auto *to_add_data = src.data(); + size_t size_of_t = framework::SizeOfType(src.type()); + offset *= size_of_t; + std::memcpy( + reinterpret_cast(reinterpret_cast(out_data) + offset), + to_add_data, src.numel() * size_of_t); +} class GenerateProposalsOp : public framework::OperatorWithKernel { public: @@ -55,12 +53,6 @@ class GenerateProposalsOp : public framework::OperatorWithKernel { PADDLE_ENFORCE(ctx->HasInput("Variances"), "Input(Variances) shouldn't be null."); - auto scores_dims = ctx->GetInputDim("Scores"); - auto bbox_deltas_dims = ctx->GetInputDim("BboxDeltas"); - auto im_info_dims = ctx->GetInputDim("ImInfo"); - auto anchors_dims = ctx->GetInputDim("Anchors"); - auto variances_dims = ctx->GetInputDim("Variances"); - ctx->SetOutputDim("RpnRois", {-1, 4}); ctx->SetOutputDim("RpnRoiProbs", {-1, 1}); } @@ -68,15 +60,15 @@ class GenerateProposalsOp : public framework::OperatorWithKernel { protected: framework::OpKernelType GetExpectedKernelType( const framework::ExecutionContext &ctx) const override { - return framework::OpKernelType( - framework::ToDataType(ctx.Input("Anchors")->type()), - ctx.device_context()); + return framework::OpKernelType(ctx.Input("Anchors")->type(), + ctx.device_context()); } }; template -void BoxCoder(const platform::DeviceContext &ctx, Tensor *all_anchors, - Tensor *bbox_deltas, Tensor *variances, Tensor *proposals) { +static inline void BoxCoder(const platform::DeviceContext &ctx, + Tensor *all_anchors, Tensor *bbox_deltas, + Tensor *variances, Tensor *proposals) { T *proposals_data = proposals->mutable_data(ctx.GetPlace()); int64_t row = all_anchors->dims()[0]; @@ -108,11 +100,11 @@ void BoxCoder(const platform::DeviceContext &ctx, Tensor *all_anchors, anchor_center_y; bbox_width = std::exp(std::min(variances_data[i * len + 2] * bbox_deltas_data[i * len + 2], - std::log(1000.0 / 16.0))) * + kBBoxClipDefault)) * anchor_width; bbox_height = std::exp(std::min(variances_data[i * len + 3] * bbox_deltas_data[i * len + 3], - std::log(1000.0 / 16.0))) * + kBBoxClipDefault)) * anchor_height; } else { bbox_center_x = @@ -120,10 +112,10 @@ void BoxCoder(const platform::DeviceContext &ctx, Tensor *all_anchors, bbox_center_y = bbox_deltas_data[i * len + 1] * anchor_height + anchor_center_y; bbox_width = std::exp(std::min(bbox_deltas_data[i * len + 2], - std::log(1000.0 / 16.0))) * + kBBoxClipDefault)) * anchor_width; bbox_height = std::exp(std::min(bbox_deltas_data[i * len + 3], - std::log(1000.0 / 16.0))) * + kBBoxClipDefault)) * anchor_height; } @@ -136,30 +128,32 @@ void BoxCoder(const platform::DeviceContext &ctx, Tensor *all_anchors, } template -void ClipTiledBoxes(const platform::DeviceContext &ctx, const Tensor &im_info, - Tensor *boxes) { +static inline void ClipTiledBoxes(const platform::DeviceContext &ctx, + const Tensor &im_info, Tensor *boxes) { T *boxes_data = boxes->mutable_data(ctx.GetPlace()); const T *im_info_data = im_info.data(); + T zero(0); for (int64_t i = 0; i < boxes->numel(); ++i) { if (i % 4 == 0) { boxes_data[i] = - std::max(std::min(boxes_data[i], im_info_data[1] - 1), 0.0f); + std::max(std::min(boxes_data[i], im_info_data[1] - 1), zero); } else if (i % 4 == 1) { boxes_data[i] = - std::max(std::min(boxes_data[i], im_info_data[0] - 1), 0.0f); + std::max(std::min(boxes_data[i], im_info_data[0] - 1), zero); } else if (i % 4 == 2) { boxes_data[i] = - std::max(std::min(boxes_data[i], im_info_data[1] - 1), 0.0f); + std::max(std::min(boxes_data[i], im_info_data[1] - 1), zero); } else { boxes_data[i] = - std::max(std::min(boxes_data[i], im_info_data[0] - 1), 0.0f); + std::max(std::min(boxes_data[i], im_info_data[0] - 1), zero); } } } template -void FilterBoxes(const platform::DeviceContext &ctx, Tensor *boxes, - float min_size, const Tensor &im_info, Tensor *keep) { +static inline void FilterBoxes(const platform::DeviceContext &ctx, + Tensor *boxes, float min_size, + const Tensor &im_info, Tensor *keep) { const T *im_info_data = im_info.data(); T *boxes_data = boxes->mutable_data(ctx.GetPlace()); T im_scale = im_info_data[2]; @@ -185,24 +179,24 @@ void FilterBoxes(const platform::DeviceContext &ctx, Tensor *boxes, keep->Resize({keep_len}); } -bool SortScorePairDescend(const std::pair &pair1, - const std::pair &pair2) { - return pair1.first > pair2.first; -} - template -void GetMaxScoreIndex(const std::vector &scores, - std::vector> *sorted_indices) { +static inline std::vector> GetSortedScoreIndex( + const std::vector &scores) { + std::vector> sorted_indices; + sorted_indices.reserve(scores.size()); for (size_t i = 0; i < scores.size(); ++i) { - sorted_indices->push_back(std::make_pair(scores[i], i)); + sorted_indices.emplace_back(scores[i], i); } // Sort the score pair according to the scores in descending order - std::stable_sort(sorted_indices->begin(), sorted_indices->end(), - SortScorePairDescend); + std::stable_sort(sorted_indices.begin(), sorted_indices.end(), + [](const std::pair &a, const std::pair &b) { + return a.first < b.first; + }); + return sorted_indices; } template -T BBoxArea(const T *box, const bool normalized) { +static inline T BBoxArea(const T *box, bool normalized) { if (box[2] < box[0] || box[3] < box[1]) { // If coordinate values are is invalid // (e.g. xmax < xmin or ymax < ymin), return 0. @@ -220,7 +214,7 @@ T BBoxArea(const T *box, const bool normalized) { } template -T JaccardOverlap(const T *box1, const T *box2, const bool normalized) { +static inline T JaccardOverlap(const T *box1, const T *box2, bool normalized) { if (box2[0] > box1[2] || box2[2] < box1[0] || box2[1] > box1[3] || box2[3] < box1[1]) { return static_cast(0.); @@ -229,8 +223,8 @@ T JaccardOverlap(const T *box1, const T *box2, const bool normalized) { const T inter_ymin = std::max(box1[1], box2[1]); const T inter_xmax = std::min(box1[2], box2[2]); const T inter_ymax = std::min(box1[3], box2[3]); - const T inter_w = std::max(0.0f, inter_xmax - inter_xmin + 1); - const T inter_h = std::max(0.0f, inter_ymax - inter_ymin + 1); + const T inter_w = std::max(T(0), inter_xmax - inter_xmin + 1); + const T inter_h = std::max(T(0), inter_ymax - inter_ymin + 1); const T inter_area = inter_w * inter_h; const T bbox1_area = BBoxArea(box1, normalized); const T bbox2_area = BBoxArea(box2, normalized); @@ -238,9 +232,21 @@ T JaccardOverlap(const T *box1, const T *box2, const bool normalized) { } } +template +static inline Tensor VectorToTensor(const std::vector &selected_indices, + int selected_num) { + Tensor keep_nms; + keep_nms.Resize({selected_num}); + auto *keep_data = keep_nms.mutable_data(platform::CPUPlace()); + for (int i = 0; i < selected_num; ++i) { + keep_data[i] = selected_indices[i]; + } + return keep_nms; +} + template -Tensor NMS(const platform::DeviceContext &ctx, Tensor *bbox, Tensor *scores, - const T nms_threshold, const float eta) { +static inline Tensor NMS(const platform::DeviceContext &ctx, Tensor *bbox, + Tensor *scores, T nms_threshold, float eta) { PADDLE_ENFORCE_NOT_NULL(bbox); int64_t num_boxes = bbox->dims()[0]; // 4: [xmin ymin xmax ymax] @@ -248,20 +254,18 @@ Tensor NMS(const platform::DeviceContext &ctx, Tensor *bbox, Tensor *scores, std::vector scores_data(num_boxes); std::copy_n(scores->data(), num_boxes, scores_data.begin()); - std::vector> sorted_indices; - GetMaxScoreIndex(scores_data, &sorted_indices); + std::vector> sorted_indices = + GetSortedScoreIndex(scores_data); std::vector selected_indices; int selected_num = 0; T adaptive_threshold = nms_threshold; const T *bbox_data = bbox->data(); - bool flag; while (sorted_indices.size() != 0) { - int idx = sorted_indices.front().second; - flag = true; - for (size_t k = 0; k < selected_indices.size(); ++k) { + int idx = sorted_indices.back().second; + bool flag = true; + for (int kept_idx : selected_indices) { if (flag) { - const int kept_idx = selected_indices[k]; T overlap = JaccardOverlap(bbox_data + idx * box_size, bbox_data + kept_idx * box_size, false); flag = (overlap <= adaptive_threshold); @@ -271,32 +275,29 @@ Tensor NMS(const platform::DeviceContext &ctx, Tensor *bbox, Tensor *scores, } if (flag) { selected_indices.push_back(idx); - selected_num++; + ++selected_num; } - sorted_indices.erase(sorted_indices.begin()); + sorted_indices.erase(sorted_indices.end() - 1); if (flag && eta < 1 && adaptive_threshold > 0.5) { adaptive_threshold *= eta; } } - Tensor keep_nms; - keep_nms.Resize({selected_num}); - int *keep_data = keep_nms.mutable_data(ctx.GetPlace()); - for (int i = 0; i < selected_num; ++i) { - keep_data[i] = selected_indices[i]; - } - - return keep_nms; + return VectorToTensor(selected_indices, selected_num); } -template +template class GenerateProposalsKernel : public framework::OpKernel { public: void Compute(const framework::ExecutionContext &context) const override { auto *scores = context.Input("Scores"); auto *bbox_deltas = context.Input("BboxDeltas"); auto *im_info = context.Input("ImInfo"); - auto *anchors = context.Input("Anchors"); - auto *variances = context.Input("Variances"); + auto anchors = detail::Ref(context.Input("Anchors"), + "Cannot find input Anchors(%s) in scope", + context.Inputs("Anchors")[0]); + auto variances = detail::Ref(context.Input("Variances"), + "Cannot find input Variances(%s) in scope", + context.Inputs("Variances")[0]); auto *rpn_rois = context.Output("RpnRois"); auto *rpn_roi_probs = context.Output("RpnRoiProbs"); @@ -307,15 +308,16 @@ class GenerateProposalsKernel : public framework::OpKernel { float min_size = context.Attr("min_size"); float eta = context.Attr("eta"); - auto &dev_ctx = context.template device_context(); + auto &dev_ctx = + context.template device_context(); - auto scores_dim = scores->dims(); + auto &scores_dim = scores->dims(); int64_t num = scores_dim[0]; int64_t c_score = scores_dim[1]; int64_t h_score = scores_dim[2]; int64_t w_score = scores_dim[3]; - auto bbox_dim = bbox_deltas->dims(); + auto &bbox_dim = bbox_deltas->dims(); int64_t c_bbox = bbox_dim[1]; int64_t h_bbox = bbox_dim[2]; int64_t w_bbox = bbox_dim[3]; @@ -330,17 +332,17 @@ class GenerateProposalsKernel : public framework::OpKernel { scores_swap.mutable_data({num, h_score, w_score, c_score}, dev_ctx.GetPlace()); - math::Transpose trans; + math::Transpose trans; std::vector axis = {0, 2, 3, 1}; trans(dev_ctx, *bbox_deltas, &bbox_deltas_swap, axis); trans(dev_ctx, *scores, &scores_swap, axis); framework::LoD lod; - std::vector lod0(1, 0); - Tensor *anchor = const_cast(anchors); - anchor->Resize({anchors->numel() / 4, 4}); - Tensor *var = const_cast(variances); - var->Resize({var->numel() / 4, 4}); + lod.resize(1); + auto &lod0 = lod[0]; + lod0.push_back(0); + anchors.Resize({anchors.numel() / 4, 4}); + variances.Resize({variances.numel() / 4, 4}); int64_t num_proposals = 0; for (int64_t i = 0; i < num; ++i) { @@ -352,24 +354,17 @@ class GenerateProposalsKernel : public framework::OpKernel { scores_slice.Resize({h_score * w_score * c_score, 1}); std::pair tensor_pair = - ProposalForOneImage(dev_ctx, im_info_slice, *anchor, *var, + ProposalForOneImage(dev_ctx, im_info_slice, anchors, variances, bbox_deltas_slice, scores_slice, pre_nms_top_n, post_nms_top_n, nms_thresh, min_size, eta); - Tensor proposals = tensor_pair.first; - Tensor scores = tensor_pair.second; - - framework::VisitDataType( - framework::ToDataType(rpn_rois->type()), - AppendProposalsFunctor(rpn_rois, 4 * num_proposals, &proposals)); - framework::VisitDataType( - framework::ToDataType(rpn_roi_probs->type()), - AppendProposalsFunctor(rpn_roi_probs, num_proposals, &scores)); + Tensor &proposals = tensor_pair.first; + Tensor &scores = tensor_pair.second; + AppendProposals(rpn_rois, 4 * num_proposals, proposals); + AppendProposals(rpn_roi_probs, num_proposals, scores); num_proposals += proposals.dims()[0]; - lod0.emplace_back(num_proposals); + lod0.push_back(num_proposals); } - - lod.emplace_back(lod0); rpn_rois->set_lod(lod); rpn_roi_probs->set_lod(lod); rpn_rois->Resize({num_proposals, 4}); @@ -377,7 +372,7 @@ class GenerateProposalsKernel : public framework::OpKernel { } std::pair ProposalForOneImage( - const DeviceContext &ctx, const Tensor &im_info_slice, + const platform::CPUDeviceContext &ctx, const Tensor &im_info_slice, const Tensor &anchors, const Tensor &variances, const Tensor &bbox_deltas_slice, // [M, 4] const Tensor &scores_slice, // [N, 1] @@ -392,10 +387,9 @@ class GenerateProposalsKernel : public framework::OpKernel { for (int i = 0; i < scores_slice.numel(); ++i) { index[i] = i; } - std::function compare = - [scores_data](const int64_t &i, const int64_t &j) { - return scores_data[i] > scores_data[j]; - }; + auto compare = [scores_data](const int64_t &i, const int64_t &j) { + return scores_data[i] > scores_data[j]; + }; if (pre_nms_top_n <= 0 || pre_nms_top_n >= scores_slice.numel()) { std::sort(index, index + scores_slice.numel(), compare); @@ -452,33 +446,45 @@ class GenerateProposalsKernel : public framework::OpKernel { class GenerateProposalsOpMaker : public framework::OpProtoAndCheckerMaker { public: void Make() override { - AddInput("Scores", "The scores of anchors should be foreground."); - AddInput("BboxDeltas", "bbox_deltas."); - AddInput("ImInfo", "Information for image reshape."); - AddInput("Anchors", "All anchors."); - AddInput("Variances", " variances"); - - AddOutput("RpnRois", "Anchors."); - AddOutput("RpnRoiProbs", "Anchors."); - AddAttr("pre_nms_topN", "pre_nms_topN"); - AddAttr("post_nms_topN", "post_nms_topN"); - AddAttr("nms_thresh", "nms_thres"); - AddAttr("min_size", "min size"); + AddInput("Scores", + "(Tensor) The scores from conv is in shape (N, A, H, W), " + "N is batch size, A is number of anchors, " + "H and W are height and width of the feature map"); + AddInput("BboxDeltas", + "(Tensor) Bounding box deltas from conv is in " + "shape (N, 4*A, H, W)."); + AddInput("ImInfo", + "(Tensor) Information for image reshape is in shape (N, 3), " + "in format (height, width, scale)"); + AddInput("Anchors", + "(Tensor) Bounding box anchors from anchor_generator_op " + "is in shape (A, H, W, 4)."); + AddInput("Variances", + "(Tensor) Bounding box variances with same shape as `Anchors`."); + + AddOutput("RpnRois", + "(LoDTensor), Output proposals with shape (rois_num, 4)."); + AddOutput("RpnRoiProbs", + "(LoDTensor) Scores of proposals with shape (rois_num, 1)."); + AddAttr("pre_nms_topN", + "Number of top scoring RPN proposals to keep before " + "applying NMS."); + AddAttr("post_nms_topN", + "Number of top scoring RPN proposals to keep after " + "applying NMS"); + AddAttr("nms_thresh", "NMS threshold used on RPN proposals."); + AddAttr("min_size", + "Proposal height and width both need to be greater " + "than this min_size."); AddAttr("eta", "The parameter for adaptive NMS."); AddComment(R"DOC( -Generate Proposals OP - -This operator proposes rois according to each box with their probability to be a foreground object and -the box can be calculated by anchors. Bbox_deltais and scores are the output of RPN. Final proposals -could be used to train detection net. - -Scores is the probability for each box to be an object. In format of (N, A, H, W) where N is batch size, A is number -of anchors, H and W are height and width of the feature map. -BboxDeltas is the differece between predicted box locatoin and anchor location. In format of (N, 4*A, H, W) +This operator Generate bounding box proposals for Faster RCNN. +The propoasls are generated for a list of images based on image +score 'Scores', bounding box regression result 'BboxDeltas' as +well as predefined bounding box shapes 'anchors'. Greedy +non-maximum suppression is applied to generate the final bounding +boxes. -For generating proposals, this operator transposes and resizes scores and bbox_deltas in size of (H*W*A, 1) and (H*W*A, 4) and - calculate box locations as proposals candidates. Then clip boxes to image and remove predicted boxes with small area. -Finally, apply nms to get final proposals as output. )DOC"); } }; @@ -490,6 +496,5 @@ namespace ops = paddle::operators; REGISTER_OPERATOR(generate_proposals, ops::GenerateProposalsOp, ops::GenerateProposalsOpMaker, paddle::framework::EmptyGradOpMaker); -REGISTER_OP_CPU_KERNEL( - generate_proposals, - ops::GenerateProposalsKernel); +REGISTER_OP_CPU_KERNEL(generate_proposals, ops::GenerateProposalsKernel, + ops::GenerateProposalsKernel); diff --git a/paddle/fluid/operators/detection/generate_proposals_op.cu b/paddle/fluid/operators/detection/generate_proposals_op.cu index 6146ff509d768c0317a5c65ed22af1a3075977a2..a0b99377109aef4776fadd68101d011a9191b1cc 100644 --- a/paddle/fluid/operators/detection/generate_proposals_op.cu +++ b/paddle/fluid/operators/detection/generate_proposals_op.cu @@ -12,14 +12,18 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ +#include #include #include #include #include "cub/cub.cuh" +#include "paddle/fluid/framework/mixed_vector.h" #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/memory/memory.h" +#include "paddle/fluid/operators/detail/safe_ref.h" #include "paddle/fluid/operators/gather.cu.h" #include "paddle/fluid/operators/math/math_function.h" +#include "paddle/fluid/platform/for_range.h" namespace paddle { namespace operators { @@ -36,62 +40,67 @@ namespace { int const kThreadsPerBlock = sizeof(uint64_t) * 8; -template -__global__ void RangeInitKernel(const T start, const T delta, const int size, - T *out) { - CUDA_1D_KERNEL_LOOP(i, size) { out[i] = start + i * delta; } -} +static const double kBBoxClipDefault = std::log(1000.0 / 16.0); + +struct RangeInitFunctor { + int start_; + int delta_; + int *out_; + __device__ void operator()(size_t i) { out_[i] = start_ + i * delta_; } +}; template -void SortDescending(const platform::CUDADeviceContext &ctx, const Tensor &value, - Tensor *value_out, Tensor *index_out) { - int num = value.numel(); +static void SortDescending(const platform::CUDADeviceContext &ctx, + const Tensor &value, Tensor *value_out, + Tensor *index_out) { + int num = static_cast(value.numel()); Tensor index_in_t; int *idx_in = index_in_t.mutable_data({num}, ctx.GetPlace()); - int block = 512; - auto stream = ctx.stream(); - RangeInitKernel<<>>(0, 1, num, idx_in); + platform::ForRange for_range(ctx, num); + for_range(RangeInitFunctor{0, 1, idx_in}); + int *idx_out = index_out->mutable_data({num}, ctx.GetPlace()); const T *keys_in = value.data(); T *keys_out = value_out->mutable_data({num}, ctx.GetPlace()); // Determine temporary device storage requirements - void *d_temp_storage = NULL; size_t temp_storage_bytes = 0; cub::DeviceRadixSort::SortPairsDescending( - d_temp_storage, temp_storage_bytes, keys_in, keys_out, idx_in, idx_out, - num); - + nullptr, temp_storage_bytes, keys_in, keys_out, idx_in, idx_out, num); // Allocate temporary storage auto place = boost::get(ctx.GetPlace()); - d_temp_storage = memory::Alloc(place, temp_storage_bytes); + auto d_temp_storage = + memory::Alloc(place, temp_storage_bytes, memory::Allocator::kScratchpad); // Run sorting operation cub::DeviceRadixSort::SortPairsDescending( - d_temp_storage, temp_storage_bytes, keys_in, keys_out, idx_in, idx_out, - num); - - memory::Free(place, d_temp_storage); -} - -template -__device__ __forceinline__ T Min(T x, T y) { - return x < y ? x : y; + d_temp_storage->ptr(), temp_storage_bytes, keys_in, keys_out, idx_in, + idx_out, num); } template -__device__ __forceinline__ T Max(T x, T y) { - return x > y ? x : y; -} - -template -__global__ void BoxDecodeAndClipKernel(const T *anchor, const T *deltas, - const T *var, const int *index, - const T *im_info, const int num, - T *proposals) { - T kBBoxClipDefault = log(1000.0 / 16.0); - CUDA_1D_KERNEL_LOOP(i, num) { +struct BoxDecodeAndClipFunctor { + const T *anchor; + const T *deltas; + const T *var; + const int *index; + const T *im_info; + + T *proposals; + + BoxDecodeAndClipFunctor(const T *anchor, const T *deltas, const T *var, + const int *index, const T *im_info, T *proposals) + : anchor(anchor), + deltas(deltas), + var(var), + index(index), + im_info(im_info), + proposals(proposals) {} + + T bbox_clip_default{static_cast(kBBoxClipDefault)}; + + __device__ void operator()(size_t i) { int k = index[i] * 4; T axmin = anchor[k]; T aymin = anchor[k + 1]; @@ -108,17 +117,17 @@ __global__ void BoxDecodeAndClipKernel(const T *anchor, const T *deltas, T dxmax = deltas[k + 2]; T dymax = deltas[k + 3]; - T d_cx = 0., d_cy = 0., d_w = 0., d_h = 0.; + T d_cx, d_cy, d_w, d_h; if (var) { d_cx = cx + dxmin * w * var[k]; d_cy = cy + dymin * h * var[k + 1]; - d_w = exp(Min(dxmax * var[k + 2], kBBoxClipDefault)) * w; - d_h = exp(Min(dymax * var[k + 3], kBBoxClipDefault)) * h; + d_w = exp(Min(dxmax * var[k + 2], bbox_clip_default)) * w; + d_h = exp(Min(dymax * var[k + 3], bbox_clip_default)) * h; } else { d_cx = cx + dxmin * w; d_cy = cy + dymin * h; - d_w = exp(Min(dxmax, kBBoxClipDefault)) * w; - d_h = exp(Min(dymax, kBBoxClipDefault)) * h; + d_w = exp(Min(dxmax, bbox_clip_default)) * w; + d_h = exp(Min(dymax, bbox_clip_default)) * h; } T oxmin = d_cx - d_w * 0.5; @@ -126,17 +135,21 @@ __global__ void BoxDecodeAndClipKernel(const T *anchor, const T *deltas, T oxmax = d_cx + d_w * 0.5 - 1.; T oymax = d_cy + d_h * 0.5 - 1.; - proposals[i * 4] = Max(Min(oxmin, im_info[1] - 1.), 0.); - proposals[i * 4 + 1] = Max(Min(oymin, im_info[0] - 1.), 0.); - proposals[i * 4 + 2] = Max(Min(oxmax, im_info[1] - 1.), 0.); - proposals[i * 4 + 3] = Max(Min(oymax, im_info[0] - 1.), 0.); + proposals[i * 4] = Max(Min(oxmin, im_info[1] - 1.), 0.); + proposals[i * 4 + 1] = Max(Min(oymin, im_info[0] - 1.), 0.); + proposals[i * 4 + 2] = Max(Min(oxmax, im_info[1] - 1.), 0.); + proposals[i * 4 + 3] = Max(Min(oymax, im_info[0] - 1.), 0.); } -} + + __device__ __forceinline__ T Min(T a, T b) const { return a > b ? b : a; } + + __device__ __forceinline__ T Max(T a, T b) const { return a > b ? a : b; } +}; template -__global__ void FilterBBoxes(const T *bboxes, const T *im_info, - const T min_size, const int num, int *keep_num, - int *keep) { +static __global__ void FilterBBoxes(const T *bboxes, const T *im_info, + const T min_size, const int num, + int *keep_num, int *keep) { T im_h = im_info[0]; T im_w = im_info[1]; T im_scale = im_info[2]; @@ -181,7 +194,7 @@ __global__ void FilterBBoxes(const T *bboxes, const T *im_info, } } -__device__ inline float IoU(const float *a, const float *b) { +static __device__ inline float IoU(const float *a, const float *b) { float left = max(a[0], b[0]), right = min(a[2], b[2]); float top = max(a[1], b[1]), bottom = min(a[3], b[3]); float width = max(right - left + 1, 0.f), height = max(bottom - top + 1, 0.f); @@ -191,8 +204,9 @@ __device__ inline float IoU(const float *a, const float *b) { return inter_s / (s_a + s_b - inter_s); } -__global__ void NMSKernel(const int n_boxes, const float nms_overlap_thresh, - const float *dev_boxes, uint64_t *dev_mask) { +static __global__ void NMSKernel(const int n_boxes, + const float nms_overlap_thresh, + const float *dev_boxes, uint64_t *dev_mask) { const int row_start = blockIdx.y; const int col_start = blockIdx.x; @@ -234,9 +248,9 @@ __global__ void NMSKernel(const int n_boxes, const float nms_overlap_thresh, } template -void NMS(const platform::CUDADeviceContext &ctx, const Tensor &proposals, - const Tensor &sorted_indices, const T nms_threshold, - Tensor *keep_out) { +static void NMS(const platform::CUDADeviceContext &ctx, const Tensor &proposals, + const Tensor &sorted_indices, const T nms_threshold, + Tensor *keep_out) { int boxes_num = proposals.dims()[0]; PADDLE_ENFORCE_EQ(boxes_num, sorted_indices.dims()[0]); @@ -247,13 +261,10 @@ void NMS(const platform::CUDADeviceContext &ctx, const Tensor &proposals, const T *boxes = proposals.data(); auto place = boost::get(ctx.GetPlace()); - int size_bytes = boxes_num * col_blocks * sizeof(uint64_t); - uint64_t *d_mask = - reinterpret_cast(memory::Alloc(place, size_bytes)); - NMSKernel<<>>(boxes_num, nms_threshold, boxes, d_mask); - uint64_t *h_mask = reinterpret_cast( - memory::Alloc(platform::CPUPlace(), size_bytes)); - memory::Copy(platform::CPUPlace(), h_mask, place, d_mask, size_bytes, 0); + framework::Vector mask(boxes_num * col_blocks); + NMSKernel<<>>( + boxes_num, nms_threshold, boxes, + mask.CUDAMutableData(boost::get(ctx.GetPlace()))); std::vector remv(col_blocks); memset(&remv[0], 0, sizeof(uint64_t) * col_blocks); @@ -267,7 +278,7 @@ void NMS(const platform::CUDADeviceContext &ctx, const Tensor &proposals, if (!(remv[nblock] & (1ULL << inblock))) { ++num_to_keep; keep_vec.push_back(i); - uint64_t *p = &h_mask[0] + i * col_blocks; + uint64_t *p = &mask[0] + i * col_blocks; for (int j = nblock; j < col_blocks; j++) { remv[j] |= p[j]; } @@ -276,12 +287,10 @@ void NMS(const platform::CUDADeviceContext &ctx, const Tensor &proposals, int *keep = keep_out->mutable_data({num_to_keep}, ctx.GetPlace()); memory::Copy(place, keep, platform::CPUPlace(), keep_vec.data(), sizeof(int) * num_to_keep, 0); - memory::Free(place, d_mask); - memory::Free(platform::CPUPlace(), h_mask); } template -std::pair ProposalForOneImage( +static std::pair ProposalForOneImage( const platform::CUDADeviceContext &ctx, const Tensor &im_info, const Tensor &anchors, const Tensor &variances, const Tensor &bbox_deltas, // [M, 4] @@ -300,18 +309,20 @@ std::pair ProposalForOneImage( // 2. box decode and clipping Tensor proposals; proposals.mutable_data({pre_nms_num, 4}, ctx.GetPlace()); - int block = 512; - auto stream = ctx.stream(); - BoxDecodeAndClipKernel<<>>( - anchors.data(), bbox_deltas.data(), variances.data(), - index_sort.data(), im_info.data(), pre_nms_num, - proposals.data()); + + { + platform::ForRange for_range(ctx, pre_nms_num); + for_range(BoxDecodeAndClipFunctor{ + anchors.data(), bbox_deltas.data(), variances.data(), + index_sort.data(), im_info.data(), proposals.data()}); + } // 3. filter Tensor keep_index, keep_num_t; keep_index.mutable_data({pre_nms_num}, ctx.GetPlace()); keep_num_t.mutable_data({1}, ctx.GetPlace()); min_size = std::max(min_size, 1.0f); + auto stream = ctx.stream(); FilterBBoxes<<<1, 512, 0, stream>>>( proposals.data(), im_info.data(), min_size, pre_nms_num, keep_num_t.data(), keep_index.data()); @@ -355,8 +366,12 @@ class CUDAGenerateProposalsKernel : public framework::OpKernel { auto *scores = context.Input("Scores"); auto *bbox_deltas = context.Input("BboxDeltas"); auto *im_info = context.Input("ImInfo"); - auto *anchors = context.Input("Anchors"); - auto *variances = context.Input("Variances"); + auto anchors = detail::Ref(context.Input("Anchors"), + "Cannot find input Anchors(%s) in scope", + context.Inputs("Anchors")[0]); + auto variances = detail::Ref(context.Input("Variances"), + "Cannot find input Variances(%s) in scope", + context.Inputs("Variances")[0]); auto *rpn_rois = context.Output("RpnRois"); auto *rpn_roi_probs = context.Output("RpnRoiProbs"); @@ -392,10 +407,8 @@ class CUDAGenerateProposalsKernel : public framework::OpKernel { trans(dev_ctx, *bbox_deltas, &bbox_deltas_swap, axis); trans(dev_ctx, *scores, &scores_swap, axis); - Tensor *anchor = const_cast(anchors); - anchor->Resize({anchors->numel() / 4, 4}); - Tensor *var = const_cast(variances); - var->Resize({var->numel() / 4, 4}); + anchors.Resize({anchors.numel() / 4, 4}); + variances.Resize({variances.numel() / 4, 4}); rpn_rois->mutable_data({bbox_deltas->numel() / 4, 4}, context.GetPlace()); @@ -417,12 +430,12 @@ class CUDAGenerateProposalsKernel : public framework::OpKernel { scores_slice.Resize({h_score * w_score * c_score, 1}); std::pair box_score_pair = - ProposalForOneImage(dev_ctx, im_info_slice, *anchor, *var, + ProposalForOneImage(dev_ctx, im_info_slice, anchors, variances, bbox_deltas_slice, scores_slice, pre_nms_top_n, post_nms_top_n, nms_thresh, min_size, eta); - Tensor proposals = box_score_pair.first; - Tensor scores = box_score_pair.second; + Tensor &proposals = box_score_pair.first; + Tensor &scores = box_score_pair.second; memory::Copy(place, rpn_rois_data + num_proposals * 4, place, proposals.data(), sizeof(T) * proposals.numel(), 0); diff --git a/paddle/fluid/operators/detection/gpc.cc b/paddle/fluid/operators/detection/gpc.cc new file mode 100644 index 0000000000000000000000000000000000000000..7c0823c0487d39eece5be08322e7d182b931ba3c --- /dev/null +++ b/paddle/fluid/operators/detection/gpc.cc @@ -0,0 +1,2201 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +/** + * @file src/gpc.cpp + * @author huhan02(com@baidu.com) + * @date 2015/12/18 14:17:30 + * @brief + * + * @modified by sunyipeng + * @email sunyipeng@baidu.com + * @date 2018/6/12 + **/ + +#include "paddle/fluid/operators/detection/gpc.h" + +namespace gpc { + +typedef struct lmt_shape { /* Local minima table */ + double y; /* Y coordinate at local minimum */ + edge_node *first_bound; /* Pointer to bound list */ + struct lmt_shape *next; /* Pointer to next local minimum */ +} lmt_node; + +typedef struct sbt_t_shape { /* Scanbeam tree */ + double y; /* Scanbeam node y value */ + struct sbt_t_shape *less; /* Pointer to nodes with lower y */ + struct sbt_t_shape *more; /* Pointer to nodes with higher y */ +} sb_tree; + +typedef struct it_shape { /* Intersection table */ + edge_node *ie[2]; /* Intersecting edge (bundle) pair */ + gpc_vertex point; /* Point of intersection */ + struct it_shape *next; /* The next intersection table node */ +} it_node; + +typedef struct st_shape { /* Sorted edge table */ + edge_node *edge; /* Pointer to AET edge */ + double xb; /* Scanbeam bottom x coordinate */ + double xt; /* Scanbeam top x coordinate */ + double dx; /* Change in x for a unit y increase */ + struct st_shape *prev; /* Previous edge in sorted list */ +} st_node; + +typedef struct bbox_shape { /* Contour axis-aligned bounding box */ + double xmin; /* Minimum x coordinate */ + double ymin; /* Minimum y coordinate */ + double xmax; /* Maximum x coordinate */ + double ymax; /* Maximum y coordinate */ +} bbox; + +/* +=========================================================================== + Global Data +=========================================================================== +*/ + +/* Horizontal edge state transitions within scanbeam boundary */ +const h_state next_h_state[3][6] = { + /* ABOVE BELOW CROSS */ + /* L R L R L R */ + /* NH */ + {BH, TH, TH, BH, NH, NH}, + /* BH */ + {NH, NH, NH, NH, TH, TH}, + /* TH */ + {NH, NH, NH, NH, BH, BH}}; + +/* +=========================================================================== + Private Functions +=========================================================================== +*/ + +static void reset_it(it_node **it) { + it_node *itn; + + while (*it) { + itn = (*it)->next; + gpc_free(*it); + *it = itn; + } +} + +static void reset_lmt(lmt_node **lmt) { + lmt_node *lmtn; + + while (*lmt) { + lmtn = (*lmt)->next; + gpc_free(*lmt); + *lmt = lmtn; + } +} + +static void insert_bound(edge_node **b, edge_node *e) { + edge_node *existing_bound = NULL; + + if (!*b) { + /* Link node e to the tail of the list */ + *b = e; + } else { + /* Do primary sort on the x field */ + if (e[0].bot.x < (*b)[0].bot.x) { + /* Insert a new node mid-list */ + existing_bound = *b; + *b = e; + (*b)->next_bound = existing_bound; + } else { + if (e[0].bot.x == (*b)[0].bot.x) { + /* Do secondary sort on the dx field */ + if (e[0].dx < (*b)[0].dx) { + /* Insert a new node mid-list */ + existing_bound = *b; + *b = e; + (*b)->next_bound = existing_bound; + } else { + /* Head further down the list */ + insert_bound(&((*b)->next_bound), e); + } + } else { + /* Head further down the list */ + insert_bound(&((*b)->next_bound), e); + } + } + } +} + +static edge_node **bound_list(lmt_node **lmt, double y) { + lmt_node *existing_node; + + if (!*lmt) { + /* Add node onto the tail end of the LMT */ + gpc_malloc(*lmt, sizeof(lmt_node), + const_cast("LMT insertion")); + (*lmt)->y = y; + (*lmt)->first_bound = NULL; + (*lmt)->next = NULL; + return &((*lmt)->first_bound); + } else if (y < (*lmt)->y) { + /* Insert a new LMT node before the current node */ + existing_node = *lmt; + gpc_malloc(*lmt, sizeof(lmt_node), + const_cast("LMT insertion")); + (*lmt)->y = y; + (*lmt)->first_bound = NULL; + (*lmt)->next = existing_node; + return &((*lmt)->first_bound); + } else { + if (y > (*lmt)->y) { + /* Head further up the LMT */ + return bound_list(&((*lmt)->next), y); + } else { + /* Use this existing LMT node */ + return &((*lmt)->first_bound); + } + } +} + +static void add_to_sbtree(int *entries, sb_tree **sbtree, double y) { + if (!*sbtree) { + /* Add a new tree node here */ + gpc_malloc(*sbtree, sizeof(sb_tree), + const_cast("scanbeam tree insertion")); + (*sbtree)->y = y; + (*sbtree)->less = NULL; + (*sbtree)->more = NULL; + (*entries)++; + } else { + if ((*sbtree)->y > y) { + /* Head into the 'less' sub-tree */ + add_to_sbtree(entries, &((*sbtree)->less), y); + } else { + if ((*sbtree)->y < y) { + /* Head into the 'more' sub-tree */ + add_to_sbtree(entries, &((*sbtree)->more), y); + } + } + } +} + +static void build_sbt(int *entries, double *sbt, sb_tree *sbtree) { + if (sbtree->less) { + build_sbt(entries, sbt, sbtree->less); + } + sbt[*entries] = sbtree->y; + (*entries)++; + if (sbtree->more) { + build_sbt(entries, sbt, sbtree->more); + } +} + +static void free_sbtree(sb_tree **sbtree) { + if (*sbtree) { + free_sbtree(&((*sbtree)->less)); + free_sbtree(&((*sbtree)->more)); + gpc_free(*sbtree); + } +} + +static int count_optimal_vertices(gpc_vertex_list c) { + int result = 0; + int i = 0; + + /* Ignore non-contributing contours */ + if (c.num_vertices > 0) { + for (i = 0; i < c.num_vertices; i++) { + /* Ignore superfluous vertices embedded in horizontal edges */ + if (gpc_optimal(c.vertex, i, c.num_vertices)) { + result++; + } + } + } + return result; +} + +static edge_node *build_lmt(lmt_node **lmt, sb_tree **sbtree, int *sbt_entries, + gpc_polygon *p, int type, gpc_op op) { + int c = 0; + int i = 0; + int min = 0; + int max = 0; + int num_edges = 0; + int v = 0; + int num_vertices = 0; + int total_vertices = 0; + int e_index = 0; + edge_node *e = NULL; + edge_node *edge_table = NULL; + + for (c = 0; c < p->num_contours; c++) { + total_vertices += count_optimal_vertices(p->contour[c]); + } + + /* Create the entire input polygon edge table in one go */ + gpc_malloc(edge_table, total_vertices * sizeof(edge_node), + const_cast("edge table creation")); + + for (c = 0; c < p->num_contours; c++) { + if (p->contour[c].num_vertices < 0) { + /* Ignore the non-contributing contour and repair the vertex count */ + p->contour[c].num_vertices = -p->contour[c].num_vertices; + } else { + /* Perform contour optimisation */ + num_vertices = 0; + for (i = 0; i < p->contour[c].num_vertices; i++) { + if (gpc_optimal(p->contour[c].vertex, i, p->contour[c].num_vertices)) { + edge_table[num_vertices].vertex.x = p->contour[c].vertex[i].x; + edge_table[num_vertices].vertex.y = p->contour[c].vertex[i].y; + + /* Record vertex in the scanbeam table */ + add_to_sbtree(sbt_entries, sbtree, edge_table[num_vertices].vertex.y); + + num_vertices++; + } + } + + /* Do the contour forward pass */ + for (min = 0; min < num_vertices; min++) { + /* If a forward local minimum... */ + if (gpc_fwd_min(edge_table, min, num_vertices)) { + /* Search for the next local maximum... */ + num_edges = 1; + max = gpc_next_index(min, num_vertices); + while (gpc_not_fmax(edge_table, max, num_vertices)) { + num_edges++; + max = gpc_next_index(max, num_vertices); + } + + /* Build the next edge list */ + e = &edge_table[e_index]; + e_index += num_edges; + v = min; + e[0].bstate[BELOW] = UNBUNDLED; + e[0].bundle[BELOW][CLIP] = 0; + e[0].bundle[BELOW][SUBJ] = 0; + for (i = 0; i < num_edges; i++) { + e[i].xb = edge_table[v].vertex.x; + e[i].bot.x = edge_table[v].vertex.x; + e[i].bot.y = edge_table[v].vertex.y; + + v = gpc_next_index(v, num_vertices); + + e[i].top.x = edge_table[v].vertex.x; + e[i].top.y = edge_table[v].vertex.y; + e[i].dx = (edge_table[v].vertex.x - e[i].bot.x) / + (e[i].top.y - e[i].bot.y); + e[i].type = type; + e[i].outp[ABOVE] = NULL; + e[i].outp[BELOW] = NULL; + e[i].next = NULL; + e[i].prev = NULL; + e[i].succ = + ((num_edges > 1) && (i < (num_edges - 1))) ? &(e[i + 1]) : NULL; + e[i].pred = ((num_edges > 1) && (i > 0)) ? &(e[i - 1]) : NULL; + e[i].next_bound = NULL; + e[i].bside[CLIP] = (op == GPC_DIFF) ? RIGHT : LEFT; + e[i].bside[SUBJ] = LEFT; + } + insert_bound(bound_list(lmt, edge_table[min].vertex.y), e); + } + } + + /* Do the contour reverse pass */ + for (min = 0; min < num_vertices; min++) { + /* If a reverse local minimum... */ + if (gpc_rev_min(edge_table, min, num_vertices)) { + /* Search for the previous local maximum... */ + num_edges = 1; + max = gpc_prev_index(min, num_vertices); + while (gpc_not_rmax(edge_table, max, num_vertices)) { + num_edges++; + max = gpc_prev_index(max, num_vertices); + } + + /* Build the previous edge list */ + e = &edge_table[e_index]; + e_index += num_edges; + v = min; + e[0].bstate[BELOW] = UNBUNDLED; + e[0].bundle[BELOW][CLIP] = 0; + e[0].bundle[BELOW][SUBJ] = 0; + for (i = 0; i < num_edges; i++) { + e[i].xb = edge_table[v].vertex.x; + e[i].bot.x = edge_table[v].vertex.x; + e[i].bot.y = edge_table[v].vertex.y; + + v = gpc_prev_index(v, num_vertices); + + e[i].top.x = edge_table[v].vertex.x; + e[i].top.y = edge_table[v].vertex.y; + e[i].dx = (edge_table[v].vertex.x - e[i].bot.x) / + (e[i].top.y - e[i].bot.y); + e[i].type = type; + e[i].outp[ABOVE] = NULL; + e[i].outp[BELOW] = NULL; + e[i].next = NULL; + e[i].prev = NULL; + e[i].succ = + ((num_edges > 1) && (i < (num_edges - 1))) ? &(e[i + 1]) : NULL; + e[i].pred = ((num_edges > 1) && (i > 0)) ? &(e[i - 1]) : NULL; + e[i].next_bound = NULL; + e[i].bside[CLIP] = (op == GPC_DIFF) ? RIGHT : LEFT; + e[i].bside[SUBJ] = LEFT; + } + insert_bound(bound_list(lmt, edge_table[min].vertex.y), e); + } + } + } + } + return edge_table; +} // NOLINT + +static void add_edge_to_aet(edge_node **aet, edge_node *edge, edge_node *prev) { + if (!*aet) { + /* Append edge onto the tail end of the AET */ + *aet = edge; + edge->prev = prev; + edge->next = NULL; + } else { + /* Do primary sort on the xb field */ + if (edge->xb < (*aet)->xb) { + /* Insert edge here (before the AET edge) */ + edge->prev = prev; + edge->next = *aet; + (*aet)->prev = edge; + *aet = edge; + } else { + if (edge->xb == (*aet)->xb) { + /* Do secondary sort on the dx field */ + if (edge->dx < (*aet)->dx) { + /* Insert edge here (before the AET edge) */ + edge->prev = prev; + edge->next = *aet; + (*aet)->prev = edge; + *aet = edge; + } else { + /* Head further into the AET */ + add_edge_to_aet(&((*aet)->next), edge, *aet); + } + } else { + /* Head further into the AET */ + add_edge_to_aet(&((*aet)->next), edge, *aet); + } + } + } +} + +static void add_intersection(it_node **it, edge_node *edge0, edge_node *edge1, + double x, double y) { + it_node *existing_node; + + if (!*it) { + /* Append a new node to the tail of the list */ + gpc_malloc(*it, sizeof(it_node), + const_cast("IT insertion")); + (*it)->ie[0] = edge0; + (*it)->ie[1] = edge1; + (*it)->point.x = x; + (*it)->point.y = y; + (*it)->next = NULL; + } else { + if ((*it)->point.y > y) { + /* Insert a new node mid-list */ + existing_node = *it; + gpc_malloc(*it, sizeof(it_node), + const_cast("IT insertion")); + (*it)->ie[0] = edge0; + (*it)->ie[1] = edge1; + (*it)->point.x = x; + (*it)->point.y = y; + (*it)->next = existing_node; + } else { + /* Head further down the list */ + add_intersection(&((*it)->next), edge0, edge1, x, y); + } + } +} + +static void add_st_edge(st_node **st, it_node **it, edge_node *edge, + double dy) { + st_node *existing_node; + double den = 0.0; + double r = 0.0; + double x = 0.0; + double y = 0.0; + + if (!*st) { + /* Append edge onto the tail end of the ST */ + gpc_malloc(*st, sizeof(st_node), + const_cast("ST insertion")); + (*st)->edge = edge; + (*st)->xb = edge->xb; + (*st)->xt = edge->xt; + (*st)->dx = edge->dx; + (*st)->prev = NULL; + } else { + den = ((*st)->xt - (*st)->xb) - (edge->xt - edge->xb); + + /* If new edge and ST edge don't cross */ + if ((edge->xt >= (*st)->xt) || (edge->dx == (*st)->dx) || + (fabs(den) <= DBL_EPSILON)) { + /* No intersection - insert edge here (before the ST edge) */ + existing_node = *st; + gpc_malloc(*st, sizeof(st_node), + const_cast("ST insertion")); + (*st)->edge = edge; + (*st)->xb = edge->xb; + (*st)->xt = edge->xt; + (*st)->dx = edge->dx; + (*st)->prev = existing_node; + } else { + /* Compute intersection between new edge and ST edge */ + r = (edge->xb - (*st)->xb) / den; + x = (*st)->xb + r * ((*st)->xt - (*st)->xb); + y = r * dy; + + /* Insert the edge pointers and the intersection point in the IT */ + add_intersection(it, (*st)->edge, edge, x, y); + + /* Head further into the ST */ + add_st_edge(&((*st)->prev), it, edge, dy); + } + } +} + +static void build_intersection_table(it_node **it, edge_node *aet, double dy) { + st_node *st; + st_node *stp; + edge_node *edge = NULL; + + /* Build intersection table for the current scanbeam */ + reset_it(it); + st = NULL; + + /* Process each AET edge */ + for (edge = aet; edge; edge = edge->next) { + if ((edge->bstate[ABOVE] == BUNDLE_HEAD) || edge->bundle[ABOVE][CLIP] || + edge->bundle[ABOVE][SUBJ]) { + add_st_edge(&st, it, edge, dy); + } + } + + /* Free the sorted edge table */ + while (st) { + stp = st->prev; + gpc_free(st); + st = stp; + } +} + +static int count_contours(polygon_node *polygon) { + int nc = 0; + int nv = 0; + vertex_node *v = NULL; + vertex_node *nextv = NULL; + + for (nc = 0; polygon; polygon = polygon->next) { + if (polygon->active) { + /* Count the vertices in the current contour */ + nv = 0; + for (v = polygon->proxy->v[LEFT]; v; v = v->next) { + nv++; + } + + /* Record valid vertex counts in the active field */ + if (nv > 2) { + polygon->active = nv; + nc++; + } else { + /* Invalid contour: just free the heap */ + for (v = polygon->proxy->v[LEFT]; v; v = nextv) { + nextv = v->next; + gpc_free(v); + } + polygon->active = 0; + } + } + } + return nc; +} + +static void add_left(polygon_node *p, double x, double y) { + vertex_node *nv = NULL; + + /* Create a new vertex node and set its fields */ + gpc_malloc(nv, sizeof(vertex_node), + const_cast("vertex node creation")); + nv->x = x; + nv->y = y; + + /* Add vertex nv to the left end of the polygon's vertex list */ + nv->next = p->proxy->v[LEFT]; + + /* Update proxy->[LEFT] to point to nv */ + p->proxy->v[LEFT] = nv; +} + +static void merge_left(polygon_node *p, polygon_node *q, polygon_node *list) { + polygon_node *target = NULL; + + /* Label contour as a hole */ + q->proxy->hole = 1; + + if (p->proxy != q->proxy) { + /* Assign p's vertex list to the left end of q's list */ + p->proxy->v[RIGHT]->next = q->proxy->v[LEFT]; + q->proxy->v[LEFT] = p->proxy->v[LEFT]; + + /* Redirect any p->proxy references to q->proxy */ + + for (target = p->proxy; list; list = list->next) { + if (list->proxy == target) { + list->active = 0; + list->proxy = q->proxy; + } + } + } +} + +static void add_right(polygon_node *p, double x, double y) { + vertex_node *nv = NULL; + + /* Create a new vertex node and set its fields */ + gpc_malloc(nv, sizeof(vertex_node), + const_cast("vertex node creation")); + nv->x = x; + nv->y = y; + nv->next = NULL; + + /* Add vertex nv to the right end of the polygon's vertex list */ + p->proxy->v[RIGHT]->next = nv; + + /* Update proxy->v[RIGHT] to point to nv */ + p->proxy->v[RIGHT] = nv; +} + +static void merge_right(polygon_node *p, polygon_node *q, polygon_node *list) { + polygon_node *target = NULL; + + /* Label contour as external */ + q->proxy->hole = 0; + + if (p->proxy != q->proxy) { + /* Assign p's vertex list to the right end of q's list */ + q->proxy->v[RIGHT]->next = p->proxy->v[LEFT]; + q->proxy->v[RIGHT] = p->proxy->v[RIGHT]; + + /* Redirect any p->proxy references to q->proxy */ + for (target = p->proxy; list; list = list->next) { + if (list->proxy == target) { + list->active = 0; + list->proxy = q->proxy; + } + } + } +} + +static void add_local_min(polygon_node **p, edge_node *edge, double x, + double y) { + polygon_node *existing_min = NULL; + vertex_node *nv = NULL; + + existing_min = *p; + + gpc_malloc(*p, sizeof(polygon_node), + const_cast("polygon node creation")); + + /* Create a new vertex node and set its fields */ + gpc_malloc(nv, sizeof(vertex_node), + const_cast("vertex node creation")); + nv->x = x; + nv->y = y; + nv->next = NULL; + + /* Initialise proxy to point to p itself */ + (*p)->proxy = (*p); + (*p)->active = 1; + (*p)->next = existing_min; + + /* Make v[LEFT] and v[RIGHT] point to new vertex nv */ + (*p)->v[LEFT] = nv; + (*p)->v[RIGHT] = nv; + + /* Assign polygon p to the edge */ + edge->outp[ABOVE] = *p; +} + +static int count_tristrips(polygon_node *tn) { + int total = 0; + + for (total = 0; tn; tn = tn->next) { + if (tn->active > 2) { + total++; + } + } + return total; +} + +void add_vertex(vertex_node **t, double x, double y) { + if (!(*t)) { + gpc_malloc(*t, sizeof(vertex_node), + const_cast("tristrip vertex creation")); + (*t)->x = x; + (*t)->y = y; + (*t)->next = NULL; + } else { + /* Head further down the list */ + add_vertex(&((*t)->next), x, y); + } +} + +void gpc_vertex_create(edge_node *e, int p, int s, double x, double y) { + add_vertex(&(e->outp[p]->v[s]), x, y); + e->outp[p]->active++; +} + +static void new_tristrip(polygon_node **tn, edge_node *edge, double x, + double y) { + if (!(*tn)) { + gpc_malloc(*tn, sizeof(polygon_node), + const_cast("tristrip node creation")); + (*tn)->next = NULL; + (*tn)->v[LEFT] = NULL; + (*tn)->v[RIGHT] = NULL; + (*tn)->active = 1; + add_vertex(&((*tn)->v[LEFT]), x, y); + edge->outp[ABOVE] = *tn; + } else { + /* Head further down the list */ + new_tristrip(&((*tn)->next), edge, x, y); + } +} + +static bbox *create_contour_bboxes(gpc_polygon *p) { + bbox *box; + int c = 0; + int v = 0; + + gpc_malloc(box, p->num_contours * sizeof(bbox), + const_cast("Bounding box creation")); + + /* Construct contour bounding boxes */ + for (c = 0; c < p->num_contours; c++) { + /* Initialise bounding box extent */ + box[c].xmin = DBL_MAX; + box[c].ymin = DBL_MAX; + box[c].xmax = -DBL_MAX; + box[c].ymax = -DBL_MAX; + + for (v = 0; v < p->contour[c].num_vertices; v++) { + /* Adjust bounding box */ + if (p->contour[c].vertex[v].x < box[c].xmin) { + box[c].xmin = p->contour[c].vertex[v].x; + } + if (p->contour[c].vertex[v].y < box[c].ymin) { + box[c].ymin = p->contour[c].vertex[v].y; + } + if (p->contour[c].vertex[v].x > box[c].xmax) { + box[c].xmax = p->contour[c].vertex[v].x; + } + if (p->contour[c].vertex[v].y > box[c].ymax) { + box[c].ymax = p->contour[c].vertex[v].y; + } + } + } + return box; +} + +static void minimax_test(gpc_polygon *subj, gpc_polygon *clip, gpc_op op) { + bbox *s_bbox; + bbox *c_bbox; + int s = 0; + int c = 0; + int *o_table = NULL; + int overlap = 0; + + s_bbox = create_contour_bboxes(subj); + c_bbox = create_contour_bboxes(clip); + + gpc_malloc(o_table, + subj->num_contours * clip->num_contours * sizeof(int), + const_cast("overlap table creation")); + + /* Check all subject contour bounding boxes against clip boxes */ + for (s = 0; s < subj->num_contours; s++) { + for (c = 0; c < clip->num_contours; c++) { + o_table[c * subj->num_contours + s] = + (!((s_bbox[s].xmax < c_bbox[c].xmin) || + (s_bbox[s].xmin > c_bbox[c].xmax))) && + (!((s_bbox[s].ymax < c_bbox[c].ymin) || + (s_bbox[s].ymin > c_bbox[c].ymax))); + } + } + + /* For each clip contour, search for any subject contour overlaps */ + for (c = 0; c < clip->num_contours; c++) { + overlap = 0; + for (s = 0; (!overlap) && (s < subj->num_contours); s++) { + overlap = o_table[c * subj->num_contours + s]; + } + + if (!overlap) { + /* Flag non contributing status by negating vertex count */ + clip->contour[c].num_vertices = -clip->contour[c].num_vertices; + } + } + + if (op == GPC_INT) { + /* For each subject contour, search for any clip contour overlaps */ + for (s = 0; s < subj->num_contours; s++) { + overlap = 0; + for (c = 0; (!overlap) && (c < clip->num_contours); c++) { + overlap = o_table[c * subj->num_contours + s]; + } + + if (!overlap) { + /* Flag non contributing status by negating vertex count */ + subj->contour[s].num_vertices = -subj->contour[s].num_vertices; + } + } + } + + gpc_free(s_bbox); + gpc_free(c_bbox); + gpc_free(o_table); +} + +/* +=========================================================================== + Public Functions +=========================================================================== +*/ + +void gpc_free_polygon(gpc_polygon *p) { + int c = 0; + + for (c = 0; c < p->num_contours; c++) { + gpc_free(p->contour[c].vertex); + } + gpc_free(p->hole); + gpc_free(p->contour); + p->num_contours = 0; +} + +/* +void gpc_read_polygon(FILE *fp, int read_hole_flags, gpc_polygon *p) { + int c = 0; + int v = 0; + + fscanf(fp, "%d", &(p->num_contours)); + gpc_malloc(p->hole, p->num_contours * sizeof(int), + (char *)"hole flag array creation"); + gpc_malloc(p->contour, + p->num_contours * sizeof(gpc_vertex_list), + (char *)"contour creation"); + for (c = 0; c < p->num_contours; c++) { + fscanf(fp, "%d", &(p->contour[c].num_vertices)); + + if (read_hole_flags) { + fscanf(fp, "%d", &(p->hole[c])); + } else { + p->hole[c] = 0; // Assume all contours to be external + } + + gpc_malloc(p->contour[c].vertex, + p->contour[c].num_vertices * sizeof(gpc_vertex), + (char *)"vertex creation"); + for (v = 0; v < p->contour[c].num_vertices; v++) { + fscanf(fp, "%lf %lf", &(p->contour[c].vertex[v].x), + &(p->contour[c].vertex[v].y)); + } + } +} + +void gpc_write_polygon(FILE *fp, int write_hole_flags, gpc_polygon *p) { + int c = 0; + int v = 0; + + fprintf(fp, "%d\n", p->num_contours); + for (c = 0; c < p->num_contours; c++) { + fprintf(fp, "%d\n", p->contour[c].num_vertices); + + if (write_hole_flags) { + fprintf(fp, "%d\n", p->hole[c]); + } + + for (v = 0; v < p->contour[c].num_vertices; v++) { + fprintf(fp, "% .*lf % .*lf\n", DBL_DIG, p->contour[c].vertex[v].x, + DBL_DIG, p->contour[c].vertex[v].y); + } + } +} +*/ + +void gpc_add_contour(gpc_polygon *p, gpc_vertex_list *new_contour, int hole) { + int *extended_hole = NULL; + int c = 0; + int v = 0; + gpc_vertex_list *extended_contour = NULL; + + /* Create an extended hole array */ + gpc_malloc(extended_hole, (p->num_contours + 1) * sizeof(int), + const_cast("contour hole addition")); + + /* Create an extended contour array */ + gpc_malloc(extended_contour, + (p->num_contours + 1) * sizeof(gpc_vertex_list), + const_cast("contour addition")); + + /* Copy the old contour and hole data into the extended arrays */ + for (c = 0; c < p->num_contours; c++) { + extended_hole[c] = p->hole[c]; + extended_contour[c] = p->contour[c]; + } + + /* Copy the new contour and hole onto the end of the extended arrays */ + c = p->num_contours; + extended_hole[c] = hole; + extended_contour[c].num_vertices = new_contour->num_vertices; + gpc_malloc(extended_contour[c].vertex, + new_contour->num_vertices * sizeof(gpc_vertex), + const_cast("contour addition")); + for (v = 0; v < new_contour->num_vertices; v++) { + extended_contour[c].vertex[v] = new_contour->vertex[v]; + } + + /* Dispose of the old contour */ + gpc_free(p->contour); + gpc_free(p->hole); + + /* Update the polygon information */ + p->num_contours++; + p->hole = extended_hole; + p->contour = extended_contour; +} + +// gpc_polygon_clip +void gpc_polygon_clip(gpc_op op, gpc_polygon *subj, gpc_polygon *clip, + gpc_polygon *result) { + sb_tree *sbtree = NULL; + it_node *it = NULL; + it_node *intersect = NULL; + edge_node *edge = NULL; + edge_node *prev_edge = NULL; + edge_node *next_edge = NULL; + edge_node *succ_edge = NULL; + edge_node *e0 = NULL; + edge_node *e1 = NULL; + edge_node *aet = NULL; + edge_node *c_heap = NULL; + edge_node *s_heap = NULL; + lmt_node *lmt = NULL; + lmt_node *local_min = NULL; + polygon_node *out_poly = NULL; + polygon_node *p = NULL; + polygon_node *q = NULL; + polygon_node *poly = NULL; + polygon_node *npoly = NULL; + polygon_node *cf = NULL; + vertex_node *vtx = NULL; + vertex_node *nv = NULL; + h_state horiz[2]; + int in[2]; + int exists[2]; + int parity[2] = {LEFT, LEFT}; + int c = 0; + int v = 0; + int contributing = 0; + int search = 0; + int scanbeam = 0; + int sbt_entries = 0; + int vclass = 0; + int bl = 0; + int br = 0; + int tl = 0; + int tr = 0; + double *sbt = NULL; + double xb = 0.0; + double px = 0.0; + double yb = 0.0; + double yt = 0.0; + double dy = 0.0; + double ix = 0.0; + double iy = 0.0; + + /* Test for trivial NULL result cases */ + if (((subj->num_contours == 0) && (clip->num_contours == 0)) || + ((subj->num_contours == 0) && ((op == GPC_INT) || (op == GPC_DIFF))) || + ((clip->num_contours == 0) && (op == GPC_INT))) { + result->num_contours = 0; + result->hole = NULL; + result->contour = NULL; + return; + } + /* Identify potentialy contributing contours */ + if (((op == GPC_INT) || (op == GPC_DIFF)) && (subj->num_contours > 0) && + (clip->num_contours > 0)) { + minimax_test(subj, clip, op); + } + /* Build LMT */ + if (subj->num_contours > 0) { + s_heap = build_lmt(&lmt, &sbtree, &sbt_entries, subj, SUBJ, op); + } + if (clip->num_contours > 0) { + c_heap = build_lmt(&lmt, &sbtree, &sbt_entries, clip, CLIP, op); + } + /* Return a NULL result if no contours contribute */ + if (lmt == NULL) { + result->num_contours = 0; + result->hole = NULL; + result->contour = NULL; + reset_lmt(&lmt); + gpc_free(s_heap); + gpc_free(c_heap); + return; + } + + /* Build scanbeam table from scanbeam tree */ + gpc_malloc(sbt, sbt_entries * sizeof(double), + const_cast("sbt creation")); + build_sbt(&scanbeam, sbt, sbtree); + scanbeam = 0; + free_sbtree(&sbtree); + /* Allow pointer re-use without causing memory leak */ + if (subj == result) { + gpc_free_polygon(subj); + } + if (clip == result) { + gpc_free_polygon(clip); + } + /* Invert clip polygon for difference operation */ + if (op == GPC_DIFF) { + parity[CLIP] = RIGHT; + } + local_min = lmt; + + // Process each scanbeam + while (scanbeam < sbt_entries) { + /* Set yb and yt to the bottom and top of the scanbeam */ + yb = sbt[scanbeam++]; + if (scanbeam < sbt_entries) { + yt = sbt[scanbeam]; + dy = yt - yb; + } + /* === SCANBEAM BOUNDARY PROCESSING ================================ */ + /* If LMT node corresponding to yb exists */ + if (local_min) { + if (local_min->y == yb) { + /* Add edges starting at this local minimum to the AET */ + for (edge = local_min->first_bound; edge; edge = edge->next_bound) { + add_edge_to_aet(&aet, edge, NULL); + } + local_min = local_min->next; + } + } + /* Set dummy previous x value */ + px = -DBL_MAX; + /* Create bundles within AET */ + e0 = aet; + e1 = aet; + /* Set up bundle fields of first edge */ + aet->bundle[ABOVE][aet->type] = (aet->top.y != yb); + aet->bundle[ABOVE][!aet->type] = 0; + aet->bstate[ABOVE] = UNBUNDLED; + + for (next_edge = aet->next; next_edge; next_edge = next_edge->next) { + /* Set up bundle fields of next edge */ + next_edge->bundle[ABOVE][next_edge->type] = (next_edge->top.y != yb); + next_edge->bundle[ABOVE][!next_edge->type] = 0; + next_edge->bstate[ABOVE] = UNBUNDLED; + /* Bundle edges above the scanbeam boundary if they coincide */ + if (next_edge->bundle[ABOVE][next_edge->type]) { + if (gpc_eq(e0->xb, next_edge->xb) && gpc_eq(e0->dx, next_edge->dx) && + (e0->top.y != yb)) { + next_edge->bundle[ABOVE][next_edge->type] ^= + e0->bundle[ABOVE][next_edge->type]; + next_edge->bundle[ABOVE][!next_edge->type] = + e0->bundle[ABOVE][!next_edge->type]; + next_edge->bstate[ABOVE] = BUNDLE_HEAD; + e0->bundle[ABOVE][CLIP] = 0; + e0->bundle[ABOVE][SUBJ] = 0; + e0->bstate[ABOVE] = BUNDLE_TAIL; + } + e0 = next_edge; + } + } + horiz[CLIP] = NH; + horiz[SUBJ] = NH; + + // Process each edge at this scanbeam boundary + for (edge = aet; edge; edge = edge->next) { + exists[CLIP] = + edge->bundle[ABOVE][CLIP] + (edge->bundle[BELOW][CLIP] << 1); + exists[SUBJ] = + edge->bundle[ABOVE][SUBJ] + (edge->bundle[BELOW][SUBJ] << 1); + if (exists[CLIP] || exists[SUBJ]) { + /* Set bundle side */ + edge->bside[CLIP] = parity[CLIP]; + edge->bside[SUBJ] = parity[SUBJ]; + /* Determine contributing status and quadrant occupancies */ + switch (op) { + case GPC_DIFF: + case GPC_INT: + contributing = (exists[CLIP] && (parity[SUBJ] || horiz[SUBJ])) || + (exists[SUBJ] && (parity[CLIP] || horiz[CLIP])) || + (exists[CLIP] && exists[SUBJ] && + (parity[CLIP] == parity[SUBJ])); + br = (parity[CLIP]) && (parity[SUBJ]); + bl = (parity[CLIP] ^ edge->bundle[ABOVE][CLIP]) && + (parity[SUBJ] ^ edge->bundle[ABOVE][SUBJ]); + tr = (parity[CLIP] ^ (horiz[CLIP] != NH)) && + (parity[SUBJ] ^ (horiz[SUBJ] != NH)); + tl = (parity[CLIP] ^ (horiz[CLIP] != NH) ^ + edge->bundle[BELOW][CLIP]) && + (parity[SUBJ] ^ (horiz[SUBJ] != NH) ^ + edge->bundle[BELOW][SUBJ]); + break; + case GPC_XOR: + contributing = exists[CLIP] || exists[SUBJ]; + br = (parity[CLIP]) ^ (parity[SUBJ]); + bl = (parity[CLIP] ^ edge->bundle[ABOVE][CLIP]) ^ + (parity[SUBJ] ^ edge->bundle[ABOVE][SUBJ]); + tr = (parity[CLIP] ^ (horiz[CLIP] != NH)) ^ + (parity[SUBJ] ^ (horiz[SUBJ] != NH)); + tl = (parity[CLIP] ^ (horiz[CLIP] != NH) ^ + edge->bundle[BELOW][CLIP]) ^ + (parity[SUBJ] ^ (horiz[SUBJ] != NH) ^ + edge->bundle[BELOW][SUBJ]); + break; + case GPC_UNION: + contributing = (exists[CLIP] && (!parity[SUBJ] || horiz[SUBJ])) || + (exists[SUBJ] && (!parity[CLIP] || horiz[CLIP])) || + (exists[CLIP] && exists[SUBJ] && + (parity[CLIP] == parity[SUBJ])); + br = (parity[CLIP]) || (parity[SUBJ]); + bl = (parity[CLIP] ^ edge->bundle[ABOVE][CLIP]) || + (parity[SUBJ] ^ edge->bundle[ABOVE][SUBJ]); + tr = (parity[CLIP] ^ (horiz[CLIP] != NH)) || + (parity[SUBJ] ^ (horiz[SUBJ] != NH)); + tl = (parity[CLIP] ^ (horiz[CLIP] != NH) ^ + edge->bundle[BELOW][CLIP]) || + (parity[SUBJ] ^ (horiz[SUBJ] != NH) ^ + edge->bundle[BELOW][SUBJ]); + break; + } + // Update parity + parity[CLIP] ^= edge->bundle[ABOVE][CLIP]; + parity[SUBJ] ^= edge->bundle[ABOVE][SUBJ]; + /* Update horizontal state */ + if (exists[CLIP]) { + horiz[CLIP] = next_h_state[horiz[CLIP]] + [((exists[CLIP] - 1) << 1) + parity[CLIP]]; + } + if (exists[SUBJ]) { + horiz[SUBJ] = next_h_state[horiz[SUBJ]] + [((exists[SUBJ] - 1) << 1) + parity[SUBJ]]; + } + vclass = tr + (tl << 1) + (br << 2) + (bl << 3); + if (contributing) { + xb = edge->xb; + switch (vclass) { + case EMN: + case IMN: + add_local_min(&out_poly, edge, xb, yb); + px = xb; + cf = edge->outp[ABOVE]; + break; + case ERI: + if (xb != px) { + add_right(cf, xb, yb); + px = xb; + } + edge->outp[ABOVE] = cf; + cf = NULL; + break; + case ELI: + add_left(edge->outp[BELOW], xb, yb); + px = xb; + cf = edge->outp[BELOW]; + break; + case EMX: + if (xb != px) { + add_left(cf, xb, yb); + px = xb; + } + merge_right(cf, edge->outp[BELOW], out_poly); + cf = NULL; + break; + case ILI: + if (xb != px) { + add_left(cf, xb, yb); + px = xb; + } + edge->outp[ABOVE] = cf; + cf = NULL; + break; + case IRI: + add_right(edge->outp[BELOW], xb, yb); + px = xb; + cf = edge->outp[BELOW]; + edge->outp[BELOW] = NULL; + break; + case IMX: + if (xb != px) { + add_right(cf, xb, yb); + px = xb; + } + merge_left(cf, edge->outp[BELOW], out_poly); + cf = NULL; + edge->outp[BELOW] = NULL; + break; + case IMM: + if (xb != px) { + add_right(cf, xb, yb); + px = xb; + } + merge_left(cf, edge->outp[BELOW], out_poly); + edge->outp[BELOW] = NULL; + add_local_min(&out_poly, edge, xb, yb); + cf = edge->outp[ABOVE]; + break; + case EMM: + if (xb != px) { + add_left(cf, xb, yb); + px = xb; + } + merge_right(cf, edge->outp[BELOW], out_poly); + edge->outp[BELOW] = NULL; + add_local_min(&out_poly, edge, xb, yb); + cf = edge->outp[ABOVE]; + break; + case LED: + if (edge->bot.y == yb) { + add_left(edge->outp[BELOW], xb, yb); + } + edge->outp[ABOVE] = edge->outp[BELOW]; + px = xb; + break; + case RED: + if (edge->bot.y == yb) { + add_right(edge->outp[BELOW], xb, yb); + } + edge->outp[ABOVE] = edge->outp[BELOW]; + px = xb; + break; + default: + break; + } /* End of switch */ + } /* End of contributing conditional */ + } /* End of edge exists conditional */ + } // End of AET loop + + /* Delete terminating edges from the AET, otherwise compute xt */ + for (edge = aet; edge; edge = edge->next) { + if (edge->top.y == yb) { + prev_edge = edge->prev; + next_edge = edge->next; + if (prev_edge) { + prev_edge->next = next_edge; + } else { + aet = next_edge; + } + if (next_edge) { + next_edge->prev = prev_edge; + } + /* Copy bundle head state to the adjacent tail edge if required */ + if ((edge->bstate[BELOW] == BUNDLE_HEAD) && prev_edge) { + if (prev_edge->bstate[BELOW] == BUNDLE_TAIL) { + prev_edge->outp[BELOW] = edge->outp[BELOW]; + prev_edge->bstate[BELOW] = UNBUNDLED; + if (prev_edge->prev) { + if (prev_edge->prev->bstate[BELOW] == BUNDLE_TAIL) { + prev_edge->bstate[BELOW] = BUNDLE_HEAD; + } + } + } + } + } else { + if (edge->top.y == yt) { + edge->xt = edge->top.x; + } else { + edge->xt = edge->bot.x + edge->dx * (yt - edge->bot.y); + } + } + } + + if (scanbeam < sbt_entries) { + /* === SCANBEAM INTERIOR PROCESSING ============================== */ + build_intersection_table(&it, aet, dy); + /* Process each node in the intersection table */ + for (intersect = it; intersect; intersect = intersect->next) { + e0 = intersect->ie[0]; + e1 = intersect->ie[1]; + /* Only generate output for contributing intersections */ + if ((e0->bundle[ABOVE][CLIP] || e0->bundle[ABOVE][SUBJ]) && + (e1->bundle[ABOVE][CLIP] || e1->bundle[ABOVE][SUBJ])) { + p = e0->outp[ABOVE]; + q = e1->outp[ABOVE]; + ix = intersect->point.x; + iy = intersect->point.y + yb; + + in[CLIP] = (e0->bundle[ABOVE][CLIP] && !e0->bside[CLIP]) || + (e1->bundle[ABOVE][CLIP] && e1->bside[CLIP]) || + (!e0->bundle[ABOVE][CLIP] && !e1->bundle[ABOVE][CLIP] && + e0->bside[CLIP] && e1->bside[CLIP]); + in[SUBJ] = (e0->bundle[ABOVE][SUBJ] && !e0->bside[SUBJ]) || + (e1->bundle[ABOVE][SUBJ] && e1->bside[SUBJ]) || + (!e0->bundle[ABOVE][SUBJ] && !e1->bundle[ABOVE][SUBJ] && + e0->bside[SUBJ] && e1->bside[SUBJ]); + + // Determine quadrant occupancies + switch (op) { + case GPC_DIFF: + case GPC_INT: + tr = (in[CLIP]) && (in[SUBJ]); + tl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP]) && + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ]); + br = (in[CLIP] ^ e0->bundle[ABOVE][CLIP]) && + (in[SUBJ] ^ e0->bundle[ABOVE][SUBJ]); + bl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP] ^ + e0->bundle[ABOVE][CLIP]) && + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ] ^ + e0->bundle[ABOVE][SUBJ]); + break; + case GPC_XOR: + tr = (in[CLIP]) ^ (in[SUBJ]); + tl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP]) ^ + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ]); + br = (in[CLIP] ^ e0->bundle[ABOVE][CLIP]) ^ + (in[SUBJ] ^ e0->bundle[ABOVE][SUBJ]); + bl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP] ^ + e0->bundle[ABOVE][CLIP]) ^ + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ] ^ + e0->bundle[ABOVE][SUBJ]); + break; + case GPC_UNION: + tr = (in[CLIP]) || (in[SUBJ]); + tl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP]) || + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ]); + br = (in[CLIP] ^ e0->bundle[ABOVE][CLIP]) || + (in[SUBJ] ^ e0->bundle[ABOVE][SUBJ]); + bl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP] ^ + e0->bundle[ABOVE][CLIP]) || + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ] ^ + e0->bundle[ABOVE][SUBJ]); + break; + } + vclass = tr + (tl << 1) + (br << 2) + (bl << 3); + switch (vclass) { + case EMN: + add_local_min(&out_poly, e0, ix, iy); + e1->outp[ABOVE] = e0->outp[ABOVE]; + break; + case ERI: + if (p) { + add_right(p, ix, iy); + e1->outp[ABOVE] = p; + e0->outp[ABOVE] = NULL; + } + break; + case ELI: + if (q) { + add_left(q, ix, iy); + e0->outp[ABOVE] = q; + e1->outp[ABOVE] = NULL; + } + break; + case EMX: + if (p && q) { + add_left(p, ix, iy); + merge_right(p, q, out_poly); + e0->outp[ABOVE] = NULL; + e1->outp[ABOVE] = NULL; + } + break; + case IMN: + add_local_min(&out_poly, e0, ix, iy); + e1->outp[ABOVE] = e0->outp[ABOVE]; + break; + case ILI: + if (p) { + add_left(p, ix, iy); + e1->outp[ABOVE] = p; + e0->outp[ABOVE] = NULL; + } + break; + case IRI: + if (q) { + add_right(q, ix, iy); + e0->outp[ABOVE] = q; + e1->outp[ABOVE] = NULL; + } + break; + case IMX: + if (p && q) { + add_right(p, ix, iy); + merge_left(p, q, out_poly); + e0->outp[ABOVE] = NULL; + e1->outp[ABOVE] = NULL; + } + break; + case IMM: + if (p && q) { + add_right(p, ix, iy); + merge_left(p, q, out_poly); + add_local_min(&out_poly, e0, ix, iy); + e1->outp[ABOVE] = e0->outp[ABOVE]; + } + break; + case EMM: + if (p && q) { + add_left(p, ix, iy); + merge_right(p, q, out_poly); + add_local_min(&out_poly, e0, ix, iy); + e1->outp[ABOVE] = e0->outp[ABOVE]; + } + break; + default: + break; + } // End of switch + } /* End of contributing intersection conditional */ + + /* Swap bundle sides in response to edge crossing */ + if (e0->bundle[ABOVE][CLIP]) { + e1->bside[CLIP] = !e1->bside[CLIP]; + } + if (e1->bundle[ABOVE][CLIP]) { + e0->bside[CLIP] = !e0->bside[CLIP]; + } + if (e0->bundle[ABOVE][SUBJ]) { + e1->bside[SUBJ] = !e1->bside[SUBJ]; + } + if (e1->bundle[ABOVE][SUBJ]) { + e0->bside[SUBJ] = !e0->bside[SUBJ]; + } + + /* Swap e0 and e1 bundles in the AET */ + prev_edge = e0->prev; + next_edge = e1->next; + if (next_edge) { + next_edge->prev = e0; + } + if (e0->bstate[ABOVE] == BUNDLE_HEAD) { + search = 1; + while (search) { + prev_edge = prev_edge->prev; + if (prev_edge) { + if (prev_edge->bstate[ABOVE] != BUNDLE_TAIL) { + search = 0; + } + } else { + search = 0; + } + } + } + if (!prev_edge) { + aet->prev = e1; + e1->next = aet; + aet = e0->next; + } else { + prev_edge->next->prev = e1; + e1->next = prev_edge->next; + prev_edge->next = e0->next; + } + e0->next->prev = prev_edge; + e1->next->prev = e1; + e0->next = next_edge; + } /* End of IT loop*/ + + // Prepare for next scanbeam + for (edge = aet; edge; edge = next_edge) { + next_edge = edge->next; + succ_edge = edge->succ; + if ((edge->top.y == yt) && succ_edge) { + /* Replace AET edge by its successor */ + succ_edge->outp[BELOW] = edge->outp[ABOVE]; + succ_edge->bstate[BELOW] = edge->bstate[ABOVE]; + succ_edge->bundle[BELOW][CLIP] = edge->bundle[ABOVE][CLIP]; + succ_edge->bundle[BELOW][SUBJ] = edge->bundle[ABOVE][SUBJ]; + prev_edge = edge->prev; + if (prev_edge) { + prev_edge->next = succ_edge; + } else { + aet = succ_edge; + } + if (next_edge) { + next_edge->prev = succ_edge; + } + succ_edge->prev = prev_edge; + succ_edge->next = next_edge; + } else { + /* Update this edge */ + edge->outp[BELOW] = edge->outp[ABOVE]; + edge->bstate[BELOW] = edge->bstate[ABOVE]; + edge->bundle[BELOW][CLIP] = edge->bundle[ABOVE][CLIP]; + edge->bundle[BELOW][SUBJ] = edge->bundle[ABOVE][SUBJ]; + edge->xb = edge->xt; + } + edge->outp[ABOVE] = NULL; + } + } + } /* === END OF SCANBEAM PROCESSING ================================== */ + // Generate result polygon from out_poly + result->contour = NULL; + result->hole = NULL; + result->num_contours = count_contours(out_poly); + if (result->num_contours > 0) { + gpc_malloc(result->hole, result->num_contours * sizeof(int), + const_cast("hole flag table creation")); + gpc_malloc(result->contour, + result->num_contours * sizeof(gpc_vertex_list), + const_cast("contour creation")); + + c = 0; + for (poly = out_poly; poly; poly = npoly) { + npoly = poly->next; + if (poly->active) { + result->hole[c] = poly->proxy->hole; + result->contour[c].num_vertices = poly->active; + gpc_malloc( + result->contour[c].vertex, + result->contour[c].num_vertices * sizeof(gpc_vertex), + const_cast("vertex creation")); + + v = result->contour[c].num_vertices - 1; + for (vtx = poly->proxy->v[LEFT]; vtx; vtx = nv) { + nv = vtx->next; + result->contour[c].vertex[v].x = vtx->x; + result->contour[c].vertex[v].y = vtx->y; + gpc_free(vtx); + v--; + } + c++; + } + gpc_free(poly); + } + } else { + for (poly = out_poly; poly; poly = npoly) { + npoly = poly->next; + gpc_free(poly); + } + } + + // Tidy up + reset_it(&it); + reset_lmt(&lmt); + gpc_free(c_heap); + gpc_free(s_heap); + gpc_free(sbt); +} // NOLINT + +void gpc_free_tristrip(gpc_tristrip *t) { + int s = 0; + for (s = 0; s < t->num_strips; s++) { + gpc_free(t->strip[s].vertex); + } + gpc_free(t->strip); + t->num_strips = 0; +} + +void gpc_polygon_to_tristrip(gpc_polygon *s, gpc_tristrip *t) { + gpc_polygon c; + c.num_contours = 0; + c.hole = NULL; + c.contour = NULL; + gpc_tristrip_clip(GPC_DIFF, s, &c, t); +} + +// gpc_tristrip_clip +void gpc_tristrip_clip(gpc_op op, gpc_polygon *subj, gpc_polygon *clip, + gpc_tristrip *result) { + sb_tree *sbtree = NULL; + it_node *it = NULL; + it_node *intersect = NULL; + edge_node *edge = NULL; + edge_node *prev_edge = NULL; + edge_node *next_edge = NULL; + edge_node *succ_edge = NULL; + edge_node *e0 = NULL; + edge_node *e1 = NULL; + edge_node *aet = NULL; + edge_node *c_heap = NULL; + edge_node *s_heap = NULL; + edge_node *cf = NULL; + lmt_node *lmt = NULL; + lmt_node *local_min = NULL; + polygon_node *tlist = NULL; + polygon_node *tn = NULL; + polygon_node *tnn = NULL; + polygon_node *p = NULL; + polygon_node *q = NULL; + vertex_node *lt = NULL; + vertex_node *ltn = NULL; + vertex_node *rt = NULL; + vertex_node *rtn = NULL; + h_state horiz[2]; + vertex_type cft = NUL; + int in[2]; + int exists[2]; + int parity[2] = {LEFT, LEFT}; + int s = 0; + int v = 0; + int contributing = 0; + int search = 0; + int scanbeam = 0; + int sbt_entries = 0; + int vclass = 0; + int bl = 0; + int br = 0; + int tl = 0; + int tr = 0; + double *sbt = NULL; + double xb = 0.0; + double px = 0.0; + double nx = 0.0; + double yb = 0.0; + double yt = 0.0; + double dy = 0.0; + double ix = 0.0; + double iy = 0.0; + + /* Test for trivial NULL result cases */ + if (((subj->num_contours == 0) && (clip->num_contours == 0)) || + ((subj->num_contours == 0) && ((op == GPC_INT) || (op == GPC_DIFF))) || + ((clip->num_contours == 0) && (op == GPC_INT))) { + result->num_strips = 0; + result->strip = NULL; + return; + } + + /* Identify potentialy contributing contours */ + if (((op == GPC_INT) || (op == GPC_DIFF)) && (subj->num_contours > 0) && + (clip->num_contours > 0)) { + minimax_test(subj, clip, op); + } + /* Build LMT */ + if (subj->num_contours > 0) { + s_heap = build_lmt(&lmt, &sbtree, &sbt_entries, subj, SUBJ, op); + } + if (clip->num_contours > 0) { + c_heap = build_lmt(&lmt, &sbtree, &sbt_entries, clip, CLIP, op); + } + /* Return a NULL result if no contours contribute */ + if (lmt == NULL) { + result->num_strips = 0; + result->strip = NULL; + reset_lmt(&lmt); + gpc_free(s_heap); + gpc_free(c_heap); + return; + } + + /* Build scanbeam table from scanbeam tree */ + gpc_malloc(sbt, sbt_entries * sizeof(double), + const_cast("sbt creation")); + build_sbt(&scanbeam, sbt, sbtree); + scanbeam = 0; + free_sbtree(&sbtree); + + /* Invert clip polygon for difference operation */ + if (op == GPC_DIFF) { + parity[CLIP] = RIGHT; + } + local_min = lmt; + + // Process each scanbeam + while (scanbeam < sbt_entries) { + /* Set yb and yt to the bottom and top of the scanbeam */ + yb = sbt[scanbeam++]; + if (scanbeam < sbt_entries) { + yt = sbt[scanbeam]; + dy = yt - yb; + } + + /* === SCANBEAM BOUNDARY PROCESSING ================================ */ + /* If LMT node corresponding to yb exists */ + if (local_min) { + if (local_min->y == yb) { + /* Add edges starting at this local minimum to the AET */ + for (edge = local_min->first_bound; edge; edge = edge->next_bound) { + add_edge_to_aet(&aet, edge, NULL); + } + local_min = local_min->next; + } + } + /* Set dummy previous x value */ + /* Create bundles within AET */ + px = -DBL_MAX; + e0 = aet; + e1 = aet; + + /* Set up bundle fields of first edge */ + aet->bundle[ABOVE][aet->type] = (aet->top.y != yb); + aet->bundle[ABOVE][!aet->type] = 0; + aet->bstate[ABOVE] = UNBUNDLED; + + for (next_edge = aet->next; next_edge; next_edge = next_edge->next) { + /* Set up bundle fields of next edge */ + next_edge->bundle[ABOVE][next_edge->type] = (next_edge->top.y != yb); + next_edge->bundle[ABOVE][!next_edge->type] = 0; + next_edge->bstate[ABOVE] = UNBUNDLED; + + /* Bundle edges above the scanbeam boundary if they coincide */ + if (next_edge->bundle[ABOVE][next_edge->type]) { + if (gpc_eq(e0->xb, next_edge->xb) && gpc_eq(e0->dx, next_edge->dx) && + (e0->top.y != yb)) { + next_edge->bundle[ABOVE][next_edge->type] ^= + e0->bundle[ABOVE][next_edge->type]; + next_edge->bundle[ABOVE][!next_edge->type] = + e0->bundle[ABOVE][!next_edge->type]; + next_edge->bstate[ABOVE] = BUNDLE_HEAD; + e0->bundle[ABOVE][CLIP] = 0; + e0->bundle[ABOVE][SUBJ] = 0; + e0->bstate[ABOVE] = BUNDLE_TAIL; + } + e0 = next_edge; + } + } + horiz[CLIP] = NH; + horiz[SUBJ] = NH; + + /* Process each edge at this scanbeam boundary */ + for (edge = aet; edge; edge = edge->next) { + exists[CLIP] = + edge->bundle[ABOVE][CLIP] + (edge->bundle[BELOW][CLIP] << 1); + exists[SUBJ] = + edge->bundle[ABOVE][SUBJ] + (edge->bundle[BELOW][SUBJ] << 1); + + if (exists[CLIP] || exists[SUBJ]) { + /* Set bundle side */ + edge->bside[CLIP] = parity[CLIP]; + edge->bside[SUBJ] = parity[SUBJ]; + + /* Determine contributing status and quadrant occupancies */ + switch (op) { + case GPC_DIFF: + case GPC_INT: + contributing = (exists[CLIP] && (parity[SUBJ] || horiz[SUBJ])) || + (exists[SUBJ] && (parity[CLIP] || horiz[CLIP])) || + (exists[CLIP] && exists[SUBJ] && + (parity[CLIP] == parity[SUBJ])); + br = (parity[CLIP]) && (parity[SUBJ]); + bl = (parity[CLIP] ^ edge->bundle[ABOVE][CLIP]) && + (parity[SUBJ] ^ edge->bundle[ABOVE][SUBJ]); + tr = (parity[CLIP] ^ (horiz[CLIP] != NH)) && + (parity[SUBJ] ^ (horiz[SUBJ] != NH)); + tl = (parity[CLIP] ^ (horiz[CLIP] != NH) ^ + edge->bundle[BELOW][CLIP]) && + (parity[SUBJ] ^ (horiz[SUBJ] != NH) ^ + edge->bundle[BELOW][SUBJ]); + break; + case GPC_XOR: + contributing = exists[CLIP] || exists[SUBJ]; + br = (parity[CLIP]) ^ (parity[SUBJ]); + bl = (parity[CLIP] ^ edge->bundle[ABOVE][CLIP]) ^ + (parity[SUBJ] ^ edge->bundle[ABOVE][SUBJ]); + tr = (parity[CLIP] ^ (horiz[CLIP] != NH)) ^ + (parity[SUBJ] ^ (horiz[SUBJ] != NH)); + tl = (parity[CLIP] ^ (horiz[CLIP] != NH) ^ + edge->bundle[BELOW][CLIP]) ^ + (parity[SUBJ] ^ (horiz[SUBJ] != NH) ^ + edge->bundle[BELOW][SUBJ]); + break; + case GPC_UNION: + contributing = (exists[CLIP] && (!parity[SUBJ] || horiz[SUBJ])) || + (exists[SUBJ] && (!parity[CLIP] || horiz[CLIP])) || + (exists[CLIP] && exists[SUBJ] && + (parity[CLIP] == parity[SUBJ])); + br = (parity[CLIP]) || (parity[SUBJ]); + bl = (parity[CLIP] ^ edge->bundle[ABOVE][CLIP]) || + (parity[SUBJ] ^ edge->bundle[ABOVE][SUBJ]); + tr = (parity[CLIP] ^ (horiz[CLIP] != NH)) || + (parity[SUBJ] ^ (horiz[SUBJ] != NH)); + tl = (parity[CLIP] ^ (horiz[CLIP] != NH) ^ + edge->bundle[BELOW][CLIP]) || + (parity[SUBJ] ^ (horiz[SUBJ] != NH) ^ + edge->bundle[BELOW][SUBJ]); + break; + } + + // Update parity + parity[CLIP] ^= edge->bundle[ABOVE][CLIP]; + parity[SUBJ] ^= edge->bundle[ABOVE][SUBJ]; + + /* Update horizontal state */ + if (exists[CLIP]) { + horiz[CLIP] = next_h_state[horiz[CLIP]] + [((exists[CLIP] - 1) << 1) + parity[CLIP]]; + } + if (exists[SUBJ]) { + horiz[SUBJ] = next_h_state[horiz[SUBJ]] + [((exists[SUBJ] - 1) << 1) + parity[SUBJ]]; + } + vclass = tr + (tl << 1) + (br << 2) + (bl << 3); + + if (contributing) { + xb = edge->xb; + switch (vclass) { + case EMN: + new_tristrip(&tlist, edge, xb, yb); + cf = edge; + break; + case ERI: + edge->outp[ABOVE] = cf->outp[ABOVE]; + if (xb != cf->xb) { + gpc_vertex_create(edge, ABOVE, RIGHT, xb, yb); + } + cf = NULL; + break; + case ELI: + gpc_vertex_create(edge, BELOW, LEFT, xb, yb); + edge->outp[ABOVE] = NULL; + cf = edge; + break; + case EMX: + if (xb != cf->xb) { + gpc_vertex_create(edge, BELOW, RIGHT, xb, yb); + } + edge->outp[ABOVE] = NULL; + cf = NULL; + break; + case IMN: + if (cft == LED) { + if (cf->bot.y != yb) { + gpc_vertex_create(cf, BELOW, LEFT, cf->xb, yb); + } + new_tristrip(&tlist, cf, cf->xb, yb); + } + edge->outp[ABOVE] = cf->outp[ABOVE]; + gpc_vertex_create(edge, ABOVE, RIGHT, xb, yb); + break; + case ILI: + new_tristrip(&tlist, edge, xb, yb); + cf = edge; + cft = ILI; + break; + case IRI: + if (cft == LED) { + if (cf->bot.y != yb) { + gpc_vertex_create(cf, BELOW, LEFT, cf->xb, yb); + } + new_tristrip(&tlist, cf, cf->xb, yb); + } + gpc_vertex_create(edge, BELOW, RIGHT, xb, yb); + edge->outp[ABOVE] = NULL; + break; + case IMX: + gpc_vertex_create(edge, BELOW, LEFT, xb, yb); + edge->outp[ABOVE] = NULL; + cft = IMX; + break; + case IMM: + gpc_vertex_create(edge, BELOW, LEFT, xb, yb); + edge->outp[ABOVE] = cf->outp[ABOVE]; + if (xb != cf->xb) { + gpc_vertex_create(cf, ABOVE, RIGHT, xb, yb); + } + cf = edge; + break; + case EMM: + gpc_vertex_create(edge, BELOW, RIGHT, xb, yb); + edge->outp[ABOVE] = NULL; + new_tristrip(&tlist, edge, xb, yb); + cf = edge; + break; + case LED: + if (edge->bot.y == yb) { + gpc_vertex_create(edge, BELOW, LEFT, xb, yb); + } + edge->outp[ABOVE] = edge->outp[BELOW]; + cf = edge; + cft = LED; + break; + case RED: + edge->outp[ABOVE] = cf->outp[ABOVE]; + if (cft == LED) { + if (cf->bot.y == yb) { + gpc_vertex_create(edge, BELOW, RIGHT, xb, yb); + } else { + if (edge->bot.y == yb) { + gpc_vertex_create(cf, BELOW, LEFT, cf->xb, yb); + gpc_vertex_create(edge, BELOW, RIGHT, xb, yb); + } + } + } else { + gpc_vertex_create(edge, BELOW, RIGHT, xb, yb); + gpc_vertex_create(edge, ABOVE, RIGHT, xb, yb); + } + cf = NULL; + break; + default: + break; + } /* End of switch */ + } /* End of contributing conditional */ + } /* End of edge exists conditional */ + } // End of AET loop + + /* Delete terminating edges from the AET, otherwise compute xt */ + for (edge = aet; edge; edge = edge->next) { + if (edge->top.y == yb) { + prev_edge = edge->prev; + next_edge = edge->next; + if (prev_edge) { + prev_edge->next = next_edge; + } else { + aet = next_edge; + } + if (next_edge) { + next_edge->prev = prev_edge; + } + + /* Copy bundle head state to the adjacent tail edge if required */ + if ((edge->bstate[BELOW] == BUNDLE_HEAD) && prev_edge) { + if (prev_edge->bstate[BELOW] == BUNDLE_TAIL) { + prev_edge->outp[BELOW] = edge->outp[BELOW]; + prev_edge->bstate[BELOW] = UNBUNDLED; + if (prev_edge->prev) { + if (prev_edge->prev->bstate[BELOW] == BUNDLE_TAIL) { + prev_edge->bstate[BELOW] = BUNDLE_HEAD; + } + } + } + } + } else { + if (edge->top.y == yt) { + edge->xt = edge->top.x; + } else { + edge->xt = edge->bot.x + edge->dx * (yt - edge->bot.y); + } + } + } + + if (scanbeam < sbt_entries) { + /* === SCANBEAM INTERIOR PROCESSING ============================== */ + build_intersection_table(&it, aet, dy); + /* Process each node in the intersection table */ + for (intersect = it; intersect; intersect = intersect->next) { + e0 = intersect->ie[0]; + e1 = intersect->ie[1]; + + /* Only generate output for contributing intersections */ + if ((e0->bundle[ABOVE][CLIP] || e0->bundle[ABOVE][SUBJ]) && + (e1->bundle[ABOVE][CLIP] || e1->bundle[ABOVE][SUBJ])) { + p = e0->outp[ABOVE]; + q = e1->outp[ABOVE]; + ix = intersect->point.x; + iy = intersect->point.y + yb; + + in[CLIP] = (e0->bundle[ABOVE][CLIP] && !e0->bside[CLIP]) || + (e1->bundle[ABOVE][CLIP] && e1->bside[CLIP]) || + (!e0->bundle[ABOVE][CLIP] && !e1->bundle[ABOVE][CLIP] && + e0->bside[CLIP] && e1->bside[CLIP]); + in[SUBJ] = (e0->bundle[ABOVE][SUBJ] && !e0->bside[SUBJ]) || + (e1->bundle[ABOVE][SUBJ] && e1->bside[SUBJ]) || + (!e0->bundle[ABOVE][SUBJ] && !e1->bundle[ABOVE][SUBJ] && + e0->bside[SUBJ] && e1->bside[SUBJ]); + + switch (op) { // Determine quadrant occupancies + case GPC_DIFF: + case GPC_INT: + tr = (in[CLIP]) && (in[SUBJ]); + tl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP]) && + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ]); + br = (in[CLIP] ^ e0->bundle[ABOVE][CLIP]) && + (in[SUBJ] ^ e0->bundle[ABOVE][SUBJ]); + bl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP] ^ + e0->bundle[ABOVE][CLIP]) && + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ] ^ + e0->bundle[ABOVE][SUBJ]); + break; + case GPC_XOR: + tr = (in[CLIP]) ^ (in[SUBJ]); + tl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP]) ^ + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ]); + br = (in[CLIP] ^ e0->bundle[ABOVE][CLIP]) ^ + (in[SUBJ] ^ e0->bundle[ABOVE][SUBJ]); + bl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP] ^ + e0->bundle[ABOVE][CLIP]) ^ + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ] ^ + e0->bundle[ABOVE][SUBJ]); + break; + case GPC_UNION: + tr = (in[CLIP]) || (in[SUBJ]); + tl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP]) || + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ]); + br = (in[CLIP] ^ e0->bundle[ABOVE][CLIP]) || + (in[SUBJ] ^ e0->bundle[ABOVE][SUBJ]); + bl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP] ^ + e0->bundle[ABOVE][CLIP]) || + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ] ^ + e0->bundle[ABOVE][SUBJ]); + break; + } + + vclass = tr + (tl << 1) + (br << 2) + (bl << 3); + switch (vclass) { + case EMN: + new_tristrip(&tlist, e1, ix, iy); + e0->outp[ABOVE] = e1->outp[ABOVE]; + break; + case ERI: + if (p) { + gpc_p_edge(prev_edge, e0, ABOVE); + gpc_vertex_create(prev_edge, ABOVE, LEFT, px, iy); + gpc_vertex_create(e0, ABOVE, RIGHT, ix, iy); + e1->outp[ABOVE] = e0->outp[ABOVE]; + e0->outp[ABOVE] = NULL; + } + break; + case ELI: + if (q) { + gpc_n_edge(next_edge, e1, ABOVE); + gpc_vertex_create(e1, ABOVE, LEFT, ix, iy); + gpc_vertex_create(next_edge, ABOVE, RIGHT, nx, iy); + e0->outp[ABOVE] = e1->outp[ABOVE]; + e1->outp[ABOVE] = NULL; + } + break; + case EMX: + if (p && q) { + gpc_vertex_create(e0, ABOVE, LEFT, ix, iy); + e0->outp[ABOVE] = NULL; + e1->outp[ABOVE] = NULL; + } + break; + case IMN: + gpc_p_edge(prev_edge, e0, ABOVE); + gpc_vertex_create(prev_edge, ABOVE, LEFT, px, iy); + gpc_n_edge(next_edge, e1, ABOVE); + gpc_vertex_create(next_edge, ABOVE, RIGHT, nx, iy); + new_tristrip(&tlist, prev_edge, px, iy); + e1->outp[ABOVE] = prev_edge->outp[ABOVE]; + gpc_vertex_create(e1, ABOVE, RIGHT, ix, iy); + new_tristrip(&tlist, e0, ix, iy); + next_edge->outp[ABOVE] = e0->outp[ABOVE]; + gpc_vertex_create(next_edge, ABOVE, RIGHT, nx, iy); + break; + case ILI: + if (p) { + gpc_vertex_create(e0, ABOVE, LEFT, ix, iy); + gpc_n_edge(next_edge, e1, ABOVE); + gpc_vertex_create(next_edge, ABOVE, RIGHT, nx, iy); + e1->outp[ABOVE] = e0->outp[ABOVE]; + e0->outp[ABOVE] = NULL; + } + break; + case IRI: + if (q) { + gpc_vertex_create(e1, ABOVE, RIGHT, ix, iy); + gpc_p_edge(prev_edge, e0, ABOVE); + gpc_vertex_create(prev_edge, ABOVE, LEFT, px, iy); + e0->outp[ABOVE] = e1->outp[ABOVE]; + e1->outp[ABOVE] = NULL; + } + break; + case IMX: + if (p && q) { + gpc_vertex_create(e0, ABOVE, RIGHT, ix, iy); + gpc_vertex_create(e1, ABOVE, LEFT, ix, iy); + e0->outp[ABOVE] = NULL; + e1->outp[ABOVE] = NULL; + gpc_p_edge(prev_edge, e0, ABOVE); + gpc_vertex_create(prev_edge, ABOVE, LEFT, px, iy); + new_tristrip(&tlist, prev_edge, px, iy); + gpc_n_edge(next_edge, e1, ABOVE); + gpc_vertex_create(next_edge, ABOVE, RIGHT, nx, iy); + next_edge->outp[ABOVE] = prev_edge->outp[ABOVE]; + gpc_vertex_create(next_edge, ABOVE, RIGHT, nx, iy); + } + break; + case IMM: + if (p && q) { + gpc_vertex_create(e0, ABOVE, RIGHT, ix, iy); + gpc_vertex_create(e1, ABOVE, LEFT, ix, iy); + gpc_p_edge(prev_edge, e0, ABOVE); + gpc_vertex_create(prev_edge, ABOVE, LEFT, px, iy); + new_tristrip(&tlist, prev_edge, px, iy); + gpc_n_edge(next_edge, e1, ABOVE); + gpc_vertex_create(next_edge, ABOVE, RIGHT, nx, iy); + e1->outp[ABOVE] = prev_edge->outp[ABOVE]; + gpc_vertex_create(e1, ABOVE, RIGHT, ix, iy); + new_tristrip(&tlist, e0, ix, iy); + next_edge->outp[ABOVE] = e0->outp[ABOVE]; + gpc_vertex_create(next_edge, ABOVE, RIGHT, nx, iy); + } + break; + case EMM: + if (p && q) { + gpc_vertex_create(e0, ABOVE, LEFT, ix, iy); + new_tristrip(&tlist, e1, ix, iy); + e0->outp[ABOVE] = e1->outp[ABOVE]; + } + break; + default: + break; + } /* End of switch */ + } /* End of contributing intersection conditional */ + + // Swap bundle sides in response to edge crossing + if (e0->bundle[ABOVE][CLIP]) { + e1->bside[CLIP] = !e1->bside[CLIP]; + } + if (e1->bundle[ABOVE][CLIP]) { + e0->bside[CLIP] = !e0->bside[CLIP]; + } + if (e0->bundle[ABOVE][SUBJ]) { + e1->bside[SUBJ] = !e1->bside[SUBJ]; + } + if (e1->bundle[ABOVE][SUBJ]) { + e0->bside[SUBJ] = !e0->bside[SUBJ]; + } + + /* Swap e0 and e1 bundles in the AET */ + prev_edge = e0->prev; + next_edge = e1->next; + if (e1->next) { + e1->next->prev = e0; + } + + if (e0->bstate[ABOVE] == BUNDLE_HEAD) { + search = 1; + while (search) { + prev_edge = prev_edge->prev; + if (prev_edge) { + if (prev_edge->bundle[ABOVE][CLIP] || + prev_edge->bundle[ABOVE][SUBJ] || + (prev_edge->bstate[ABOVE] == BUNDLE_HEAD)) { + search = 0; + } + } else { + search = 0; + } + } + } + if (!prev_edge) { + e1->next = aet; + aet = e0->next; + } else { + e1->next = prev_edge->next; + prev_edge->next = e0->next; + } + e0->next->prev = prev_edge; + e1->next->prev = e1; + e0->next = next_edge; + } /* End of IT loop*/ + + /* Prepare for next scanbeam */ + for (edge = aet; edge; edge = next_edge) { + next_edge = edge->next; + succ_edge = edge->succ; + + if ((edge->top.y == yt) && succ_edge) { + /* Replace AET edge by its successor */ + succ_edge->outp[BELOW] = edge->outp[ABOVE]; + succ_edge->bstate[BELOW] = edge->bstate[ABOVE]; + succ_edge->bundle[BELOW][CLIP] = edge->bundle[ABOVE][CLIP]; + succ_edge->bundle[BELOW][SUBJ] = edge->bundle[ABOVE][SUBJ]; + prev_edge = edge->prev; + if (prev_edge) { + prev_edge->next = succ_edge; + } else { + aet = succ_edge; + } + if (next_edge) { + next_edge->prev = succ_edge; + } + succ_edge->prev = prev_edge; + succ_edge->next = next_edge; + } else { + /* Update this edge */ + edge->outp[BELOW] = edge->outp[ABOVE]; + edge->bstate[BELOW] = edge->bstate[ABOVE]; + edge->bundle[BELOW][CLIP] = edge->bundle[ABOVE][CLIP]; + edge->bundle[BELOW][SUBJ] = edge->bundle[ABOVE][SUBJ]; + edge->xb = edge->xt; + } + edge->outp[ABOVE] = NULL; + } + } + } /* === END OF SCANBEAM PROCESSING ================================== */ + + // Generate result tristrip from tlist + result->strip = NULL; + result->num_strips = count_tristrips(tlist); + if (result->num_strips > 0) { + gpc_malloc(result->strip, + result->num_strips * sizeof(gpc_vertex_list), + const_cast("tristrip list creation")); + + s = 0; + for (tn = tlist; tn; tn = tnn) { + tnn = tn->next; + if (tn->active > 2) { + /* Valid tristrip: copy the vertices and free the heap */ + result->strip[s].num_vertices = tn->active; + gpc_malloc(result->strip[s].vertex, + tn->active * sizeof(gpc_vertex), + const_cast("tristrip creation")); + v = 0; + if (0) { + lt = tn->v[RIGHT]; + rt = tn->v[LEFT]; + } else { + lt = tn->v[LEFT]; + rt = tn->v[RIGHT]; + } + while (lt || rt) { + if (lt) { + ltn = lt->next; + result->strip[s].vertex[v].x = lt->x; + result->strip[s].vertex[v].y = lt->y; + v++; + gpc_free(lt); + lt = ltn; + } + if (rt) { + rtn = rt->next; + result->strip[s].vertex[v].x = rt->x; + result->strip[s].vertex[v].y = rt->y; + v++; + gpc_free(rt); + rt = rtn; + } + } + s++; + } else { + /* Invalid tristrip: just free the heap */ + for (lt = tn->v[LEFT]; lt; lt = ltn) { + ltn = lt->next; + gpc_free(lt); + } + for (rt = tn->v[RIGHT]; rt; rt = rtn) { + rtn = rt->next; + gpc_free(rt); + } + } + gpc_free(tn); + } + } + // Tidy up + reset_it(&it); + reset_lmt(&lmt); + gpc_free(c_heap); + gpc_free(s_heap); + gpc_free(sbt); +} // NOLINT + +} // namespace gpc + +/* vim: set expandtab ts=4 sw=4 sts=4 tw=100: */ diff --git a/paddle/fluid/operators/detection/gpc.h b/paddle/fluid/operators/detection/gpc.h new file mode 100644 index 0000000000000000000000000000000000000000..ee86262ef2c486e4eaeeeaf56c2392d2a1c5851b --- /dev/null +++ b/paddle/fluid/operators/detection/gpc.h @@ -0,0 +1,246 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +/*************************************************************************** + * + * Copyright (c) 2015 Baidu.com, Inc. All Rights Reserved + * + **************************************************************************/ + +/** + * @file include/gpc.h + * @author huhan02(com@baidu.com) + * @date 2015/12/18 13:52:10 + * @brief + * + * @modified by sunyipeng + * @email sunyipeng@baidu.com + * @date 2018/6/12 + **/ + +#ifndef PADDLE_FLUID_OPERATORS_DETECTION_GPC_H_ // GPC_H_ +#define PADDLE_FLUID_OPERATORS_DETECTION_GPC_H_ // GPC_H_ + +#include +#include +#include +#include + +namespace gpc { + +typedef enum { // Set operation type + GPC_DIFF, // Difference + GPC_INT, // Intersection + GPC_XOR, // Exclusive or + GPC_UNION // Union +} gpc_op; + +typedef struct { // Polygon vertex structure + double x; // Vertex x component + double y; // vertex y component +} gpc_vertex; + +typedef struct { // Vertex list structure + int num_vertices; // Number of vertices in list + gpc_vertex *vertex; // Vertex array pointer +} gpc_vertex_list; + +typedef struct { // Polygon set structure + int num_contours; // Number of contours in polygon + int *hole; // Hole external contour flags + gpc_vertex_list *contour; // Contour array pointer +} gpc_polygon; + +typedef struct { // Tristrip set structure + int num_strips; // Number of tristrips + gpc_vertex_list *strip; // Tristrip array pointer +} gpc_tristrip; + +typedef enum { LEFT, RIGHT } gpc_left_right; + +typedef enum { ABOVE, BELOW } gpc_above_below; + +typedef enum { CLIP, SUBJ } gpc_clip_subj; + +typedef enum { /* Edge intersection classes */ + NUL, /* Empty non-intersection */ + EMX, /* External maximum */ + ELI, /* External left intermediate */ + TED, /* Top edge */ + ERI, /* External right intermediate */ + RED, /* Right edge */ + IMM, /* Internal maximum and minimum */ + IMN, /* Internal minimum */ + EMN, /* External minimum */ + EMM, /* External maximum and minimum */ + LED, /* Left edge */ + ILI, /* Internal left intermediate */ + BED, /* Bottom edge */ + IRI, /* Internal right intermediate */ + IMX, /* Internal maximum */ + FUL /* Full non-intersection */ +} vertex_type; + +typedef enum { /* Horizontal edge states */ + NH, /* No horizontal edge */ + BH, /* Bottom horizontal edge */ + TH /* Top horizontal edge */ +} h_state; + +typedef enum { /* Edge bundle state */ + UNBUNDLED, /* Isolated edge not within a bundle */ + BUNDLE_HEAD, /* Bundle head node */ + BUNDLE_TAIL /* Passive bundle tail node */ +} bundle_state; + +typedef struct v_shape { /* Internal vertex list datatype */ + double x; /* X coordinate component */ + double y; /* Y coordinate component */ + struct v_shape *next; /* Pointer to next vertex in list */ +} vertex_node; + +typedef struct p_shape { /* Internal contour / tristrip type */ + int active; /* Active flag / vertex count */ + int hole; /* Hole / external contour flag */ + vertex_node *v[2]; /* Left and right vertex list ptrs */ + struct p_shape *next; /* Pointer to next polygon contour */ + struct p_shape *proxy; /* Pointer to actual structure used */ +} polygon_node; + +typedef struct edge_shape { + gpc_vertex vertex; /* Piggy-backed contour vertex data */ + gpc_vertex bot; /* Edge lower (x, y) coordinate */ + gpc_vertex top; /* Edge upper (x, y) coordinate */ + double xb; /* Scanbeam bottom x coordinate */ + double xt; /* Scanbeam top x coordinate */ + double dx; /* Change in x for a unit y increase */ + int type; /* Clip / subject edge flag */ + int bundle[2][2]; /* Bundle edge flags */ + int bside[2]; /* Bundle left / right indicators */ + bundle_state bstate[2]; /* Edge bundle state */ + polygon_node *outp[2]; /* Output polygon / tristrip pointer */ + struct edge_shape *prev; /* Previous edge in the AET */ + struct edge_shape *next; /* Next edge in the AET */ + struct edge_shape *pred; /* Edge connected at the lower end */ + struct edge_shape *succ; /* Edge connected at the upper end */ + struct edge_shape *next_bound; /* Pointer to next bound in LMT */ +} edge_node; + +inline bool gpc_eq(float a, float b) { return (fabs(a - b) <= 1e-6); } + +inline bool gpc_prev_index(float a, float b) { return (fabs(a - b) <= 1e-6); } + +inline int gpc_prev_index(int i, int n) { return ((i - 1 + n) % n); } + +inline int gpc_next_index(int i, int n) { return ((i + 1) % n); } + +inline int gpc_optimal(gpc_vertex *v, int i, int n) { + return (v[(i + 1) % n].y != v[i].y || v[(i - 1 + n) % n].y != v[i].y); +} + +inline int gpc_fwd_min(edge_node *v, int i, int n) { + return (v[(i + 1) % n].vertex.y > v[i].vertex.y && + v[(i - 1 + n) % n].vertex.y >= v[i].vertex.y); +} + +inline int gpc_not_fmax(edge_node *v, int i, int n) { + return (v[(i + 1) % n].vertex.y > v[i].vertex.y); +} + +inline int gpc_rev_min(edge_node *v, int i, int n) { + return (v[(i + 1) % n].vertex.y >= v[i].vertex.y && + v[(i - 1 + n) % n].vertex.y > v[i].vertex.y); +} + +inline int gpc_not_rmax(edge_node *v, int i, int n) { + return (v[(i - 1 + n) % n].vertex.y > v[i].vertex.y); +} + +// inline void gpc_p_edge(edge_node *d, edge_node *e, int p, double i, double j) +// { +inline void gpc_p_edge(edge_node *d, edge_node *e, int p) { + d = e; + do { + d = d->prev; + } while (!d->outp[p]); + // i = d->bot.x + d->dx * (j - d->bot.y); +} + +// inline void gpc_n_edge(edge_node *d, edge_node *e, int p, double i, double j) +// { +inline void gpc_n_edge(edge_node *d, edge_node *e, int p) { + d = e; + do { + d = d->next; + } while (!d->outp[p]); + // i = d->bot.x + d->dx * (j - d->bot.y); +} + +template +void gpc_malloc(T *&p, int b, char *s) { + if (b > 0) { + p = (T *)malloc(b); + + if (!p) { + fprintf(stderr, "gpc malloc failure: %s\n", s); + exit(0); + } + } else { + p = NULL; + } +} +template +void gpc_free(T *&p) { + if (p) { + free(p); + p = NULL; + } +} + +/* +=========================================================================== + Public Function Prototypes +=========================================================================== +*/ + +void add_vertex(vertex_node **t, double x, double y); + +void gpc_vertex_create(edge_node *e, int p, int s, double x, double y); + +/* +void gpc_read_polygon(FILE *infile_ptr, int read_hole_flags, + gpc_polygon *polygon); + +void gpc_write_polygon(FILE *outfile_ptr, int write_hole_flags, + gpc_polygon *polygon); +*/ +void gpc_add_contour(gpc_polygon *polygon, gpc_vertex_list *contour, int hole); + +void gpc_polygon_clip(gpc_op set_operation, gpc_polygon *subject_polygon, + gpc_polygon *clip_polygon, gpc_polygon *result_polygon); + +void gpc_tristrip_clip(gpc_op set_operation, gpc_polygon *subject_polygon, + gpc_polygon *clip_polygon, + gpc_tristrip *result_tristrip); + +void gpc_polygon_to_tristrip(gpc_polygon *polygon, gpc_tristrip *tristrip); + +void gpc_free_polygon(gpc_polygon *polygon); + +void gpc_free_tristrip(gpc_tristrip *tristrip); + +} // namespace gpc + +#endif // PADDLE_FLUID_OPERATORS_DETECTION_GPC_H_ +/* vim: set expandtab ts=4 sw=4 sts=4 tw=100: */ diff --git a/paddle/fluid/operators/detection/mask_util.cc b/paddle/fluid/operators/detection/mask_util.cc new file mode 100644 index 0000000000000000000000000000000000000000..bd6fee713815345152fce73e85a45aa5cd68b1da --- /dev/null +++ b/paddle/fluid/operators/detection/mask_util.cc @@ -0,0 +1,229 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/detection/mask_util.h" +#include +#include +#include +#include +#include +#include "paddle/fluid/memory/memory.h" + +namespace paddle { +namespace operators { + +uint32_t UMax(uint32_t a, uint32_t b) { return (a > b) ? a : b; } + +static inline int Compare(const void* a, const void* b) { + uint32_t c = *(reinterpret_cast(a)); + uint32_t d = *(reinterpret_cast(b)); + return c > d ? 1 : c < d ? -1 : 0; +} + +void Decode(const uint32_t* cnts, int m, uint8_t* mask) { + uint8_t v = 0; + for (int j = 0; j < m; j++) { + for (uint32_t k = 0; k < cnts[j]; k++) { + *(mask++) = v; + } + v = !v; + } +} + +typedef uint32_t uint; +void Poly2Mask(const float* xy, int k, int h, int w, uint8_t* mask) { + int j, m = 0; + double scale = 5; + int *x, *y, *u, *v; + uint *a, *b; + platform::CPUPlace cpu; + auto xptr = memory::Alloc(cpu, sizeof(int) * (k + 1) * 2); + x = reinterpret_cast(xptr->ptr()); + y = x + (k + 1); + + for (j = 0; j < k; j++) x[j] = static_cast(scale * xy[j * 2 + 0] + .5); + x[k] = x[0]; + for (j = 0; j < k; j++) y[j] = static_cast(scale * xy[j * 2 + 1] + .5); + y[k] = y[0]; + for (j = 0; j < k; j++) { + m += UMax(abs(x[j] - x[j + 1]), abs(y[j] - y[j + 1])) + 1; + } + auto vptr = memory::Alloc(cpu, sizeof(int) * m * 2); + u = reinterpret_cast(vptr->ptr()); + v = u + m; + m = 0; + for (j = 0; j < k; j++) { + int xs = x[j], xe = x[j + 1], ys = y[j], ye = y[j + 1], dx, dy, t, d; + int flip; + double s; + dx = abs(xe - xs); + dy = abs(ys - ye); + flip = (dx >= dy && xs > xe) || (dx < dy && ys > ye); + if (flip) { + t = xs; + xs = xe; + xe = t; + t = ys; + ys = ye; + ye = t; + } + if (dx >= dy) { + s = dx == 0 ? 0 : static_cast(ye - ys) / dx; + for (d = 0; d <= dx; d++) { + t = flip ? dx - d : d; + u[m] = t + xs; + v[m] = static_cast(ys + s * t + .5); + m++; + } + } else { + s = dy == 0 ? 0 : static_cast(xe - xs) / dy; + for (d = 0; d <= dy; d++) { + t = flip ? dy - d : d; + v[m] = t + ys; + u[m] = static_cast(xs + s * t + .5); + m++; + } + } + } + /* get points along y-boundary and downsample */ + k = m; + m = 0; + double xd, yd; + auto xyptr = memory::Alloc(cpu, sizeof(int) * k * 2); + x = reinterpret_cast(xyptr->ptr()); + y = x + k; + for (j = 1; j < k; j++) { + if (u[j] != u[j - 1]) { + xd = static_cast(u[j] < u[j - 1] ? u[j] : u[j] - 1); + xd = (xd + .5) / scale - .5; + if (floor(xd) != xd || xd < 0 || xd > w - 1) continue; + yd = static_cast(v[j] < v[j - 1] ? v[j] : v[j - 1]); + yd = (yd + .5) / scale - .5; + if (yd < 0) + yd = 0; + else if (yd > h) + yd = h; + yd = ceil(yd); + x[m] = static_cast(xd); + y[m] = static_cast(yd); + m++; + } + } + /* compute rle encoding given y-boundary points */ + k = m; + auto aptr = memory::Alloc(cpu, sizeof(uint) * (k + 1)); + a = reinterpret_cast(aptr->ptr()); + for (j = 0; j < k; j++) a[j] = static_cast(x[j] * h + y[j]); + a[k++] = static_cast(h * w); + + qsort(a, k, sizeof(uint), Compare); + uint p = 0; + for (j = 0; j < k; j++) { + uint t = a[j]; + a[j] -= p; + p = t; + } + auto bptr = memory::Alloc(cpu, sizeof(uint32_t) * k); + b = reinterpret_cast(bptr->ptr()); + j = m = 0; + b[m++] = a[j++]; + while (j < k) { + if (a[j] > 0) { + b[m++] = a[j++]; + } else { + j++; + if (j < k) b[m - 1] += a[j++]; + } + } + + // convert to mask + auto mskptr = memory::Alloc(cpu, sizeof(uint8_t) * h * w); + uint8_t* msk = reinterpret_cast(mskptr->ptr()); + Decode(b, m, msk); + + for (int ii = 0; ii < h; ++ii) { + for (int jj = 0; jj < w; ++jj) { + mask[ii * w + jj] = msk[jj * h + ii]; + } + } +} + +void Poly2Boxes(const std::vector>>& polys, + float* boxes) { + // lists + for (size_t i = 0; i < polys.size(); ++i) { + float x0 = std::numeric_limits::max(); + float x1 = std::numeric_limits::min(); + float y0 = std::numeric_limits::max(); + float y1 = std::numeric_limits::min(); + // each list may have more than one polys + for (size_t j = 0; j < polys[i].size(); ++j) { + for (size_t k = 0; k < polys[i][j].size() / 2; ++k) { + x0 = std::min(x0, polys[i][j][2 * k]); + x1 = std::max(x1, polys[i][j][2 * k]); + y0 = std::min(y0, polys[i][j][2 * k + 1]); + y1 = std::max(y1, polys[i][j][2 * k + 1]); + } + } + boxes[i * 4] = x0; + boxes[i * 4 + 1] = y0; + boxes[i * 4 + 2] = x1; + boxes[i * 4 + 3] = y1; + } +} + +void Polys2MaskWrtBox(const std::vector>& polygons, + const float* box, int M, uint8_t* mask) { + float w = box[2] - box[0]; + float h = box[3] - box[1]; + w = std::max(w, static_cast(1.)); + h = std::max(h, static_cast(1.)); + + uint8_t* msk = nullptr; + if (polygons.size() == 1UL) { + msk = mask; + } else { + msk = reinterpret_cast( + malloc(M * M * polygons.size() * sizeof(uint8_t))); + } + for (size_t i = 0; i < polygons.size(); ++i) { + int k = polygons[i].size() / 2; + std::vector p; + for (int j = 0; j < k; ++j) { + float pw = (polygons[i][2 * j] - box[0]) * M / w; + float ph = (polygons[i][2 * j + 1] - box[1]) * M / h; + p.push_back(pw); + p.push_back(ph); + } + uint8_t* msk_i = msk + i * M * M; + Poly2Mask(p.data(), k, M, M, msk_i); + } + + if (polygons.size() > 1UL) { + for (size_t i = 0; i < polygons.size(); ++i) { + uint8_t* msk_i = msk + i * M * M; + for (int j = 0; j < M * M; ++j) { + if (i == 0) { + mask[j] = msk_i[j]; + } else { + mask[j] = (mask[j] + msk_i[j]) > 0 ? 1 : 0; + } + } + } + free(msk); + } +} + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/detection/mask_util.h b/paddle/fluid/operators/detection/mask_util.h new file mode 100644 index 0000000000000000000000000000000000000000..4e0ea54f6d89ff273382afc1e9a151cfd9773cc6 --- /dev/null +++ b/paddle/fluid/operators/detection/mask_util.h @@ -0,0 +1,30 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include +#include + +namespace paddle { +namespace operators { + +void Poly2Mask(const float* ploy, int k, int h, int w, uint8_t* mask); + +void Poly2Boxes(const std::vector>>& polys, + float* boxes); + +void Polys2MaskWrtBox(const std::vector>& polygons, + const float* box, int M, uint8_t* mask); +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/detection/mask_util_test.cc b/paddle/fluid/operators/detection/mask_util_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..de904e947463977229545897b723b98b4d0708d6 --- /dev/null +++ b/paddle/fluid/operators/detection/mask_util_test.cc @@ -0,0 +1,115 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/detection/mask_util.h" +#include +#include "paddle/fluid/memory/memory.h" + +namespace paddle { +namespace operators { + +template +void Compare(const T* a, const T* b, const int n) { + for (int i = 0; i < n; i++) { + EXPECT_EQ(a[i], b[i]); + } +} + +TEST(MaskUtil, Poly2MaskTest) { + float polys[] = {1.97f, 1.88f, 5.81f, 1.88f, 1.69f, + 6.53f, 5.94f, 6.38f, 1.97f, 1.88f}; + int h = 8, w = 8; + int k = 5; // length(polys) / 2 + // clang-format off + uint8_t expect_mask[] = { + 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 1, 1, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 1, 0, 0, 0, 0, + 0, 0, 1, 1, 1, 0, 0, 0, + 0, 0, 1, 1, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0 + }; + // clang-format on + + // the groud-truth mask is computed by coco API: + // + // import pycocotools.mask as mask_util + // import numpy as np + // segm = [1.97, 1.88, 5.81, 1.88, 1.69, 6.53, 5.94, 6.38, 1.97, 1.88] + // rles = mask_util.frPyObjects([segm], im_h, im_w) + // mask = mask_util.decode(rles) + // print mask + platform::CPUPlace cpu; + auto allocation = memory::Alloc(cpu, sizeof(expect_mask)); + uint8_t* mask = reinterpret_cast(allocation->ptr()); + Poly2Mask(polys, k, h, w, mask); + Compare(expect_mask, mask, h * w); +} + +TEST(MaskUtil, Poly2BoxesTest) { + // clang-format off + std::vector>> polys = { + {{1.97f, 1.88f, 5.81f, 1.88f, 1.69f, 6.53f, 5.94f, 6.38f, 1.97f, 1.88f}}, + {{2.97f, 1.88f, 3.81f, 1.68f, 1.69f, 6.63f, 6.94f, 6.58f, 2.97f, 0.88f}} + }; + float expect_boxes[] = { + 1.69f, 1.88f, 5.94f, 6.53f, + 1.69f, 0.88f, 6.94f, 6.63f + }; + // clang-format on + + platform::CPUPlace cpu; + auto allocation = memory::Alloc(cpu, sizeof(expect_boxes)); + float* boxes = reinterpret_cast(allocation->ptr()); + Poly2Boxes(polys, boxes); + Compare(expect_boxes, boxes, 8); +} + +TEST(MaskUtil, Polys2MaskWrtBoxTest) { + // clang-format off + std::vector>> polys = {{ + {1.97f, 1.88f, 5.81f, 1.88f, 1.69f, 6.53f, 5.94f, 6.38f, 1.97f, 1.88f}, + {2.97f, 1.88f, 3.81f, 1.68f, 1.69f, 6.63f, 6.94f, 6.58f, 2.97f, 0.88f}}}; + float expect_boxes[] = { + 1.69f, 0.88f, 6.94f, 6.63f + }; + uint8_t expect_mask[] = { + 0, 0, 0, 0, 0, 0, 0, 0, + 0, 1, 1, 1, 1, 1, 0, 0, + 0, 0, 1, 1, 1, 0, 0, 0, + 0, 0, 1, 1, 1, 0, 0, 0, + 0, 0, 1, 1, 1, 0, 0, 0, + 0, 1, 1, 1, 1, 1, 0, 0, + 0, 1, 1, 1, 1, 1, 1, 0, + 1, 1, 1, 1, 1, 1, 1, 1 + }; + // clang-format on + + platform::CPUPlace cpu; + auto allocation = memory::Alloc(cpu, sizeof(expect_boxes)); + float* boxes = reinterpret_cast(allocation->ptr()); + Poly2Boxes(polys, boxes); + Compare(expect_boxes, boxes, 4); + + auto allocat_mask = memory::Alloc(cpu, sizeof(expect_mask)); + uint8_t* mask = reinterpret_cast(allocat_mask->ptr()); + int M = 8; + Polys2MaskWrtBox(polys[0], expect_boxes, M, mask); + Compare(expect_mask, mask, M * M); +} + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/detection/mine_hard_examples_op.cc b/paddle/fluid/operators/detection/mine_hard_examples_op.cc index 54a4b87ec8f13c4d474aad4cc0b8159cd5f59d1c..f70e6adb5b4aefc02dabd4425ee4d633fff82e31 100644 --- a/paddle/fluid/operators/detection/mine_hard_examples_op.cc +++ b/paddle/fluid/operators/detection/mine_hard_examples_op.cc @@ -249,8 +249,7 @@ class MineHardExamplesOp : public framework::OperatorWithKernel { framework::OpKernelType GetExpectedKernelType( const framework::ExecutionContext& ctx) const override { return framework::OpKernelType( - framework::ToDataType(ctx.Input("ClsLoss")->type()), - platform::CPUPlace()); + ctx.Input("ClsLoss")->type(), platform::CPUPlace()); } }; diff --git a/paddle/fluid/operators/detection/multiclass_nms_op.cc b/paddle/fluid/operators/detection/multiclass_nms_op.cc index 60b93efdce810f8552374449fe5a6fc79b1a92c1..f357e3ccf905309e6656f3fa87fbee45dc357c1e 100644 --- a/paddle/fluid/operators/detection/multiclass_nms_op.cc +++ b/paddle/fluid/operators/detection/multiclass_nms_op.cc @@ -9,10 +9,11 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and limitations under the License. */ +#include #include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/detection/poly_util.h" namespace paddle { namespace operators { @@ -20,9 +21,6 @@ namespace operators { using Tensor = framework::Tensor; using LoDTensor = framework::LoDTensor; -constexpr int64_t kOutputDim = 6; -constexpr int64_t kBBoxSize = 4; - class MultiClassNMSOp : public framework::OperatorWithKernel { public: using framework::OperatorWithKernel::OperatorWithKernel; @@ -37,31 +35,52 @@ class MultiClassNMSOp : public framework::OperatorWithKernel { auto box_dims = ctx->GetInputDim("BBoxes"); auto score_dims = ctx->GetInputDim("Scores"); - - PADDLE_ENFORCE_EQ(box_dims.size(), 3, - "The rank of Input(BBoxes) must be 3."); - PADDLE_ENFORCE_EQ(score_dims.size(), 3, - "The rank of Input(Scores) must be 3."); - PADDLE_ENFORCE_EQ(box_dims[2], 4, - "The 2nd dimension of Input(BBoxes) must be 4, " - "represents the layout of coordinate " - "[xmin, ymin, xmax, ymax]"); - PADDLE_ENFORCE_EQ(box_dims[1], score_dims[2], - "The 1st dimensiong of Input(BBoxes) must be equal to " - "3rd dimension of Input(Scores), which represents the " - "predicted bboxes."); - + auto score_size = score_dims.size(); + + if (ctx->IsRuntime()) { + PADDLE_ENFORCE(score_size == 2 || score_size == 3, + "The rank of Input(Scores) must be 2 or 3"); + PADDLE_ENFORCE_EQ(box_dims.size(), 3, + "The rank of Input(BBoxes) must be 3"); + if (score_size == 3) { + PADDLE_ENFORCE(box_dims[2] == 4 || box_dims[2] == 8 || + box_dims[2] == 16 || box_dims[2] == 24 || + box_dims[2] == 32, + "The last dimension of Input(BBoxes) must be 4 or 8, " + "represents the layout of coordinate " + "[xmin, ymin, xmax, ymax] or " + "4 points: [x1, y1, x2, y2, x3, y3, x4, y4] or " + "8 points: [xi, yi] i= 1,2,...,8 or " + "12 points: [xi, yi] i= 1,2,...,12 or " + "16 points: [xi, yi] i= 1,2,...,16"); + PADDLE_ENFORCE_EQ( + box_dims[1], score_dims[2], + "The 2nd dimension of Input(BBoxes) must be equal to " + "last dimension of Input(Scores), which represents the " + "predicted bboxes."); + } else { + PADDLE_ENFORCE(box_dims[2] == 4, + "The last dimension of Input(BBoxes) must be 4"); + PADDLE_ENFORCE_EQ(box_dims[1], score_dims[1], + "The 2nd dimension of Input(BBoxes)" + "must be equal to the 2nd dimension" + " of Input(Scores)"); + } + } // Here the box_dims[0] is not the real dimension of output. // It will be rewritten in the computing kernel. - ctx->SetOutputDim("Out", {box_dims[1], 6}); + if (score_size == 3) { + ctx->SetOutputDim("Out", {box_dims[1], box_dims[2] + 2}); + } else { + ctx->SetOutputDim("Out", {-1, box_dims[2] + 2}); + } } protected: framework::OpKernelType GetExpectedKernelType( const framework::ExecutionContext& ctx) const override { return framework::OpKernelType( - framework::ToDataType( - ctx.Input("Scores")->type()), + ctx.Input("Scores")->type(), platform::CPUPlace()); } }; @@ -119,8 +138,9 @@ static inline T JaccardOverlap(const T* box1, const T* box2, const T inter_ymin = std::max(box1[1], box2[1]); const T inter_xmax = std::min(box1[2], box2[2]); const T inter_ymax = std::min(box1[3], box2[3]); - const T inter_w = inter_xmax - inter_xmin; - const T inter_h = inter_ymax - inter_ymin; + T norm = normalized ? static_cast(0.) : static_cast(1.); + T inter_w = inter_xmax - inter_xmin + norm; + T inter_h = inter_ymax - inter_ymin + norm; const T inter_area = inter_w * inter_h; const T bbox1_area = BBoxArea(box1, normalized); const T bbox2_area = BBoxArea(box2, normalized); @@ -128,15 +148,55 @@ static inline T JaccardOverlap(const T* box1, const T* box2, } } +template +T PolyIoU(const T* box1, const T* box2, const size_t box_size, + const bool normalized) { + T bbox1_area = PolyArea(box1, box_size, normalized); + T bbox2_area = PolyArea(box2, box_size, normalized); + T inter_area = PolyOverlapArea(box1, box2, box_size, normalized); + if (bbox1_area == 0 || bbox2_area == 0 || inter_area == 0) { + // If coordinate values are invalid + // if area size <= 0, return 0. + return T(0.); + } else { + return inter_area / (bbox1_area + bbox2_area - inter_area); + } +} + +template +void SliceOneClass(const platform::DeviceContext& ctx, + const framework::Tensor& items, const int class_id, + framework::Tensor* one_class_item) { + T* item_data = one_class_item->mutable_data(ctx.GetPlace()); + const T* items_data = items.data(); + const int64_t num_item = items.dims()[0]; + const int class_num = items.dims()[1]; + if (items.dims().size() == 3) { + int item_size = items.dims()[2]; + for (int i = 0; i < num_item; ++i) { + std::memcpy(item_data + i * item_size, + items_data + i * class_num * item_size + class_id * item_size, + sizeof(T) * item_size); + } + } else { + for (int i = 0; i < num_item; ++i) { + item_data[i] = items_data[i * class_num + class_id]; + } + } +} + template class MultiClassNMSKernel : public framework::OpKernel { public: void NMSFast(const Tensor& bbox, const Tensor& scores, const T score_threshold, const T nms_threshold, const T eta, - const int64_t top_k, std::vector* selected_indices) const { + const int64_t top_k, std::vector* selected_indices, + const bool normalized) const { // The total boxes for each instance. int64_t num_boxes = bbox.dims()[0]; // 4: [xmin ymin xmax ymax] + // 8: [x1 y1 x2 y2 x3 y3 x4 y4] + // 16, 24, or 32: [x1 y1 x2 y2 ... xn yn], n = 8, 12 or 16 int64_t box_size = bbox.dims()[1]; std::vector scores_data(num_boxes); @@ -154,8 +214,20 @@ class MultiClassNMSKernel : public framework::OpKernel { for (size_t k = 0; k < selected_indices->size(); ++k) { if (keep) { const int kept_idx = (*selected_indices)[k]; - T overlap = JaccardOverlap(bbox_data + idx * box_size, - bbox_data + kept_idx * box_size, true); + T overlap = T(0.); + // 4: [xmin ymin xmax ymax] + if (box_size == 4) { + overlap = + JaccardOverlap(bbox_data + idx * box_size, + bbox_data + kept_idx * box_size, normalized); + } + // 8: [x1 y1 x2 y2 x3 y3 x4 y4] or 16, 24, 32 + if (box_size == 8 || box_size == 16 || box_size == 24 || + box_size == 32) { + overlap = PolyIoU(bbox_data + idx * box_size, + bbox_data + kept_idx * box_size, box_size, + normalized); + } keep = overlap <= adaptive_threshold; } else { break; @@ -173,37 +245,58 @@ class MultiClassNMSKernel : public framework::OpKernel { void MultiClassNMS(const framework::ExecutionContext& ctx, const Tensor& scores, const Tensor& bboxes, + const int scores_size, std::map>* indices, int* num_nmsed_out) const { int64_t background_label = ctx.Attr("background_label"); int64_t nms_top_k = ctx.Attr("nms_top_k"); int64_t keep_top_k = ctx.Attr("keep_top_k"); + bool normalized = ctx.Attr("normalized"); T nms_threshold = static_cast(ctx.Attr("nms_threshold")); T nms_eta = static_cast(ctx.Attr("nms_eta")); T score_threshold = static_cast(ctx.Attr("score_threshold")); + auto& dev_ctx = ctx.template device_context(); - int64_t class_num = scores.dims()[0]; - int64_t predict_dim = scores.dims()[1]; int num_det = 0; + + int64_t class_num = scores_size == 3 ? scores.dims()[0] : scores.dims()[1]; + Tensor bbox_slice, score_slice; for (int64_t c = 0; c < class_num; ++c) { if (c == background_label) continue; - Tensor score = scores.Slice(c, c + 1); - NMSFast(bboxes, score, score_threshold, nms_threshold, nms_eta, nms_top_k, - &((*indices)[c])); + if (scores_size == 3) { + score_slice = scores.Slice(c, c + 1); + bbox_slice = bboxes; + } else { + score_slice.Resize({scores.dims()[0], 1}); + bbox_slice.Resize({scores.dims()[0], 4}); + SliceOneClass(dev_ctx, scores, c, &score_slice); + SliceOneClass(dev_ctx, bboxes, c, &bbox_slice); + } + NMSFast(bbox_slice, score_slice, score_threshold, nms_threshold, nms_eta, + nms_top_k, &((*indices)[c]), normalized); + if (scores_size == 2) { + std::stable_sort((*indices)[c].begin(), (*indices)[c].end()); + } num_det += (*indices)[c].size(); } *num_nmsed_out = num_det; const T* scores_data = scores.data(); if (keep_top_k > -1 && num_det > keep_top_k) { + const T* sdata; std::vector>> score_index_pairs; for (const auto& it : *indices) { int label = it.first; - const T* sdata = scores_data + label * predict_dim; + if (scores_size == 3) { + sdata = scores_data + label * scores.dims()[1]; + } else { + score_slice.Resize({scores.dims()[0], 1}); + SliceOneClass(dev_ctx, scores, label, &score_slice); + sdata = score_slice.data(); + } const std::vector& label_indices = it.second; for (size_t j = 0; j < label_indices.size(); ++j) { int idx = label_indices[j]; - PADDLE_ENFORCE_LT(idx, predict_dim); score_index_pairs.push_back( std::make_pair(sdata[idx], std::make_pair(label, idx))); } @@ -220,82 +313,121 @@ class MultiClassNMSKernel : public framework::OpKernel { int idx = score_index_pairs[j].second.second; new_indices[label].push_back(idx); } + if (scores_size == 2) { + for (const auto& it : new_indices) { + int label = it.first; + std::stable_sort(new_indices[label].begin(), + new_indices[label].end()); + } + } new_indices.swap(*indices); *num_nmsed_out = keep_top_k; } } - void MultiClassOutput(const Tensor& scores, const Tensor& bboxes, + void MultiClassOutput(const platform::DeviceContext& ctx, + const Tensor& scores, const Tensor& bboxes, const std::map>& selected_indices, - Tensor* outs) const { - int predict_dim = scores.dims()[1]; + const int scores_size, Tensor* outs) const { + int64_t class_num = scores.dims()[1]; + int64_t predict_dim = scores.dims()[1]; + int64_t box_size = bboxes.dims()[1]; + if (scores_size == 2) { + box_size = bboxes.dims()[2]; + } + int64_t out_dim = box_size + 2; auto* scores_data = scores.data(); auto* bboxes_data = bboxes.data(); auto* odata = outs->data(); - + const T* sdata; + Tensor bbox; + bbox.Resize({scores.dims()[0], box_size}); int count = 0; for (const auto& it : selected_indices) { int label = it.first; - const T* sdata = scores_data + label * predict_dim; const std::vector& indices = it.second; + if (scores_size == 2) { + SliceOneClass(ctx, bboxes, label, &bbox); + } else { + sdata = scores_data + label * predict_dim; + } for (size_t j = 0; j < indices.size(); ++j) { int idx = indices[j]; - const T* bdata = bboxes_data + idx * kBBoxSize; - odata[count * kOutputDim] = label; // label - odata[count * kOutputDim + 1] = sdata[idx]; // score - // xmin, ymin, xmax, ymax - std::memcpy(odata + count * kOutputDim + 2, bdata, 4 * sizeof(T)); + odata[count * out_dim] = label; // label + const T* bdata; + if (scores_size == 3) { + bdata = bboxes_data + idx * box_size; + odata[count * out_dim + 1] = sdata[idx]; // score + } else { + bdata = bbox.data() + idx * box_size; + odata[count * out_dim + 1] = *(scores_data + idx * class_num + label); + } + // xmin, ymin, xmax, ymax or multi-points coordinates + std::memcpy(odata + count * out_dim + 2, bdata, box_size * sizeof(T)); count++; } } } void Compute(const framework::ExecutionContext& ctx) const override { - auto* boxes = ctx.Input("BBoxes"); - auto* scores = ctx.Input("Scores"); + auto* boxes = ctx.Input("BBoxes"); + auto* scores = ctx.Input("Scores"); auto* outs = ctx.Output("Out"); auto score_dims = scores->dims(); - - int64_t batch_size = score_dims[0]; - int64_t class_num = score_dims[1]; - int64_t predict_dim = score_dims[2]; - int64_t box_dim = boxes->dims()[2]; + auto score_size = score_dims.size(); + auto& dev_ctx = ctx.template device_context(); std::vector>> all_indices; std::vector batch_starts = {0}; - for (int64_t i = 0; i < batch_size; ++i) { - Tensor ins_score = scores->Slice(i, i + 1); - ins_score.Resize({class_num, predict_dim}); - - Tensor ins_boxes = boxes->Slice(i, i + 1); - ins_boxes.Resize({predict_dim, box_dim}); - + int64_t batch_size = score_dims[0]; + int64_t box_dim = boxes->dims()[2]; + int64_t out_dim = box_dim + 2; + int num_nmsed_out = 0; + Tensor boxes_slice, scores_slice; + int n = score_size == 3 ? batch_size : boxes->lod().back().size() - 1; + for (int i = 0; i < n; ++i) { + if (score_size == 3) { + scores_slice = scores->Slice(i, i + 1); + scores_slice.Resize({score_dims[1], score_dims[2]}); + boxes_slice = boxes->Slice(i, i + 1); + boxes_slice.Resize({score_dims[2], box_dim}); + } else { + auto boxes_lod = boxes->lod().back(); + scores_slice = scores->Slice(boxes_lod[i], boxes_lod[i + 1]); + boxes_slice = boxes->Slice(boxes_lod[i], boxes_lod[i + 1]); + } std::map> indices; - int num_nmsed_out = 0; - MultiClassNMS(ctx, ins_score, ins_boxes, &indices, &num_nmsed_out); + MultiClassNMS(ctx, scores_slice, boxes_slice, score_size, &indices, + &num_nmsed_out); all_indices.push_back(indices); batch_starts.push_back(batch_starts.back() + num_nmsed_out); } int num_kept = batch_starts.back(); if (num_kept == 0) { - T* od = outs->mutable_data({1}, ctx.GetPlace()); + T* od = outs->mutable_data({1, 1}, ctx.GetPlace()); od[0] = -1; + batch_starts = {0, 1}; } else { - outs->mutable_data({num_kept, kOutputDim}, ctx.GetPlace()); - for (int64_t i = 0; i < batch_size; ++i) { - Tensor ins_score = scores->Slice(i, i + 1); - ins_score.Resize({class_num, predict_dim}); - - Tensor ins_boxes = boxes->Slice(i, i + 1); - ins_boxes.Resize({predict_dim, box_dim}); - + outs->mutable_data({num_kept, out_dim}, ctx.GetPlace()); + for (int i = 0; i < n; ++i) { + if (score_size == 3) { + scores_slice = scores->Slice(i, i + 1); + boxes_slice = boxes->Slice(i, i + 1); + scores_slice.Resize({score_dims[1], score_dims[2]}); + boxes_slice.Resize({score_dims[2], box_dim}); + } else { + auto boxes_lod = boxes->lod().back(); + scores_slice = scores->Slice(boxes_lod[i], boxes_lod[i + 1]); + boxes_slice = boxes->Slice(boxes_lod[i], boxes_lod[i + 1]); + } int64_t s = batch_starts[i]; int64_t e = batch_starts[i + 1]; if (e > s) { Tensor out = outs->Slice(s, e); - MultiClassOutput(ins_score, ins_boxes, all_indices[i], &out); + MultiClassOutput(dev_ctx, scores_slice, boxes_slice, all_indices[i], + score_dims.size(), &out); } } } @@ -311,16 +443,24 @@ class MultiClassNMSOpMaker : public framework::OpProtoAndCheckerMaker { public: void Make() override { AddInput("BBoxes", - "(Tensor) A 3-D Tensor with shape [N, M, 4] represents the " + "Two types of bboxes are supported:" + "1. (Tensor) A 3-D Tensor with shape " + "[N, M, 4 or 8 16 24 32] represents the " "predicted locations of M bounding bboxes, N is the batch size. " "Each bounding box has four coordinate values and the layout is " - "[xmin, ymin, xmax, ymax]."); + "[xmin, ymin, xmax, ymax], when box size equals to 4." + "2. (LoDTensor) A 3-D Tensor with shape [M, C, 4]" + "M is the number of bounding boxes, C is the class number"); AddInput("Scores", - "(Tensor) A 3-D Tensor with shape [N, C, M] represents the " + "Two types of scores are supported:" + "1. (Tensor) A 3-D Tensor with shape [N, C, M] represents the " "predicted confidence predictions. N is the batch size, C is the " "class number, M is number of bounding boxes. For each category " "there are total M scores which corresponding M bounding boxes. " - " Please note, M is equal to the 1st dimension of BBoxes. "); + " Please note, M is equal to the 2nd dimension of BBoxes. " + "2. (LoDTensor) A 2-D LoDTensor with shape [M, C]. " + "M is the number of bbox, C is the class number. In this case, " + "Input BBoxes should be the second case with shape [M, C, 4]."); AddAttr( "background_label", "(int, defalut: 0) " @@ -348,35 +488,40 @@ class MultiClassNMSOpMaker : public framework::OpProtoAndCheckerMaker { "(int64_t) " "Number of total bboxes to be kept per image after NMS " "step. -1 means keeping all bboxes after NMS step."); + AddAttr("normalized", + "(bool, default true) " + "Whether detections are normalized.") + .SetDefault(true); AddOutput("Out", "(LoDTensor) A 2-D LoDTensor with shape [No, 6] represents the " "detections. Each row has 6 values: " - "[label, confidence, xmin, ymin, xmax, ymax], No is the total " - "number of detections in this mini-batch. For each instance, " + "[label, confidence, xmin, ymin, xmax, ymax] or " + "(LoDTensor) A 2-D LoDTensor with shape [No, 10] represents the " + "detections. Each row has 10 values: " + "[label, confidence, x1, y1, x2, y2, x3, y3, x4, y4]. No is the " + "total number of detections in this mini-batch." + "For each instance, " "the offsets in first dimension are called LoD, the number of " "offset is N + 1, if LoD[i + 1] - LoD[i] == 0, means there is " "no detected bbox."); AddComment(R"DOC( This operator is to do multi-class non maximum suppression (NMS) on a batched of boxes and scores. - In the NMS step, this operator greedily selects a subset of detection bounding boxes that have high scores larger than score_threshold, if providing this threshold, then selects the largest nms_top_k confidences scores if nms_top_k is larger than -1. Then this operator pruns away boxes that have high IOU (intersection over union) overlap with already selected boxes by adaptive threshold NMS based on parameters of nms_threshold and nms_eta. - Aftern NMS step, at most keep_top_k number of total bboxes are to be kept per image if keep_top_k is larger than -1. - This operator support multi-class and batched inputs. It applying NMS independently for each class. The outputs is a 2-D LoDTenosr, for each image, the offsets in first dimension of LoDTensor are called LoD, the number of offset is N + 1, where N is the batch size. If LoD[i + 1] - LoD[i] == 0, means there is no detected bbox for this image. If there is no detected boxes -for all images, all the elements in LoD are 0, and the Out only contains one -value which is -1. +for all images, all the elements in LoD are set to {1}, and the Out only +contains one value which is -1. )DOC"); } }; diff --git a/paddle/fluid/operators/detection/poly_util.cc b/paddle/fluid/operators/detection/poly_util.cc new file mode 100644 index 0000000000000000000000000000000000000000..1af2c95c6cf526d651b196b54614a21a9cddde8c --- /dev/null +++ b/paddle/fluid/operators/detection/poly_util.cc @@ -0,0 +1,132 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#ifndef POLY_UTIL_CC_ +#define POLY_UTIL_CC_ + +#include "paddle/fluid/operators/detection/poly_util.h" +#include "paddle/fluid/framework/op_registry.h" + +namespace paddle { +namespace operators { + +using gpc::gpc_polygon_clip; +using gpc::gpc_free_polygon; + +template +void Array2PointVec(const T*& box, const size_t box_size, + std::vector>& vec) { + size_t pts_num = box_size / 2; + vec.resize(pts_num); + for (size_t i = 0; i < pts_num; i++) { + vec.at(i).x = box[2 * i]; + vec.at(i).y = box[2 * i + 1]; + } +} + +template +void Array2Poly(const T*& box, const size_t box_size, gpc::gpc_polygon& poly) { + size_t pts_num = box_size / 2; + poly.num_contours = 1; + poly.hole = (int*)malloc(sizeof(int)); + poly.hole[0] = 0; + poly.contour = (gpc::gpc_vertex_list*)malloc(sizeof(gpc::gpc_vertex_list)); + poly.contour->num_vertices = pts_num; + poly.contour->vertex = + (gpc::gpc_vertex*)malloc(sizeof(gpc::gpc_vertex) * pts_num); + for (size_t i = 0; i < pts_num; ++i) { + poly.contour->vertex[i].x = box[2 * i]; + poly.contour->vertex[i].y = box[2 * i + 1]; + } +} + +template +void PointVec2Poly(const std::vector>& vec, gpc::gpc_polygon& poly) { + int pts_num = vec.size(); + poly.num_contours = 1; + poly.hole = (int*)malloc(sizeof(int)); + poly.hole[0] = 0; + poly.contour = (gpc::gpc_vertex_list*)malloc(sizeof(gpc::gpc_vertex_list)); + poly.contour->num_vertices = pts_num; + poly.contour->vertex = + (gpc::gpc_vertex*)malloc(sizeof(gpc::gpc_vertex) * pts_num); + for (size_t i = 0; i < pts_num; ++i) { + poly.contour->vertex[i].x = vec[i].x; + poly.contour->vertex[i].y = vec[i].y; + } +} + +template +void Poly2PointVec(const gpc::gpc_vertex_list& contour, + std::vector>& vec) { + int pts_num = contour.num_vertices; + vec.resize(pts_num); + for (int i = 0; i < pts_num; i++) { + vec.at(i).x = contour.vertex[i].x; + vec.at(i).y = contour.vertex[i].y; + } +} + +template +T GetContourArea(std::vector>& vec) { + size_t pts_num = vec.size(); + if (pts_num < 3) return T(0.); + T area = T(0.); + for (size_t i = 0; i < pts_num; ++i) { + area += vec[i].x * vec[(i + 1) % pts_num].y - + vec[i].y * vec[(i + 1) % pts_num].x; + } + return std::fabs(area / 2.0); +} + +template +T PolyArea(const T* box, const size_t box_size, const bool normalized) { + // If coordinate values are is invalid + // if area size <= 0, return 0. + std::vector> vec; + Array2PointVec(box, box_size, vec); + return GetContourArea(vec); +} + +template +T PolyOverlapArea(const T* box1, const T* box2, const size_t box_size, + const bool normalized) { + gpc::gpc_polygon poly1; + gpc::gpc_polygon poly2; + Array2Poly(box1, box_size, poly1); + Array2Poly(box2, box_size, poly2); + gpc::gpc_polygon respoly; + gpc::gpc_op op = gpc::GPC_INT; + gpc::gpc_polygon_clip(op, &poly2, &poly1, &respoly); + + T inter_area = T(0.); + int contour_num = respoly.num_contours; + for (int i = 0; i < contour_num; ++i) { + std::vector> resvec; + Poly2PointVec(respoly.contour[i], resvec); + // inter_area += std::fabs(cv::contourArea(resvec)) + 0.5f * + // (cv::arcLength(resvec, true)); + inter_area += GetContourArea(resvec); + } + + gpc::gpc_free_polygon(&poly1); + gpc::gpc_free_polygon(&poly2); + gpc::gpc_free_polygon(&respoly); + return inter_area; +} + +} // namespace operators +} // namespace paddle + +#endif diff --git a/paddle/fluid/operators/detection/poly_util.h b/paddle/fluid/operators/detection/poly_util.h new file mode 100644 index 0000000000000000000000000000000000000000..f07baf72d9ff07b8fcb45dcfb2a35741fb1aeed0 --- /dev/null +++ b/paddle/fluid/operators/detection/poly_util.h @@ -0,0 +1,73 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#ifndef POLY_UTIL_H_ +#define POLY_UTIL_H_ + +#include +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/detection/gpc.h" + +namespace paddle { +namespace operators { + +template +class Point_ { + public: + // default constructor + Point_() {} + Point_(T _x, T _y) {} + Point_(const Point_& pt) {} + + Point_& operator=(const Point_& pt); + // conversion to another data type + // template operator Point_<_T>() const; + // conversion to the old-style C structures + // operator Vec() const; + + // checks whether the point is inside the specified rectangle + // bool inside(const Rect_& r) const; + T x; //!< x coordinate of the point + T y; //!< y coordinate of the point +}; + +template +void Array2PointVec(const T*& box, const size_t box_size, + std::vector>& vec); + +template +void Array2Poly(const T*& box, const size_t box_size, gpc::gpc_polygon& poly); + +template +void PointVec2Poly(const std::vector>& vec, gpc::gpc_polygon& poly); + +template +void Poly2PointVec(const gpc::gpc_vertex_list& contour, + std::vector>& vec); + +template +T GetContourArea(std::vector>& vec); + +template +T PolyArea(const T* box, const size_t box_size, const bool normalized); + +template +T PolyOverlapArea(const T* box1, const T* box2, const size_t box_size, + const bool normalized); +} // namespace operators +} // namespace paddle + +#include "paddle/fluid/operators/detection/poly_util.cc" + +#endif // POLY_UTIL_H_ diff --git a/paddle/fluid/operators/detection/polygon_box_transform_op.cc b/paddle/fluid/operators/detection/polygon_box_transform_op.cc index 568d50d457d838d5f11605710c0d3b987af01d10..4b3bc2edb58fe23393d906094c41b6ad62c71155 100644 --- a/paddle/fluid/operators/detection/polygon_box_transform_op.cc +++ b/paddle/fluid/operators/detection/polygon_box_transform_op.cc @@ -41,9 +41,9 @@ class PolygonBoxTransformCPUKernel : public framework::OpKernel { for (int id_w = 0; id_w < width; ++id_w) { id = id_n * height * width + width * id_h + id_w; if (id_n % 2 == 0) { - out_data[id] = id_w - in_data[id]; + out_data[id] = id_w * 4 - in_data[id]; } else { - out_data[id] = id_h - in_data[id]; + out_data[id] = id_h * 4 - in_data[id]; } } } diff --git a/paddle/fluid/operators/detection/polygon_box_transform_op.cu b/paddle/fluid/operators/detection/polygon_box_transform_op.cu index 6187ac6622c65d2bbc525c3fe2cb397cf74ac612..e1eaf084a3413dd1d13514e2d7b22572d21dd119 100644 --- a/paddle/fluid/operators/detection/polygon_box_transform_op.cu +++ b/paddle/fluid/operators/detection/polygon_box_transform_op.cu @@ -32,9 +32,9 @@ __global__ void PolygonBoxTransformKernel(const int n, const int h, const int w, if (id_n < n && id_h < h && id_w < w) { int id = id_n * h * w + w * id_h + id_w; if (id_n % 2 == 0) { - output[id] = id_w - input[id]; + output[id] = id_w * 4 - input[id]; } else { - output[id] = id_h - input[id]; + output[id] = id_h * 4 - input[id]; } } } diff --git a/paddle/fluid/operators/detection/prior_box_op.cc b/paddle/fluid/operators/detection/prior_box_op.cc index b5cb6a724c095eb849f3a184f13843e1a0cca92f..3e75c0394f971d0c8ab5edc88467e56c86db8815 100644 --- a/paddle/fluid/operators/detection/prior_box_op.cc +++ b/paddle/fluid/operators/detection/prior_box_op.cc @@ -72,8 +72,7 @@ class PriorBoxOp : public framework::OperatorWithKernel { framework::OpKernelType GetExpectedKernelType( const framework::ExecutionContext& ctx) const override { return framework::OpKernelType( - framework::ToDataType(ctx.Input("Input")->type()), - ctx.device_context()); + ctx.Input("Input")->type(), ctx.device_context()); } }; diff --git a/paddle/fluid/operators/detection/prior_box_op.h b/paddle/fluid/operators/detection/prior_box_op.h index 4e226abbb51c271502f0ca5419d488643b5a1a82..d3e26256b50f2d7010fee3738802d59173678b34 100644 --- a/paddle/fluid/operators/detection/prior_box_op.h +++ b/paddle/fluid/operators/detection/prior_box_op.h @@ -46,13 +46,6 @@ inline void ExpandAspectRatios(const std::vector& input_aspect_ratior, } } -template -struct ClipFunctor { - HOSTDEVICE inline T operator()(T in) const { - return std::min(std::max(in, 0.), 1.); - } -}; - template class PriorBoxOpKernel : public framework::OpKernel { public: @@ -101,31 +94,30 @@ class PriorBoxOpKernel : public framework::OpKernel { boxes->mutable_data(ctx.GetPlace()); vars->mutable_data(ctx.GetPlace()); - auto e_boxes = framework::EigenTensor::From(*boxes); + T* b_t = boxes->data(); for (int h = 0; h < feature_height; ++h) { for (int w = 0; w < feature_width; ++w) { T center_x = (w + offset) * step_width; T center_y = (h + offset) * step_height; T box_width, box_height; - int idx = 0; for (size_t s = 0; s < min_sizes.size(); ++s) { auto min_size = min_sizes[s]; if (min_max_aspect_ratios_order) { box_width = box_height = min_size / 2.; - e_boxes(h, w, idx, 0) = (center_x - box_width) / img_width; - e_boxes(h, w, idx, 1) = (center_y - box_height) / img_height; - e_boxes(h, w, idx, 2) = (center_x + box_width) / img_width; - e_boxes(h, w, idx, 3) = (center_y + box_height) / img_height; - idx++; + b_t[0] = (center_x - box_width) / img_width; + b_t[1] = (center_y - box_height) / img_height; + b_t[2] = (center_x + box_width) / img_width; + b_t[3] = (center_y + box_height) / img_height; + b_t += 4; if (max_sizes.size() > 0) { auto max_size = max_sizes[s]; // square prior with size sqrt(minSize * maxSize) box_width = box_height = sqrt(min_size * max_size) / 2.; - e_boxes(h, w, idx, 0) = (center_x - box_width) / img_width; - e_boxes(h, w, idx, 1) = (center_y - box_height) / img_height; - e_boxes(h, w, idx, 2) = (center_x + box_width) / img_width; - e_boxes(h, w, idx, 3) = (center_y + box_height) / img_height; - idx++; + b_t[0] = (center_x - box_width) / img_width; + b_t[1] = (center_y - box_height) / img_height; + b_t[2] = (center_x + box_width) / img_width; + b_t[3] = (center_y + box_height) / img_height; + b_t += 4; } // priors with different aspect ratios for (size_t r = 0; r < aspect_ratios.size(); ++r) { @@ -135,11 +127,11 @@ class PriorBoxOpKernel : public framework::OpKernel { } box_width = min_size * sqrt(ar) / 2.; box_height = min_size / sqrt(ar) / 2.; - e_boxes(h, w, idx, 0) = (center_x - box_width) / img_width; - e_boxes(h, w, idx, 1) = (center_y - box_height) / img_height; - e_boxes(h, w, idx, 2) = (center_x + box_width) / img_width; - e_boxes(h, w, idx, 3) = (center_y + box_height) / img_height; - idx++; + b_t[0] = (center_x - box_width) / img_width; + b_t[1] = (center_y - box_height) / img_height; + b_t[2] = (center_x + box_width) / img_width; + b_t[3] = (center_y + box_height) / img_height; + b_t += 4; } } else { // priors with different aspect ratios @@ -147,21 +139,21 @@ class PriorBoxOpKernel : public framework::OpKernel { float ar = aspect_ratios[r]; box_width = min_size * sqrt(ar) / 2.; box_height = min_size / sqrt(ar) / 2.; - e_boxes(h, w, idx, 0) = (center_x - box_width) / img_width; - e_boxes(h, w, idx, 1) = (center_y - box_height) / img_height; - e_boxes(h, w, idx, 2) = (center_x + box_width) / img_width; - e_boxes(h, w, idx, 3) = (center_y + box_height) / img_height; - idx++; + b_t[0] = (center_x - box_width) / img_width; + b_t[1] = (center_y - box_height) / img_height; + b_t[2] = (center_x + box_width) / img_width; + b_t[3] = (center_y + box_height) / img_height; + b_t += 4; } if (max_sizes.size() > 0) { auto max_size = max_sizes[s]; // square prior with size sqrt(minSize * maxSize) box_width = box_height = sqrt(min_size * max_size) / 2.; - e_boxes(h, w, idx, 0) = (center_x - box_width) / img_width; - e_boxes(h, w, idx, 1) = (center_y - box_height) / img_height; - e_boxes(h, w, idx, 2) = (center_x + box_width) / img_width; - e_boxes(h, w, idx, 3) = (center_y + box_height) / img_height; - idx++; + b_t[0] = (center_x - box_width) / img_width; + b_t[1] = (center_y - box_height) / img_height; + b_t[2] = (center_x + box_width) / img_width; + b_t[3] = (center_y + box_height) / img_height; + b_t += 4; } } } @@ -169,11 +161,10 @@ class PriorBoxOpKernel : public framework::OpKernel { } if (clip) { - platform::Transform trans; - ClipFunctor clip_func; - trans(ctx.template device_context(), - boxes->data(), boxes->data() + boxes->numel(), - boxes->data(), clip_func); + T* dt = boxes->data(); + std::transform(dt, dt + boxes->numel(), dt, [](T v) -> T { + return std::min(std::max(v, 0.), 1.); + }); } framework::Tensor var_t; @@ -181,6 +172,10 @@ class PriorBoxOpKernel : public framework::OpKernel { framework::make_ddim({1, static_cast(variances.size())}), ctx.GetPlace()); auto var_et = framework::EigenTensor::From(var_t); + +#ifdef PADDLE_WITH_MKLML +#pragma omp parallel for +#endif for (size_t i = 0; i < variances.size(); ++i) { var_et(0, i) = variances[i]; } @@ -190,8 +185,15 @@ class PriorBoxOpKernel : public framework::OpKernel { vars->Resize({box_num, static_cast(variances.size())}); auto e_vars = framework::EigenMatrix::From(*vars); - e_vars = var_et.broadcast(Eigen::DSizes(box_num, 1)); +#ifdef PADDLE_WITH_MKLML +#pragma omp parallel for collapse(2) +#endif + for (int i = 0; i < box_num; ++i) { + for (int j = 0; j < variances.size(); ++j) { + e_vars(i, j) = variances[j]; + } + } vars->Resize(var_dim); } }; // namespace operators diff --git a/paddle/fluid/operators/detection/roi_perspective_transform_op.cc b/paddle/fluid/operators/detection/roi_perspective_transform_op.cc index 42c720e701fbabacf1280dec2f78d3f6b99dfea2..5b84221cfa5902d01540a06c6bc61fe9eac986f0 100644 --- a/paddle/fluid/operators/detection/roi_perspective_transform_op.cc +++ b/paddle/fluid/operators/detection/roi_perspective_transform_op.cc @@ -13,6 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #include +#include #include #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/operators/math/math_function.h" @@ -493,14 +494,14 @@ class ROIPerspectiveTransformOp : public framework::OperatorWithKernel { auto out_dims = framework::make_ddim(out_dims_v); ctx->SetOutputDim("Out", out_dims); + ctx->ShareLoD("ROIs", /*->*/ "Out"); } protected: framework::OpKernelType GetExpectedKernelType( const framework::ExecutionContext& ctx) const override { - return framework::OpKernelType( - framework::ToDataType(ctx.Input("X")->type()), - ctx.device_context()); + return framework::OpKernelType(ctx.Input("X")->type(), + ctx.device_context()); } }; @@ -519,9 +520,8 @@ class ROIPerspectiveTransformGradOp : public framework::OperatorWithKernel { protected: framework::OpKernelType GetExpectedKernelType( const framework::ExecutionContext& ctx) const override { - return framework::OpKernelType( - framework::ToDataType(ctx.Input("X")->type()), - ctx.device_context()); + return framework::OpKernelType(ctx.Input("X")->type(), + ctx.device_context()); } }; @@ -569,13 +569,31 @@ class ROIPerspectiveTransformOpMaker } }; +class ROIPerspectiveTransformGradDescMaker + : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + std::unique_ptr op(new framework::OpDesc()); + op->SetType("roi_perspective_transform_grad"); + op->SetInput("X", Input("X")); + op->SetInput("ROIs", Input("ROIs")); + op->SetInput(framework::GradVarName("Out"), OutputGrad("Out")); + op->SetOutput(framework::GradVarName("X"), InputGrad("X")); + op->SetAttrMap(Attrs()); + return op; + } +}; + } // namespace operators } // namespace paddle namespace ops = paddle::operators; REGISTER_OPERATOR(roi_perspective_transform, ops::ROIPerspectiveTransformOp, ops::ROIPerspectiveTransformOpMaker, - paddle::framework::DefaultGradOpDescMaker); + ops::ROIPerspectiveTransformGradDescMaker); REGISTER_OPERATOR(roi_perspective_transform_grad, ops::ROIPerspectiveTransformGradOp); REGISTER_OP_CPU_KERNEL(roi_perspective_transform, diff --git a/paddle/fluid/operators/detection/roi_perspective_transform_op.cu b/paddle/fluid/operators/detection/roi_perspective_transform_op.cu index c82930cc4994c3854e60f40ae9909a90d82cbff6..862d664d42e03d2ae968ea0bdec8ae8e50bf7fb3 100644 --- a/paddle/fluid/operators/detection/roi_perspective_transform_op.cu +++ b/paddle/fluid/operators/detection/roi_perspective_transform_op.cu @@ -15,6 +15,10 @@ limitations under the License. */ #include #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/platform/cuda_primitives.h" +#include "paddle/fluid/platform/float16.h" + +using paddle::platform::PADDLE_CUDA_NUM_THREADS; +using paddle::platform::float16; namespace paddle { namespace operators { @@ -31,12 +35,12 @@ namespace operators { template __device__ bool GT_E(T a, T b) { - return (a > b) || fabs(a - b) < 1e-4; + return (a > b) || Eigen::numext::abs(a - b) < 1e-4; } template __device__ bool LT_E(T a, T b) { - return (a < b) || fabs(a - b) < 1e-4; + return (a < b) || Eigen::numext::abs(a - b) < 1e-4; } template diff --git a/paddle/fluid/operators/detection/rpn_target_assign_op.cc b/paddle/fluid/operators/detection/rpn_target_assign_op.cc index dda423efd35b96f5e1d7c55389818f46ef3d8694..0b8053e8d03c426e5a1b619e67bc8dae21c5c024 100644 --- a/paddle/fluid/operators/detection/rpn_target_assign_op.cc +++ b/paddle/fluid/operators/detection/rpn_target_assign_op.cc @@ -52,10 +52,12 @@ class RpnTargetAssignOp : public framework::OperatorWithKernel { PADDLE_ENFORCE( ctx->HasOutput("TargetBBox"), "Output(TargetBBox) of RpnTargetAssignOp should not be null"); + PADDLE_ENFORCE( + ctx->HasOutput("BBoxInsideWeight"), + "Output(BBoxInsideWeight) of RpnTargetAssignOp should not be null"); auto anchor_dims = ctx->GetInputDim("Anchor"); auto gt_boxes_dims = ctx->GetInputDim("GtBoxes"); - auto is_crowd_dims = ctx->GetInputDim("IsCrowd"); auto im_info_dims = ctx->GetInputDim("ImInfo"); PADDLE_ENFORCE_EQ(anchor_dims.size(), 2, "The rank of Input(Anchor) must be 2."); @@ -68,14 +70,14 @@ class RpnTargetAssignOp : public framework::OperatorWithKernel { ctx->SetOutputDim("ScoreIndex", {-1}); ctx->SetOutputDim("TargetLabel", {-1, 1}); ctx->SetOutputDim("TargetBBox", {-1, 4}); + ctx->SetOutputDim("BBoxInsideWeight", {-1, 4}); } protected: framework::OpKernelType GetExpectedKernelType( const framework::ExecutionContext& ctx) const override { return framework::OpKernelType( - framework::ToDataType( - ctx.Input("Anchor")->type()), + ctx.Input("Anchor")->type(), platform::CPUPlace()); } }; @@ -169,6 +171,7 @@ void ScoreAssign(const T* anchor_by_gt_overlap_data, const float rpn_positive_overlap, const float rpn_negative_overlap, std::vector* fg_inds, std::vector* bg_inds, std::vector* tgt_lbl, + std::vector* fg_fake, std::vector* bbox_inside_weight, std::minstd_rand engine, bool use_random) { float epsilon = 0.00001; int anchor_num = anchor_to_gt_max.dims()[0]; @@ -201,12 +204,12 @@ void ScoreAssign(const T* anchor_by_gt_overlap_data, // Reservoir Sampling int fg_num = static_cast(rpn_fg_fraction * rpn_batch_size_per_im); ReservoirSampling(fg_num, &fg_inds_fake, engine, use_random); - fg_num = static_cast(fg_inds_fake.size()); - for (int64_t i = 0; i < fg_num; ++i) { + int fg_fake_num = static_cast(fg_inds_fake.size()); + for (int64_t i = 0; i < fg_fake_num; ++i) { target_label[fg_inds_fake[i]] = 1; } - int bg_num = rpn_batch_size_per_im - fg_num; + int bg_num = rpn_batch_size_per_im - fg_fake_num; for (int64_t i = 0; i < anchor_num; ++i) { if (anchor_to_gt_max_data[i] < rpn_negative_overlap) { bg_inds_fake.push_back(i); @@ -214,12 +217,28 @@ void ScoreAssign(const T* anchor_by_gt_overlap_data, } ReservoirSampling(bg_num, &bg_inds_fake, engine, use_random); bg_num = static_cast(bg_inds_fake.size()); + int fake_num = 0; for (int64_t i = 0; i < bg_num; ++i) { + // fg fake found + if (target_label[bg_inds_fake[i]] == 1) { + fake_num++; + fg_fake->emplace_back(fg_inds_fake[0]); + for (int j = 0; j < 4; ++j) { + bbox_inside_weight->emplace_back(T(0.)); + } + } target_label[bg_inds_fake[i]] = 0; } + for (int64_t i = 0; i < (fg_fake_num - fake_num) * 4; ++i) { + bbox_inside_weight->emplace_back(T(1.)); + } + for (int64_t i = 0; i < anchor_num; ++i) { - if (target_label[i] == 1) fg_inds->emplace_back(i); + if (target_label[i] == 1) { + fg_inds->emplace_back(i); + fg_fake->emplace_back(i); + } if (target_label[i] == 0) bg_inds->emplace_back(i); } fg_num = fg_inds->size(); @@ -248,7 +267,8 @@ std::vector SampleRpnFgBgGt(const platform::CPUDeviceContext& ctx, std::vector bg_inds; std::vector gt_inds; std::vector tgt_lbl; - + std::vector fg_fake; + std::vector bbox_inside_weight; // Calculate the max IoU between anchors and gt boxes // Map from anchor to gt box that has highest overlap auto place = ctx.GetPlace(); @@ -275,32 +295,37 @@ std::vector SampleRpnFgBgGt(const platform::CPUDeviceContext& ctx, // Follow the Faster RCNN's implementation ScoreAssign(anchor_by_gt_overlap_data, anchor_to_gt_max, gt_to_anchor_max, rpn_batch_size_per_im, rpn_fg_fraction, rpn_positive_overlap, - rpn_negative_overlap, &fg_inds, &bg_inds, &tgt_lbl, engine, - use_random); + rpn_negative_overlap, &fg_inds, &bg_inds, &tgt_lbl, &fg_fake, + &bbox_inside_weight, engine, use_random); int fg_num = fg_inds.size(); int bg_num = bg_inds.size(); - gt_inds.reserve(fg_num); - for (int i = 0; i < fg_num; ++i) { - gt_inds.emplace_back(argmax[fg_inds[i]]); + int fg_fake_num = fg_fake.size(); + gt_inds.reserve(fg_fake_num); + for (int i = 0; i < fg_fake_num; ++i) { + gt_inds.emplace_back(argmax[fg_fake[i]]); } - - Tensor loc_index_t, score_index_t, tgt_lbl_t, gt_inds_t; - int* loc_index_data = loc_index_t.mutable_data({fg_num}, place); + Tensor loc_index_t, score_index_t, tgt_lbl_t, gt_inds_t, bbox_inside_weight_t; + int* loc_index_data = loc_index_t.mutable_data({fg_fake_num}, place); int* score_index_data = score_index_t.mutable_data({fg_num + bg_num}, place); int* tgt_lbl_data = tgt_lbl_t.mutable_data({fg_num + bg_num}, place); - int* gt_inds_data = gt_inds_t.mutable_data({fg_num}, place); - std::copy(fg_inds.begin(), fg_inds.end(), loc_index_data); + int* gt_inds_data = gt_inds_t.mutable_data({fg_fake_num}, place); + T* bbox_inside_weight_data = + bbox_inside_weight_t.mutable_data({fg_fake_num, 4}, place); + std::copy(fg_fake.begin(), fg_fake.end(), loc_index_data); std::copy(fg_inds.begin(), fg_inds.end(), score_index_data); std::copy(bg_inds.begin(), bg_inds.end(), score_index_data + fg_num); std::copy(tgt_lbl.begin(), tgt_lbl.end(), tgt_lbl_data); std::copy(gt_inds.begin(), gt_inds.end(), gt_inds_data); + std::copy(bbox_inside_weight.begin(), bbox_inside_weight.end(), + bbox_inside_weight_data); std::vector loc_score_tgtlbl_gt; loc_score_tgtlbl_gt.emplace_back(loc_index_t); loc_score_tgtlbl_gt.emplace_back(score_index_t); loc_score_tgtlbl_gt.emplace_back(tgt_lbl_t); loc_score_tgtlbl_gt.emplace_back(gt_inds_t); + loc_score_tgtlbl_gt.emplace_back(bbox_inside_weight_t); return loc_score_tgtlbl_gt; } @@ -318,6 +343,7 @@ class RpnTargetAssignKernel : public framework::OpKernel { auto* score_index = context.Output("ScoreIndex"); auto* tgt_bbox = context.Output("TargetBBox"); auto* tgt_lbl = context.Output("TargetLabel"); + auto* bbox_inside_weight = context.Output("BBoxInsideWeight"); PADDLE_ENFORCE_EQ(gt_boxes->lod().size(), 1UL, "RpnTargetAssignOp gt_boxes needs 1 level of LoD"); @@ -340,7 +366,7 @@ class RpnTargetAssignKernel : public framework::OpKernel { score_index->mutable_data({max_num}, place); tgt_bbox->mutable_data({max_num, 4}, place); tgt_lbl->mutable_data({max_num, 1}, place); - + bbox_inside_weight->mutable_data({max_num, 4}, place); auto& dev_ctx = context.device_context(); std::random_device rnd; @@ -394,6 +420,7 @@ class RpnTargetAssignKernel : public framework::OpKernel { Tensor sampled_score_index = loc_score_tgtlbl_gt[1]; Tensor sampled_tgtlbl = loc_score_tgtlbl_gt[2]; Tensor sampled_gt_index = loc_score_tgtlbl_gt[3]; + Tensor sampled_bbox_inside_weight = loc_score_tgtlbl_gt[4]; int loc_num = sampled_loc_index.dims()[0]; int score_num = sampled_score_index.dims()[0]; @@ -432,6 +459,8 @@ class RpnTargetAssignKernel : public framework::OpKernel { AppendRpns(score_index, total_score_num, &sampled_score_index_unmap); AppendRpns(tgt_bbox, total_loc_num * 4, &sampled_tgt_bbox); AppendRpns(tgt_lbl, total_score_num, &sampled_tgtlbl); + AppendRpns(bbox_inside_weight, total_loc_num * 4, + &sampled_bbox_inside_weight); total_loc_num += loc_num; total_score_num += score_num; @@ -448,10 +477,12 @@ class RpnTargetAssignKernel : public framework::OpKernel { score_index->set_lod(loc_score); tgt_bbox->set_lod(lod_loc); tgt_lbl->set_lod(loc_score); + bbox_inside_weight->set_lod(lod_loc); loc_index->Resize({total_loc_num}); score_index->Resize({total_score_num}); tgt_bbox->Resize({total_loc_num, 4}); tgt_lbl->Resize({total_score_num, 1}); + bbox_inside_weight->Resize({total_loc_num, 4}); } }; @@ -514,6 +545,9 @@ class RpnTargetAssignOpMaker : public framework::OpProtoAndCheckerMaker { "TargetLabel", "(Tensor), The target labels of each anchor with shape " "[F + B, 1], F and B are sampled foreground and backgroud number."); + AddOutput("BBoxInsideWeight", + "(Tensor), The bbox inside weight with shape " + "[F, 4], F is the sampled foreground number."); AddComment(R"DOC( This operator can be, for a given set of ground truth bboxes and the anchors, to assign classification and regression targets to each prediction. diff --git a/paddle/fluid/operators/detection/target_assign_op.cc b/paddle/fluid/operators/detection/target_assign_op.cc index 367001939251114a9cf442fd85c734958ccb2da8..c057c82ce0f5eef67c09d0ed719ddd24382f451d 100644 --- a/paddle/fluid/operators/detection/target_assign_op.cc +++ b/paddle/fluid/operators/detection/target_assign_op.cc @@ -57,9 +57,8 @@ class TargetAssignOp : public framework::OperatorWithKernel { protected: framework::OpKernelType GetExpectedKernelType( const framework::ExecutionContext& ctx) const override { - return framework::OpKernelType( - framework::ToDataType(ctx.Input("X")->type()), - ctx.device_context()); + return framework::OpKernelType(ctx.Input("X")->type(), + ctx.device_context()); } }; diff --git a/paddle/fluid/operators/detection/yolo_box_op.cc b/paddle/fluid/operators/detection/yolo_box_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..e0d7e25d944cf2321799da4c73de9f74d9fd287d --- /dev/null +++ b/paddle/fluid/operators/detection/yolo_box_op.cc @@ -0,0 +1,167 @@ +/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve. + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. */ + +#include "paddle/fluid/operators/detection/yolo_box_op.h" +#include "paddle/fluid/framework/op_registry.h" + +namespace paddle { +namespace operators { + +using framework::Tensor; + +class YoloBoxOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("X"), + "Input(X) of YoloBoxOp should not be null."); + PADDLE_ENFORCE(ctx->HasInput("ImgSize"), + "Input(ImgSize) of YoloBoxOp should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("Boxes"), + "Output(Boxes) of YoloBoxOp should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("Scores"), + "Output(Scores) of YoloBoxOp should not be null."); + + auto dim_x = ctx->GetInputDim("X"); + auto dim_imgsize = ctx->GetInputDim("ImgSize"); + auto anchors = ctx->Attrs().Get>("anchors"); + int anchor_num = anchors.size() / 2; + auto class_num = ctx->Attrs().Get("class_num"); + + PADDLE_ENFORCE_EQ(dim_x.size(), 4, "Input(X) should be a 4-D tensor."); + PADDLE_ENFORCE_EQ( + dim_x[1], anchor_num * (5 + class_num), + "Input(X) dim[1] should be equal to (anchor_mask_number * (5 " + "+ class_num))."); + PADDLE_ENFORCE_EQ(dim_imgsize.size(), 2, + "Input(ImgSize) should be a 2-D tensor."); + PADDLE_ENFORCE_EQ( + dim_imgsize[0], dim_x[0], + "Input(ImgSize) dim[0] and Input(X) dim[0] should be same."); + PADDLE_ENFORCE_EQ(dim_imgsize[1], 2, "Input(ImgSize) dim[1] should be 2."); + PADDLE_ENFORCE_GT(anchors.size(), 0, + "Attr(anchors) length should be greater than 0."); + PADDLE_ENFORCE_EQ(anchors.size() % 2, 0, + "Attr(anchors) length should be even integer."); + PADDLE_ENFORCE_GT(class_num, 0, + "Attr(class_num) should be an integer greater than 0."); + + int box_num = dim_x[2] * dim_x[3] * anchor_num; + std::vector dim_boxes({dim_x[0], box_num, 4}); + ctx->SetOutputDim("Boxes", framework::make_ddim(dim_boxes)); + + std::vector dim_scores({dim_x[0], box_num, class_num}); + ctx->SetOutputDim("Scores", framework::make_ddim(dim_scores)); + } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext& ctx) const override { + return framework::OpKernelType(ctx.Input("X")->type(), + ctx.GetPlace()); + } +}; + +class YoloBoxOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", + "The input tensor of YoloBox operator is a 4-D tensor with " + "shape of [N, C, H, W]. The second dimension(C) stores " + "box locations, confidence score and classification one-hot " + "keys of each anchor box. Generally, X should be the output " + "of YOLOv3 network."); + AddInput("ImgSize", + "The image size tensor of YoloBox operator, " + "This is a 2-D tensor with shape of [N, 2]. This tensor holds " + "height and width of each input image used for resizing output " + "box in input image scale."); + AddOutput("Boxes", + "The output tensor of detection boxes of YoloBox operator, " + "This is a 3-D tensor with shape of [N, M, 4], N is the " + "batch num, M is output box number, and the 3rd dimension " + "stores [xmin, ymin, xmax, ymax] coordinates of boxes."); + AddOutput("Scores", + "The output tensor of detection boxes scores of YoloBox " + "operator, This is a 3-D tensor with shape of " + "[N, M, :attr:`class_num`], N is the batch num, M is " + "output box number."); + + AddAttr("class_num", "The number of classes to predict."); + AddAttr>("anchors", + "The anchor width and height, " + "it will be parsed pair by pair.") + .SetDefault(std::vector{}); + AddAttr("downsample_ratio", + "The downsample ratio from network input to YoloBox operator " + "input, so 32, 16, 8 should be set for the first, second, " + "and thrid YoloBox operators.") + .SetDefault(32); + AddAttr("conf_thresh", + "The confidence scores threshold of detection boxes. " + "Boxes with confidence scores under threshold should " + "be ignored.") + .SetDefault(0.01); + AddComment(R"DOC( + This operator generates YOLO detection boxes from output of YOLOv3 network. + + The output of previous network is in shape [N, C, H, W], while H and W + should be the same, H and W specify the grid size, each grid point predict + given number boxes, this given number, which following will be represented as S, + is specified by the number of anchors. In the second dimension(the channel + dimension), C should be equal to S * (5 + class_num), class_num is the object + category number of source dataset(such as 80 in coco dataset), so the + second(channel) dimension, apart from 4 box location coordinates x, y, w, h, + also includes confidence score of the box and class one-hot key of each anchor + box. + + Assume the 4 location coordinates are :math:`t_x, t_y, t_w, t_h`, the box + predictions should be as follows: + + $$ + b_x = \\sigma(t_x) + c_x + $$ + $$ + b_y = \\sigma(t_y) + c_y + $$ + $$ + b_w = p_w e^{t_w} + $$ + $$ + b_h = p_h e^{t_h} + $$ + + in the equation above, :math:`c_x, c_y` is the left top corner of current grid + and :math:`p_w, p_h` is specified by anchors. + + The logistic regression value of the 5th channel of each anchor prediction boxes + represents the confidence score of each prediction box, and the logistic + regression value of the last :attr:`class_num` channels of each anchor prediction + boxes represents the classifcation scores. Boxes with confidence scores less than + :attr:`conf_thresh` should be ignored, and box final scores is the product of + confidence scores and classification scores. + + $$ + score_{pred} = score_{conf} * score_{class} + $$ + + )DOC"); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(yolo_box, ops::YoloBoxOp, ops::YoloBoxOpMaker, + paddle::framework::EmptyGradOpMaker); +REGISTER_OP_CPU_KERNEL(yolo_box, ops::YoloBoxKernel, + ops::YoloBoxKernel); diff --git a/paddle/fluid/operators/detection/yolo_box_op.cu b/paddle/fluid/operators/detection/yolo_box_op.cu new file mode 100644 index 0000000000000000000000000000000000000000..5a882958e66a79507e053a96b15be8cbbcc83164 --- /dev/null +++ b/paddle/fluid/operators/detection/yolo_box_op.cu @@ -0,0 +1,120 @@ +/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/detection/yolo_box_op.h" +#include "paddle/fluid/operators/math/math_function.h" + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; + +template +__global__ void KeYoloBoxFw(const T* input, const int* imgsize, T* boxes, + T* scores, const float conf_thresh, + const int* anchors, const int n, const int h, + const int w, const int an_num, const int class_num, + const int box_num, int input_size) { + int tid = blockIdx.x * blockDim.x + threadIdx.x; + int stride = blockDim.x * gridDim.x; + T box[4]; + for (; tid < n * box_num; tid += stride) { + int grid_num = h * w; + int i = tid / box_num; + int j = (tid % box_num) / grid_num; + int k = (tid % grid_num) / w; + int l = tid % w; + + int an_stride = (5 + class_num) * grid_num; + int img_height = imgsize[2 * i]; + int img_width = imgsize[2 * i + 1]; + + int obj_idx = + GetEntryIndex(i, j, k * w + l, an_num, an_stride, grid_num, 4); + T conf = sigmoid(input[obj_idx]); + if (conf < conf_thresh) { + continue; + } + + int box_idx = + GetEntryIndex(i, j, k * w + l, an_num, an_stride, grid_num, 0); + GetYoloBox(box, input, anchors, l, k, j, h, input_size, box_idx, + grid_num, img_height, img_width); + box_idx = (i * box_num + j * grid_num + k * w + l) * 4; + CalcDetectionBox(boxes, box, box_idx, img_height, img_width); + + int label_idx = + GetEntryIndex(i, j, k * w + l, an_num, an_stride, grid_num, 5); + int score_idx = (i * box_num + j * grid_num + k * w + l) * class_num; + CalcLabelScore(scores, input, label_idx, score_idx, class_num, conf, + grid_num); + } +} + +template +class YoloBoxOpCUDAKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* input = ctx.Input("X"); + auto* img_size = ctx.Input("ImgSize"); + auto* boxes = ctx.Output("Boxes"); + auto* scores = ctx.Output("Scores"); + + auto anchors = ctx.Attr>("anchors"); + int class_num = ctx.Attr("class_num"); + float conf_thresh = ctx.Attr("conf_thresh"); + int downsample_ratio = ctx.Attr("downsample_ratio"); + + const int n = input->dims()[0]; + const int h = input->dims()[2]; + const int w = input->dims()[3]; + const int box_num = boxes->dims()[1]; + const int an_num = anchors.size() / 2; + int input_size = downsample_ratio * h; + + auto& dev_ctx = ctx.cuda_device_context(); + auto& allocator = + platform::DeviceTemporaryAllocator::Instance().Get(dev_ctx); + int bytes = sizeof(int) * anchors.size(); + auto anchors_ptr = allocator.Allocate(sizeof(int) * anchors.size()); + int* anchors_data = reinterpret_cast(anchors_ptr->ptr()); + const auto gplace = boost::get(ctx.GetPlace()); + const auto cplace = platform::CPUPlace(); + memory::Copy(gplace, anchors_data, cplace, anchors.data(), bytes, + dev_ctx.stream()); + + const T* input_data = input->data(); + const int* imgsize_data = img_size->data(); + T* boxes_data = boxes->mutable_data({n, box_num, 4}, ctx.GetPlace()); + T* scores_data = + scores->mutable_data({n, box_num, class_num}, ctx.GetPlace()); + math::SetConstant set_zero; + set_zero(dev_ctx, boxes, static_cast(0)); + set_zero(dev_ctx, scores, static_cast(0)); + + int grid_dim = (n * box_num + 512 - 1) / 512; + grid_dim = grid_dim > 8 ? 8 : grid_dim; + + KeYoloBoxFw<<>>( + input_data, imgsize_data, boxes_data, scores_data, conf_thresh, + anchors_data, n, h, w, an_num, class_num, box_num, input_size); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OP_CUDA_KERNEL(yolo_box, ops::YoloBoxOpCUDAKernel, + ops::YoloBoxOpCUDAKernel); diff --git a/paddle/fluid/operators/detection/yolo_box_op.h b/paddle/fluid/operators/detection/yolo_box_op.h new file mode 100644 index 0000000000000000000000000000000000000000..8b7c7df0f3cf754f59c994dbe5b1cc2ac5fb773b --- /dev/null +++ b/paddle/fluid/operators/detection/yolo_box_op.h @@ -0,0 +1,149 @@ +/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve. + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. */ + +#pragma once +#include +#include +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/platform/hostdevice.h" + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; + +template +HOSTDEVICE inline T sigmoid(T x) { + return 1.0 / (1.0 + std::exp(-x)); +} + +template +HOSTDEVICE inline void GetYoloBox(T* box, const T* x, const int* anchors, int i, + int j, int an_idx, int grid_size, + int input_size, int index, int stride, + int img_height, int img_width) { + box[0] = (i + sigmoid(x[index])) * img_width / grid_size; + box[1] = (j + sigmoid(x[index + stride])) * img_height / grid_size; + box[2] = std::exp(x[index + 2 * stride]) * anchors[2 * an_idx] * img_width / + input_size; + box[3] = std::exp(x[index + 3 * stride]) * anchors[2 * an_idx + 1] * + img_height / input_size; +} + +HOSTDEVICE inline int GetEntryIndex(int batch, int an_idx, int hw_idx, + int an_num, int an_stride, int stride, + int entry) { + return (batch * an_num + an_idx) * an_stride + entry * stride + hw_idx; +} + +template +HOSTDEVICE inline void CalcDetectionBox(T* boxes, T* box, const int box_idx, + const int img_height, + const int img_width) { + boxes[box_idx] = box[0] - box[2] / 2; + boxes[box_idx + 1] = box[1] - box[3] / 2; + boxes[box_idx + 2] = box[0] + box[2] / 2; + boxes[box_idx + 3] = box[1] + box[3] / 2; + + boxes[box_idx] = boxes[box_idx] > 0 ? boxes[box_idx] : static_cast(0); + boxes[box_idx + 1] = + boxes[box_idx + 1] > 0 ? boxes[box_idx + 1] : static_cast(0); + boxes[box_idx + 2] = boxes[box_idx + 2] < img_width - 1 + ? boxes[box_idx + 2] + : static_cast(img_width - 1); + boxes[box_idx + 3] = boxes[box_idx + 3] < img_height - 1 + ? boxes[box_idx + 3] + : static_cast(img_height - 1); +} + +template +HOSTDEVICE inline void CalcLabelScore(T* scores, const T* input, + const int label_idx, const int score_idx, + const int class_num, const T conf, + const int stride) { + for (int i = 0; i < class_num; i++) { + scores[score_idx + i] = conf * sigmoid(input[label_idx + i * stride]); + } +} + +template +class YoloBoxKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* input = ctx.Input("X"); + auto* imgsize = ctx.Input("ImgSize"); + auto* boxes = ctx.Output("Boxes"); + auto* scores = ctx.Output("Scores"); + auto anchors = ctx.Attr>("anchors"); + int class_num = ctx.Attr("class_num"); + float conf_thresh = ctx.Attr("conf_thresh"); + int downsample_ratio = ctx.Attr("downsample_ratio"); + + const int n = input->dims()[0]; + const int h = input->dims()[2]; + const int w = input->dims()[3]; + const int box_num = boxes->dims()[1]; + const int an_num = anchors.size() / 2; + int input_size = downsample_ratio * h; + + const int stride = h * w; + const int an_stride = (class_num + 5) * stride; + + Tensor anchors_; + auto anchors_data = + anchors_.mutable_data({an_num * 2}, ctx.GetPlace()); + std::copy(anchors.begin(), anchors.end(), anchors_data); + + const T* input_data = input->data(); + const int* imgsize_data = imgsize->data(); + T* boxes_data = boxes->mutable_data({n, box_num, 4}, ctx.GetPlace()); + memset(boxes_data, 0, boxes->numel() * sizeof(T)); + T* scores_data = + scores->mutable_data({n, box_num, class_num}, ctx.GetPlace()); + memset(scores_data, 0, scores->numel() * sizeof(T)); + + T box[4]; + for (int i = 0; i < n; i++) { + int img_height = imgsize_data[2 * i]; + int img_width = imgsize_data[2 * i + 1]; + + for (int j = 0; j < an_num; j++) { + for (int k = 0; k < h; k++) { + for (int l = 0; l < w; l++) { + int obj_idx = + GetEntryIndex(i, j, k * w + l, an_num, an_stride, stride, 4); + T conf = sigmoid(input_data[obj_idx]); + if (conf < conf_thresh) { + continue; + } + + int box_idx = + GetEntryIndex(i, j, k * w + l, an_num, an_stride, stride, 0); + GetYoloBox(box, input_data, anchors_data, l, k, j, h, input_size, + box_idx, stride, img_height, img_width); + box_idx = (i * box_num + j * stride + k * w + l) * 4; + CalcDetectionBox(boxes_data, box, box_idx, img_height, + img_width); + + int label_idx = + GetEntryIndex(i, j, k * w + l, an_num, an_stride, stride, 5); + int score_idx = (i * box_num + j * stride + k * w + l) * class_num; + CalcLabelScore(scores_data, input_data, label_idx, score_idx, + class_num, conf, stride); + } + } + } + } + } +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/detection/yolov3_loss_op.cc b/paddle/fluid/operators/detection/yolov3_loss_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..6c37da17f4011d38efcdc5406331f1be173dd0dd --- /dev/null +++ b/paddle/fluid/operators/detection/yolov3_loss_op.cc @@ -0,0 +1,299 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve. + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. */ + +#include "paddle/fluid/operators/detection/yolov3_loss_op.h" +#include +#include "paddle/fluid/framework/op_registry.h" + +namespace paddle { +namespace operators { + +using framework::Tensor; + +class Yolov3LossOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("X"), + "Input(X) of Yolov3LossOp should not be null."); + PADDLE_ENFORCE(ctx->HasInput("GTBox"), + "Input(GTBox) of Yolov3LossOp should not be null."); + PADDLE_ENFORCE(ctx->HasInput("GTLabel"), + "Input(GTLabel) of Yolov3LossOp should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("Loss"), + "Output(Loss) of Yolov3LossOp should not be null."); + PADDLE_ENFORCE( + ctx->HasOutput("ObjectnessMask"), + "Output(ObjectnessMask) of Yolov3LossOp should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("GTMatchMask"), + "Output(GTMatchMask) of Yolov3LossOp should not be null."); + + auto dim_x = ctx->GetInputDim("X"); + auto dim_gtbox = ctx->GetInputDim("GTBox"); + auto dim_gtlabel = ctx->GetInputDim("GTLabel"); + auto anchors = ctx->Attrs().Get>("anchors"); + int anchor_num = anchors.size() / 2; + auto anchor_mask = ctx->Attrs().Get>("anchor_mask"); + int mask_num = anchor_mask.size(); + auto class_num = ctx->Attrs().Get("class_num"); + + PADDLE_ENFORCE_EQ(dim_x.size(), 4, "Input(X) should be a 4-D tensor."); + PADDLE_ENFORCE_EQ(dim_x[2], dim_x[3], + "Input(X) dim[3] and dim[4] should be euqal."); + PADDLE_ENFORCE_EQ( + dim_x[1], mask_num * (5 + class_num), + "Input(X) dim[1] should be equal to (anchor_mask_number * (5 " + "+ class_num))."); + PADDLE_ENFORCE_EQ(dim_gtbox.size(), 3, + "Input(GTBox) should be a 3-D tensor"); + PADDLE_ENFORCE_EQ(dim_gtbox[2], 4, "Input(GTBox) dim[2] should be 5"); + PADDLE_ENFORCE_EQ(dim_gtlabel.size(), 2, + "Input(GTLabel) should be a 2-D tensor"); + PADDLE_ENFORCE_EQ(dim_gtlabel[0], dim_gtbox[0], + "Input(GTBox) and Input(GTLabel) dim[0] should be same"); + PADDLE_ENFORCE_EQ(dim_gtlabel[1], dim_gtbox[1], + "Input(GTBox) and Input(GTLabel) dim[1] should be same"); + PADDLE_ENFORCE_GT(anchors.size(), 0, + "Attr(anchors) length should be greater then 0."); + PADDLE_ENFORCE_EQ(anchors.size() % 2, 0, + "Attr(anchors) length should be even integer."); + for (size_t i = 0; i < anchor_mask.size(); i++) { + PADDLE_ENFORCE_LT( + anchor_mask[i], anchor_num, + "Attr(anchor_mask) should not crossover Attr(anchors)."); + } + PADDLE_ENFORCE_GT(class_num, 0, + "Attr(class_num) should be an integer greater then 0."); + + if (ctx->HasInput("GTScore")) { + auto dim_gtscore = ctx->GetInputDim("GTScore"); + PADDLE_ENFORCE_EQ(dim_gtscore.size(), 2, + "Input(GTScore) should be a 2-D tensor"); + PADDLE_ENFORCE_EQ( + dim_gtscore[0], dim_gtbox[0], + "Input(GTBox) and Input(GTScore) dim[0] should be same"); + PADDLE_ENFORCE_EQ( + dim_gtscore[1], dim_gtbox[1], + "Input(GTBox) and Input(GTScore) dim[1] should be same"); + } + + std::vector dim_out({dim_x[0]}); + ctx->SetOutputDim("Loss", framework::make_ddim(dim_out)); + + std::vector dim_obj_mask({dim_x[0], mask_num, dim_x[2], dim_x[3]}); + ctx->SetOutputDim("ObjectnessMask", framework::make_ddim(dim_obj_mask)); + + std::vector dim_gt_match_mask({dim_gtbox[0], dim_gtbox[1]}); + ctx->SetOutputDim("GTMatchMask", framework::make_ddim(dim_gt_match_mask)); + } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext& ctx) const override { + return framework::OpKernelType(ctx.Input("X")->type(), + platform::CPUPlace()); + } +}; + +class Yolov3LossOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", + "The input tensor of YOLOv3 loss operator, " + "This is a 4-D tensor with shape of [N, C, H, W]." + "H and W should be same, and the second dimention(C) stores" + "box locations, confidence score and classification one-hot" + "keys of each anchor box"); + AddInput("GTBox", + "The input tensor of ground truth boxes, " + "This is a 3-D tensor with shape of [N, max_box_num, 5], " + "max_box_num is the max number of boxes in each image, " + "In the third dimention, stores x, y, w, h coordinates, " + "x, y is the center cordinate of boxes and w, h is the " + "width and height and x, y, w, h should be divided by " + "input image height to scale to [0, 1]."); + AddInput("GTLabel", + "The input tensor of ground truth label, " + "This is a 2-D tensor with shape of [N, max_box_num], " + "and each element should be an integer to indicate the " + "box class id."); + AddInput("GTScore", + "The score of GTLabel, This is a 2-D tensor in same shape " + "GTLabel, and score values should in range (0, 1). This " + "input is for GTLabel score can be not 1.0 in image mixup " + "augmentation.") + .AsDispensable(); + AddOutput("Loss", + "The output yolov3 loss tensor, " + "This is a 1-D tensor with shape of [N]"); + AddOutput("ObjectnessMask", + "This is an intermediate tensor with shape of [N, M, H, W], " + "M is the number of anchor masks. This parameter caches the " + "mask for calculate objectness loss in gradient kernel.") + .AsIntermediate(); + AddOutput("GTMatchMask", + "This is an intermediate tensor with shape of [N, B], " + "B is the max box number of GT boxes. This parameter caches " + "matched mask index of each GT boxes for gradient calculate.") + .AsIntermediate(); + + AddAttr("class_num", "The number of classes to predict."); + AddAttr>("anchors", + "The anchor width and height, " + "it will be parsed pair by pair.") + .SetDefault(std::vector{}); + AddAttr>("anchor_mask", + "The mask index of anchors used in " + "current YOLOv3 loss calculation.") + .SetDefault(std::vector{}); + AddAttr("downsample_ratio", + "The downsample ratio from network input to YOLOv3 loss " + "input, so 32, 16, 8 should be set for the first, second, " + "and thrid YOLOv3 loss operators.") + .SetDefault(32); + AddAttr("ignore_thresh", + "The ignore threshold to ignore confidence loss.") + .SetDefault(0.7); + AddAttr("use_label_smooth", + "Whether to use label smooth. Default True.") + .SetDefault(true); + AddComment(R"DOC( + This operator generates yolov3 loss based on given predict result and ground + truth boxes. + + The output of previous network is in shape [N, C, H, W], while H and W + should be the same, H and W specify the grid size, each grid point predict + given number boxes, this given number, which following will be represented as S, + is specified by the number of anchors, In the second dimension(the channel + dimension), C should be equal to S * (class_num + 5), class_num is the object + category number of source dataset(such as 80 in coco dataset), so in the + second(channel) dimension, apart from 4 box location coordinates x, y, w, h, + also includes confidence score of the box and class one-hot key of each anchor box. + + Assume the 4 location coordinates are :math:`t_x, t_y, t_w, t_h`, the box predictions + should be as follows: + + $$ + b_x = \\sigma(t_x) + c_x + $$ + $$ + b_y = \\sigma(t_y) + c_y + $$ + $$ + b_w = p_w e^{t_w} + $$ + $$ + b_h = p_h e^{t_h} + $$ + + In the equation above, :math:`c_x, c_y` is the left top corner of current grid + and :math:`p_w, p_h` is specified by anchors. + + As for confidence score, it is the logistic regression value of IoU between + anchor boxes and ground truth boxes, the score of the anchor box which has + the max IoU should be 1, and if the anchor box has IoU bigger than ignore + thresh, the confidence score loss of this anchor box will be ignored. + + Therefore, the yolov3 loss consist of three major parts, box location loss, + confidence score loss, and classification loss. The L2 loss is used for + box coordinates (w, h), and sigmoid cross entropy loss is used for box + coordinates (x, y), confidence score loss and classification loss. + + Each groud truth box find a best matching anchor box in all anchors, + prediction of this anchor box will incur all three parts of losses, and + prediction of anchor boxes with no GT box matched will only incur objectness + loss. + + In order to trade off box coordinate losses between big boxes and small + boxes, box coordinate losses will be mutiplied by scale weight, which is + calculated as follows. + + $$ + weight_{box} = 2.0 - t_w * t_h + $$ + + Final loss will be represented as follows. + + $$ + loss = (loss_{xy} + loss_{wh}) * weight_{box} + + loss_{conf} + loss_{class} + $$ + + While :attr:`use_label_smooth` is set to be :attr:`True`, the classification + target will be smoothed when calculating classification loss, target of + positive samples will be smoothed to :math:`1.0 - 1.0 / class\_num` and target of + negetive samples will be smoothed to :math:`1.0 / class\_num`. + + While :attr:`GTScore` is given, which means the mixup score of ground truth + boxes, all losses incured by a ground truth box will be multiplied by its + mixup score. + )DOC"); + } +}; + +class Yolov3LossOpGrad : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null"); + PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Loss")), + "Input(Loss@GRAD) should not be null"); + auto dim_x = ctx->GetInputDim("X"); + if (ctx->HasOutput(framework::GradVarName("X"))) { + ctx->SetOutputDim(framework::GradVarName("X"), dim_x); + } + } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext& ctx) const override { + return framework::OpKernelType(ctx.Input("X")->type(), + platform::CPUPlace()); + } +}; + +class Yolov3LossGradMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + auto* op = new framework::OpDesc(); + op->SetType("yolov3_loss_grad"); + op->SetInput("X", Input("X")); + op->SetInput("GTBox", Input("GTBox")); + op->SetInput("GTLabel", Input("GTLabel")); + op->SetInput("GTScore", Input("GTScore")); + op->SetInput(framework::GradVarName("Loss"), OutputGrad("Loss")); + op->SetInput("ObjectnessMask", Output("ObjectnessMask")); + op->SetInput("GTMatchMask", Output("GTMatchMask")); + + op->SetAttrMap(Attrs()); + + op->SetOutput(framework::GradVarName("X"), InputGrad("X")); + op->SetOutput(framework::GradVarName("GTBox"), {}); + op->SetOutput(framework::GradVarName("GTLabel"), {}); + op->SetOutput(framework::GradVarName("GTScore"), {}); + return std::unique_ptr(op); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(yolov3_loss, ops::Yolov3LossOp, ops::Yolov3LossOpMaker, + ops::Yolov3LossGradMaker); +REGISTER_OPERATOR(yolov3_loss_grad, ops::Yolov3LossOpGrad); +REGISTER_OP_CPU_KERNEL(yolov3_loss, ops::Yolov3LossKernel, + ops::Yolov3LossKernel); +REGISTER_OP_CPU_KERNEL(yolov3_loss_grad, ops::Yolov3LossGradKernel, + ops::Yolov3LossGradKernel); diff --git a/paddle/fluid/operators/detection/yolov3_loss_op.h b/paddle/fluid/operators/detection/yolov3_loss_op.h new file mode 100644 index 0000000000000000000000000000000000000000..a004b022b75174012d10ba38e5ec161830c62640 --- /dev/null +++ b/paddle/fluid/operators/detection/yolov3_loss_op.h @@ -0,0 +1,500 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve. + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. */ + +#pragma once +#include +#include +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/math/math_function.h" + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; +template +using EigenTensor = framework::EigenTensor; +template +using EigenVector = framework::EigenVector; + +template +static inline bool LessEqualZero(T x) { + return x < 1e-6; +} + +template +static T SigmoidCrossEntropy(T x, T label) { + return (x > 0 ? x : 0.0) - x * label + std::log(1.0 + std::exp(-std::abs(x))); +} + +template +static T L1Loss(T x, T y) { + return std::abs(y - x); +} + +template +static T SigmoidCrossEntropyGrad(T x, T label) { + return 1.0 / (1.0 + std::exp(-x)) - label; +} + +template +static T L1LossGrad(T x, T y) { + return x > y ? 1.0 : -1.0; +} + +static int GetMaskIndex(std::vector mask, int val) { + for (size_t i = 0; i < mask.size(); i++) { + if (mask[i] == val) { + return i; + } + } + return -1; +} + +template +struct Box { + T x, y, w, h; +}; + +template +static inline T sigmoid(T x) { + return 1.0 / (1.0 + std::exp(-x)); +} + +template +static inline Box GetYoloBox(const T* x, std::vector anchors, int i, + int j, int an_idx, int grid_size, + int input_size, int index, int stride) { + Box b; + b.x = (i + sigmoid(x[index])) / grid_size; + b.y = (j + sigmoid(x[index + stride])) / grid_size; + b.w = std::exp(x[index + 2 * stride]) * anchors[2 * an_idx] / input_size; + b.h = std::exp(x[index + 3 * stride]) * anchors[2 * an_idx + 1] / input_size; + return b; +} + +template +static inline Box GetGtBox(const T* gt, int batch, int max_boxes, int idx) { + Box b; + b.x = gt[(batch * max_boxes + idx) * 4]; + b.y = gt[(batch * max_boxes + idx) * 4 + 1]; + b.w = gt[(batch * max_boxes + idx) * 4 + 2]; + b.h = gt[(batch * max_boxes + idx) * 4 + 3]; + return b; +} + +template +static inline T BoxOverlap(T c1, T w1, T c2, T w2) { + T l1 = c1 - w1 / 2.0; + T l2 = c2 - w2 / 2.0; + T left = l1 > l2 ? l1 : l2; + T r1 = c1 + w1 / 2.0; + T r2 = c2 + w2 / 2.0; + T right = r1 < r2 ? r1 : r2; + return right - left; +} + +template +static inline T CalcBoxIoU(Box b1, Box b2) { + T w = BoxOverlap(b1.x, b1.w, b2.x, b2.w); + T h = BoxOverlap(b1.y, b1.h, b2.y, b2.h); + T inter_area = (w < 0 || h < 0) ? 0.0 : w * h; + T union_area = b1.w * b1.h + b2.w * b2.h - inter_area; + return inter_area / union_area; +} + +static inline int GetEntryIndex(int batch, int an_idx, int hw_idx, int an_num, + int an_stride, int stride, int entry) { + return (batch * an_num + an_idx) * an_stride + entry * stride + hw_idx; +} + +template +static void CalcBoxLocationLoss(T* loss, const T* input, Box gt, + std::vector anchors, int an_idx, + int box_idx, int gi, int gj, int grid_size, + int input_size, int stride, T score) { + T tx = gt.x * grid_size - gi; + T ty = gt.y * grid_size - gj; + T tw = std::log(gt.w * input_size / anchors[2 * an_idx]); + T th = std::log(gt.h * input_size / anchors[2 * an_idx + 1]); + + T scale = (2.0 - gt.w * gt.h) * score; + loss[0] += SigmoidCrossEntropy(input[box_idx], tx) * scale; + loss[0] += SigmoidCrossEntropy(input[box_idx + stride], ty) * scale; + loss[0] += L1Loss(input[box_idx + 2 * stride], tw) * scale; + loss[0] += L1Loss(input[box_idx + 3 * stride], th) * scale; +} + +template +static void CalcBoxLocationLossGrad(T* input_grad, const T loss, const T* input, + Box gt, std::vector anchors, + int an_idx, int box_idx, int gi, int gj, + int grid_size, int input_size, int stride, + T score) { + T tx = gt.x * grid_size - gi; + T ty = gt.y * grid_size - gj; + T tw = std::log(gt.w * input_size / anchors[2 * an_idx]); + T th = std::log(gt.h * input_size / anchors[2 * an_idx + 1]); + + T scale = (2.0 - gt.w * gt.h) * score; + input_grad[box_idx] = + SigmoidCrossEntropyGrad(input[box_idx], tx) * scale * loss; + input_grad[box_idx + stride] = + SigmoidCrossEntropyGrad(input[box_idx + stride], ty) * scale * loss; + input_grad[box_idx + 2 * stride] = + L1LossGrad(input[box_idx + 2 * stride], tw) * scale * loss; + input_grad[box_idx + 3 * stride] = + L1LossGrad(input[box_idx + 3 * stride], th) * scale * loss; +} + +template +static inline void CalcLabelLoss(T* loss, const T* input, const int index, + const int label, const int class_num, + const int stride, const T pos, const T neg, + T score) { + for (int i = 0; i < class_num; i++) { + T pred = input[index + i * stride]; + loss[0] += SigmoidCrossEntropy(pred, (i == label) ? pos : neg) * score; + } +} + +template +static inline void CalcLabelLossGrad(T* input_grad, const T loss, + const T* input, const int index, + const int label, const int class_num, + const int stride, const T pos, const T neg, + T score) { + for (int i = 0; i < class_num; i++) { + T pred = input[index + i * stride]; + input_grad[index + i * stride] = + SigmoidCrossEntropyGrad(pred, (i == label) ? pos : neg) * score * + loss; + } +} + +template +static inline void CalcObjnessLoss(T* loss, const T* input, const T* objness, + const int n, const int an_num, const int h, + const int w, const int stride, + const int an_stride) { + for (int i = 0; i < n; i++) { + for (int j = 0; j < an_num; j++) { + for (int k = 0; k < h; k++) { + for (int l = 0; l < w; l++) { + T obj = objness[k * w + l]; + if (obj > 1e-5) { + // positive sample: obj = mixup score + loss[i] += SigmoidCrossEntropy(input[k * w + l], 1.0) * obj; + } else if (obj > -0.5) { + // negetive sample: obj = 0 + loss[i] += SigmoidCrossEntropy(input[k * w + l], 0.0); + } + } + } + objness += stride; + input += an_stride; + } + } +} + +template +static inline void CalcObjnessLossGrad(T* input_grad, const T* loss, + const T* input, const T* objness, + const int n, const int an_num, + const int h, const int w, + const int stride, const int an_stride) { + for (int i = 0; i < n; i++) { + for (int j = 0; j < an_num; j++) { + for (int k = 0; k < h; k++) { + for (int l = 0; l < w; l++) { + T obj = objness[k * w + l]; + if (obj > 1e-5) { + input_grad[k * w + l] = + SigmoidCrossEntropyGrad(input[k * w + l], 1.0) * obj * + loss[i]; + } else if (obj > -0.5) { + input_grad[k * w + l] = + SigmoidCrossEntropyGrad(input[k * w + l], 0.0) * loss[i]; + } + } + } + objness += stride; + input += an_stride; + input_grad += an_stride; + } + } +} + +template +static void inline GtValid(bool* valid, const T* gtbox, const int n, + const int b) { + for (int i = 0; i < n; i++) { + for (int j = 0; j < b; j++) { + if (LessEqualZero(gtbox[j * 4 + 2]) || LessEqualZero(gtbox[j * 4 + 3])) { + valid[j] = false; + } else { + valid[j] = true; + } + } + valid += b; + gtbox += b * 4; + } +} + +template +class Yolov3LossKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* input = ctx.Input("X"); + auto* gt_box = ctx.Input("GTBox"); + auto* gt_label = ctx.Input("GTLabel"); + auto* gt_score = ctx.Input("GTScore"); + auto* loss = ctx.Output("Loss"); + auto* objness_mask = ctx.Output("ObjectnessMask"); + auto* gt_match_mask = ctx.Output("GTMatchMask"); + auto anchors = ctx.Attr>("anchors"); + auto anchor_mask = ctx.Attr>("anchor_mask"); + int class_num = ctx.Attr("class_num"); + float ignore_thresh = ctx.Attr("ignore_thresh"); + int downsample_ratio = ctx.Attr("downsample_ratio"); + bool use_label_smooth = ctx.Attr("use_label_smooth"); + + const int n = input->dims()[0]; + const int h = input->dims()[2]; + const int w = input->dims()[3]; + const int an_num = anchors.size() / 2; + const int mask_num = anchor_mask.size(); + const int b = gt_box->dims()[1]; + int input_size = downsample_ratio * h; + + const int stride = h * w; + const int an_stride = (class_num + 5) * stride; + + T label_pos = 1.0; + T label_neg = 0.0; + if (use_label_smooth) { + label_pos = 1.0 - 1.0 / static_cast(class_num); + label_neg = 1.0 / static_cast(class_num); + } + + const T* input_data = input->data(); + const T* gt_box_data = gt_box->data(); + const int* gt_label_data = gt_label->data(); + T* loss_data = loss->mutable_data({n}, ctx.GetPlace()); + memset(loss_data, 0, loss->numel() * sizeof(T)); + T* obj_mask_data = + objness_mask->mutable_data({n, mask_num, h, w}, ctx.GetPlace()); + memset(obj_mask_data, 0, objness_mask->numel() * sizeof(T)); + int* gt_match_mask_data = + gt_match_mask->mutable_data({n, b}, ctx.GetPlace()); + + const T* gt_score_data; + if (!gt_score) { + Tensor gtscore; + gtscore.mutable_data({n, b}, ctx.GetPlace()); + math::SetConstant()( + ctx.template device_context(), >score, + static_cast(1.0)); + gt_score = >score; + gt_score_data = gtscore.data(); + } else { + gt_score_data = gt_score->data(); + } + + // calc valid gt box mask, avoid calc duplicately in following code + Tensor gt_valid_mask; + bool* gt_valid_mask_data = + gt_valid_mask.mutable_data({n, b}, ctx.GetPlace()); + GtValid(gt_valid_mask_data, gt_box_data, n, b); + + for (int i = 0; i < n; i++) { + for (int j = 0; j < mask_num; j++) { + for (int k = 0; k < h; k++) { + for (int l = 0; l < w; l++) { + // each predict box find a best match gt box, if overlap is bigger + // then ignore_thresh, ignore the objectness loss. + int box_idx = + GetEntryIndex(i, j, k * w + l, mask_num, an_stride, stride, 0); + Box pred = GetYoloBox(input_data, anchors, l, k, anchor_mask[j], + h, input_size, box_idx, stride); + T best_iou = 0; + for (int t = 0; t < b; t++) { + if (!gt_valid_mask_data[i * b + t]) { + continue; + } + Box gt = GetGtBox(gt_box_data, i, b, t); + T iou = CalcBoxIoU(pred, gt); + if (iou > best_iou) { + best_iou = iou; + } + } + + // If best IoU is bigger then ignore_thresh, + // ignore the objectness loss. + if (best_iou > ignore_thresh) { + int obj_idx = (i * mask_num + j) * stride + k * w + l; + obj_mask_data[obj_idx] = static_cast(-1); + } + // all losses should be calculated if best IoU + // is bigger then truth thresh, but currently, + // truth thresh is an unreachable value as 1.0. + } + } + } + for (int t = 0; t < b; t++) { + if (!gt_valid_mask_data[i * b + t]) { + gt_match_mask_data[i * b + t] = -1; + continue; + } + Box gt = GetGtBox(gt_box_data, i, b, t); + int gi = static_cast(gt.x * w); + int gj = static_cast(gt.y * h); + Box gt_shift = gt; + gt_shift.x = 0.0; + gt_shift.y = 0.0; + T best_iou = 0.0; + int best_n = 0; + // each gt box find a best match anchor box as positive sample, + // for positive sample, all losses should be calculated, and for + // other samples, only objectness loss is required. + for (int an_idx = 0; an_idx < an_num; an_idx++) { + Box an_box; + an_box.x = 0.0; + an_box.y = 0.0; + an_box.w = anchors[2 * an_idx] / static_cast(input_size); + an_box.h = anchors[2 * an_idx + 1] / static_cast(input_size); + float iou = CalcBoxIoU(an_box, gt_shift); + if (iou > best_iou) { + best_iou = iou; + best_n = an_idx; + } + } + + int mask_idx = GetMaskIndex(anchor_mask, best_n); + gt_match_mask_data[i * b + t] = mask_idx; + if (mask_idx >= 0) { + T score = gt_score_data[i * b + t]; + int box_idx = GetEntryIndex(i, mask_idx, gj * w + gi, mask_num, + an_stride, stride, 0); + CalcBoxLocationLoss(loss_data + i, input_data, gt, anchors, best_n, + box_idx, gi, gj, h, input_size, stride, score); + + int obj_idx = (i * mask_num + mask_idx) * stride + gj * w + gi; + obj_mask_data[obj_idx] = score; + + int label = gt_label_data[i * b + t]; + int label_idx = GetEntryIndex(i, mask_idx, gj * w + gi, mask_num, + an_stride, stride, 5); + CalcLabelLoss(loss_data + i, input_data, label_idx, label, + class_num, stride, label_pos, label_neg, score); + } + } + } + + CalcObjnessLoss(loss_data, input_data + 4 * stride, obj_mask_data, n, + mask_num, h, w, stride, an_stride); + } +}; + +template +class Yolov3LossGradKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* input = ctx.Input("X"); + auto* gt_box = ctx.Input("GTBox"); + auto* gt_label = ctx.Input("GTLabel"); + auto* gt_score = ctx.Input("GTScore"); + auto* input_grad = ctx.Output(framework::GradVarName("X")); + auto* loss_grad = ctx.Input(framework::GradVarName("Loss")); + auto* objness_mask = ctx.Input("ObjectnessMask"); + auto* gt_match_mask = ctx.Input("GTMatchMask"); + auto anchors = ctx.Attr>("anchors"); + auto anchor_mask = ctx.Attr>("anchor_mask"); + int class_num = ctx.Attr("class_num"); + int downsample_ratio = ctx.Attr("downsample_ratio"); + bool use_label_smooth = ctx.Attr("use_label_smooth"); + + const int n = input_grad->dims()[0]; + const int c = input_grad->dims()[1]; + const int h = input_grad->dims()[2]; + const int w = input_grad->dims()[3]; + const int mask_num = anchor_mask.size(); + const int b = gt_match_mask->dims()[1]; + int input_size = downsample_ratio * h; + + const int stride = h * w; + const int an_stride = (class_num + 5) * stride; + + T label_pos = 1.0; + T label_neg = 0.0; + if (use_label_smooth) { + label_pos = 1.0 - 1.0 / static_cast(class_num); + label_neg = 1.0 / static_cast(class_num); + } + + const T* input_data = input->data(); + const T* gt_box_data = gt_box->data(); + const int* gt_label_data = gt_label->data(); + const T* loss_grad_data = loss_grad->data(); + const T* obj_mask_data = objness_mask->data(); + const int* gt_match_mask_data = gt_match_mask->data(); + T* input_grad_data = + input_grad->mutable_data({n, c, h, w}, ctx.GetPlace()); + memset(input_grad_data, 0, input_grad->numel() * sizeof(T)); + + const T* gt_score_data; + if (!gt_score) { + Tensor gtscore; + gtscore.mutable_data({n, b}, ctx.GetPlace()); + math::SetConstant()( + ctx.template device_context(), >score, + static_cast(1.0)); + gt_score = >score; + gt_score_data = gtscore.data(); + } else { + gt_score_data = gt_score->data(); + } + + for (int i = 0; i < n; i++) { + for (int t = 0; t < b; t++) { + int mask_idx = gt_match_mask_data[i * b + t]; + if (mask_idx >= 0) { + T score = gt_score_data[i * b + t]; + Box gt = GetGtBox(gt_box_data, i, b, t); + int gi = static_cast(gt.x * w); + int gj = static_cast(gt.y * h); + + int box_idx = GetEntryIndex(i, mask_idx, gj * w + gi, mask_num, + an_stride, stride, 0); + CalcBoxLocationLossGrad(input_grad_data, loss_grad_data[i], + input_data, gt, anchors, + anchor_mask[mask_idx], box_idx, gi, gj, h, + input_size, stride, score); + + int label = gt_label_data[i * b + t]; + int label_idx = GetEntryIndex(i, mask_idx, gj * w + gi, mask_num, + an_stride, stride, 5); + CalcLabelLossGrad(input_grad_data, loss_grad_data[i], input_data, + label_idx, label, class_num, stride, label_pos, + label_neg, score); + } + } + } + + CalcObjnessLossGrad(input_grad_data + 4 * stride, loss_grad_data, + input_data + 4 * stride, obj_mask_data, n, mask_num, + h, w, stride, an_stride); + } +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/detection_map_op.cc b/paddle/fluid/operators/detection_map_op.cc index d7f49a9590e4ef4ca4d2ad5a92572c70e6bfb6ac..e1d113f8542da8827b9e36e44fc1bac6c07c9257 100644 --- a/paddle/fluid/operators/detection_map_op.cc +++ b/paddle/fluid/operators/detection_map_op.cc @@ -71,8 +71,7 @@ class DetectionMAPOp : public framework::OperatorWithKernel { framework::OpKernelType GetExpectedKernelType( const framework::ExecutionContext& ctx) const override { return framework::OpKernelType( - framework::ToDataType( - ctx.Input("DetectRes")->type()), + ctx.Input("DetectRes")->type(), platform::CPUPlace()); } }; diff --git a/paddle/fluid/operators/dgc_clip_by_norm_op.cc b/paddle/fluid/operators/dgc_clip_by_norm_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..6ebad4de3c8ebc57823709c04498a1f4311942a5 --- /dev/null +++ b/paddle/fluid/operators/dgc_clip_by_norm_op.cc @@ -0,0 +1,67 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include + +#include "paddle/fluid/operators/dgc_clip_by_norm_op.h" + +namespace paddle { +namespace operators { + +class DGCClipByNormOp : public ClipByNormOp { + public: + using ClipByNormOp::ClipByNormOp; + + protected: + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("current_step"), + "current_step should be set."); + + return ClipByNormOp::InferShape(ctx); + } + + framework::OpKernelType GetKernelTypeForVar( + const std::string& var_name, const framework::Tensor& tensor, + const framework::OpKernelType& expected_kernel_type) const override { + if (var_name == "current_step") { + VLOG(10) << "var_name:" << var_name << " need not to transform"; + return expected_kernel_type; + } + + return framework::OperatorWithKernel::GetKernelTypeForVar( + var_name, tensor, expected_kernel_type); + } +}; + +class DGCClipByNormOpMaker : public ClipByNormOpMaker { + public: + void Make() override { + AddInput("current_step", "(Tensor) Current step."); + AddAttr("rampup_begin_step", + "(float, -1.0)" + "The period when begin k_select.") + .SetDefault(-1.0); + + return ClipByNormOpMaker::Make(); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OP_WITHOUT_GRADIENT(dgc_clip_by_norm, ops::DGCClipByNormOp, + ops::DGCClipByNormOpMaker); + +REGISTER_OP_CPU_KERNEL( + dgc_clip_by_norm, + ops::DGCClipByNormKernel); diff --git a/paddle/fluid/operators/dgc_clip_by_norm_op.cu b/paddle/fluid/operators/dgc_clip_by_norm_op.cu new file mode 100644 index 0000000000000000000000000000000000000000..e7f564b7ab4d1c11810dc096faec7f5a375b8563 --- /dev/null +++ b/paddle/fluid/operators/dgc_clip_by_norm_op.cu @@ -0,0 +1,20 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/dgc_clip_by_norm_op.h" + +namespace ops = paddle::operators; +REGISTER_OP_CUDA_KERNEL( + dgc_clip_by_norm, + ops::DGCClipByNormKernel); diff --git a/paddle/fluid/operators/dgc_clip_by_norm_op.h b/paddle/fluid/operators/dgc_clip_by_norm_op.h new file mode 100644 index 0000000000000000000000000000000000000000..197bf59b2a470e1f6e4e31c6706d1e3f8e73fbbc --- /dev/null +++ b/paddle/fluid/operators/dgc_clip_by_norm_op.h @@ -0,0 +1,49 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include "paddle/fluid/operators/clip_by_norm_op.h" + +namespace paddle { +namespace operators { + +template +class DGCClipByNormKernel : public ClipByNormKernel { + public: + void Compute(const framework::ExecutionContext& context) const override { + auto rampup_begin_step = context.Attr("rampup_begin_step"); + if (static_cast(rampup_begin_step) < 0) { + return; + } + + auto current_step_tensor = context.Input("current_step"); + auto* current_step = current_step_tensor->data(); + + VLOG(10) << "current_step:" << *current_step + << ", rampup_begin_step:" << rampup_begin_step; + + if (static_cast(*current_step) < static_cast(rampup_begin_step)) { + VLOG(10) << "current_step:" << *current_step + << " < rampup_begin_step:" << rampup_begin_step + << " so does't use dgc_clip_by_norm"; + return; + } + + return ClipByNormKernel::Compute(context); + }; +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/dgc_op.cc b/paddle/fluid/operators/dgc_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..ccdeea2d0a96342a57ca56ae2b686f81b32fd866 --- /dev/null +++ b/paddle/fluid/operators/dgc_op.cc @@ -0,0 +1,138 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/dgc_op.h" +#include +#include +#include "paddle/fluid/framework/op_registry.h" + +namespace paddle { +namespace operators { + +class DGCOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("U"), "Input(U) of DGCop should not be null."); + PADDLE_ENFORCE(ctx->HasInput("V"), "Input(V) of DGCop should not be null."); + PADDLE_ENFORCE(ctx->HasInput("Grad"), + "Input(Grad) of DGCop should not be null."); + PADDLE_ENFORCE(ctx->HasInput("current_step"), + "Input(current_step) of DGCop should not be null."); + + PADDLE_ENFORCE(ctx->HasOutput("U_out"), + "Output(U_out) of DGCop should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("V_out"), + "Output(V_out) of DGCop should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("k"), + "Output(k) of DGCop should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("EncodeGrad"), + "Output(EncodeGrad) of DGCop should not be null."); + } + + protected: + framework::OpKernelType GetKernelTypeForVar( + const std::string& var_name, const framework::Tensor& tensor, + const framework::OpKernelType& expected_kernel_type) const override { + if (var_name == "current_step" || var_name == "rampup_step" || + var_name == "k") { + VLOG(10) << "var_name:" << var_name << " need not to transform"; + return expected_kernel_type; + } + + return framework::OperatorWithKernel::GetKernelTypeForVar( + var_name, tensor, expected_kernel_type); + } +}; + +class DGCOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("U", "(Tensor) Middle tensor of DGC"); + AddInput("V", "(Tensor) Middle tensor of DGC"); + AddInput("Grad", "(Tensor) Input gradient"); + AddInput("current_step", "(Tensor) Current step."); + + AddOutput("U_out", + "(Tensor) " + "Output encoded gradient"); + AddOutput("V_out", + "(Tensor) " + "Output encoded gradient"); + AddOutput("EncodeGrad", + "(Tensor) " + "Output encoded gradient"); + AddOutput("Grad_out", + "(Tensor) " + "Output grad gradient"); + AddOutput("k", + "(Tensor) " + "Output top-k value"); + + AddAttr("m", + "(float, 0.9) " + "The momentum of learning rate.") + .SetDefault(0.9); + + AddAttr("use_nesterov", + "(bool, true)" + "The momentum of learning rate.") + .SetDefault(true); + + AddAttr>("sparsity", + "(vecotr, float)" + "The period sparsity of k_select."); + + AddAttr("rampup_begin_step", + "(float, 0.0)" + "The period when begin k_select.") + .SetDefault(0.0); + + AddAttr("rampup_step", + "(float, 0.0)" + "The period when begin k_select."); + + AddComment(R"DOC( + Original paper is https://arxiv.org/abs/1712.01887 + + DGC reduce the communication bandwidth by sending only the important gradients (sparse update):\ + only gradients larger than a threshold are transmitted. + + To avoid losing information, DGC accumulate the rest of the gradients locally. + + Eventually, these gradients become large enough to be transmitted. + + Thus, DGC send the large gradients immediately but eventually send all of the gradients over time. + + To ensure no loss of accuracy, DGC employs momentum correc-tionandlocal gradient clipping on top of the gradient sparsification to maintain model performance. + + DGC also uses momentum factor masking and warmup training to overcome the staleness problem caused by reduced communication. + + This optimizer will do two things: + + 1. Compress the gradient by get TopK import value from tensor \ + and use it for allreduce to reduce network bandwidth. + + 2. Call momentum to optimize on the cost. + +)DOC"); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OP_WITHOUT_GRADIENT(dgc, ops::DGCOp, ops::DGCOpMaker); diff --git a/paddle/fluid/operators/dgc_op.cu b/paddle/fluid/operators/dgc_op.cu new file mode 100644 index 0000000000000000000000000000000000000000..0f0bf441a70bef9cb69362a9cf333aeb51e835b6 --- /dev/null +++ b/paddle/fluid/operators/dgc_op.cu @@ -0,0 +1,20 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/dgc_op.h" + +namespace ops = paddle::operators; + +REGISTER_OP_CUDA_KERNEL( + dgc, ops::DGCOpKernel); diff --git a/paddle/fluid/operators/dgc_op.h b/paddle/fluid/operators/dgc_op.h new file mode 100644 index 0000000000000000000000000000000000000000..8d1683bdb2d521971ffbfa8d60b138a67d7eb52c --- /dev/null +++ b/paddle/fluid/operators/dgc_op.h @@ -0,0 +1,132 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include +#include "dgc/dgc.h" +#include "paddle/fluid/framework/eigen.h" +#include "paddle/fluid/operators/elementwise/elementwise_add_op.h" + +namespace paddle { +namespace operators { + +inline float get_period_sparcity(const std::vector& sparsity, + float cur_step, float rampup_steps) { + PADDLE_ENFORCE(static_cast(cur_step) >= 0); + + size_t idx = static_cast(cur_step * sparsity.size() / rampup_steps); + if (idx >= sparsity.size()) { + return 0.999; + } + + PADDLE_ENFORCE(idx < sparsity.size()); + return sparsity[idx]; +} + +template +class DGCOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto u = ctx.Input("U"); + auto v = ctx.Input("V"); + auto g = ctx.Input("Grad"); + + // attrs + float m = ctx.Attr("m"); + bool use_nesterov = ctx.Attr("use_nesterov"); + auto sparsity = ctx.Attr>("sparsity"); + auto rampup_begin_step = ctx.Attr("rampup_begin_step"); + auto rampup_step = ctx.Attr("rampup_step"); + + // current step + auto current_step_tensor = ctx.Input("current_step"); + const float* current_step = current_step_tensor->data(); + + if (static_cast(*current_step) < static_cast(rampup_begin_step)) { + VLOG(10) << "current_step:" << *current_step + << " < rampup_begin_step:" << rampup_begin_step + << " so does't use dgc"; + return; + } + + float ratio = + 1 - get_period_sparcity(sparsity, static_cast(*current_step), + rampup_step); + PADDLE_ENFORCE(ratio > 0.0 && ratio < 1.0); + int k = static_cast(g->numel() * ratio); + + VLOG(10) << "m:" << m << ", use_nesterov:" << use_nesterov + << ", rampup_begin_step:" << rampup_begin_step + << ", rampup_step:" << rampup_step + << ", current_step:" << *current_step << ", ratio:" << ratio + << ", k:" << k; + + auto k_out = ctx.Output("k"); + T* k_out_data = k_out->data(); + *k_out_data = k; + + auto u_out = ctx.Output("U_out"); + auto v_out = ctx.Output("V_out"); + auto encode_grad_out = ctx.Output("EncodeGrad"); + + // FIXME(gongwb): use cublas. + auto u_out_e = framework::EigenVector::Flatten(*u_out); + auto u_e = framework::EigenVector::Flatten(*u); + auto g_e = framework::EigenVector::Flatten(*g); + auto& dev_ctx = ctx.template device_context(); + auto& eigen_ctx = *dev_ctx.eigen_device(); + if (use_nesterov) { + // u = m * (u + g) + u_out_e.device(eigen_ctx) = m * (u_e + g_e); + + // v = u + v + g + ElementwiseComputeEx, DeviceContext, T>( + ctx, u, v, 0, AddFunctor(), v_out); + + ElementwiseComputeEx, DeviceContext, T>( + ctx, g, v, 0, AddFunctor(), v_out); + } else { + // u = m * u + g + u_out_e.device(eigen_ctx) = m * u_e + g_e; + + // v = u + v + ElementwiseComputeEx, DeviceContext, T>( + ctx, u, v, 0, AddFunctor(), v_out); + } + + T* v_out_data = v_out->mutable_data(ctx.GetPlace()); + T* u_out_data = u_out->mutable_data(ctx.GetPlace()); + T* encode_grad_out_data = encode_grad_out->mutable_data( + framework::DDim{2 * k}, ctx.GetPlace()); + + int buf_size = paddle::communication::dgc::get_buffer_size(k); + auto& allocator = platform::DeviceTemporaryAllocator::Instance().Get( + ctx.GetPlace(), dev_ctx.stream()); + auto tmp_ious_data = allocator.Allocate(buf_size); + void* buf = reinterpret_cast(tmp_ious_data->ptr()); + + if (!paddle::communication::dgc::k_select( + static_cast(encode_grad_out_data), k, v_out_data, + static_cast(v_out->numel()), buf, dev_ctx.stream(), + u_out_data)) { + LOG(FATAL) << "v_out numel:" << v_out->numel(); + } + + auto grad_out = ctx.Output("Grad_out"); + math::SetConstant tset; + tset(dev_ctx, grad_out, static_cast(0)); + } +}; +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/distributed/CMakeLists.txt b/paddle/fluid/operators/distributed/CMakeLists.txt index 21db93958a4a586c74a1e060f1f04b5af1dcd889..972b4f67a8388ce68952fa90aaa224cd45c6d226 100644 --- a/paddle/fluid/operators/distributed/CMakeLists.txt +++ b/paddle/fluid/operators/distributed/CMakeLists.txt @@ -9,36 +9,55 @@ else() endif() configure_file(send_recv.proto.in ${CMAKE_CURRENT_SOURCE_DIR}/send_recv.proto @ONLY) +# FIXME(typhoonzero): use add_subdirectory once we clean the dependency of these files +set(DISTRIBUTE_COMPILE_FLAGS "-Wno-non-virtual-dtor -Wno-error=non-virtual-dtor -Wno-error=delete-non-virtual-dtor") if(WITH_GRPC) - grpc_library(sendrecvop_grpc SRCS grpc_bytebuffer_stream.cc sendrecvop_utils.cc grpc_client.cc - request_handler_impl.cc rpc_client.cc rpc_server.cc grpc_server.cc variable_response.cc grpc_variable_response.cc grpc_serde.cc + set(GRPC_DEPS grpc++_unsecure grpc_unsecure gpr cares zlib protobuf) + set(GRPC_SRCS grpc/grpc_client.cc grpc/grpc_server.cc grpc/grpc_serde.cc grpc/grpc_bytebuffer_stream.cc grpc/grpc_variable_response.cc) + grpc_library(sendrecvop_rpc SRCS sendrecvop_utils.cc + request_handler_impl.cc rpc_client.cc rpc_server.cc + variable_response.cc + collective_client.cc collective_server.cc + ${GRPC_SRCS} PROTO send_recv.proto - DEPS lod_tensor selected_rows memory) - set(DISTRIBUTE_COMPILE_FLAGS "-Wno-non-virtual-dtor -Wno-error=non-virtual-dtor -Wno-error=delete-non-virtual-dtor") - set_source_files_properties(grpc_serde_test.cc rpc_server_test.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS}) - cc_test(grpc_serde_test SRCS grpc_serde_test.cc - DEPS grpc++_unsecure grpc_unsecure gpr cares zlib protobuf sendrecvop_grpc scope profiler math_function SERIAL) - cc_test(rpc_server_test SRCS rpc_server_test.cc - DEPS sendrecvop_grpc grpc++_unsecure grpc_unsecure gpr cares zlib protobuf executor proto_desc lookup_sparse_table_op SERIAL) - cc_test(varhandle_test SRCS varhandle_test.cc DEPS profiler) - return() -endif() - - -set(DISTRIBUTE_COMPILE_FLAGS "-Wno-non-virtual-dtor -Wno-error=non-virtual-dtor -Wno-error=delete-non-virtual-dtor") + DEPS lod_tensor selected_rows_functor memory scope ${GRPC_DEPS}) -set_source_files_properties(brpc_server.cc brpc_client.cc rpc_server_test.cc brpc_serde_test.cc - brpc_variable_response.cc brpc_sendrecvop_utils.cc brpc_rdma_pool.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS}) + set_source_files_properties(grpc_serde_test.cc rpc_server_test.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS}) + set(RPC_DEPS sendrecvop_rpc ${GRPC_DEPS}) -brpc_library(sendrecvop_brpc SRCS brpc_client.cc brpc_server.cc rpc_server.cc rpc_client.cc request_handler_impl.cc brpc_sendrecvop_utils.cc - brpc_variable_response.cc variable_response.cc sendrecvop_utils.cc brpc_rdma_pool.cc - PROTO send_recv.proto - DEPS lod_tensor selected_rows memory) + cc_test(grpc_serde_test SRCS grpc/grpc_serde_test.cc + DEPS ${RPC_DEPS} scope profiler math_function SERIAL) -set(brpc_test_depends sendrecvop_brpc brpc ssl crypto protobuf leveldb gflags glog executor proto_desc lookup_table_op snappystream snappy) +else() + set(BRPC_SRCS brpc/brpc_client.cc brpc/brpc_server.cc brpc/brpc_sendrecvop_utils.cc brpc/brpc_variable_response.cc brpc/brpc_rdma_pool.cc) + set_source_files_properties(${BRPC_SRCS} parameter_prefetch.cc parameter_send.cc parameter_recv.cc communicator.cc rpc_server_test.cc brpc/brpc_serde_test.cc collective_server.cc collective_server_test.cc collective_client.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS}) + + set(BRPC_DEPS brpc ssl crypto protobuf leveldb snappystream snappy zlib) + + brpc_library(sendrecvop_rpc SRCS sendrecvop_utils.cc + request_handler_impl.cc rpc_client.cc rpc_server.cc + variable_response.cc + collective_client.cc collective_server.cc + ${BRPC_SRCS} + PROTO send_recv.proto + DEPS lod_tensor selected_rows memory scope ${BRPC_DEPS}) + + set(RPC_DEPS sendrecvop_rpc ${BRPC_DEPS}) + cc_test(brpc_serde_test SRCS brpc/brpc_serde_test.cc + DEPS ${RPC_DEPS} gflags glog executor proto_desc lookup_sparse_table_op SERIAL) +endif() -cc_test(brpc_server_test SRCS rpc_server_test.cc - DEPS ${brpc_test_depends} SERIAL) -cc_test(brpc_serde_test SRCS brpc_serde_test.cc - DEPS ${brpc_test_depends} SERIAL) +cc_test(rpc_server_test SRCS rpc_server_test.cc + DEPS ${RPC_DEPS} executor proto_desc lookup_sparse_table_op SERIAL) +cc_test(varhandle_test SRCS varhandle_test.cc DEPS profiler scope) +cc_library(parameter_prefetch SRCS parameter_prefetch.cc DEPS sendrecvop_rpc memory) +cc_library(parameter_send SRCS parameter_send.cc DEPS sendrecvop_rpc memory) +cc_library(parameter_recv SRCS parameter_recv.cc DEPS sendrecvop_rpc memory) +cc_library(communicator SRCS communicator.cc DEPS scope selected_rows tensor variable_helper selected_rows_functor simple_threadpool parameter_send parameter_recv) +cc_test(communicator_test SRCS communicator_test.cc DEPS communicator) +if(WITH_GPU) + cc_test(collective_server_test SRCS collective_server_test.cc + DEPS sendrecvop_rpc executor ${RPC_DEPS} + selected_rows_functor scope math_function SERIAL) +endif() diff --git a/paddle/fluid/operators/distributed/brpc/brpc_client.cc b/paddle/fluid/operators/distributed/brpc/brpc_client.cc new file mode 100644 index 0000000000000000000000000000000000000000..a1a3443348129b5cdf057592fced8fdff238ac09 --- /dev/null +++ b/paddle/fluid/operators/distributed/brpc/brpc_client.cc @@ -0,0 +1,455 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/operators/distributed/brpc/brpc_client.h" +#include "paddle/fluid/framework/threadpool.h" +#include "paddle/fluid/operators/distributed/brpc/brpc_sendrecvop_utils.h" +#include "paddle/fluid/platform/profiler.h" + +namespace paddle { +namespace operators { +namespace distributed { + +DEFINE_int32(timeout_ms, 30000, "RPC timeout in milliseconds"); +DEFINE_int32(max_retry, 3, "Max retries(not including the first RPC)"); + +BRPCClient::~BRPCClient() { Wait(); } + +void HandleSendResponse(brpc::Controller* cntl, sendrecv::VoidMessage* response, + VarHandlePtr var_h, ChannelQueuePtr ch_ptr, + ChannelContextPtr ch_ctx, BRPCClient* cls) { + // std::unique_ptr makes sure cntl/response will be deleted before returning. + std::unique_ptr cntl_guard(cntl); + std::unique_ptr response_guard(response); + + // this channel can be used by other now. + ch_ptr->Push(ch_ctx); + + if (cntl->Failed()) { + LOG(FATAL) << "Fail to send SendVar: " << var_h->name() + << ", error text: " << cntl->ErrorText(); + var_h->Finish(false); + cls->DecreaseReqCount(); + return; + } + var_h->Finish(true); + cls->DecreaseReqCount(); + + VLOG(4) << "HandleSendResponse from: " << cntl->remote_side() + << ", varname: " << var_h->name() + << ", latency: " << cntl->latency_us() << "us"; + VLOG(4) << "Finish HandleSendResponse"; +} + +VarHandlePtr BRPCClient::AsyncSendVar(const std::string& ep, + const platform::DeviceContext& ctx, + const framework::Scope& scope, + const std::string& var_name, + int64_t time_out) { + const platform::DeviceContext* p_ctx = &ctx; + const std::string ep_val = ep; + const std::string var_name_val = var_name; + const framework::Scope* p_scope = &scope; + const auto ch_ptr = GetChannel(ep_val); + const std::string method = kSendRPC; + VarHandlePtr var_h(new VarHandle(ep, method, var_name_val, p_ctx, p_scope)); + + framework::AsyncIO([=] { + auto ch_ctx = ch_ptr->Pop(); + brpc::Controller* cntl = new brpc::Controller(); + sendrecv::VoidMessage* response = new sendrecv::VoidMessage(); + cntl->set_timeout_ms(time_out); + + auto* var = p_scope->FindVar(var_name_val); + sendrecv::VariableMessage request; + distributed::SerializeToIOBuf(var_name_val, var, *p_ctx, &request, + &cntl->request_attachment(), "", false, + trainer_id_); + + google::protobuf::Closure* done = brpc::NewCallback( + &HandleSendResponse, cntl, response, var_h, ch_ptr, ch_ctx, this); + + platform::RecordRPCEvent record_event(method); + + ch_ctx->stub->SendVariable(cntl, &request, response, done); + + if (UNLIKELY(platform::IsProfileEnabled())) { + var_h->Wait(); + } + }); + req_count_++; + + return var_h; +} +void HandleFetchBarrierResponse(brpc::Controller* cntl, + sendrecv::VariableMessage* response, + VarHandlePtr var_h, ChannelQueuePtr ch_ptr, + ChannelContextPtr ch_ctx, BRPCClient* cls) { + // std::unique_ptr makes sure cntl/response will be deleted before returning. + std::unique_ptr cntl_guard(cntl); + std::unique_ptr response_guard(response); + + // this channel can be used other now. + ch_ptr->Push(ch_ctx); + + if (cntl->Failed()) { + LOG(FATAL) << "Fail to get HandleFetchBarrierResponse: " << var_h->name() + << ", error text: " << cntl->ErrorText(); + var_h->Finish(false); + cls->DecreaseReqCount(); + return; + } + + var_h->Finish(true); + cls->DecreaseReqCount(); + + VLOG(4) << "HandleFetchBarrierResponse from: " << cntl->remote_side() + << ", varname: " << var_h->name() + << ", latency: " << cntl->latency_us() << "us"; + VLOG(4) << "Finish HandleFetchBarrierResponse"; +} +void HandleGetResponse(brpc::Controller* cntl, + sendrecv::VariableMessage* response, VarHandlePtr var_h, + ChannelQueuePtr ch_ptr, ChannelContextPtr ch_ctx, + BRPCClient* cls) { + // std::unique_ptr makes sure cntl/response will be deleted before returning. + std::unique_ptr cntl_guard(cntl); + std::unique_ptr response_guard(response); + + // this channel can be used other now. + ch_ptr->Push(ch_ctx); + + if (cntl->Failed()) { + LOG(FATAL) << "Fail to GetVar: " << var_h->name() + << ", error text: " << cntl->ErrorText(); + cls->DecreaseReqCount(); + var_h->Finish(false); + return; + } + + VLOG(4) << "HandleGetResponse from: " << cntl->remote_side() + << ", varname: " << var_h->name() + << ", latency: " << cntl->latency_us() << "us"; + + framework::Variable* outvar = nullptr; + int trainer_id; + distributed::DeserializeFromIOBuf(*response, cntl->response_attachment(), + *var_h->ctx(), var_h->scope(), &outvar, + &trainer_id); + VLOG(4) << "Finish HandleGetResponse"; + cls->DecreaseReqCount(); + var_h->Finish(true); +} + +VarHandlePtr BRPCClient::_AsyncGetVar(const std::string& ep, + const platform::DeviceContext& ctx, + const framework::Scope& scope, + const std::string& var_name, + const std::string& out_var_name, + const std::string& method_name, + int64_t time_out) { + const platform::DeviceContext* p_ctx = &ctx; + const std::string ep_val = ep; + const std::string var_name_val = var_name; + const std::string out_varname_val = out_var_name; + const framework::Scope* p_scope = &scope; + const auto ch_ptr = GetChannel(ep_val); + const std::string method = kGetRPC; + VarHandlePtr var_h( + new VarHandle(ep, method, out_varname_val, p_ctx, p_scope)); + + framework::AsyncIO([=] { + auto ch_ctx = ch_ptr->Pop(); + + brpc::Controller* cntl = new brpc::Controller(); + sendrecv::VariableMessage* response = new sendrecv::VariableMessage(); + cntl->set_timeout_ms(time_out); + + sendrecv::VariableMessage req; + req.set_varname(var_name_val); + req.set_out_varname(out_varname_val); + req.set_trainer_id(trainer_id_); + + google::protobuf::Closure* done = brpc::NewCallback( + &HandleGetResponse, cntl, response, var_h, ch_ptr, ch_ctx, this); + + platform::RecordRPCEvent record_event(method); + + if (method_name == kGetMonomerRPC) { + ch_ctx->stub->GetMonomerVariable(cntl, &req, response, done); + } else if (method_name == kGetNoBarrierRPC) { + ch_ctx->stub->GetVariableNoBarrier(cntl, &req, response, done); + } else { + ch_ctx->stub->GetVariable(cntl, &req, response, done); + } + + if (UNLIKELY(platform::IsProfileEnabled())) { + var_h->Wait(); + } + }); + + req_count_++; + + return var_h; +} + +VarHandlePtr BRPCClient::AsyncGetVarNoBarrier( + const std::string& ep, const platform::DeviceContext& ctx, + const framework::Scope& scope, const std::string& var_name, + const std::string& out_var_name, int64_t time_out) { + std::string var_name_no_barrier = + string::Sprintf("%s%s", var_name, WITHOUT_BARRIER_MESSAGE); + + return _AsyncGetVar(ep, ctx, scope, var_name_no_barrier, out_var_name, + kGetNoBarrierRPC, time_out); +} + +VarHandlePtr BRPCClient::AsyncGetMonomerVariable( + const std::string& ep, const platform::DeviceContext& ctx, + const framework::Scope& scope, const std::string& var_name, + int64_t time_out) { + return _AsyncGetVar(ep, ctx, scope, var_name, var_name, kGetMonomerRPC, + time_out); +} + +VarHandlePtr BRPCClient::AsyncGetMonomerBarrier(const std::string& ep, + const std::string& var_name, + int64_t time_out) { + return AsyncSendMessage(ep, kSendMonomerFetchBarrierRPC, var_name, time_out); +} + +VarHandlePtr BRPCClient::AsyncGetVar(const std::string& ep, + const platform::DeviceContext& ctx, + const framework::Scope& scope, + const std::string& var_name, + const std::string& out_var_name, + int64_t time_out) { + return _AsyncGetVar(ep, ctx, scope, var_name, out_var_name, kGetRPC, + time_out); +} + +VarHandlePtr BRPCClient::AsyncPrefetchVar(const std::string& ep, + const platform::DeviceContext& ctx, + const framework::Scope& scope, + const std::string& in_var_name, + const std::string& out_var_name, + const std::string& table_name, + int64_t time_out) { + const platform::DeviceContext* p_ctx = &ctx; + const std::string ep_val = ep; + const std::string in_var_name_val = in_var_name; + const std::string out_var_name_val = out_var_name; + const std::string table_name_val = table_name; + const framework::Scope* p_scope = &scope; + const auto ch_ptr = GetChannel(ep_val); + + const std::string method = kPrefetchRPC; + + VarHandlePtr var_h( + new VarHandle(ep, method, out_var_name_val, p_ctx, p_scope)); + + framework::AsyncIO([=] { + auto ch_ctx = ch_ptr->Pop(); + + brpc::Controller* cntl = new brpc::Controller(); + sendrecv::VariableMessage* response = new sendrecv::VariableMessage(); + cntl->set_timeout_ms(time_out); + + auto* var = p_scope->FindVar(in_var_name_val); + sendrecv::VariableMessage req; + distributed::SerializeToIOBuf(in_var_name_val, var, *p_ctx, &req, + &cntl->request_attachment(), out_var_name_val, + false, 0, table_name_val); + + platform::RecordRPCEvent record_event(method); + + google::protobuf::Closure* done = brpc::NewCallback( + &HandleGetResponse, cntl, response, var_h, ch_ptr, ch_ctx, this); + + ch_ctx->stub->PrefetchVariable(cntl, &req, response, done); + + if (UNLIKELY(platform::IsProfileEnabled())) { + var_h->Wait(); + } + }); + + req_count_++; + return var_h; +} + +VarHandlePtr BRPCClient::AsyncSendBatchBarrier(const std::string& ep, + int64_t time_out) { + return AsyncSendMessage(ep, kBatchBarrierRPC, BATCH_BARRIER_MESSAGE, + time_out); +} + +VarHandlePtr BRPCClient::AsyncSendFetchBarrier(const std::string& ep, + int64_t time_out) { + auto ch_ptr = GetChannel(ep); + auto ch_ctx = ch_ptr->Pop(); + + brpc::Controller* cntl = new brpc::Controller(); + sendrecv::VariableMessage* response = new sendrecv::VariableMessage(); + cntl->set_timeout_ms(time_out); + + sendrecv::VariableMessage req; + req.set_varname(FETCH_BARRIER_MESSAGE); + + const std::string method = kFetchBarrierRPC; + // var handle + VarHandlePtr var_h( + new VarHandle(ep, method, FETCH_BARRIER_MESSAGE, nullptr, nullptr)); + + platform::RecordRPCEvent record_event(method); + + google::protobuf::Closure* done = brpc::NewCallback( + &HandleFetchBarrierResponse, cntl, response, var_h, ch_ptr, ch_ctx, this); + + ch_ctx->stub->GetVariable(cntl, &req, response, done); + + req_count_++; + + if (UNLIKELY(platform::IsProfileEnabled())) { + var_h->Wait(); + } + + return var_h; +} + +bool BRPCClient::Wait() { + VLOG(9) << "begin to brpcclient wait"; + { + std::unique_lock lk(sync_mutex_); + sync_cond_.wait(lk, [this] { return req_count_ == 0; }); + } + VLOG(9) << "end to brpcclient wait"; + return true; +} + +ChannelQueuePtr BRPCClient::GetChannel(const std::string& ep) { + VLOG(4) << "begin to GetChannel:" << ep; + { + std::lock_guard guard(chan_mutex_); + auto it = channels_.find(ep); + if (it != channels_.end()) { + VLOG(4) << "end to GetChannel:" << ep; + return it->second; + } + } + + ChannelQueuePtr q(new framework::BlockingQueue()); + + brpc::ChannelOptions options; +#ifdef PADDLE_WITH_BRPC_RDMA + options.use_rdma = true; +#endif + options.protocol = "baidu_std"; + // don't use pooled type. the server can't afford that. + options.connection_type = "single"; + options.connect_timeout_ms = 1000; + options.timeout_ms = FLAGS_timeout_ms /*milliseconds*/; + options.max_retry = FLAGS_max_retry; + + VLOG(1) << "create " << brpc_channel_num_per_server_ + << " brpc channels to pserver:" << ep; + + for (int i = 0; i < brpc_channel_num_per_server_; ++i) { + std::shared_ptr c(new ChannelContext()); + if (c->channel.Init(ep.c_str(), &options) != 0) { + LOG(FATAL) << "Fail to initialize channel"; + return nullptr; + } + + c->stub.reset(new sendrecv::SendRecvService_Stub( + static_cast(&c->channel))); + q->Push(c); + } + + { + std::lock_guard guard(chan_mutex_); + channels_[ep] = q; + } + + VLOG(4) << "end to GetChannel:" << ep; + return q; +} + +VarHandlePtr BRPCClient::AsyncSendComplete(const std::string& ep, + int64_t time_out) { + return AsyncSendMessage(ep, kSendCompleteRPC, COMPLETE_MESSAGE, time_out); +} + +void BRPCClient::SendComplete() { + for (auto& kv : channels_) { + AsyncSendComplete(kv.first); + } +} + +VarHandlePtr BRPCClient::AsyncSendVarMessage( + const std::string& ep, const std::string& method_name, + const sendrecv::VariableMessage& req, int64_t time_out) { + auto ch_ptr = GetChannel(ep); + auto ch_ctx = ch_ptr->Pop(); + + brpc::Controller* cntl = new brpc::Controller(); + sendrecv::VoidMessage* response = new sendrecv::VoidMessage(); + cntl->set_timeout_ms(time_out); + + platform::RecordRPCEvent record_event(method_name); + + VarHandlePtr var_h( + new VarHandle(ep, method_name, req.varname(), nullptr, nullptr)); + + google::protobuf::Closure* done = brpc::NewCallback( + &HandleSendResponse, cntl, response, var_h, ch_ptr, ch_ctx, this); + + if (method_name == kCheckPointNotifyRPC) { + ch_ctx->stub->CheckpointNotify(cntl, &req, response, done); + } else if (method_name == kSendMonomerFetchBarrierRPC) { + ch_ctx->stub->GetMonomerBarrier(cntl, &req, response, done); + } else { + ch_ctx->stub->SendVariable(cntl, &req, response, done); + } + req_count_++; + + if (UNLIKELY(platform::IsProfileEnabled())) { + var_h->Wait(); + } + + return var_h; +} + +VarHandlePtr BRPCClient::AsyncSendMessage(const std::string& ep, + const std::string& method_name, + const std::string& message, + int64_t time_out) { + sendrecv::VariableMessage req; + req.set_varname(message); + + return AsyncSendVarMessage(ep, method_name, req, time_out); +} + +VarHandlePtr BRPCClient::AsyncCheckpointNotify(const std::string& ep, + const std::string& dir, + int64_t time_out) { + sendrecv::VariableMessage req; + req.set_varname(CHECKPOINT_SAVE_MESSAGE); + req.set_out_varname(dir); + + return AsyncSendVarMessage(ep, "CheckPointNotifyRPC", req, time_out); +} + +} // namespace distributed +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/distributed/brpc/brpc_client.h b/paddle/fluid/operators/distributed/brpc/brpc_client.h new file mode 100644 index 0000000000000000000000000000000000000000..501a593b11d35c160348e42ee47216a85647aac4 --- /dev/null +++ b/paddle/fluid/operators/distributed/brpc/brpc_client.h @@ -0,0 +1,172 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include + +#include // NOLINT +#include +#include +#include +#include +#include // NOLINT +#include +#include + +#include "brpc/channel.h" +#include "paddle/fluid/framework/blocking_queue.h" +#include "paddle/fluid/framework/data_type.h" +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/framework/selected_rows.h" +#include "paddle/fluid/operators/distributed/brpc/brpc_sendrecvop_utils.h" +#include "paddle/fluid/operators/distributed/distributed_pb.h" +#include "paddle/fluid/operators/distributed/request_handler.h" +#include "paddle/fluid/operators/distributed/rpc_client.h" +#include "paddle/fluid/platform/macros.h" // for DISABLE_COPY_AND_ASSIGN + +namespace paddle { +namespace operators { +namespace distributed { + +struct ChannelContext { + brpc::Channel channel; + std::shared_ptr stub; +}; + +typedef std::shared_ptr ChannelContextPtr; +typedef std::shared_ptr> + ChannelQueuePtr; + +class BRPCClient : public RPCClient { + public: + BRPCClient() {} + virtual ~BRPCClient(); + + VarHandlePtr AsyncSendVar(const std::string& ep, + const platform::DeviceContext& ctx, + const framework::Scope& scope, + const std::string& var_name, + int64_t time_out = FLAGS_rpc_deadline) override; + + VarHandlePtr AsyncGetVar(const std::string& ep, + const platform::DeviceContext& ctx, + const framework::Scope& scope, + const std::string& var_name, + const std::string& out_var_name, + int64_t time_out = FLAGS_rpc_deadline) override; + + VarHandlePtr AsyncGetMonomerBarrier( + const std::string& ep, const std::string& var_name, + int64_t time_out = FLAGS_rpc_deadline) override; + + VarHandlePtr AsyncGetMonomerVariable( + const std::string& ep, const platform::DeviceContext& ctx, + const framework::Scope& scope, const std::string& var_name, + int64_t time_out = FLAGS_rpc_deadline) override; + + VarHandlePtr AsyncGetVarNoBarrier(const std::string& ep, + const platform::DeviceContext& ctx, + const framework::Scope& scope, + const std::string& var_name, + const std::string& out_varname, + int64_t time_out = FLAGS_rpc_deadline); + + VarHandlePtr AsyncPrefetchVar(const std::string& ep, + const platform::DeviceContext& ctx, + const framework::Scope& scope, + const std::string& in_var_name, + const std::string& out_var_name, + const std::string& table_name = "", + int64_t time_out = FLAGS_rpc_deadline) override; + + VarHandlePtr AsyncSendBatchBarrier( + const std::string& ep, int64_t time_out = FLAGS_rpc_deadline) override; + + VarHandlePtr AsyncSendFetchBarrier( + const std::string& ep, int64_t time_out = FLAGS_rpc_deadline) override; + + VarHandlePtr AsyncCheckpointNotify( + const std::string& ep, const std::string& dir, + int64_t time_out = FLAGS_rpc_deadline) override; + + bool Wait() override; + + void SendComplete() override; + + private: + VarHandlePtr _AsyncGetVar(const std::string& ep, + const platform::DeviceContext& ctx, + const framework::Scope& scope, + const std::string& var_name, + const std::string& out_var_name, + const std::string& method_name, + int64_t time_out = FLAGS_rpc_deadline); + + void Proceed(); + ChannelQueuePtr GetChannel(const std::string& ep); + + VarHandlePtr AsyncSendComplete(const std::string& ep, + int64_t time_out = FLAGS_rpc_deadline); + + VarHandlePtr AsyncSendMessage(const std::string& ep, + const std::string& method_name, + const std::string& message, int64_t time_out); + + VarHandlePtr AsyncSendVarMessage(const std::string& ep, + const std::string& method_name, + const sendrecv::VariableMessage& req, + int64_t time_out); + + friend void HandleSendResponse(brpc::Controller* cntl, + sendrecv::VoidMessage* response, + VarHandlePtr var_h, ChannelQueuePtr ch_ptr, + ChannelContextPtr ch_ctx, BRPCClient* cls); + + friend void HandleGetResponse(brpc::Controller* cntl, + sendrecv::VariableMessage* response, + VarHandlePtr var_h, ChannelQueuePtr ch_ptr, + ChannelContextPtr ch_ctx, BRPCClient* cls); + + friend void HandleFetchBarrierResponse(brpc::Controller* cntl, + sendrecv::VariableMessage* response, + VarHandlePtr var_h, + ChannelQueuePtr ch_ptr, + ChannelContextPtr ch_ctx, + BRPCClient* cls); + void DecreaseReqCount() { + if (--req_count_ <= 0) { + sync_cond_.notify_all(); + } + } + + private: + std::unordered_map channels_; + + // mutex for Wait client sync + std::mutex sync_mutex_; + std::condition_variable sync_cond_; + std::atomic req_count_{0}; + + static constexpr int brpc_channel_num_per_server_ = 4; + + // mutex for GetChannel thread safety + std::mutex chan_mutex_; + DISABLE_COPY_AND_ASSIGN(BRPCClient); +}; + +} // namespace distributed +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/distributed/brpc/brpc_rdma_pool.cc b/paddle/fluid/operators/distributed/brpc/brpc_rdma_pool.cc new file mode 100644 index 0000000000000000000000000000000000000000..d5c614001e0b2ff24812d5326318883de938fbb8 --- /dev/null +++ b/paddle/fluid/operators/distributed/brpc/brpc_rdma_pool.cc @@ -0,0 +1,84 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#ifdef PADDLE_WITH_BRPC_RDMA + +#include "paddle/fluid/operators/distributed/brpc/brpc_rdma_pool.h" +#include "brpc/channel.h" +#include "brpc/rdma/rdma_helper.h" +#include "paddle/fluid/platform/enforce.h" + +namespace paddle { +namespace operators { +namespace distributed { + +RdmaMemPool& RdmaMemPool::Instance() { + static RdmaMemPool* g_rdma_mem_pool = new RdmaMemPool(); + return *g_rdma_mem_pool; +} + +void* RdmaMemPool::Find(const std::string& varname, int64_t size) { + pthread_rwlock_rdlock(&access_); + auto it = pool_.find(varname); + if (it == pool_.end()) { + pthread_rwlock_unlock(&access_); + return nullptr; + } + + auto info = it->second; + if (info.data_size != size) { + pthread_rwlock_unlock(&access_); + PADDLE_ENFORCE(false, "var:%s size:%ld != %ld", varname, size, + info.data_size); + return nullptr; + } + + pthread_rwlock_unlock(&access_); + return info.data; +} + +void RdmaMemPool::Register(const std::string& varname, void* data, + int64_t data_size) { + void* old = Find(varname, data_size); + if (old != nullptr) { + if (data != old) { + PADDLE_ENFORCE(false, "var:%s data:%ld != %ld", varname, data, old); + } + VLOG(7) << "Find on rdma:" << varname << " data:" << data + << " data_size:" << data_size; + return; + } + + VarInfo info; + info.data = data; + info.data_size = data_size; + + pthread_rwlock_wrlock(&access_); + pool_[varname] = info; + pthread_rwlock_unlock(&access_); + + if (brpc::rdma::RegisterMemoryForRdma(data, data_size)) { + LOG(FATAL) << "register " << varname << " data:" << data + << " data_size:" << data_size << " error"; + } + + VLOG(4) << "register on rdma:" << varname << " data:" << data + << " data_size:" << data_size; +} + +} // namespace distributed +} // namespace operators +} // namespace paddle + +#endif diff --git a/paddle/fluid/operators/distributed/brpc/brpc_rdma_pool.h b/paddle/fluid/operators/distributed/brpc/brpc_rdma_pool.h new file mode 100644 index 0000000000000000000000000000000000000000..156a93ec5784715c0a68c1af2e31d640dfc60277 --- /dev/null +++ b/paddle/fluid/operators/distributed/brpc/brpc_rdma_pool.h @@ -0,0 +1,56 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#ifdef PADDLE_WITH_BRPC_RDMA + +#include // NOLINT +#include +#include + +namespace paddle { +namespace operators { +namespace distributed { + +/* + * This class is used to avoid duplicated registion of brpc::rdma. + */ +class RdmaMemPool { + public: + static RdmaMemPool& Instance(); + RdmaMemPool() : access_(PTHREAD_RWLOCK_INITIALIZER) {} + + virtual ~RdmaMemPool() { pthread_rwlock_destroy(&access_); } + + void Register(const std::string& varname, void* data, int64_t size); + void* Find(const std::string& varname, int64_t size); + + private: + struct VarInfo { + void* data; + int64_t data_size; + + VarInfo() : data(nullptr), data_size(0) {} + }; + + private: + std::unordered_map pool_; + pthread_rwlock_t access_; +}; + +} // namespace distributed +} // namespace operators +} // namespace paddle + +#endif diff --git a/paddle/fluid/operators/distributed/brpc/brpc_sendrecvop_utils.cc b/paddle/fluid/operators/distributed/brpc/brpc_sendrecvop_utils.cc new file mode 100644 index 0000000000000000000000000000000000000000..49e048f07a2396824a51db5c6012206bd8848e82 --- /dev/null +++ b/paddle/fluid/operators/distributed/brpc/brpc_sendrecvop_utils.cc @@ -0,0 +1,207 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#ifdef PADDLE_WITH_CUDA +#include +#endif +#include +#include +#include // NOLINT + +#include "paddle/fluid/framework/data_type.h" +#include "paddle/fluid/operators/distributed/brpc/brpc_rdma_pool.h" +#include "paddle/fluid/operators/distributed/brpc/brpc_sendrecvop_utils.h" +#include "paddle/fluid/operators/distributed/brpc/brpc_variable_response.h" +#include "paddle/fluid/operators/distributed/distributed_pb.h" +#include "paddle/fluid/platform/profiler.h" + +namespace paddle { +namespace operators { +namespace distributed { + +class IOBufWriter { + public: + static void Append(const std::string& varname, butil::IOBuf* iobuf, int k, + const char* v, int64_t vlen) { + if (vlen >= std::numeric_limits::max() || vlen < 0) { + LOG(FATAL) << "AppendZeroCopy varname:" << varname << ", vlen:" << vlen; + } + + iobuf->append(reinterpret_cast(&k), 4); + iobuf->append(reinterpret_cast(&vlen), 8); + iobuf->append(v, vlen); + } + + static void AppendTCPZeroCopy(butil::IOBuf* iobuf, int k, const char* v, + int64_t vlen, bool in_cuda_pinned, + void (*destroy)(void*), void* user_data) { + VLOG(7) << "AppendTCPZeroCopy " + << " k:" << k + << " data:" << static_cast(const_cast(v)) + << " data_size:" << vlen << " in_cuda_pinned:" << in_cuda_pinned; + + iobuf->append(reinterpret_cast(&k), 4); + iobuf->append(reinterpret_cast(&vlen), 8); + + // FIXME(gongwb): use append_zerocopy + /* + if (in_cuda_pinned) { + iobuf->append_zerocopy(v, vlen, IOBufWriter::FreeMemory); + } else { + iobuf->append_zerocopy(v, vlen, nullptr); + } + */ + iobuf->append(v, vlen); + destroy(user_data); + } + +#ifdef PADDLE_WITH_BRPC_RDMA + static void AppendRdmaZeroCopy(const std::string varname, butil::IOBuf* iobuf, + int k, const char* v, int64_t vlen, + bool in_cuda_pinned, void (*destroy)(void*), + void* user_data) { + VLOG(7) << "AppendRdmaZeroCopy varname:" << varname << " k:" << k + << " data:" << static_cast(const_cast(v)) + << " data_size:" << vlen << " in_cuda_pinned:" << in_cuda_pinned; + + iobuf->append(reinterpret_cast(&k), 4); + iobuf->append(reinterpret_cast(&vlen), 8); + + RdmaMemPool::Instance().Register( + varname, static_cast(const_cast(v)), vlen); + + // FIXME(gongwb): use append_zerocopy + // iobuf->append_zerocopy(v, vlen, nullptr); + iobuf->append(v, vlen); + destroy(user_data); + return; + } +#endif + + static void AppendZeroCopy(const std::string varname, butil::IOBuf* iobuf, + int k, const char* v, int64_t vlen, + bool in_cuda_pinned, void (*destroy)(void*), + void* user_data) { + if (vlen >= std::numeric_limits::max() || vlen < 0) { + LOG(FATAL) << "AppendZeroCopy varname:" << varname << ", vlen:" << vlen; + } + +#ifdef PADDLE_WITH_BRPC_RDMA + IOBufWriter::AppendRdmaZeroCopy(varname, iobuf, k, v, vlen, in_cuda_pinned, + destroy, user_data); +#else + IOBufWriter::AppendTCPZeroCopy(iobuf, k, v, vlen, in_cuda_pinned, destroy, + user_data); +#endif + } +}; + +void SerializeToIOBuf(const std::string& name, framework::Variable* var, + const platform::DeviceContext& ctx, VarMsg* request, + butil::IOBuf* iobuf, const std::string& out_varname, + bool var_is_not_stable, int trainer_id, + const std::string& table_name) { + std::unique_ptr payload; + + request->set_varname(name); + request->set_trainer_id(trainer_id); + // Note: normally the profiler is enabled in 1 trainer, hence only + // 1 trainer returns true for ShouldSendProfileState(). It tells PS + // servers the trainer's profiling state so that PS can follow the + // trainer. + if (platform::ShouldSendProfileState()) { + if (platform::IsProfileEnabled()) { + request->set_profile(platform::kEnableProfiler); + } else { + request->set_profile(platform::kDisableProfiler); + } + } + if (!out_varname.empty()) { + request->set_out_varname(out_varname); + } + if (!table_name.empty()) { + request->set_table_name(table_name); + } + if (var->IsType()) { + request->set_type(::sendrecv::LOD_TENSOR); + payload.reset(new TensorPayload(GetTensorPayload(var, ctx, request))); + } else if (var->IsType()) { + request->set_type(::sendrecv::SELECTED_ROWS); + payload.reset(new TensorPayload(GetSelectedRowsPayload(var, ctx, request))); +#ifdef PADDLE_WITH_CUDA + } else if (var->IsType()) { + request->set_type(::sendrecv::NCCL_ID); + const ncclUniqueId& uid = var->Get(); + // TODO(gongwb): use append_zero to avoid data copy. + IOBufWriter::Append(name, iobuf, + sendrecv::VariableMessage::kSerializedFieldNumber, + uid.internal, NCCL_UNIQUE_ID_BYTES); + return; +#endif + } else { + PADDLE_THROW("Serialize does not support type: %s", + typeid(var->Type()).name()); + } + + PADDLE_ENFORCE_NOT_NULL(payload); + + // FIXME(gongwb): it seems that can use zero copy. + if (var_is_not_stable) { + IOBufWriter::Append( + name, iobuf, ::sendrecv::VariableMessage::kSerializedFieldNumber, + static_cast(payload->ptr()), payload->memory_size()); + } else { + if (platform::is_gpu_place(ctx.GetPlace())) { +#ifdef PADDLE_WITH_CUDA + IOBufWriter::AppendZeroCopy( + name, iobuf, ::sendrecv::VariableMessage::kSerializedFieldNumber, + static_cast(payload->ptr()), payload->memory_size(), + true, SerializeDestroyCallback, static_cast(payload.get())); + payload.release(); +#endif + } else { + IOBufWriter::AppendZeroCopy( + name, iobuf, ::sendrecv::VariableMessage::kSerializedFieldNumber, + static_cast(payload->ptr()), payload->memory_size(), + false, SerializeDestroyCallback, static_cast(payload.get())); + payload.release(); + } + } + + if (var->IsType()) { + auto* slr = var->GetMutable(); + PADDLE_ENFORCE(VectorElemName(slr->rows()) == typeid(int64_t).name()); + size_t rows_memory_size = slr->rows().size() * sizeof(int64_t); + + IOBufWriter::Append(name, iobuf, + ::sendrecv::VariableMessage::kRowsFieldNumber, + reinterpret_cast(slr->rows().data()), + static_cast(rows_memory_size)); + } +} + +void DeserializeFromIOBuf(const ::sendrecv::VariableMessage& meta, + const butil::IOBuf& iobuf, + const platform::DeviceContext& ctx, + const framework::Scope* scope, + framework::Variable** var, int* trainer_id) { + operators::distributed::BRPCVariableResponse resp(scope, &ctx); + PADDLE_ENFORCE(resp.Parse(iobuf, meta) == 0, "parse iobuf to tensor error!"); + *var = resp.GetVar(); + *trainer_id = resp.GetTrainerId(); +} + +} // namespace distributed +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/distributed/brpc/brpc_sendrecvop_utils.h b/paddle/fluid/operators/distributed/brpc/brpc_sendrecvop_utils.h new file mode 100644 index 0000000000000000000000000000000000000000..a5bdc331eb29c7c0fe00d7f346025426b51e1cb3 --- /dev/null +++ b/paddle/fluid/operators/distributed/brpc/brpc_sendrecvop_utils.h @@ -0,0 +1,49 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include +#include + +#include "brpc/channel.h" +#include "paddle/fluid/framework/data_type.h" +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/framework/selected_rows.h" +#include "paddle/fluid/framework/tensor_util.h" +#include "paddle/fluid/framework/var_type.h" +#include "paddle/fluid/operators/distributed/distributed_pb.h" +#include "paddle/fluid/operators/distributed/sendrecvop_utils.h" + +namespace paddle { +namespace operators { +namespace distributed { + +void SerializeToIOBuf(const std::string& name, framework::Variable* var, + const platform::DeviceContext& ctx, VarMsg* request, + butil::IOBuf* iobuf, const std::string& out_varname, + bool var_is_not_stable, const int trainer_id = 0, + const std::string& table_name = std::string()); + +void DeserializeFromIOBuf(const VarMsg& meta, const butil::IOBuf& iobuf, + const platform::DeviceContext& ctx, + const framework::Scope* scope, + framework::Variable** var, int* trainer_id); + +} // namespace distributed +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/distributed/brpc/brpc_serde_test.cc b/paddle/fluid/operators/distributed/brpc/brpc_serde_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..b902d3db487789a417ed0e5ffc032e7e06ba43fb --- /dev/null +++ b/paddle/fluid/operators/distributed/brpc/brpc_serde_test.cc @@ -0,0 +1,175 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include // NOLINT + +#include "brpc/channel.h" +#include "google/protobuf/text_format.h" +#include "gtest/gtest.h" +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/tensor_util.h" +#include "paddle/fluid/framework/variable.h" +#include "paddle/fluid/operators/distributed/brpc/brpc_sendrecvop_utils.h" +#include "paddle/fluid/operators/distributed/brpc/brpc_variable_response.h" +#include "paddle/fluid/operators/distributed/sendrecvop_utils.h" +#include "paddle/fluid/operators/distributed/variable_response.h" +#include "paddle/fluid/operators/math/math_function.h" +#include "paddle/fluid/platform/place.h" +#include "paddle/fluid/string/printf.h" + +namespace framework = paddle::framework; +namespace platform = paddle::platform; +namespace operators = paddle::operators; +namespace math = paddle::operators::math; +namespace memory = paddle::memory; + +void RunSerdeTestSelectedRows(platform::Place place) { + platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance(); + auto& ctx = *pool.Get(place); + + butil::IOBuf iobuf; + sendrecv::VariableMessage msg; + int tensor_numel = 564 * 128; + + // serialize var to IOBuf + { + framework::Variable var; + auto* slr = var.GetMutable(); + slr->set_height(1000); + auto* tensor = slr->mutable_value(); + auto* rows = slr->mutable_rows(); + tensor->Resize(framework::make_ddim({564, 128})); + tensor->mutable_data(place); + math::set_constant(ctx, tensor, 32.7); + for (int i = 0; i < 564; ++i) rows->push_back(i); + + operators::distributed::SerializeToIOBuf("myvar", &var, ctx, &msg, &iobuf, + "", false); + } + + // desrialize + { + framework::Scope scope; + scope.Var("myvar"); + operators::distributed::BRPCVariableResponse resp(&scope, &ctx); + EXPECT_EQ(resp.Parse(iobuf, msg), 0); + + framework::Variable* var2 = resp.GetVar(); + + auto* slr2 = var2->GetMutable(); + auto* tensor2 = slr2->mutable_value(); + auto* rows2 = slr2->mutable_rows(); + float* tensor_data2 = nullptr; + framework::Tensor tmp_tensor; + + if (platform::is_gpu_place(ctx.GetPlace())) { + platform::CPUPlace cpu; + framework::TensorCopy(*tensor2, cpu, &tmp_tensor); + tensor_data2 = tmp_tensor.data(); + } else { + tensor_data2 = const_cast(tensor2->data()); + } + const int64_t* rows_data2 = rows2->data(); + + for (int i = 0; i < tensor_numel; ++i) { + EXPECT_FLOAT_EQ(tensor_data2[i], 32.7); + } + for (size_t i = 0; i < rows2->size(); ++i) { + EXPECT_EQ(rows_data2[i], static_cast(i)); + } + EXPECT_EQ(slr2->height(), 1000); + } +} + +void RunTestLodTensor(platform::Place place) { + platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance(); + auto& ctx = *pool.Get(place); + + // serialize var to ByteBuffer + butil::IOBuf iobuf; + sendrecv::VariableMessage msg; + int tensor_numel = 512 * 8 * 4 * 2; + { + framework::Variable var; + auto* tensor = var.GetMutable(); + tensor->Resize(framework::make_ddim({512, 8, 4, 2})); + framework::LoD lod; + lod.push_back(framework::Vector({1, 3, 8})); + tensor->set_lod(lod); + tensor->mutable_data(place); + math::set_constant(ctx, tensor, 31.9); + + operators::distributed::SerializeToIOBuf("myvar", &var, ctx, &msg, &iobuf, + "", false); + } + + // check sendrecv::VariableMessage meta data + { + EXPECT_EQ(msg.varname(), "myvar"); + EXPECT_EQ(msg.type(), 0); + EXPECT_EQ(msg.dims()[0], 512); + EXPECT_EQ(msg.dims()[1], 8); + EXPECT_EQ(msg.dims()[2], 4); + EXPECT_EQ(msg.dims()[3], 2); + EXPECT_EQ(msg.lod_level(), 1); + EXPECT_EQ(msg.lod(0).lod_data(0), 1); + EXPECT_EQ(msg.lod(0).lod_data(1), 3); + EXPECT_EQ(msg.lod(0).lod_data(2), 8); + } + + // deserialize + { + framework::Scope scope; + scope.Var("myvar"); + operators::distributed::BRPCVariableResponse resp(&scope, &ctx); + EXPECT_EQ(resp.Parse(iobuf, msg), 0); + + framework::Variable* var2 = resp.GetVar(); + + auto tensor2 = var2->Get(); + float* tensor_data2 = nullptr; + framework::Tensor tmp_tensor; + + if (platform::is_gpu_place(ctx.GetPlace())) { + platform::CPUPlace cpu; + framework::TensorCopy(tensor2, cpu, &tmp_tensor); + tensor_data2 = tmp_tensor.data(); + } else { + tensor_data2 = const_cast(tensor2.data()); + } + + for (int i = 0; i < tensor_numel; ++i) + EXPECT_FLOAT_EQ(tensor_data2[i], 31.9); + } +} + +TEST(LodTensor, Run) { + platform::CPUPlace place; + RunTestLodTensor(place); +#ifdef PADDLE_WITH_CUDA + platform::CUDAPlace gpu(0); + RunTestLodTensor(gpu); +#endif +} + +TEST(SelectedRows, Run) { + platform::CPUPlace place; + RunSerdeTestSelectedRows(place); +#ifdef PADDLE_WITH_CUDA + platform::CUDAPlace gpu; + RunSerdeTestSelectedRows(gpu); +#endif +} diff --git a/paddle/fluid/operators/distributed/brpc/brpc_server.cc b/paddle/fluid/operators/distributed/brpc/brpc_server.cc new file mode 100644 index 0000000000000000000000000000000000000000..fea9b09414638b607ca7f7d558ce14a2d5bfa03d --- /dev/null +++ b/paddle/fluid/operators/distributed/brpc/brpc_server.cc @@ -0,0 +1,403 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/operators/distributed/brpc/brpc_server.h" +#include "paddle/fluid/framework/threadpool.h" +#include "paddle/fluid/operators/distributed/brpc/brpc_sendrecvop_utils.h" +#include "paddle/fluid/operators/distributed/brpc/brpc_variable_response.h" +#include "paddle/fluid/operators/distributed/request_handler.h" + +namespace sendrecv { + +namespace distributed = paddle::operators::distributed; + +typedef std::unordered_map + HandlerMap; + +class BRPCServiceImpl : public SendRecvService { + public: + explicit BRPCServiceImpl(const HandlerMap& rpc_call_map, + distributed::RPCServer* rpc_server) + : rpc_server_(rpc_server) { + VLOG(3) << "BRPCServiceImpl size: " << rpc_call_map.size(); + auto it = rpc_call_map.find(distributed::kRequestSend); + if (it != rpc_call_map.end()) { + request_send_h_ = it->second; + send_threads_.reset(new paddle::framework::ThreadPool( + rpc_server_->GetThreadNum(distributed::kRequestSend))); + } + + it = rpc_call_map.find(distributed::kRequestGet); + if (it != rpc_call_map.end()) { + request_get_h_ = it->second; + get_threads_.reset(new paddle::framework::ThreadPool( + rpc_server_->GetThreadNum(distributed::kRequestGet))); + } + + it = rpc_call_map.find(distributed::kRequestGetNoBarrier); + if (it != rpc_call_map.end()) { + request_getnobarrier_h_ = it->second; + getnobarrier_threads_.reset(new paddle::framework::ThreadPool( + rpc_server_->GetThreadNum(distributed::kRequestGetNoBarrier))); + } + + it = rpc_call_map.find(distributed::kRequestPrefetch); + if (it != rpc_call_map.end()) { + request_prefetch_h_ = it->second; + prefetch_threads_.reset(new paddle::framework::ThreadPool( + rpc_server_->GetThreadNum(distributed::kRequestPrefetch))); + } + + it = rpc_call_map.find(distributed::kRequestCheckpoint); + if (it != rpc_call_map.end()) { + request_checkpoint_h_ = it->second; + checkpoint_notify_threads_.reset(new paddle::framework::ThreadPool( + rpc_server_->GetThreadNum(distributed::kRequestPrefetch))); + } + + it = rpc_call_map.find(distributed::kRequestGetMonomerVariable); + if (it != rpc_call_map.end()) { + request_get_monomer_handler_h_ = it->second; + } + + it = rpc_call_map.find(distributed::kRequestGetMonomerBarrier); + if (it != rpc_call_map.end()) { + request_get_monomer_barrier_handler_h_ = it->second; + } + } + + virtual ~BRPCServiceImpl() {} + void SendVariable(google::protobuf::RpcController* cntl_butil, + const VariableMessage* request, VoidMessage* response, + google::protobuf::Closure* done) override { + send_threads_->Run( + [=] { _SendVariable(cntl_butil, request, response, done); }); + } + + void _SendVariable(google::protobuf::RpcController* cntl_butil, + const VariableMessage* request, VoidMessage* response, + google::protobuf::Closure* done) { + PADDLE_ENFORCE(request_send_h_ != nullptr, + "RequestSend handler should be registed first!"); + brpc::ClosureGuard done_guard(done); + brpc::Controller* cntl = static_cast(cntl_butil); + + std::string varname = request->varname(); + VLOG(3) << "RequestSend var_name:" << varname + << ", trainer_id:" << request->trainer_id() + << ", from:" << cntl->remote_side(); + + distributed::BRPCVariableResponse resp(request_send_h_->scope(), + request_send_h_->dev_ctx(), + !request_send_h_->sync_mode()); + PADDLE_ENFORCE(resp.Parse(cntl->request_attachment(), *request) == 0, + "parse iobuf to tensor error!"); + + auto scope = resp.GetMutableLocalScope(); + auto invar = resp.GetVar(); + int trainer_id = request->trainer_id(); + paddle::framework::Variable* outvar = nullptr; + + request_send_h_->Handle(varname, scope, invar, &outvar, trainer_id); + } + + void GetVariable(google::protobuf::RpcController* cntl_butil, + const VariableMessage* request, VariableMessage* response, + google::protobuf::Closure* done) override { + get_threads_->Run( + [=] { _GetVariable(cntl_butil, request, response, done); }); + } + + void GetVariableNoBarrier(google::protobuf::RpcController* cntl_butil, + const VariableMessage* request, + VariableMessage* response, + google::protobuf::Closure* done) override { + getnobarrier_threads_->Run( + [=] { _GetVariableNoBarrier(cntl_butil, request, response, done); }); + } + + void _GetVariable(google::protobuf::RpcController* cntl_butil, + const VariableMessage* request, VariableMessage* response, + google::protobuf::Closure* done) { + PADDLE_ENFORCE(request_get_h_ != nullptr, + "RequestGet handler should be registed first!"); + + brpc::ClosureGuard done_guard(done); + brpc::Controller* cntl = static_cast(cntl_butil); + + std::string varname = request->varname(); + std::string out_varname = request->out_varname(); + VLOG(3) << "RequestGet varname:" << varname + << ", out_varname:" << out_varname + << ", trainer_id:" << request->trainer_id() + << ", from:" << cntl->remote_side(); + + auto scope = request_get_h_->scope(); + paddle::framework::Variable* invar = nullptr; + int trainer_id = request->trainer_id(); + paddle::framework::Variable* outvar = nullptr; + + request_get_h_->Handle(varname, scope, invar, &outvar, trainer_id, + out_varname); + + if (outvar) { + distributed::SerializeToIOBuf(out_varname, outvar, + *request_get_h_->dev_ctx(), response, + &cntl->response_attachment(), "", false); + } + } + + void _GetVariableNoBarrier(google::protobuf::RpcController* cntl_butil, + const VariableMessage* request, + VariableMessage* response, + google::protobuf::Closure* done) { + PADDLE_ENFORCE(request_getnobarrier_h_ != nullptr, + "RequestGetNoBarrier handler should be registed first!"); + + brpc::ClosureGuard done_guard(done); + brpc::Controller* cntl = static_cast(cntl_butil); + + std::string varname = request->varname(); + std::string out_varname = request->out_varname(); + int trainer_id = request->trainer_id(); + + VLOG(3) << "RequestGetNoBarrier varname:" << varname + << ", out_varname:" << out_varname << ", trainer_id:" << trainer_id + << ", from:" << cntl->remote_side(); + + auto scope = request_getnobarrier_h_->scope(); + paddle::framework::Variable* invar = nullptr; + paddle::framework::Variable* outvar = nullptr; + + request_getnobarrier_h_->Handle(varname, scope, invar, &outvar, trainer_id, + out_varname); + + if (outvar) { + distributed::SerializeToIOBuf( + out_varname, outvar, *request_getnobarrier_h_->dev_ctx(), response, + &cntl->response_attachment(), "", false); + } + } + + void PrefetchVariable(google::protobuf::RpcController* cntl_butil, + const VariableMessage* request, + VariableMessage* response, + google::protobuf::Closure* done) override { + prefetch_threads_->Run( + [=] { _PrefetchVariable(cntl_butil, request, response, done); }); + } + + void _PrefetchVariable(google::protobuf::RpcController* cntl_butil, + const VariableMessage* request, + VariableMessage* response, + google::protobuf::Closure* done) { + PADDLE_ENFORCE(request_prefetch_h_ != nullptr, + "kRequestPrefetch handler should be registed first!"); + + brpc::ClosureGuard done_guard(done); + brpc::Controller* cntl = static_cast(cntl_butil); + + // prefetch process... + std::string in_var_name = request->varname(); + std::string out_var_name = request->out_varname(); + VLOG(3) << "RequestPrefetch, in_var_name: " << in_var_name + << ", out_var_name: " << out_var_name + << ", trainer_id:" << request->trainer_id() + << ", from:" << cntl->remote_side(); + + distributed::BRPCVariableResponse resp( + request_prefetch_h_->scope(), request_prefetch_h_->dev_ctx(), true); + + PADDLE_ENFORCE(resp.Parse(cntl->request_attachment(), *request) == 0, + "parse iobuf to tensor error!"); + + auto scope = resp.GetMutableLocalScope(); + auto invar = scope->FindVar(in_var_name); + std::string table_name = request->table_name(); + int trainer_id = request->trainer_id(); + paddle::framework::Variable* outvar = scope->Var(out_var_name); + + request_prefetch_h_->Handle(in_var_name, scope, invar, &outvar, trainer_id, + out_var_name, table_name); + + distributed::SerializeToIOBuf(out_var_name, outvar, + *request_prefetch_h_->dev_ctx(), response, + &cntl->response_attachment(), "", true); + } + + void CheckpointNotify(google::protobuf::RpcController* cntl_butil, + const VariableMessage* request, VoidMessage* response, + google::protobuf::Closure* done) override { + checkpoint_notify_threads_->Run( + [=] { _CheckpointNotify(cntl_butil, request, response, done); }); + } + + void _CheckpointNotify(google::protobuf::RpcController* cntl_butil, + const VariableMessage* request, VoidMessage* response, + google::protobuf::Closure* done) { + PADDLE_ENFORCE( + request_checkpoint_h_ != nullptr, + "kRequestCheckpointNotify handler should be registed first!"); + + brpc::ClosureGuard done_guard(done); + brpc::Controller* cntl = static_cast(cntl_butil); + + distributed::BRPCVariableResponse resp(request_checkpoint_h_->scope(), + request_checkpoint_h_->dev_ctx()); + + auto scope = resp.GetMutableLocalScope(); + + std::string checkpoint_notify = request->varname(); + std::string checkpoint_dir = request->out_varname(); + int trainer_id = request->trainer_id(); + + VLOG(4) << "RequestCheckpointNotify notify: " << checkpoint_notify + << ", dir: " << checkpoint_dir + << ", trainer_id:" << request->trainer_id() + << ", from:" << cntl->remote_side(); + + request_checkpoint_h_->Handle(checkpoint_notify, scope, nullptr, nullptr, + trainer_id, checkpoint_dir); + } + + void GetMonomerVariable(google::protobuf::RpcController* cntl_butil, + const VariableMessage* request, + VariableMessage* response, + google::protobuf::Closure* done) override { + PADDLE_ENFORCE( + request_get_monomer_handler_h_ != nullptr, + "kRequestGetMonomerVariable handler should be registed first!"); + + brpc::ClosureGuard done_guard(done); + brpc::Controller* cntl = static_cast(cntl_butil); + + // proc request. + std::string varname = request->varname(); + VLOG(3) << "GetMonomerVariable " << varname + << ", trainer_id:" << request->trainer_id() + << ", from:" << cntl->remote_side(); + + rpc_server_->WaitVarCond(varname); + distributed::MonomerHandle h = rpc_server_->GetMonomer(varname); + + auto scope = h.scope_; + auto invar = scope->FindVar(varname); + paddle::framework::Variable* outvar = nullptr; + + request_get_monomer_handler_h_->Handle(varname, scope, invar, &outvar, + request->trainer_id()); + + if (outvar) { + distributed::SerializeToIOBuf(varname, outvar, *h.dev_ctx_, response, + &cntl->response_attachment(), "", false); + } + } + + void GetMonomerBarrier(google::protobuf::RpcController* cntl_butil, + const VariableMessage* request, VoidMessage* response, + google::protobuf::Closure* done) override { + PADDLE_ENFORCE( + request_get_monomer_barrier_handler_h_ != nullptr, + "RequestGetMonomerBarrier handler should be registed first!"); + + brpc::ClosureGuard done_guard(done); + brpc::Controller* cntl = static_cast(cntl_butil); + + std::string varname = request->varname(); + VLOG(3) << "RequestGetMonomerBarrier var_name:" << varname + << ", trainer_id:" << request->trainer_id() + << ", from:" << cntl->remote_side(); + + rpc_server_->WaitVarCond(varname); + distributed::MonomerHandle h = rpc_server_->GetMonomer(varname); + + paddle::framework::Scope* scope = nullptr; + paddle::framework::Variable* invar = nullptr; + paddle::framework::Variable* outvar = nullptr; + + request_get_monomer_barrier_handler_h_->Handle( + varname, scope, invar, &outvar, request->trainer_id()); + } + + private: + distributed::RequestHandler* request_send_h_{nullptr}; + distributed::RequestHandler* request_get_h_{nullptr}; + distributed::RequestHandler* request_getnobarrier_h_{nullptr}; + distributed::RequestHandler* request_prefetch_h_{nullptr}; + distributed::RequestHandler* request_checkpoint_h_{nullptr}; + distributed::RequestHandler* request_get_monomer_handler_h_{nullptr}; + distributed::RequestHandler* request_get_monomer_barrier_handler_h_{nullptr}; + + distributed::RPCServer* rpc_server_{nullptr}; + + // FIXME(gongwb): brpc should support process one rpc use one threadpool. + std::unique_ptr send_threads_; + std::unique_ptr get_threads_; + std::unique_ptr getnobarrier_threads_; + std::unique_ptr prefetch_threads_; + std::unique_ptr checkpoint_notify_threads_; +}; +} // namespace sendrecv + +namespace paddle { +namespace operators { +namespace distributed { + +void AsyncBRPCServer::StartServer() { + // Instance of your service. + sendrecv::BRPCServiceImpl service_impl(rpc_call_map_, this); + + // Add the service into server. Notice the second parameter, because the + // service is put on stack, we don't want server to delete it, otherwise + // use brpc::SERVER_OWNS_SERVICE. + if (server_.AddService(&service_impl, brpc::SERVER_DOESNT_OWN_SERVICE) != 0) { + LOG(FATAL) << "Fail to add service"; + return; + } + + brpc::ServerOptions options; +#ifdef PADDLE_WITH_BRPC_RDMA + options.use_rdma = true; +#endif + options.idle_timeout_sec = idle_timeout_s_; + options.max_concurrency = max_concurrency_; + if (server_.Start(bind_address_.c_str(), &options) != 0) { + LOG(FATAL) << "Fail to start EchoServer" << bind_address_; + return; + } + + butil::EndPoint ep = server_.listen_address(); + selected_port_ = ep.port; + + { + std::lock_guard lock(this->mutex_ready_); + ready_ = 1; + } + condition_ready_.notify_all(); + + server_.Join(); +} + +void AsyncBRPCServer::ShutDownImpl() { server_.Stop(1000); } + +void AsyncBRPCServer::WaitServerReady() { + VLOG(3) << "AsyncGRPCServer is wait server ready"; + std::unique_lock lock(this->mutex_ready_); + condition_ready_.wait(lock, [=] { return this->ready_ == 1; }); + VLOG(3) << "AsyncGRPCServer WaitSeverReady"; +} + +}; // namespace distributed +}; // namespace operators +}; // namespace paddle diff --git a/paddle/fluid/operators/distributed/brpc/brpc_server.h b/paddle/fluid/operators/distributed/brpc/brpc_server.h new file mode 100644 index 0000000000000000000000000000000000000000..78bbe5adc0813d7cf29963c78947d52bcaea9643 --- /dev/null +++ b/paddle/fluid/operators/distributed/brpc/brpc_server.h @@ -0,0 +1,53 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include // NOLINT +#include // NOLINT +#include + +#include "brpc/server.h" +#include "paddle/fluid/operators/distributed/distributed_pb.h" +#include "paddle/fluid/operators/distributed/rpc_server.h" + +namespace paddle { +namespace operators { +namespace distributed { + +class AsyncBRPCServer final : public RPCServer { + public: + explicit AsyncBRPCServer(const std::string& address, int client_num) + : RPCServer(address, client_num), ready_(0) {} + + virtual ~AsyncBRPCServer() {} + void StartServer() override; + void WaitServerReady() override; + + private: + void ShutDownImpl() override; + + brpc::Server server_; + + static constexpr int idle_timeout_s_ = -1; + static constexpr int max_concurrency_ = 0; + + std::mutex mutex_ready_; + std::condition_variable condition_ready_; + int ready_; +}; + +}; // namespace distributed +}; // namespace operators +}; // namespace paddle diff --git a/paddle/fluid/operators/distributed/brpc/brpc_variable_response.cc b/paddle/fluid/operators/distributed/brpc/brpc_variable_response.cc new file mode 100644 index 0000000000000000000000000000000000000000..eb78917ad2d8b49f1b2d1f8dfb2cbca8a9a9610d --- /dev/null +++ b/paddle/fluid/operators/distributed/brpc/brpc_variable_response.cc @@ -0,0 +1,73 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +// + +#include "paddle/fluid/operators/distributed/brpc/brpc_variable_response.h" +#include "paddle/fluid/operators/distributed/send_recv.pb.h" + +namespace paddle { +namespace operators { +namespace distributed { + +namespace pb = ::google::protobuf; +using vr = ::sendrecv::VariableMessage; + +int BRPCVariableResponse::Parse(Source* source) { + pb::io::ZeroCopyInputStream* input_stream = source->contents(); + pb::io::CodedInputStream input(input_stream); + input.SetTotalBytesLimit(INT_MAX, INT_MAX); + + while (1) { + unsigned int tag = 0; + if (!input.ReadLittleEndian32(&tag)) { + break; + } + + uint64_t num_bytes = 0; + if (!input.ReadLittleEndian64(&num_bytes)) { + break; + } + + int field = static_cast(tag); + int ret = field == 0 ? -1 : field; + switch (field) { + case vr::kSerializedFieldNumber: { + if (!ProcSerializedField(field, &input, num_bytes)) { + return ret; + } + break; + } + case vr::kRowsFieldNumber: { + PADDLE_ENFORCE((meta_.type() == sendrecv::SELECTED_ROWS || + meta_.type() == sendrecv::LOD_TENSOR) && + meta_.varname() != "", + "meta info should be got first!"); + + if (!CopySelectRowsData(&input, *dev_ctx_, num_bytes)) { + return ret; + } + break; + } + default: { + PADDLE_ENFORCE(false, "not surpported %u fieldnumber", field); + return ret; + } + } + } + + return 0; +} +} // namespace distributed +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/distributed/brpc/brpc_variable_response.h b/paddle/fluid/operators/distributed/brpc/brpc_variable_response.h new file mode 100644 index 0000000000000000000000000000000000000000..6282f08a725367f74dbcf1fa6a2ad49469d64725 --- /dev/null +++ b/paddle/fluid/operators/distributed/brpc/brpc_variable_response.h @@ -0,0 +1,67 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include + +#include "brpc/channel.h" +#include "paddle/fluid/framework/data_type.h" +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/framework/selected_rows.h" +#include "paddle/fluid/framework/var_type.h" + +#include "paddle/fluid/operators/distributed/distributed_pb.h" + +#include "google/protobuf/io/coded_stream.h" +#include "google/protobuf/io/zero_copy_stream.h" +#include "paddle/fluid/framework/tensor.h" +#include "paddle/fluid/operators/distributed/variable_response.h" + +namespace paddle { +namespace operators { +namespace distributed { + +class BRPCSourceWrapper : public Source { + public: + explicit BRPCSourceWrapper(const butil::IOBuf& iobuf) : source_(iobuf) {} + ::google::protobuf::io::ZeroCopyInputStream* contents() override { + return &source_; + } + + private: + butil::IOBufAsZeroCopyInputStream source_; +}; + +class BRPCVariableResponse : public VariableResponse { + public: + BRPCVariableResponse(const framework::Scope* scope, + const platform::DeviceContext* dev_ctx, + bool create_scope = false) + : VariableResponse(scope, dev_ctx, create_scope) {} + + virtual ~BRPCVariableResponse() {} + + // parse attachment from iobuf + int Parse(Source* source) override; + int Parse(const butil::IOBuf& iobuf, const sendrecv::VariableMessage& meta) { + BRPCSourceWrapper wrapper(iobuf); + return VariableResponse::Parse(&wrapper, meta); + } +}; + +}; // namespace distributed +}; // namespace operators +}; // namespace paddle diff --git a/paddle/fluid/operators/distributed/brpc_client.cc b/paddle/fluid/operators/distributed/brpc_client.cc deleted file mode 100644 index b394c678fb6503eb73a1e11e6feb814251e9e940..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/distributed/brpc_client.cc +++ /dev/null @@ -1,180 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include "paddle/fluid/operators/distributed/brpc_client.h" -#include "paddle/fluid/framework/threadpool.h" - -namespace paddle { -namespace operators { -namespace distributed { - -DEFINE_int32(brpc_channel_num, 24, - "Number of channels to send requests connected to one server"); -DEFINE_int32(timeout_ms, 30000, "RPC timeout in milliseconds"); -DEFINE_int32(max_retry, 3, "Max retries(not including the first RPC)"); - -BRPCClient::~BRPCClient() { Wait(); } - -void HandleSendResponse(brpc::Controller* cntl, - sendrecv::VoidMessage* response) { - // std::unique_ptr makes sure cntl/response will be deleted before returning. - std::unique_ptr cntl_guard(cntl); - std::unique_ptr response_guard(response); - - if (cntl->Failed()) { - LOG(WARNING) << "Fail to send EchoRequest, " << cntl->ErrorText(); - return; - } - LOG(INFO) << "Received response from " << cntl->remote_side() - << " latency=" << cntl->latency_us() << "us"; -} - -bool BRPCClient::AsyncSendVar(const std::string& ep, - const platform::DeviceContext& ctx, - const framework::Scope& scope, - const std::string& var_name, int64_t time_out) { - const platform::DeviceContext* p_ctx = &ctx; - const std::string ep_val = ep; - const std::string var_name_val = var_name; - const framework::Scope* p_scope = &scope; - const auto ch_ptr = GetChannel(ep_val); - - framework::AsyncIO( - [var_name_val, p_ctx, ep_val, p_scope, time_out, ch_ptr, this] { - auto ch_ctx = ch_ptr->Pop(); - brpc::Controller* cntl = new brpc::Controller(); - sendrecv::VoidMessage* response = new sendrecv::VoidMessage(); - cntl->set_timeout_ms(time_out); - - google::protobuf::Closure* done = - brpc::NewCallback(&HandleSendResponse, cntl, response); - - sendrecv::VariableMessage request; - ch_ctx->stub->SendVariable(cntl, &request, response, done); - }); - req_count_++; - - return true; -} - -void HandleGetResponse(brpc::Controller* cntl, - sendrecv::VariableMessage* response) { - // std::unique_ptr makes sure cntl/response will be deleted before returning. - std::unique_ptr cntl_guard(cntl); - std::unique_ptr response_guard(response); - - if (cntl->Failed()) { - LOG(WARNING) << "Fail to send EchoRequest, " << cntl->ErrorText(); - return; - } - LOG(INFO) << "Received response from " << cntl->remote_side() - << " latency=" << cntl->latency_us() << "us"; - - // framework::Variable* outvar = nullptr; - // DeserializeFromByteBuffer(ret_msg, *var_h.ctx, var_h.scope, &outvar); -} - -bool BRPCClient::AsyncGetVar(const std::string& ep, - const platform::DeviceContext& ctx, - const framework::Scope& scope, - const std::string& var_name, int64_t time_out) { - const platform::DeviceContext* p_ctx = &ctx; - const std::string ep_val = ep; - const std::string var_name_val = var_name; - const framework::Scope* p_scope = &scope; - const auto ch = GetChannel(ep_val); - - framework::AsyncIO( - [var_name_val, ep_val, p_scope, p_ctx, time_out, ch, this] {}); - - req_count_++; - - return true; -} - -bool BRPCClient::AsyncPrefetchVar(const std::string& ep, - const platform::DeviceContext& ctx, - const framework::Scope& scope, - const std::string& in_var_name, - const std::string& out_var_name, - int64_t time_out) { - const platform::DeviceContext* p_ctx = &ctx; - const std::string ep_val = ep; - const std::string in_var_name_val = in_var_name; - const std::string out_var_name_val = out_var_name; - const framework::Scope* p_scope = &scope; - const auto ch = GetChannel(ep_val); - - framework::AsyncIO([in_var_name_val, out_var_name_val, ep_val, p_scope, p_ctx, - time_out, ch, this] {}); - - req_count_++; - return true; -} - -void BRPCClient::AsyncSendBatchBarrier(const std::string& ep, - int64_t time_out) { - req_count_++; -} - -void BRPCClient::AsyncSendFetchBarrier(const std::string& ep, - int64_t time_out) { - req_count_++; -} - -void BRPCClient::Wait() { - std::unique_lock lk(sync_mutex_); - sync_cond_.wait(lk, [this] { return req_count_ == 0; }); -} - -ChannelQueuePtr BRPCClient::GetChannel(const std::string& ep) { - { - std::lock_guard guard(chan_mutex_); - auto it = channels_.find(ep); - if (it != channels_.end()) { - return it->second; - } - } - - ChannelQueuePtr q(new framework::BlockingQueue()); - - brpc::ChannelOptions options; - options.protocol = "baidu_std"; - options.connection_type = "pooled"; - options.connect_timeout_ms = 100; - options.timeout_ms = FLAGS_timeout_ms /*milliseconds*/; - options.max_retry = FLAGS_max_retry; - for (int i = 0; i < FLAGS_brpc_channel_num; ++i) { - std::shared_ptr c(new ChannelContext()); - if (c->channel.Init(ep.c_str(), &options) != 0) { - LOG(ERROR) << "Fail to initialize channel"; - return nullptr; - } - - c->stub.reset(new sendrecv::SendRecvService_Stub( - static_cast(&c->channel))); - q->Push(c); - } - - { - std::lock_guard guard(chan_mutex_); - channels_[ep] = q; - } - - return q; -} - -} // namespace distributed -} // namespace operators -} // namespace paddle diff --git a/paddle/fluid/operators/distributed/brpc_client.h b/paddle/fluid/operators/distributed/brpc_client.h deleted file mode 100644 index 8ff1f0a6076b3574c42065edcbac50eb75b3b483..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/distributed/brpc_client.h +++ /dev/null @@ -1,98 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#pragma once - -#include - -#include // NOLINT -#include -#include -#include -#include -#include // NOLINT -#include -#include - -#include "brpc/channel.h" -#include "paddle/fluid/framework/blocking_queue.h" -#include "paddle/fluid/framework/data_type.h" -#include "paddle/fluid/framework/lod_tensor.h" -#include "paddle/fluid/framework/scope.h" -#include "paddle/fluid/framework/selected_rows.h" -#include "paddle/fluid/operators/distributed/rpc_client.h" -#include "paddle/fluid/operators/distributed/send_recv.pb.h" -#include "paddle/fluid/platform/macros.h" // for DISABLE_COPY_AND_ASSIGN - -namespace paddle { -namespace operators { -namespace distributed { - -struct ChannelContext { - brpc::Channel channel; - std::shared_ptr stub; -}; - -typedef std::shared_ptr ChannelContextPtr; -typedef std::shared_ptr> - ChannelQueuePtr; - -class BRPCClient : public RPCClient { - public: - BRPCClient() {} - virtual ~BRPCClient(); - - bool AsyncSendVar(const std::string& ep, const platform::DeviceContext& ctx, - const framework::Scope& scope, const std::string& var_name, - int64_t time_out = FLAGS_rpc_deadline) override; - - bool AsyncGetVar(const std::string& ep, const platform::DeviceContext& ctx, - const framework::Scope& scope, const std::string& var_name, - int64_t time_out = FLAGS_rpc_deadline) override; - - bool AsyncPrefetchVar(const std::string& ep, - const platform::DeviceContext& ctx, - const framework::Scope& scope, - const std::string& in_var_name, - const std::string& out_var_name, - int64_t time_out = FLAGS_rpc_deadline) override; - - void AsyncSendBatchBarrier(const std::string& ep, - int64_t time_out = FLAGS_rpc_deadline) override; - - void AsyncSendFetchBarrier(const std::string& ep, - int64_t time_out = FLAGS_rpc_deadline) override; - - void Wait() override; - - private: - void Proceed(); - ChannelQueuePtr GetChannel(const std::string& ep); - - private: - std::unordered_map channels_; - - // mutex for Wait client sync - std::mutex sync_mutex_; - std::condition_variable sync_cond_; - std::atomic req_count_{0}; - - // mutex for GetChannel thread safety - std::mutex chan_mutex_; - DISABLE_COPY_AND_ASSIGN(BRPCClient); -}; - -} // namespace distributed -} // namespace operators -} // namespace paddle diff --git a/paddle/fluid/operators/distributed/brpc_server.cc b/paddle/fluid/operators/distributed/brpc_server.cc deleted file mode 100644 index 862167f02084cfe81db1c0936bbfb0415fa85721..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/distributed/brpc_server.cc +++ /dev/null @@ -1,144 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include "paddle/fluid/operators/distributed/brpc_server.h" -#include "paddle/fluid/operators/distributed/request_handler.h" - -namespace sendrecv { - -typedef std::unordered_map - HandlerMap; - -class BRPCServiceImpl : public SendRecvService { - public: - explicit BRPCServiceImpl(const HandlerMap& rpc_call_map) - : request_send_h_(nullptr), - request_get_h_(nullptr), - request_prefetch_h_(nullptr) { - auto it = rpc_call_map.find(paddle::operators::distributed::kRequestSend); - if (it != rpc_call_map.end()) { - request_send_h_ = it->second; - } - - it = rpc_call_map.find(paddle::operators::distributed::kRequestSend); - if (it != rpc_call_map.end()) { - request_get_h_ = it->second; - } - - it = rpc_call_map.find(paddle::operators::distributed::kRequestPrefetch); - if (it != rpc_call_map.end()) { - request_prefetch_h_ = it->second; - } - } - - virtual ~BRPCServiceImpl() {} - - void SendVariable(google::protobuf::RpcController* cntl_butil, - const VariableMessage* request, VoidMessage* response, - google::protobuf::Closure* done) override { - PADDLE_ENFORCE(request_send_h_ != nullptr, - "RequestSend handler should be registed first!"); - brpc::ClosureGuard done_guard(done); - - paddle::framework::Scope* local_scope = request_send_h_->scope(); - paddle::framework::Variable* outvar = nullptr; - paddle::framework::Variable* invar = nullptr; - - std::string varname = request->varname(); - - if (!request_send_h_->sync_mode()) { - local_scope = &request_send_h_->scope()->NewScope(); - invar = local_scope->Var(varname); - } else { - invar = local_scope->FindVar(varname); - } - - request_send_h_->Handle(varname, local_scope, invar, &outvar); - - if (!request_send_h_->sync_mode()) { - request_send_h_->scope()->DeleteScope(local_scope); - } - } - - void GetVariable(google::protobuf::RpcController* cntl_butil, - const VariableMessage* request, VariableMessage* response, - google::protobuf::Closure* done) override { - PADDLE_ENFORCE(request_get_h_ != nullptr, - "RequestGet handler should be registed first!"); - } - - void PrefetchVariable(google::protobuf::RpcController* cntl_butil, - const VariableMessage* request, - VariableMessage* response, - google::protobuf::Closure* done) override { - PADDLE_ENFORCE(request_prefetch_h_ != nullptr, - "kRequestPrefetch handler should be registed first!"); - } - - private: - paddle::operators::distributed::RequestHandler* request_send_h_; - paddle::operators::distributed::RequestHandler* request_get_h_; - paddle::operators::distributed::RequestHandler* request_prefetch_h_; -}; -} // namespace sendrecv - -namespace paddle { -namespace operators { -namespace distributed { - -void AsyncBRPCServer::StartServer() { - // Instance of your service. - sendrecv::BRPCServiceImpl service_impl(rpc_call_map_); - - // Add the service into server. Notice the second parameter, because the - // service is put on stack, we don't want server to delete it, otherwise - // use brpc::SERVER_OWNS_SERVICE. - if (server_.AddService(&service_impl, brpc::SERVER_DOESNT_OWN_SERVICE) != 0) { - LOG(FATAL) << "Fail to add service"; - return; - } - - brpc::ServerOptions options; - options.idle_timeout_sec = idle_timeout_s_; - options.max_concurrency = max_concurrency_; - if (server_.Start(bind_address_.c_str(), &options) != 0) { - LOG(FATAL) << "Fail to start EchoServer" << bind_address_; - return; - } - - butil::EndPoint ep = server_.listen_address(); - selected_port_ = ep.port; - - { - std::lock_guard lock(this->mutex_ready_); - ready_ = 1; - } - condition_ready_.notify_all(); - - server_.Join(); -} - -void AsyncBRPCServer::ShutDownImpl() { server_.Stop(1000); } - -void AsyncBRPCServer::WaitServerReady() { - VLOG(3) << "AsyncGRPCServer is wait server ready"; - std::unique_lock lock(this->mutex_ready_); - condition_ready_.wait(lock, [=] { return this->ready_ == 1; }); - VLOG(3) << "AsyncGRPCServer WaitSeverReady"; -} - -}; // namespace distributed -}; // namespace operators -}; // namespace paddle diff --git a/paddle/fluid/operators/distributed/brpc_server.h b/paddle/fluid/operators/distributed/brpc_server.h deleted file mode 100644 index 85a7ad0dfe843dad483d43631b69a79d75211ce9..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/distributed/brpc_server.h +++ /dev/null @@ -1,53 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#pragma once - -#include // NOLINT -#include // NOLINT -#include - -#include "brpc/server.h" -#include "paddle/fluid/operators/distributed/rpc_server.h" -#include "paddle/fluid/operators/distributed/send_recv.pb.h" - -namespace paddle { -namespace operators { -namespace distributed { - -class AsyncBRPCServer final : public RPCServer { - public: - explicit AsyncBRPCServer(const std::string& address, int client_num) - : RPCServer(address, client_num), ready_(0) {} - - virtual ~AsyncBRPCServer() {} - void StartServer() override; - void WaitServerReady() override; - - private: - void ShutDownImpl() override; - - brpc::Server server_; - - static constexpr int idle_timeout_s_ = -1; - static constexpr int max_concurrency_ = 0; - - std::mutex mutex_ready_; - std::condition_variable condition_ready_; - int ready_; -}; - -}; // namespace distributed -}; // namespace operators -}; // namespace paddle diff --git a/paddle/fluid/operators/distributed/collective_client.cc b/paddle/fluid/operators/distributed/collective_client.cc new file mode 100644 index 0000000000000000000000000000000000000000..6d3f53431113621fc859eda8e7448383772d20a3 --- /dev/null +++ b/paddle/fluid/operators/distributed/collective_client.cc @@ -0,0 +1,59 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include // NOLINT +#include +#include "gflags/gflags.h" + +#include "paddle/fluid/operators/distributed/collective_client.h" + +DECLARE_int32(rpc_deadline); + +namespace paddle { +namespace operators { +namespace distributed { +std::once_flag CollectiveClient::init_flag_; +std::unique_ptr CollectiveClient::client_(nullptr); + +bool CollectiveClient::Gather(const std::vector& remote_vars, + std::vector* dst, + const platform::DeviceContext& ctx, + framework::Scope* scope, int64_t time_out) { + for (auto r : remote_vars) { + VLOG(50) << "begin gather from ep:" << r.String(); + scope->Var(r.var_name_)->GetMutable(); + VarHandlePtr ptr = rpc_client_->AsyncGetMonomerVariable( + r.ep_, ctx, *scope, r.var_name_, time_out); + } + + rpc_client_->Wait(); + + for (auto r : remote_vars) { + auto select_rows = + scope->FindVar(r.var_name_)->GetMutable(); + dst->push_back(select_rows); + + VLOG(4) << "gather from ep:" << r.String() + << ", select_rows:" << GetSelectedRowsInfo(*select_rows); + + rpc_client_->AsyncGetMonomerBarrier(r.ep_, r.var_name_); + } + + rpc_client_->Wait(); + return true; +} + +} // namespace distributed +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/distributed/collective_client.h b/paddle/fluid/operators/distributed/collective_client.h new file mode 100644 index 0000000000000000000000000000000000000000..6a3a450a1fd2e52c341f824f4816ca13784bda85 --- /dev/null +++ b/paddle/fluid/operators/distributed/collective_client.h @@ -0,0 +1,93 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include // NOLINT +#include +#include +#include "gflags/gflags.h" + +#include "paddle/fluid/framework/data_type.h" +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/operators/distributed/distributed.h" +#include "paddle/fluid/operators/distributed/request_handler.h" + +DECLARE_int32(rpc_deadline); + +namespace paddle { +namespace operators { +namespace distributed { + +inline std::string GetSelectedRowsInfo(const framework::SelectedRows& slr) { + std::stringstream ss; + ss << ", height:" << slr.height() << ", rows:["; + for (unsigned int i = 0; i < slr.rows().size(); i++) { + if (i != slr.rows().size() - 1) { + ss << slr.rows()[i] << ","; + } else { + ss << slr.rows()[i]; + } + } + ss << "], dims:" << slr.value().dims(); + return ss.str(); +} + +struct RemoteVar { + std::string ep_; + std::string var_name_; + int trainer_id_{0}; + + std::string String() { + std::stringstream ss; + ss << "ep:" << ep_ << ", var_name:" << var_name_ + << ", trainer_id:" << trainer_id_; + + return ss.str(); + } +}; + +class CollectiveClient { + public: + CollectiveClient() { + rpc_client_.reset(new RPCCLIENT_T()); + rpc_client_->InitImpl(); + } + virtual ~CollectiveClient() {} + + // note this function will retain the rank order. + bool Gather(const std::vector& remote_vars, + std::vector* dst, + const platform::DeviceContext& ctx, framework::Scope* scope, + int64_t time_out = FLAGS_rpc_deadline); + + static CollectiveClient* GetInstance() { + std::call_once(init_flag_, [&]() { + if (client_.get() == nullptr) { + client_.reset(new CollectiveClient()); + } + }); + return client_.get(); + } + + private: + std::unique_ptr rpc_client_; + + static std::once_flag init_flag_; + static std::unique_ptr client_; +}; +} // namespace distributed +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/distributed/collective_server.cc b/paddle/fluid/operators/distributed/collective_server.cc new file mode 100644 index 0000000000000000000000000000000000000000..c95652400c27acd406ca3f70a0dfa8d329e94358 --- /dev/null +++ b/paddle/fluid/operators/distributed/collective_server.cc @@ -0,0 +1,74 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include // for removing the port file +#include +#include +#include +#include // NOLINT +#include + +#include "paddle/fluid/operators/distributed/collective_server.h" + +DEFINE_int32(collective_get_thread_num, 5, "number of threads for rpc get"); + +namespace paddle { +namespace operators { +namespace distributed { + +std::once_flag CollectiveServer::init_flag_; +std::shared_ptr CollectiveServer::collective_server_(nullptr); + +CollectiveServer::CollectiveServer(const std::string& end_point, int fan_in) { + VLOG(1) << "Create colllective server:" << end_point << ", fan_in:" << fan_in; + rpc_server_.reset(new RPCSERVER_T(end_point, fan_in)); +} + +void CollectiveServer::Stop() { + rpc_server_->ShutDown(); + server_thread_->join(); + loop_thread_->join(); +} + +void CollectiveServer::StartServer() { + get_monomer_handler_.reset(new GetMonomerHandler()); + get_monomer_handler_->SetRPCServer(rpc_server_.get()); + + get_barrier_handler_.reset(new GetMonomerBarrierHandler()); + get_barrier_handler_->SetRPCServer(rpc_server_.get()); + + rpc_server_->RegisterRPC(distributed::kRequestGetMonomerVariable, + get_monomer_handler_.get(), + FLAGS_collective_get_thread_num); + rpc_server_->RegisterRPC(distributed::kRequestGetMonomerBarrier, + get_barrier_handler_.get(), 1); + + server_thread_.reset(new std::thread([&]() { rpc_server_->StartServer(); })); + rpc_server_->WaitServerReady(); + + loop_thread_.reset(new std::thread([&]() { + while (true) { + if (rpc_server_->IsExit()) { + LOG(WARNING) << "get exit!rpc_processor break!"; + break; + } + sleep(1); + } + VLOG(1) << "CollectiveServer loop_thread end"; + })); +} + +}; // namespace distributed +}; // namespace operators +}; // namespace paddle diff --git a/paddle/fluid/operators/distributed/collective_server.h b/paddle/fluid/operators/distributed/collective_server.h new file mode 100644 index 0000000000000000000000000000000000000000..03c688a78e1cbaba4afe4585e619956188a767a1 --- /dev/null +++ b/paddle/fluid/operators/distributed/collective_server.h @@ -0,0 +1,110 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include +#include // NOLINT +#include +#include + +#include "gflags/gflags.h" + +#include "paddle/fluid/operators/distributed/distributed.h" +#include "paddle/fluid/operators/distributed/request_handler.h" +#include "paddle/fluid/operators/distributed/request_handler_impl.h" +#include "paddle/fluid/operators/distributed/rpc_server.h" + +namespace paddle { +namespace operators { +namespace distributed { + +class CollectiveServer; + +class GetMonomerHandler final : public RequestHandler { + public: + GetMonomerHandler() : RequestHandler(true) {} + virtual ~GetMonomerHandler() {} + bool Handle(const std::string& var_name, framework::Scope* scope, + framework::Variable* var, framework::Variable** outvar, + const int trainer_id, const std::string& out_var_name = "", + const std::string& table_name = "") override { + VLOG(50) << "GetMonomerHandler recv " << var_name; + + *outvar = scope->FindVar(var_name); + PADDLE_ENFORCE(outvar != nullptr, "%s not found", var_name); + + return true; + } +}; + +class GetMonomerBarrierHandler final : public RequestHandler { + public: + GetMonomerBarrierHandler() : RequestHandler(true) {} + virtual ~GetMonomerBarrierHandler() {} + bool Handle(const std::string& var_name, framework::Scope* scope, + framework::Variable* var, framework::Variable** outvar, + const int trainer_id, const std::string& out_var_name = "", + const std::string& table_name = "") override { + VLOG(50) << "GetMonomerHandler recv " << var_name; + + rpc_server_->IncreaseVarBarrier(var_name); + + return true; + } +}; + +class CollectiveServer final { + public: + explicit CollectiveServer(const std::string& end_point, int fan_in); + + virtual ~CollectiveServer() {} + + void StartServer(); + + static CollectiveServer* GetInstance(const std::string& end_point, + int fan_in) { + std::call_once(init_flag_, [&]() { + if (collective_server_.get() == nullptr) { + collective_server_.reset(new CollectiveServer(end_point, fan_in)); + collective_server_->StartServer(); + } + }); + + return collective_server_.get(); + } + + std::shared_ptr GetRPCServer() { return rpc_server_; } + + void Stop(); + + private: + std::unique_ptr get_monomer_handler_; + std::unique_ptr get_barrier_handler_; + + std::shared_ptr rpc_server_; + std::shared_ptr server_thread_; + std::shared_ptr loop_thread_; + + bool ready_{false}; + + static std::once_flag init_flag_; + static std::shared_ptr collective_server_; +}; + +}; // namespace distributed +}; // namespace operators +}; // namespace paddle diff --git a/paddle/fluid/operators/distributed/collective_server_test.cc b/paddle/fluid/operators/distributed/collective_server_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..90f2f9fd65bf1b8c1edda6a2ebe0ce5288ddcb5d --- /dev/null +++ b/paddle/fluid/operators/distributed/collective_server_test.cc @@ -0,0 +1,130 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include // NOLINT + +#include "gtest/gtest.h" +#include "paddle/fluid/framework/block_desc.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/framework/operator.h" +#include "paddle/fluid/framework/tensor_util.h" + +#include "paddle/fluid/operators/distributed/collective_client.h" +#include "paddle/fluid/operators/distributed/collective_server.h" +#include "paddle/fluid/operators/distributed/distributed.h" +#include "paddle/fluid/operators/distributed/request_handler_impl.h" +#include "paddle/fluid/operators/math/math_function.h" + +namespace framework = paddle::framework; +namespace platform = paddle::platform; +namespace distributed = paddle::operators::distributed; + +std::unique_ptr StartServer( + const std::string& ep, int fan_in, framework::Scope* scope, + platform::DeviceContext* dev_ctx) { + distributed::CollectiveServer* server = + distributed::CollectiveServer::GetInstance(ep, fan_in); + + auto rpc_server = server->GetRPCServer(); + rpc_server->RegisterVar("var1", distributed::kRequestGetMonomerVariable, + scope, dev_ctx); + + std::cout << "StartServer return" << std::endl; + return std::unique_ptr(server); +} + +std::unique_ptr GenerateVars(platform::Place place) { + platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance(); + auto& ctx = *pool.Get(place); + + framework::Scope* scope = new framework::Scope(); + framework::Variable* var = scope->Var("var1"); + auto* slr = var->GetMutable(); + slr->set_height(20000); + + auto* tensor = slr->mutable_value(); + auto* rows = slr->mutable_rows(); + + tensor->Resize(framework::make_ddim({3, 1024})); + tensor->mutable_data(place); + + paddle::operators::math::set_constant(ctx, tensor, 32.7); + for (int i = 0; i < 3; ++i) rows->push_back(i); + + std::cout << "src:" << distributed::GetSelectedRowsInfo(*slr); + + return std::unique_ptr(scope); +} + +void Gather(const std::vector& vars, + platform::DeviceContext* dev_ctx) { + distributed::CollectiveClient* client = + distributed::CollectiveClient::GetInstance(); + + framework::Scope* scope = new framework::Scope(); + framework::Variable* var = scope->Var("var1"); + var->GetMutable(); + + std::vector dst; + client->Gather(vars, &dst, *dev_ctx, scope); + std::cout << "dst:" << distributed::GetSelectedRowsInfo(*dst[0]); + dev_ctx->Wait(); + + ASSERT_EQ(dst[0]->value().dims(), framework::make_ddim({3, 1024})); + ASSERT_EQ(dst[0]->height(), 20000); + ASSERT_EQ(dst[0]->rows().size(), static_cast(3)); + for (int i = 0; i < 3; i++) { + ASSERT_EQ(dst[0]->rows()[i], i); + } + + std::vector vec; + TensorToVector(dst[0]->value(), *dev_ctx, &vec); + for (size_t i = 0; i < 3 * 1024; i++) { + ASSERT_FLOAT_EQ(vec[i], 32.7); + } +} + +TEST(CollectiveServer, GPU) { + platform::CUDAPlace place; + platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance(); + auto& ctx = *pool.Get(place); + + std::string ep = "127.0.0.1:7164"; + auto scope = GenerateVars(place); + + auto* v1 = scope->FindVar("var1"); + std::cout << "var1:" << v1 << std::endl; + + auto server = StartServer(ep, 2, scope.get(), &ctx); + auto rpc_server = server->GetRPCServer(); + + distributed::RemoteVar var; + var.ep_ = ep; + var.var_name_ = "var1"; + var.trainer_id_ = 0; + + std::vector vars{var}; + Gather(vars, &ctx); + Gather(vars, &ctx); + + std::cout << "begin WaitVarBarrier" << std::endl; + rpc_server->WaitVarBarrier("var1"); + rpc_server->ClearRegisteredVars(); + server->Stop(); + + scope.release(); + server.release(); +} diff --git a/paddle/fluid/operators/distributed/communicator.cc b/paddle/fluid/operators/distributed/communicator.cc new file mode 100644 index 0000000000000000000000000000000000000000..eba18c67771fa26eed855b0f19591e06101f424d --- /dev/null +++ b/paddle/fluid/operators/distributed/communicator.cc @@ -0,0 +1,213 @@ +/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/distributed/communicator.h" + +#include +#include // NOLINT +#include // NOLINT + +#include "paddle/fluid/framework/eigen.h" +#include "paddle/fluid/framework/selected_rows.h" +#include "paddle/fluid/framework/tensor_util.h" +#include "paddle/fluid/framework/variable_helper.h" +#include "paddle/fluid/operators/distributed/parameter_recv.h" +#include "paddle/fluid/operators/distributed/parameter_send.h" + +DEFINE_bool(communicator_independent_recv_thread, true, + "use an independent to recv vars from parameter server"); +DEFINE_int32(communicator_send_queue_size, 20, + "queue size to recv gradient before send"); +DEFINE_int32(communicator_max_send_grad_num_before_recv, 20, + "max grad num to send before recv parameters"); +DEFINE_int32(communicator_thread_pool_size, 5, "thread num to do send or recv"); +DEFINE_int32(communicator_max_merge_var_num, 20, + "max var num to merge and send"); +DEFINE_bool(communicator_fake_rpc, false, + "fake mode does not really send any thing"); + +namespace paddle { +namespace operators { +namespace distributed { + +inline double GetCurrentUS() { + struct timeval time; + gettimeofday(&time, NULL); + return 1e+6 * time.tv_sec + time.tv_usec; +} + +std::unique_ptr Communicator::communicator_(nullptr); +std::once_flag Communicator::init_flag_; + +Communicator::Communicator(const RpcCtxMap &send_varname_to_ctx, + const RpcCtxMap &recv_varname_to_ctx, + Scope *recv_scope) + : send_varname_to_ctx_(send_varname_to_ctx), + recv_varname_to_ctx_(recv_varname_to_ctx), + recv_scope_(recv_scope) { + // get all send information from graph, build vars_to_send + VLOG(0) << "communicator_independent_recv_thread: " + << FLAGS_communicator_independent_recv_thread; + VLOG(0) << "communicator_send_queue_size: " + << FLAGS_communicator_send_queue_size; + VLOG(0) << "communicator_max_send_grad_num_before_recv: " + << FLAGS_communicator_max_send_grad_num_before_recv; + VLOG(0) << "communicator_thread_pool_size: " + << FLAGS_communicator_thread_pool_size; + VLOG(0) << "communicator_max_merge_var_num: " + << FLAGS_communicator_max_merge_var_num; + VLOG(0) << "communicator_fake_rpc: " << FLAGS_communicator_fake_rpc; + send_scope_.reset(new Scope()); + for (auto &iter : send_varname_to_ctx_) { + send_varname_to_queue_[iter.first] = + std::make_shared>>( + FLAGS_communicator_send_queue_size); + } + send_threadpool_.reset(new ::ThreadPool(FLAGS_communicator_thread_pool_size)); + recv_threadpool_.reset(new ::ThreadPool(FLAGS_communicator_thread_pool_size)); +} + +Communicator::~Communicator() { + VLOG(3) << "~Communicator"; + running_ = false; + if (send_thread_) send_thread_->join(); + if (recv_thread_) recv_thread_->join(); + VLOG(3) << "~Communicator done"; +} + +void Communicator::SendThread() { + VLOG(3) << "SendThread start!"; + while (running_) { + std::vector> task_futures; + task_futures.reserve(send_varname_to_ctx_.size()); + VLOG(3) << "run send graph"; + auto before_run_send_graph = GetCurrentUS(); + for (auto &iter : send_varname_to_queue_) { + auto &var_name = iter.first; + auto &var_queue = iter.second; + if (var_queue->Size() > 0) { + auto send_task = [this, &var_name, &var_queue] { + VLOG(3) << var_name << " merge and send"; + std::vector> vars; + size_t merged_var_num = 0; + while (var_queue->Size() > 0 && + merged_var_num < FLAGS_communicator_max_merge_var_num) { + vars.push_back(var_queue->Pop()); + // only count the send number of the first var + if (var_name == send_varname_to_queue_.begin()->first) { + grad_num_.fetch_add(1, std::memory_order_relaxed); + } + merged_var_num++; + } + auto before_merge = GetCurrentUS(); + MergeVars(var_name, vars, send_scope_.get()); + auto after_merge = GetCurrentUS(); + VLOG(3) << "merge " << var_name << " use time " + << after_merge - before_merge; + auto send_functor = distributed::ParameterSend(); + auto &ctx = send_varname_to_ctx_.at(var_name); + if (!FLAGS_communicator_fake_rpc) { + send_functor(ctx, *send_scope_, true); + } + auto after_send = GetCurrentUS(); + VLOG(3) << "send " << var_name << " use time " + << after_send - after_merge; + }; + task_futures.emplace_back( + send_threadpool_->enqueue(std::move(send_task))); + } else { + VLOG(3) << var_name << " queue empty"; + } + } + for (auto &task_f : task_futures) { + task_f.wait(); + } + auto after_run_send_graph = GetCurrentUS(); + auto send_graph_use_time = after_run_send_graph - before_run_send_graph; + if (send_graph_use_time > 100) { + VLOG(1) << "run send graph use time " + << after_run_send_graph - before_run_send_graph; + } + if (!FLAGS_communicator_independent_recv_thread) { + RecvAll(); + } + } +} + +void Communicator::RecvAll() { + VLOG(3) << "parallel run recv graph"; + auto before_send = GetCurrentUS(); + std::vector> task_futures; + task_futures.reserve(recv_varname_to_ctx_.size()); + for (auto &iter : recv_varname_to_ctx_) { + auto recv_task = [this, &iter] { + auto &var_name = iter.first; + VLOG(3) << "recv var " << var_name; + auto recv_functor = distributed::ParameterRecv(); + if (!FLAGS_communicator_fake_rpc) { + recv_functor(iter.second, *recv_scope_); + } + }; + task_futures.emplace_back(recv_threadpool_->enqueue(std::move(recv_task))); + } + for (auto &task : task_futures) { + task.wait(); + } + auto after_recv = GetCurrentUS(); + VLOG(1) << "run recv graph use time " << after_recv - before_send; +} + +void Communicator::RecvThread() { + VLOG(3) << "RecvThread start!"; + while (running_) { + auto grad_num = grad_num_.load(); + if (grad_num > FLAGS_communicator_max_send_grad_num_before_recv) { + VLOG(1) << "current grad num " << grad_num; + RecvAll(); + grad_num_.store(0); + } else { + std::this_thread::sleep_for(std::chrono::milliseconds(10)); + } + } +} + +void Communicator::Send(const std::string &var_name, + const framework::Scope &scope) { + VLOG(3) << "communicator send " << var_name; + // push var into send queue by var_name + auto *grad_var = scope.FindVar(var_name); + PADDLE_ENFORCE(grad_var->IsInitialized(), "grad var should be inited"); + auto tmp_grad_var = std::make_shared(); + framework::CopyVariable(*grad_var, tmp_grad_var.get()); + auto &queue = send_varname_to_queue_.at(var_name); + VLOG(3) << "send " << var_name << " queue size " << queue->Size(); + queue->Push(tmp_grad_var); +} + +Communicator *Communicator::GetInstance() { return communicator_.get(); } + +void Communicator::Start() { + running_ = true; + // start send and recv thread + send_thread_.reset( + new std::thread(std::bind(&Communicator::SendThread, this))); + if (FLAGS_communicator_independent_recv_thread) { + recv_thread_.reset( + new std::thread(std::bind(&Communicator::RecvThread, this))); + } +} + +} // namespace distributed +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/distributed/communicator.h b/paddle/fluid/operators/distributed/communicator.h new file mode 100644 index 0000000000000000000000000000000000000000..41155bfc31bb31520fdcf5bd50b203f2e1f2c516 --- /dev/null +++ b/paddle/fluid/operators/distributed/communicator.h @@ -0,0 +1,219 @@ +/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/framework/variable.h" +#include "paddle/fluid/operators/distributed/rpc_common.h" +#include "paddle/fluid/operators/math/math_function.h" +#include "paddle/fluid/operators/math/selected_rows_functor.h" +#include "paddle/fluid/platform/device_context.h" +#include "paddle/fluid/platform/enforce.h" +#include "paddle/fluid/platform/place.h" + +namespace paddle { +namespace operators { +namespace distributed { + +using Scope = framework::Scope; +using Variable = framework::Variable; + +template +class BlockingQueue { + public: + explicit BlockingQueue(size_t capacity) : capacity_(capacity) { + PADDLE_ENFORCE_GT(capacity_, 0, "The capacity must be greater than 0."); + } + + bool Push(const T& elem) { + { + std::unique_lock lock(mutex_); + cv_.wait(lock, [&] { return queue_.size() < capacity_; }); + PADDLE_ENFORCE_LT(queue_.size(), capacity_); + queue_.push_back(elem); + } + cv_.notify_one(); + return true; + } + + bool Push(T&& elem) { + { + std::unique_lock lock(mutex_); + cv_.wait(lock, [&] { return queue_.size() < capacity_; }); + PADDLE_ENFORCE_LT(queue_.size(), capacity_); + queue_.emplace_back(std::move(elem)); + } + cv_.notify_one(); + return true; + } + + T Pop() { + std::unique_lock lock(mutex_); + cv_.wait(lock, [=] { return !queue_.empty(); }); + T rc(std::move(queue_.front())); + queue_.pop_front(); + cv_.notify_one(); + return rc; + } + + size_t Cap() const { + std::lock_guard lock(mutex_); + return capacity_; + } + + size_t Size() const { + std::lock_guard lock(mutex_); + return queue_.size(); + } + + private: + const size_t capacity_; + std::deque queue_; + + mutable std::mutex mutex_; + std::condition_variable cv_; +}; + +template +using EigenVector = framework::EigenVector; + +inline void MergeVars(const std::string& var_name, + const std::vector>& vars, + Scope* scope) { + PADDLE_ENFORCE(!vars.empty(), "should have value to merge!"); + auto cpu_place = platform::CPUPlace(); + auto& var0 = vars[0]; + auto* out_var = scope->Var(var_name); + if (var0->IsType()) { + auto dims = var0->Get().dims(); + VLOG(3) << "merge " << var_name << " LoDTensor " << dims; + + // init output tensor + auto* out_t = out_var->GetMutable(); + out_t->mutable_data(dims, cpu_place); + + // check the input dims + for (auto& var : vars) { + auto& var_t = var->Get(); + PADDLE_ENFORCE_EQ(var_t.dims(), dims, "should have the same dims"); + } + + // set output tensor to 0. + auto cpu_ctx = paddle::platform::CPUDeviceContext(); + math::SetConstant + constant_functor; + constant_functor(cpu_ctx, out_t, static_cast(0)); + + // sum all vars to out + auto result = EigenVector::Flatten(*out_t); + for (auto& var : vars) { + auto& in_t = var->Get(); + auto in = EigenVector::Flatten(in_t); + result.device(*cpu_ctx.eigen_device()) = result + in; + } + } else if (var0->IsType()) { + auto& slr0 = var0->Get(); + auto* out_slr = out_var->GetMutable(); + out_slr->mutable_rows()->clear(); + out_slr->mutable_value()->mutable_data({{}}, cpu_place); + std::vector inputs; + inputs.reserve(vars.size()); + for (auto& var : vars) { + inputs.push_back(&var->Get()); + } + math::scatter::MergeAdd + merge_add; + auto dev_ctx = paddle::platform::CPUDeviceContext(); + merge_add(dev_ctx, inputs, out_slr, false); + VLOG(3) << "merge " << var_name << " SelectedRows height: " << slr0.height() + << " dims: " << slr0.value().dims(); + } else { + PADDLE_THROW("unsupported var type!"); + } +} + +using RpcCtxMap = std::unordered_map; + +class Communicator { + public: + Communicator(const RpcCtxMap& send_varname_to_ctx, + const RpcCtxMap& recv_varname_to_ctx, Scope* recv_scope); + + ~Communicator(); + + void Start(); + + // send grad + void Send(const std::string& var_name, const framework::Scope& scope); + + private: + // recv all parameter + void RecvAll(); + void SendThread(); + void RecvThread(); + + bool running_ = false; + std::unordered_map>>> + send_varname_to_queue_; + RpcCtxMap send_varname_to_ctx_; + RpcCtxMap recv_varname_to_ctx_; + std::unique_ptr send_thread_; + std::unique_ptr recv_thread_; + Scope* recv_scope_; // should be global scope + std::unique_ptr send_scope_; // an independent scope + std::unique_ptr<::ThreadPool> send_threadpool_{nullptr}; + std::unique_ptr<::ThreadPool> recv_threadpool_{nullptr}; + std::atomic_uint grad_num_{0}; // the num of gradient sent since last recv + + // the following code is for initialize the commnunicator + public: + static void Init(const RpcCtxMap& send_varname_to_ctx, + const RpcCtxMap& recv_varname_to_ctx, Scope* recv_scope) { + InitImpl(send_varname_to_ctx, recv_varname_to_ctx, recv_scope); + } + + static Communicator* GetInstance(); + + private: + // Init is called by GetInstance. + static void InitImpl(const RpcCtxMap& send_varname_to_ctx, + const RpcCtxMap& recv_varname_to_ctx, + Scope* recv_scope) { + if (communicator_ == nullptr) { + communicator_.reset(new Communicator(send_varname_to_ctx, + recv_varname_to_ctx, recv_scope)); + } + } + + private: + static std::once_flag init_flag_; + static std::unique_ptr communicator_; +}; + +} // namespace distributed +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/distributed/communicator_test.cc b/paddle/fluid/operators/distributed/communicator_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..5294ac33d15611a003eeb7971891e8ca85ec6a73 --- /dev/null +++ b/paddle/fluid/operators/distributed/communicator_test.cc @@ -0,0 +1,110 @@ +// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include +#include +#include + +#include "paddle/fluid/operators/distributed/communicator.h" + +namespace paddle { +namespace operators { +namespace distributed { + +using LoDTensor = framework::LoDTensor; +using SelectedRows = framework::SelectedRows; + +TEST(communicator, merge_lod_tensors) { + auto cpu_place = platform::CPUPlace(); + auto dims = framework::make_ddim({2, 3}); + std::vector> in_vars; + float out_value = 0; + for (auto i = 0; i < 10; ++i) { + auto var = std::make_shared(); + in_vars.emplace_back(var); + auto *tensor = var->GetMutable(); + auto *data = tensor->mutable_data(dims, cpu_place); + for (auto j = 0; j < tensor->numel(); ++j) { + data[j] = static_cast(i); + } + out_value += static_cast(i); + } + const std::string out_name = "Out"; + std::unique_ptr scope; + scope.reset(new framework::Scope()); + scope->Var(out_name); + for (auto i = 0; i < 10; ++i) { + MergeVars(out_name, in_vars, scope.get()); + } + auto &out_tensor = scope->FindVar(out_name)->Get(); + auto *out_data = out_tensor.data(); + ASSERT_EQ(out_tensor.dims(), dims); + for (auto i = 0; i < out_tensor.numel(); ++i) { + ASSERT_EQ(out_data[i], out_value); + } +} + +TEST(communicator, merge_selected_rows) { + auto cpu_place = platform::CPUPlace(); + int64_t width = 10; + std::vector> in_vars; + const int64_t height = 100; + for (auto i = 0; i < 10; ++i) { + std::vector rows; + for (auto k = 0; k <= i; ++k) { + rows.push_back(k); + } + auto var = std::make_shared(); + in_vars.emplace_back(var); + auto *slr = var->GetMutable(); + slr->set_height(height); + slr->set_rows(rows); + auto dims = + framework::make_ddim({static_cast(rows.size()), width}); + auto *data = slr->mutable_value()->mutable_data(dims, cpu_place); + for (auto i = 0; i < rows.size(); ++i) { + for (auto j = 0; j < width; ++j) { + data[i * width + j] = static_cast(rows[i]); + } + } + } + const std::string out_name = "Out"; + std::unique_ptr scope; + scope.reset(new framework::Scope()); + scope->Var(out_name); + for (auto i = 0; i < 10; ++i) { + MergeVars(out_name, in_vars, scope.get()); + } + auto &out_slr = scope->FindVar(out_name)->Get(); + auto &out_t = out_slr.value(); + auto *out_data = out_t.data(); + ASSERT_EQ(out_t.dims(), framework::make_ddim({10, width})); + std::vector out_values; + out_values.reserve(10); + for (auto i = 0; i < 10; ++i) { + out_values.push_back(static_cast(i * (10 - i))); + } + for (auto i = 0; i < out_slr.rows().size(); ++i) { + ASSERT_EQ(out_slr.rows()[i], i); + for (auto j = 0; j < width; ++j) { + ASSERT_EQ(out_data[i * width + j], out_values[i]); + } + } +} + +} // namespace distributed +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/distributed/distributed.h b/paddle/fluid/operators/distributed/distributed.h new file mode 100644 index 0000000000000000000000000000000000000000..3a9f92259875749ab2ddf26c18cd230c58a61c44 --- /dev/null +++ b/paddle/fluid/operators/distributed/distributed.h @@ -0,0 +1,35 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#ifdef PADDLE_WITH_DISTRIBUTE + +#ifdef PADDLE_WITH_GRPC + +#include "paddle/fluid/operators/distributed/grpc/grpc_client.h" +#include "paddle/fluid/operators/distributed/grpc/grpc_server.h" +#define RPCSERVER_T paddle::operators::distributed::AsyncGRPCServer +#define RPCCLIENT_T paddle::operators::distributed::GRPCClient + +#else // PADDLE_WITH_GRPC + +#include "paddle/fluid/operators/distributed/brpc/brpc_client.h" +#include "paddle/fluid/operators/distributed/brpc/brpc_server.h" +#define RPCSERVER_T paddle::operators::distributed::AsyncBRPCServer +#define RPCCLIENT_T paddle::operators::distributed::BRPCClient + +#endif // PADDLE_WITH_GRPC + +#endif // PADDLE_WITH_DISTRIBUTE diff --git a/paddle/fluid/operators/distributed/distributed_pb.h b/paddle/fluid/operators/distributed/distributed_pb.h new file mode 100644 index 0000000000000000000000000000000000000000..f1c662be9af67b418e17987e4eb1ff0a2809c3e3 --- /dev/null +++ b/paddle/fluid/operators/distributed/distributed_pb.h @@ -0,0 +1,30 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#ifdef PADDLE_WITH_DISTRIBUTE + +#ifdef PADDLE_WITH_GRPC + +#include "paddle/fluid/operators/distributed/send_recv.grpc.pb.h" +#include "paddle/fluid/operators/distributed/send_recv.pb.h" + +#else // PADDLE_WITH_GRPC + +#include "paddle/fluid/operators/distributed/send_recv.pb.h" + +#endif // PADDLE_WITH_GRPC + +#endif // PADDLE_WITH_DISTRIBUTE diff --git a/paddle/fluid/operators/distributed/grpc/grpc_bytebuffer_stream.cc b/paddle/fluid/operators/distributed/grpc/grpc_bytebuffer_stream.cc new file mode 100644 index 0000000000000000000000000000000000000000..c2cb0d7f04eb67275030e841740f0cdb291d9f87 --- /dev/null +++ b/paddle/fluid/operators/distributed/grpc/grpc_bytebuffer_stream.cc @@ -0,0 +1,88 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +// NOTE: This file was originally created by tensorflow +// (https://github.com/tensorflow/tensorflow/) we borrow this +// file and did some modifications so that we can send gRPC +// requests without too much copying of the tensor data. + +#include "paddle/fluid/operators/distributed/grpc/grpc_bytebuffer_stream.h" + +namespace paddle { +namespace operators { +namespace distributed { + +GrpcByteBufferSource::GrpcByteBufferSource() {} + +bool GrpcByteBufferSource::Init(const grpc::ByteBuffer& src) { + cur_ = -1; + left_ = 0; + ptr_ = nullptr; + byte_count_ = 0; + bool ok = src.Dump(&slices_).ok(); + if (!ok) { + slices_.clear(); + } + return ok; +} + +bool GrpcByteBufferSource::Next(const void** data, int* size) { + // Use loop instead of if in case buffer contained empty slices. + while (left_ == 0) { + // Advance to next slice. + cur_++; + if (cur_ >= slices_.size()) { + return false; + } + const ::grpc::Slice& s = slices_[cur_]; + left_ = s.size(); + ptr_ = reinterpret_cast(s.begin()); + } + + *data = ptr_; + *size = left_; + byte_count_ += left_; + ptr_ += left_; + left_ = 0; + return true; +} + +void GrpcByteBufferSource::BackUp(int count) { + ptr_ -= count; + left_ += count; + byte_count_ -= count; +} + +bool GrpcByteBufferSource::Skip(int count) { + const void* data; + int size; + while (Next(&data, &size)) { + if (size >= count) { + BackUp(size - count); + return true; + } + // size < count; + count -= size; + } + // error or we have too large count; + return false; +} + +google::protobuf::int64 GrpcByteBufferSource::ByteCount() const { + return byte_count_; +} + +} // namespace distributed +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/distributed/grpc_bytebuffer_stream.h b/paddle/fluid/operators/distributed/grpc/grpc_bytebuffer_stream.h similarity index 100% rename from paddle/fluid/operators/distributed/grpc_bytebuffer_stream.h rename to paddle/fluid/operators/distributed/grpc/grpc_bytebuffer_stream.h diff --git a/paddle/fluid/operators/distributed/grpc/grpc_client.cc b/paddle/fluid/operators/distributed/grpc/grpc_client.cc new file mode 100644 index 0000000000000000000000000000000000000000..61e94dae3c7a107e10fa5e5518651014cec078bc --- /dev/null +++ b/paddle/fluid/operators/distributed/grpc/grpc_client.cc @@ -0,0 +1,475 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include + +#include "glog/logging.h" // For VLOG +#include "paddle/fluid/framework/threadpool.h" +#include "paddle/fluid/operators/distributed/grpc/grpc_client.h" +#include "paddle/fluid/operators/distributed/grpc/grpc_serde.h" +#include "paddle/fluid/operators/distributed/request_handler.h" +#include "paddle/fluid/platform/port.h" +#include "paddle/fluid/platform/profiler.h" + +DECLARE_bool(rpc_disable_reuse_port); + +namespace paddle { +namespace operators { +namespace distributed { + +void GRPCClient::InitImpl() { + // start the client process thread + // TODO(wuyi): can make this in a threadpool + PADDLE_ENFORCE(client_thread_ == nullptr, + "please not re init proceed thread"); + client_thread_.reset(new std::thread(std::bind(&GRPCClient::Proceed, this))); +} + +void GRPCClient::SendComplete() { + std::unique_lock lk(completed_mutex_); + if (!completed_) { + for (auto& it : channels_) { + VLOG(3) << "send complete message to " << it.first; + this->AsyncSendComplete(it.first); + } + PADDLE_ENFORCE(this->Wait(), "internal grpc error"); + completed_ = true; + } +} + +GRPCClient::~GRPCClient() { + stopped_ = true; + Wait(); + cq_.Shutdown(); + { + std::lock_guard guard(chan_mutex_); + for (auto& it : channels_) { + it.second.reset(); + } + channels_.clear(); + } + client_thread_->join(); +} + +VarHandlePtr GRPCClient::AsyncSendVar(const std::string& ep, + const platform::DeviceContext& ctx, + const framework::Scope& scope, + const std::string& var_name, + int64_t time_out) { + const platform::DeviceContext* p_ctx = &ctx; + const std::string ep_val = ep; + const std::string var_name_val = var_name; + const framework::Scope* p_scope = &scope; + const auto ch = GetChannel(ep_val); + SendProcessor* s = new SendProcessor(ch); + const std::string method = kSendRPC; + VarHandlePtr h(new VarHandle(ep, method, var_name_val, p_ctx, p_scope)); + s->Prepare(h, time_out); + + framework::AsyncIO([var_name_val, p_scope, p_ctx, s, method, h, this] { + auto* var = p_scope->FindVar(var_name_val); + + ::grpc::ByteBuffer req; + SerializeToByteBuffer(var_name_val, var, *p_ctx, &req, "", trainer_id_); + + VLOG(3) << s->GetVarHandlePtr()->String() << " begin"; + + // stub context + s->response_call_back_ = nullptr; + + platform::RecordRPCEvent record_event(method); + + auto call = s->stub_g_.PrepareUnaryCall( + s->context_.get(), "/sendrecv.SendRecvService/SendVariable", req, &cq_); + call->StartCall(); + call->Finish(&s->reply_, &s->status_, reinterpret_cast(s)); + + if (UNLIKELY(platform::IsProfileEnabled())) { + h->Wait(); + } + }); + req_count_++; + + return h; +} + +void ProcGetResponse(const VarHandle& var_h, + const ::grpc::ByteBuffer& ret_msg) { + VLOG(4) << "ProcGetResponse"; + framework::Variable* outvar = nullptr; + // get response's trainer_id is not used + int trainer_id; + DeserializeFromByteBuffer(ret_msg, *var_h.ctx(), var_h.scope(), &outvar, + &trainer_id); +} + +template +void RequestToByteBuffer(const T& proto, ::grpc::ByteBuffer* result) { + ::grpc::Slice slice(proto.ByteSizeLong()); + proto.SerializeWithCachedSizesToArray(const_cast(slice.begin())); + ::grpc::ByteBuffer tmp(&slice, 1); + result->Swap(&tmp); +} + +VarHandlePtr GRPCClient::AsyncGetVar(const std::string& ep, + const platform::DeviceContext& ctx, + const framework::Scope& scope, + const std::string& var_name, + const std::string& out_varname, + int64_t time_out) { + return _AsyncGetVar(ep, ctx, scope, kGetRPC, var_name, out_varname, + "/sendrecv.SendRecvService/GetVariable", time_out); +} + +VarHandlePtr GRPCClient::AsyncGetVarNoBarrier( + const std::string& ep, const platform::DeviceContext& ctx, + const framework::Scope& scope, const std::string& var_name, + const std::string& out_varname, int64_t time_out) { + std::string var_name_no_barrier = + string::Sprintf("%s%s", var_name, WITHOUT_BARRIER_MESSAGE); + + return _AsyncGetVar( + ep, ctx, scope, kGetNoBarrierRPC, var_name_no_barrier, out_varname, + "/sendrecv.SendRecvService/GetVariableNoBarrier", time_out); +} + +VarHandlePtr GRPCClient::AsyncGetMonomerVariable( + const std::string& ep, const platform::DeviceContext& ctx, + const framework::Scope& scope, const std::string& var_name, + int64_t time_out) { + return _AsyncGetVar(ep, ctx, scope, kGetMonomerRPC, var_name, var_name, + "/sendrecv.SendRecvService/GetMonomerVariable", time_out); +} + +VarHandlePtr GRPCClient::_AsyncGetVar( + const std::string& ep, const platform::DeviceContext& ctx, + const framework::Scope& scope, const std::string& method, + const std::string& var_name, const std::string& out_varname, + const std::string& rpc_path, int64_t time_out) { + const platform::DeviceContext* p_ctx = &ctx; + const std::string ep_val = ep; + const std::string var_name_val = var_name; + const std::string out_varname_val = out_varname; + const framework::Scope* p_scope = &scope; + const auto ch = GetChannel(ep_val); + GetProcessor* s = new GetProcessor(ch); + + VarHandlePtr h(new VarHandle(ep, method, out_varname_val, p_ctx, p_scope)); + s->Prepare(h, time_out); + + framework::AsyncIO( + [var_name_val, out_varname_val, s, method, p_ctx, h, rpc_path, this] { + // prepare input + sendrecv::VariableMessage req; + req.set_varname(var_name_val); + req.set_out_varname(out_varname_val); + req.set_trainer_id(trainer_id_); + ::grpc::ByteBuffer buf; + RequestToByteBuffer(req, &buf); + + VLOG(3) << s->GetVarHandlePtr()->String() << " begin"; + + // stub context + s->response_call_back_ = ProcGetResponse; + + platform::RecordRPCEvent record_event(method); + + auto call = + s->stub_g_.PrepareUnaryCall(s->context_.get(), rpc_path, buf, &cq_); + call->StartCall(); + call->Finish(&s->reply_, &s->status_, reinterpret_cast(s)); + + if (UNLIKELY(platform::IsProfileEnabled())) { + h->Wait(); + } + }); + + req_count_++; + + return h; +} + +VarHandlePtr GRPCClient::AsyncPrefetchVar(const std::string& ep, + const platform::DeviceContext& ctx, + const framework::Scope& scope, + const std::string& in_var_name, + const std::string& out_var_name, + const std::string& table_name, + int64_t time_out) { + const platform::DeviceContext* p_ctx = &ctx; + const std::string ep_val = ep; + const std::string in_var_name_val = in_var_name; + const std::string out_var_name_val = out_var_name; + const std::string table_name_val = table_name; + const framework::Scope* p_scope = &scope; + const auto ch = GetChannel(ep_val); + GetProcessor* s = new GetProcessor(ch); + + const std::string method = kPrefetchRPC; + + VarHandlePtr h(new VarHandle(ep, method, out_var_name_val, p_ctx, p_scope)); + s->Prepare(h, time_out); + + framework::AsyncIO([in_var_name_val, out_var_name_val, ep_val, p_scope, p_ctx, + s, method, h, table_name_val, this] { + auto* var = p_scope->FindVar(in_var_name_val); + + ::grpc::ByteBuffer req; + SerializeToByteBuffer(in_var_name_val, var, *p_ctx, &req, out_var_name_val, + 0, table_name_val); + + VLOG(3) << s->GetVarHandlePtr()->String() << " begin"; + + // stub context + s->response_call_back_ = ProcGetResponse; + + platform::RecordRPCEvent record_event(method); + + auto call = s->stub_g_.PrepareUnaryCall( + s->context_.get(), "/sendrecv.SendRecvService/PrefetchVariable", req, + &cq_); + call->StartCall(); + call->Finish(&s->reply_, &s->status_, static_cast(s)); + + if (UNLIKELY(platform::IsProfileEnabled())) { + h->Wait(); + } + }); + + req_count_++; + return h; +} + +VarHandlePtr GRPCClient::AsyncSendBatchBarrier(const std::string& ep, + int64_t time_out) { + const auto ch = GetChannel(ep); + + BatchBarrierProcessor* s = new BatchBarrierProcessor(ch); + const std::string method = kBatchBarrierRPC; + VarHandlePtr h( + new VarHandle(ep, method, BATCH_BARRIER_MESSAGE, nullptr, nullptr)); + s->Prepare(h, time_out); + + sendrecv::VariableMessage req; + req.set_varname(BATCH_BARRIER_MESSAGE); + + platform::RecordRPCEvent record_event(method); + + auto rpc = s->stub_->AsyncSendVariable(s->context_.get(), req, &cq_); + rpc->Finish(&s->reply_, &s->status_, reinterpret_cast(s)); + req_count_++; + + if (UNLIKELY(platform::IsProfileEnabled())) { + h->Wait(); + } + + return h; +} + +VarHandlePtr GRPCClient::AsyncSendFetchBarrier(const std::string& ep, + int64_t time_out) { + const auto ch = GetChannel(ep); + FetchBarrierProcessor* s = new FetchBarrierProcessor(ch); + const std::string method = kFetchBarrierRPC; + VarHandlePtr h( + new VarHandle(ep, method, FETCH_BARRIER_MESSAGE, nullptr, nullptr)); + s->Prepare(h, time_out); + + sendrecv::VariableMessage req; + req.set_varname(FETCH_BARRIER_MESSAGE); + + platform::RecordRPCEvent record_event(method); + + auto rpc = s->stub_->AsyncGetVariable(s->context_.get(), req, &cq_); + rpc->Finish(&s->reply_, &s->status_, reinterpret_cast(s)); + req_count_++; + + if (UNLIKELY(platform::IsProfileEnabled())) { + h->Wait(); + } + + return h; +} + +VarHandlePtr GRPCClient::AsyncGetMonomerBarrier(const std::string& ep, + const std::string& var_name, + int64_t time_out) { + const auto ch = GetChannel(ep); + BatchBarrierProcessor* s = new BatchBarrierProcessor(ch); + const std::string method = kSendMonomerFetchBarrierRPC; + VarHandlePtr h(new VarHandle(ep, method, var_name, nullptr, nullptr)); + s->Prepare(h, time_out); + + VLOG(30) << s->GetVarHandlePtr()->String() << " begin"; + + sendrecv::VariableMessage req; + req.set_varname(var_name); + + platform::RecordRPCEvent record_event(method); + + auto rpc = s->stub_->AsyncGetMonomerBarrier(s->context_.get(), req, &cq_); + rpc->Finish(&s->reply_, &s->status_, reinterpret_cast(s)); + req_count_++; + + if (UNLIKELY(platform::IsProfileEnabled())) { + h->Wait(); + } + + return h; +} + +VarHandlePtr GRPCClient::AsyncSendComplete(const std::string& ep, + int64_t time_out) { + const auto ch = GetChannel(ep); + + BatchBarrierProcessor* s = new BatchBarrierProcessor(ch); + const std::string method = kSendCompleteRPC; + VarHandlePtr h(new VarHandle(ep, method, COMPLETE_MESSAGE, nullptr, nullptr)); + s->Prepare(h, time_out); + + sendrecv::VariableMessage req; + req.set_varname(COMPLETE_MESSAGE); + + platform::RecordRPCEvent record_event(method); + + auto rpc = s->stub_->AsyncSendVariable(s->context_.get(), req, &cq_); + rpc->Finish(&s->reply_, &s->status_, reinterpret_cast(s)); + req_count_++; + + if (UNLIKELY(platform::IsProfileEnabled())) { + h->Wait(); + } + + return h; +} + +VarHandlePtr GRPCClient::AsyncCheckpointNotify(const std::string& ep, + const std::string& dir, + int64_t time_out) { + const auto ch = GetChannel(ep); + + CheckpointNotifyProcessor* s = new CheckpointNotifyProcessor(ch); + + const std::string method = kCheckPointNotifyRPC; + + VarHandlePtr h( + new VarHandle(ep, method, CHECKPOINT_SAVE_MESSAGE, nullptr, nullptr)); + s->Prepare(h, time_out); + + sendrecv::VariableMessage req; + req.set_varname(CHECKPOINT_SAVE_MESSAGE); + req.set_out_varname(dir); + + platform::RecordRPCEvent record_event(method); + + auto rpc = s->stub_->AsyncCheckpointNotify(s->context_.get(), req, &cq_); + rpc->Finish(&s->reply_, &s->status_, reinterpret_cast(s)); + req_count_++; + + if (UNLIKELY(platform::IsProfileEnabled())) { + h->Wait(); + } + + return h; +} + +bool GRPCClient::Wait() { + std::unique_lock lk(sync_mutex_); + sync_cond_.wait(lk, [this] { return (req_count_ == 0 || ok_ == false); }); + return ok_; +} + +void GRPCClient::Proceed() { + void* tag = nullptr; + bool ok = false; + + VLOG(3) << "GRPCClient Proceed begin"; + while (!stopped_ && cq_.Next(&tag, &ok)) { + BaseProcessor* c = static_cast(tag); + GPR_ASSERT(ok); + PADDLE_ENFORCE(c); + + if (c->status_.ok()) { + VLOG(3) << c->GetVarHandlePtr()->String() << " process"; + c->Process(); + } else if (c->status_.error_code() == grpc::StatusCode::DEADLINE_EXCEEDED) { + LOG(FATAL) << c->GetVarHandlePtr()->String() + << " meets grpc error, error_code:" << c->status_.error_code() + << " error_message:" << c->status_.error_message() + << " error_details:" << c->status_.error_details(); + { + std::lock_guard lk(sync_mutex_); + ok_ = false; + } + c->Finish(false); + } else { + LOG(FATAL) << c->GetVarHandlePtr()->String() + << " meets grpc error, error_code:" << c->status_.error_code() + << " error_message:" << c->status_.error_message() + << " error_details:" << c->status_.error_details(); + + c->Finish(false); + } + + bool notify = false; + { + std::lock_guard lk(sync_mutex_); + req_count_--; + notify = (req_count_ <= 0 || !c->status_.ok()); + } + + delete c; + + if (notify) { + sync_cond_.notify_all(); + } + } + + // Last log message + // Avoid using VLOG() and LOG(): in the destructor of google::LogMessage() a + // static Mutex log_mutex is used for synchronization, which might have been + // destructed at this moment. + if (FLAGS_v >= 3) { + std::string msg("GRPCClient Proceed end"); + fwrite(msg.c_str(), msg.length(), 1, stdout); + } +} + +std::shared_ptr GRPCClient::GetChannel(const std::string& ep) { + std::lock_guard guard(chan_mutex_); + auto it = channels_.find(ep); + if (it != channels_.end()) { + return it->second; + } + + // Channel configurations: + grpc::ChannelArguments args; + args.SetInt(GRPC_ARG_MAX_RECONNECT_BACKOFF_MS, 2000); + if (FLAGS_rpc_disable_reuse_port) { + args.SetInt(GRPC_ARG_ALLOW_REUSEPORT, 0); + } + args.SetCompressionAlgorithm(GRPC_COMPRESS_NONE); + args.SetMaxSendMessageSize(std::numeric_limits::max()); + args.SetMaxReceiveMessageSize(std::numeric_limits::max()); + + auto ch = + grpc::CreateCustomChannel(ep, grpc::InsecureChannelCredentials(), args); + channels_[ep] = ch; + return ch; +} + +} // namespace distributed +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/distributed/grpc/grpc_client.h b/paddle/fluid/operators/distributed/grpc/grpc_client.h new file mode 100644 index 0000000000000000000000000000000000000000..ce0d2152aa27c62b6e12881aaf2ae458597e67e6 --- /dev/null +++ b/paddle/fluid/operators/distributed/grpc/grpc_client.h @@ -0,0 +1,268 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include + +#include // NOLINT +#include // NOLINT +#include +#include +#include +#include +#include // NOLINT +#include +#include // NOLINT +#include + +#include "grpc++/channel.h" +#include "grpc++/generic/generic_stub.h" +#include "grpc++/grpc++.h" +#include "grpc++/support/byte_buffer.h" +#include "grpc++/support/slice.h" +#include "grpc/support/log.h" +#include "paddle/fluid/framework/blocking_queue.h" +#include "paddle/fluid/framework/data_type.h" +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/framework/selected_rows.h" +#include "paddle/fluid/operators/distributed/distributed_pb.h" +#include "paddle/fluid/operators/distributed/request_handler.h" +#include "paddle/fluid/operators/distributed/rpc_client.h" +#include "paddle/fluid/operators/distributed/sendrecvop_utils.h" +#include "paddle/fluid/platform/macros.h" // for DISABLE_COPY_AND_ASSIGN + +namespace paddle { +namespace operators { +namespace distributed { + +void ProcGetResponse(const VarHandle& var_h, const grpc::ByteBuffer& msg); + +class BaseProcessor { + public: + BaseProcessor() { context_ = nullptr; } + + virtual ~BaseProcessor() {} + + virtual void Prepare(VarHandlePtr h, int64_t time_out) { + var_h_ = h; + + context_.reset(new grpc::ClientContext()); + context_->set_wait_for_ready(true); + if (time_out) { + std::chrono::system_clock::time_point deadline = + std::chrono::system_clock::now() + + std::chrono::milliseconds(time_out); + context_->set_deadline(deadline); + } + } + + void Process() { + ProcessImpl(); + var_h_->Finish(true); + } + + VarHandlePtr GetVarHandlePtr() { return var_h_; } + bool Wait() { return var_h_->Wait(); } + void Finish(bool ok) { return var_h_->Finish(ok); } + virtual void ProcessImpl() = 0; + + std::unique_ptr context_; + grpc::Status status_; + + protected: + VarHandlePtr var_h_; +}; + +typedef std::function + RequestSendCallBack; + +class SendProcessor : public BaseProcessor { + public: + explicit SendProcessor(std::shared_ptr ch) + : BaseProcessor(), stub_g_(ch) {} + + virtual ~SendProcessor() {} + + void ProcessImpl() override { + if (response_call_back_) { + response_call_back_(*var_h_.get(), reply_); + } + } + + ::grpc::GenericStub stub_g_; + ::grpc::ByteBuffer reply_; + RequestSendCallBack response_call_back_ = nullptr; +}; + +typedef std::function + RequestGetCallBack; + +class GetProcessor : public BaseProcessor { + public: + explicit GetProcessor(std::shared_ptr ch) + : BaseProcessor(), stub_g_(ch) {} + + virtual ~GetProcessor() {} + + void ProcessImpl() override { + if (response_call_back_) { + response_call_back_(*var_h_.get(), reply_); + } + } + + ::grpc::ByteBuffer reply_; + ::grpc::GenericStub stub_g_; + RequestGetCallBack response_call_back_ = ProcGetResponse; +}; + +class BatchBarrierProcessor : public BaseProcessor { + public: + explicit BatchBarrierProcessor(std::shared_ptr ch) + : BaseProcessor() { + stub_ = sendrecv::SendRecvService::NewStub(ch); + } + + virtual ~BatchBarrierProcessor() {} + + void ProcessImpl() override {} + sendrecv::VoidMessage reply_; + std::unique_ptr stub_; +}; + +class FetchBarrierProcessor : public BaseProcessor { + public: + explicit FetchBarrierProcessor(std::shared_ptr ch) + : BaseProcessor() { + stub_ = sendrecv::SendRecvService::NewStub(ch); + } + + virtual ~FetchBarrierProcessor() {} + + void ProcessImpl() override {} + sendrecv::VariableMessage reply_; + std::unique_ptr stub_; +}; + +class CheckpointNotifyProcessor : public BaseProcessor { + public: + explicit CheckpointNotifyProcessor(std::shared_ptr ch) + : BaseProcessor() { + stub_ = sendrecv::SendRecvService::NewStub(ch); + } + + virtual ~CheckpointNotifyProcessor() {} + + void ProcessImpl() override {} + sendrecv::VoidMessage reply_; + std::unique_ptr stub_; +}; + +class GRPCClient : public RPCClient { + public: + GRPCClient() : ok_(true), completed_(false), stopped_(false) {} + virtual ~GRPCClient(); + + VarHandlePtr AsyncSendVar(const std::string& ep, + const platform::DeviceContext& ctx, + const framework::Scope& scope, + const std::string& var_name, + int64_t time_out = FLAGS_rpc_deadline) override; + + VarHandlePtr AsyncGetVar(const std::string& ep, + const platform::DeviceContext& ctx, + const framework::Scope& scope, + const std::string& var_name, + const std::string& out_varname, + int64_t time_out = FLAGS_rpc_deadline) override; + + VarHandlePtr AsyncGetVarNoBarrier( + const std::string& ep, const platform::DeviceContext& ctx, + const framework::Scope& scope, const std::string& var_name, + const std::string& out_varname, + int64_t time_out = FLAGS_rpc_deadline) override; + + VarHandlePtr AsyncGetMonomerVariable( + const std::string& ep, const platform::DeviceContext& ctx, + const framework::Scope& scope, const std::string& var_name, + int64_t time_out = FLAGS_rpc_deadline) override; + + VarHandlePtr AsyncPrefetchVar(const std::string& ep, + const platform::DeviceContext& ctx, + const framework::Scope& scope, + const std::string& in_var_name, + const std::string& out_var_name, + const std::string& table_name = "", + int64_t time_out = FLAGS_rpc_deadline) override; + + VarHandlePtr AsyncSendBatchBarrier( + const std::string& ep, int64_t time_out = FLAGS_rpc_deadline) override; + + VarHandlePtr AsyncSendFetchBarrier(const std::string& ep, + int64_t time_out) override; + + VarHandlePtr AsyncGetMonomerBarrier( + const std::string& ep, const std::string& var_name, + int64_t time_out = FLAGS_rpc_deadline) override; + + VarHandlePtr AsyncCheckpointNotify( + const std::string& ep, const std::string& dir, + int64_t time_out = FLAGS_rpc_deadline) override; + + VarHandlePtr AsyncSendComplete( + const std::string& ep, int64_t time_out = FLAGS_rpc_deadline) override; + + bool Wait() override; + + void SendComplete() override; + + void InitImpl() override; + + private: + void Proceed(); + + std::shared_ptr GetChannel(const std::string& ep); + VarHandlePtr _AsyncGetVar( + const std::string& ep, const platform::DeviceContext& ctx, + const framework::Scope& scope, const std::string& method, + const std::string& var_name, const std::string& out_varname, + const std::string& rpc_path, int64_t time_out = FLAGS_rpc_deadline); + + private: + grpc::CompletionQueue cq_; + std::unordered_map> channels_; + std::unique_ptr client_thread_{nullptr}; + + // mutex for Wait client sync + std::mutex sync_mutex_; + std::condition_variable sync_cond_; + std::atomic req_count_{0}; + bool ok_; + + // mutex for GetChannel thread safety + std::mutex chan_mutex_; + DISABLE_COPY_AND_ASSIGN(GRPCClient); + + // mutex for sending complete message only once + std::mutex completed_mutex_; + bool completed_; + + volatile bool stopped_; +}; + +} // namespace distributed +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/distributed/grpc/grpc_serde.cc b/paddle/fluid/operators/distributed/grpc/grpc_serde.cc new file mode 100644 index 0000000000000000000000000000000000000000..91c398d0c84db1fc67740cd2368d178610ef0841 --- /dev/null +++ b/paddle/fluid/operators/distributed/grpc/grpc_serde.cc @@ -0,0 +1,162 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#ifdef PADDLE_WITH_CUDA +#include +#endif +#include +#include +#include // NOLINT + +#include "google/protobuf/io/coded_stream.h" +#include "google/protobuf/io/zero_copy_stream.h" +#include "paddle/fluid/framework/data_type.h" +#include "paddle/fluid/operators/distributed/grpc/grpc_bytebuffer_stream.h" +#include "paddle/fluid/operators/distributed/grpc/grpc_serde.h" +#include "paddle/fluid/operators/distributed/grpc/grpc_variable_response.h" +#include "paddle/fluid/operators/distributed/proto_encoder_helper.h" +#include "paddle/fluid/operators/distributed/sendrecvop_utils.h" +#include "paddle/fluid/platform/port.h" +#include "paddle/fluid/platform/profiler.h" + +namespace paddle { +namespace operators { +namespace distributed { + +void SerializeToByteBuffer(const std::string& name, framework::Variable* var, + const platform::DeviceContext& ctx, + ::grpc::ByteBuffer* msg, const std::string& out_name, + const int trainer_id, + const std::string& table_name) { + platform::RecordRPCEvent record_event("serial"); + VarMsg request; + TensorPayload* payload = nullptr; + + request.set_varname(name); + request.set_trainer_id(trainer_id); + // Note: normally the profiler is enabled in 1 trainer, hence only + // 1 trainer returns true for ShouldSendProfileState(). It tells PS + // servers the trainer's profiling state so that PS can follow the + // trainer. + if (platform::ShouldSendProfileState()) { + if (platform::IsProfileEnabled()) { + request.set_profile(platform::kEnableProfiler); + } else { + request.set_profile(platform::kDisableProfiler); + } + } + if (!out_name.empty()) { + request.set_out_varname(out_name); + } + if (!table_name.empty()) { + request.set_table_name(table_name); + } + if (var->IsType()) { + request.set_type(::sendrecv::LOD_TENSOR); + payload = new TensorPayload(GetTensorPayload(var, ctx, &request)); + } else if (var->IsType()) { + request.set_type(::sendrecv::SELECTED_ROWS); + payload = new TensorPayload(GetSelectedRowsPayload(var, ctx, &request)); +#ifdef PADDLE_WITH_CUDA + } else if (var->IsType()) { + request.set_type(::sendrecv::NCCL_ID); +#endif + } else { + PADDLE_THROW("Serialize does not support type: %s", + typeid(var->Type()).name()); + } + + std::string header; + request.AppendToString(&header); + auto buffer = std::unique_ptr(new char[1024]); + void* buf = buffer.get(); + ProtoEncodeHelper e(static_cast(buf), 1024); + e.WriteRawBytes(std::string(header.data(), header.size())); +// NCCLID is copied directly to the message, return bytebuffer +// with only one slice if serializing NCCLID. +#ifdef PADDLE_WITH_CUDA + if (var->IsType()) { + e.WriteVarlengthBeginning(VarMsg::kSerializedFieldNumber, + NCCL_UNIQUE_ID_BYTES); + const ncclUniqueId& uid = var->Get(); + e.WriteRawBytes(std::string(uid.internal, NCCL_UNIQUE_ID_BYTES)); + + // for serialize NCCL_ID + ::grpc::Slice slices(e.size()); + memcpy(const_cast(slices.begin()), e.data(), e.size()); + ::grpc::ByteBuffer tmp(&slices, 1); + msg->Swap(&tmp); + return; + } +#endif + PADDLE_ENFORCE_NOT_NULL(payload); + + e.WriteVarlengthBeginning(VarMsg::kSerializedFieldNumber, + payload->memory_size()); + if (payload->memory_size() >= std::numeric_limits::max()) { + LOG(FATAL) << "FATAL error: varname:" << name + << ", vlen:" << payload->memory_size() + << " >= std::numeric_limits::max():" + << std::numeric_limits::max() << ", so exit!"; + } + // steal reference of tensor data + ::grpc::Slice slices[4]; // metadata, tensor, rows meta, rows + int num_slices = 2; // only SelectedRows have rows buffer + slices[0] = ::grpc::Slice(e.size()); + memcpy(const_cast(slices[0].begin()), e.data(), e.size()); + slices[1] = ::grpc::Slice( + grpc_slice_new_with_user_data(payload->ptr(), payload->memory_size(), + SerializeDestroyCallback, payload), + ::grpc::Slice::STEAL_REF); + + if (var->IsType()) { + auto* slr = var->GetMutable(); + ProtoEncodeHelper e2(static_cast(buf), 128); + + PADDLE_ENFORCE(VectorElemName(slr->rows()) == typeid(int64_t).name()); + size_t rows_memory_size = slr->rows().size() * sizeof(int64_t); + + e2.WriteVarlengthBeginning(VarMsg::kRowsFieldNumber, rows_memory_size); + slices[2] = ::grpc::Slice(e2.size()); + memcpy(const_cast(slices[2].begin()), e2.data(), e2.size()); + + slices[3] = ::grpc::Slice( + grpc_slice_new_with_user_data( + const_cast( + reinterpret_cast(slr->rows().data())), + rows_memory_size, [](void* backing) {}, + const_cast( + reinterpret_cast(slr->rows().data()))), + ::grpc::Slice::STEAL_REF); + num_slices = 4; + } + + ::grpc::ByteBuffer tmp(&slices[0], num_slices); + msg->Swap(&tmp); +} + +void DeserializeFromByteBuffer(const ::grpc::ByteBuffer& msg, + const platform::DeviceContext& ctx, + const framework::Scope* scope, + framework::Variable** var, int* trainer_id) { + platform::RecordRPCEvent record_event("deserial"); + operators::distributed::GRPCVariableResponse resp(scope, &ctx); + PADDLE_ENFORCE(resp.Parse(msg) == 0, "parse bytebuffer to tensor error!"); + *var = resp.GetVar(); + *trainer_id = resp.GetTrainerId(); +} + +} // namespace distributed +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/distributed/grpc/grpc_serde.h b/paddle/fluid/operators/distributed/grpc/grpc_serde.h new file mode 100644 index 0000000000000000000000000000000000000000..c9a57beb3a6a7a7cc9973ff0e5325a3daa6d98a9 --- /dev/null +++ b/paddle/fluid/operators/distributed/grpc/grpc_serde.h @@ -0,0 +1,52 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include + +#include "paddle/fluid/framework/data_type.h" +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/framework/selected_rows.h" +#include "paddle/fluid/framework/tensor_util.h" +#include "paddle/fluid/framework/var_type.h" +#include "paddle/fluid/operators/distributed/sendrecvop_utils.h" +#include "paddle/fluid/platform/port.h" + +#include "paddle/fluid/operators/distributed/distributed_pb.h" + +namespace paddle { +namespace operators { +namespace distributed { + +typedef void (*DestroyCallback)(void*); + +void SerializeToByteBuffer(const std::string& name, framework::Variable* var, + const platform::DeviceContext& ctx, + ::grpc::ByteBuffer* msg, + const std::string& out_varname = std::string(), + const int trainer_id = 0, + const std::string& table_name = std::string()); + +void DeserializeFromByteBuffer(const ::grpc::ByteBuffer& msg, + const platform::DeviceContext& ctx, + const framework::Scope* scope, + framework::Variable** var, int* trainer_id); + +} // namespace distributed +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/distributed/grpc/grpc_serde_test.cc b/paddle/fluid/operators/distributed/grpc/grpc_serde_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..749c1bf39a48608876c77a74aa98be51947cf3b3 --- /dev/null +++ b/paddle/fluid/operators/distributed/grpc/grpc_serde_test.cc @@ -0,0 +1,224 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include // NOLINT + +#include "google/protobuf/text_format.h" +#include "gtest/gtest.h" +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/tensor_util.h" +#include "paddle/fluid/framework/variable.h" +#include "paddle/fluid/operators/distributed/distributed.h" +#include "paddle/fluid/operators/distributed/grpc/grpc_serde.h" +#include "paddle/fluid/operators/distributed/grpc/grpc_variable_response.h" +#include "paddle/fluid/operators/distributed/sendrecvop_utils.h" +#include "paddle/fluid/operators/math/math_function.h" +#include "paddle/fluid/platform/place.h" +#include "paddle/fluid/string/printf.h" + +namespace framework = paddle::framework; +namespace platform = paddle::platform; +namespace operators = paddle::operators; +namespace math = paddle::operators::math; +namespace memory = paddle::memory; + +void RunSerdeTestSelectedRows(platform::Place place) { + platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance(); + auto& ctx = *pool.Get(place); + + // serialize var to ByteBuffer + framework::Variable var; + auto* slr = var.GetMutable(); + slr->set_height(1000); + auto* tensor = slr->mutable_value(); + auto* rows = slr->mutable_rows(); + tensor->Resize(framework::make_ddim({564, 128})); + tensor->mutable_data(place); + int tensor_numel = 564 * 128; + math::set_constant(ctx, tensor, 32.7); + for (int i = 0; i < 564; ++i) rows->push_back(i); + + ::grpc::ByteBuffer msg; + operators::distributed::SerializeToByteBuffer("myvar", &var, ctx, &msg); + EXPECT_GT(msg.Length(), static_cast(0)); + + // deserialize + std::vector<::grpc::Slice> slices; + (void)msg.Dump(&slices); + std::string tmp; + for (const auto& s : slices) { + tmp.append(reinterpret_cast(s.begin()), s.size()); + } + + sendrecv::VariableMessage varmsg; + EXPECT_TRUE(varmsg.ParseFromString(tmp)); + + // deserialize bytebuffer + EXPECT_EQ(varmsg.varname(), "myvar"); + EXPECT_EQ(varmsg.type(), 1); + + const float* tensor_data = + reinterpret_cast(varmsg.serialized().data()); + const int64_t* rows_data = + reinterpret_cast(varmsg.rows().data()); + for (int i = 0; i < tensor_numel; ++i) { + EXPECT_FLOAT_EQ(tensor_data[i], 32.7); + } + for (int i = 0; i < 564; ++i) { + EXPECT_EQ(rows_data[i], i); + } + + // deserialize zero-copy + // framework::Variable var2; + // operators::distributed::DeserializeFromByteBuffer(msg, ctx, &var2); + framework::Scope scope; + scope.Var("myvar"); + operators::distributed::GRPCVariableResponse resp(&scope, &ctx); + EXPECT_EQ(resp.Parse(msg), 0); + + framework::Variable* var2 = resp.GetVar(); + + auto* slr2 = var2->GetMutable(); + auto* tensor2 = slr2->mutable_value(); + auto* rows2 = slr2->mutable_rows(); + float* tensor_data2 = nullptr; + framework::Tensor tmp_tensor; + + if (platform::is_gpu_place(ctx.GetPlace())) { + platform::CPUPlace cpu; + framework::TensorCopy(*tensor2, cpu, &tmp_tensor); + tensor_data2 = tmp_tensor.data(); + } else { + tensor_data2 = const_cast(tensor2->data()); + } + const int64_t* rows_data2 = rows2->data(); + + for (int i = 0; i < tensor_numel; ++i) { + EXPECT_FLOAT_EQ(tensor_data2[i], 32.7); + } + for (size_t i = 0; i < rows2->size(); ++i) { + EXPECT_EQ(rows_data2[i], static_cast(i)); + } + EXPECT_EQ(slr2->height(), 1000); +} + +void RunTestLodTensor(platform::Place place, int from_type = 0) { + // serialize var to ByteBuffer + framework::Variable var; + auto* tensor = var.GetMutable(); + tensor->Resize(framework::make_ddim({512, 8, 4, 2})); + framework::LoD lod; + lod.push_back(framework::Vector({1, 3, 8})); + tensor->set_lod(lod); + int tensor_numel = 512 * 8 * 4 * 2; + platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance(); + auto& ctx = *pool.Get(place); + tensor->mutable_data(place); + math::set_constant(ctx, tensor, 31.9); + + ::grpc::ByteBuffer msg; + operators::distributed::SerializeToByteBuffer("myvar", &var, ctx, &msg, + "outvar", 0, "table_name"); + EXPECT_GT(msg.Length(), static_cast(0)); + + // deserialize + std::vector<::grpc::Slice> slices; + (void)msg.Dump(&slices); + std::string tmp; + for (const auto& s : slices) { + tmp.append(reinterpret_cast(s.begin()), s.size()); + } + sendrecv::VariableMessage varmsg; + EXPECT_TRUE(varmsg.ParseFromString(tmp)); + EXPECT_EQ(varmsg.varname(), "myvar"); + EXPECT_EQ(varmsg.type(), 0); + EXPECT_EQ(varmsg.dims()[0], 512); + EXPECT_EQ(varmsg.dims()[1], 8); + EXPECT_EQ(varmsg.dims()[2], 4); + EXPECT_EQ(varmsg.dims()[3], 2); + EXPECT_EQ(varmsg.lod_level(), 1); + EXPECT_EQ(varmsg.lod(0).lod_data(0), 1); + EXPECT_EQ(varmsg.lod(0).lod_data(1), 3); + EXPECT_EQ(varmsg.lod(0).lod_data(2), 8); + + const float* tensor_data = + reinterpret_cast(varmsg.serialized().data()); + for (int i = 0; i < tensor_numel; ++i) { + EXPECT_FLOAT_EQ(tensor_data[i], 31.9); + } + + // message binary + std::string str; + varmsg.SerializeToString(&str); + + // message bytebuffer + ::grpc::Slice slices_2[1]; + int num_slices = 1; + slices_2[0] = ::grpc::Slice(str.length()); + memcpy(const_cast(slices_2[0].begin()), str.c_str(), str.length()); + ::grpc::ByteBuffer bytebuffer2(&slices_2[0], num_slices); + + // deserialize zero-copy + framework::Scope scope; + scope.Var("myvar"); + operators::distributed::GRPCVariableResponse resp(&scope, &ctx); + if (from_type == 0) { + EXPECT_EQ(resp.Parse(msg), 0); + } else { + EXPECT_EQ(resp.Parse(bytebuffer2), 0); + } + + framework::Variable* var2 = resp.GetVar(); + + auto tensor2 = var2->Get(); + float* tensor_data2 = nullptr; + framework::Tensor tmp_tensor; + + if (platform::is_gpu_place(ctx.GetPlace())) { + platform::CPUPlace cpu; + framework::TensorCopy(tensor2, cpu, &tmp_tensor); + tensor_data2 = tmp_tensor.data(); + } else { + tensor_data2 = const_cast(tensor2.data()); + } + + EXPECT_EQ(varmsg.lod_level(), 1); + EXPECT_EQ(varmsg.lod(0).lod_data(0), 1); + EXPECT_EQ(varmsg.lod(0).lod_data(1), 3); + EXPECT_EQ(varmsg.lod(0).lod_data(2), 8); + for (int i = 0; i < tensor_numel; ++i) EXPECT_FLOAT_EQ(tensor_data2[i], 31.9); +} + +TEST(LodTensor, Run) { + platform::CPUPlace place; + RunTestLodTensor(place); + RunTestLodTensor(place, 1); +#ifdef PADDLE_WITH_CUDA + platform::CUDAPlace gpu(0); + RunTestLodTensor(gpu); + RunTestLodTensor(gpu, 1); +#endif +} + +TEST(SelectedRows, Run) { + platform::CPUPlace place; + RunSerdeTestSelectedRows(place); + +#ifdef PADDLE_WITH_CUDA + platform::CUDAPlace gpu; + RunSerdeTestSelectedRows(gpu); +#endif +} diff --git a/paddle/fluid/operators/distributed/grpc/grpc_server.cc b/paddle/fluid/operators/distributed/grpc/grpc_server.cc new file mode 100644 index 0000000000000000000000000000000000000000..0eb313f75dfa64f8722faa365128f3111f72bd0b --- /dev/null +++ b/paddle/fluid/operators/distributed/grpc/grpc_server.cc @@ -0,0 +1,589 @@ +/*Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include + +#include "paddle/fluid/operators/distributed/grpc/grpc_serde.h" +#include "paddle/fluid/operators/distributed/grpc/grpc_server.h" + +using ::grpc::ServerAsyncResponseWriter; + +DECLARE_bool(rpc_disable_reuse_port); + +namespace paddle { +namespace operators { +namespace distributed { +enum CallStatus { PROCESS = 0, FINISH }; + +// reference: +// https://stackoverflow.com/questions/41732884/grpc-multiple-services-in-cpp-async-server +class RequestBase { + public: + explicit RequestBase(GrpcService::AsyncService* service, + ::grpc::ServerCompletionQueue* cq, + RequestHandler* request_handler, int req_id) + : service_(service), + cq_(cq), + status_(PROCESS), + request_handler_(request_handler), + req_id_(req_id) { + PADDLE_ENFORCE(cq_); + } + virtual ~RequestBase() {} + virtual void Process() = 0; + + std::string Status2String(const std::string& method) { + std::string status = "Process"; + if (status_ == FINISH) { + status = "Finish"; + } + + std::ostringstream s; + s << method << " name:[" << GetReqName() << "]" + << ", ep:[" << ctx_.peer() << "]" + << " " << status << " using req_id:" << req_id_; + return s.str(); + } + + CallStatus Status() const { + std::lock_guard l(status_mu_); + return status_; + } + + template + void Finish(const T& reply, ServerAsyncResponseWriter* responder) { + std::lock_guard l(status_mu_); + status_ = FINISH; + responder->Finish(reply, ::grpc::Status::OK, + reinterpret_cast(static_cast(req_id_))); + } + virtual std::string GetReqName() = 0; + + protected: + mutable std::mutex status_mu_; + ::grpc::ServerContext ctx_; + GrpcService::AsyncService* service_; + ::grpc::ServerCompletionQueue* cq_; + CallStatus status_; + RequestHandler* request_handler_; + int req_id_; +}; + +class RequestSend final : public RequestBase { + public: + explicit RequestSend(GrpcService::AsyncService* service, + ::grpc::ServerCompletionQueue* cq, + RequestHandler* request_handler, int req_id) + : RequestBase(service, cq, request_handler, req_id), responder_(&ctx_) { + request_.reset(new GRPCVariableResponse(request_handler->scope(), + request_handler->dev_ctx(), + !request_handler->sync_mode())); + int method_id = static_cast(distributed::GrpcMethod::kSendVariable); + service_->RequestAsyncUnary( + method_id, &ctx_, request_.get(), &responder_, cq_, cq_, + reinterpret_cast(static_cast(req_id))); + } + virtual ~RequestSend() {} + std::string GetReqName() override { return request_->Varname(); } + + void Process() override { + std::string varname = GetReqName(); + VLOG(4) << "RequestSend var_name:" << varname; + + auto scope = request_->GetMutableLocalScope(); + auto invar = request_->GetVar(); + int trainer_id = request_->GetTrainerId(); + framework::Variable* outvar = nullptr; + request_handler_->Handle(varname, scope, invar, &outvar, trainer_id); + Finish(reply_, &responder_); + } + + protected: + sendrecv::VoidMessage reply_; + std::shared_ptr request_; + ServerAsyncResponseWriter responder_; +}; + +class RequestGet final : public RequestBase { + public: + explicit RequestGet(GrpcService::AsyncService* service, + ::grpc::ServerCompletionQueue* cq, + RequestHandler* request_handler, int req_id) + : RequestBase(service, cq, request_handler, req_id), responder_(&ctx_) { + auto method_id = static_cast(distributed::GrpcMethod::kGetVariable); + service_->RequestAsyncUnary( + method_id, &ctx_, &request_, &responder_, cq_, cq_, + reinterpret_cast(static_cast(req_id))); + } + + virtual ~RequestGet() {} + + std::string GetReqName() override { return request_.varname(); } + + void Process() override { + // proc request. + std::string varname = request_.varname(); + std::string out_varname = request_.out_varname(); + int trainer_id = request_.trainer_id(); + + VLOG(4) << "RequestGet " << out_varname << " from " << varname; + + auto scope = request_handler_->scope(); + framework::Variable* invar = nullptr; + framework::Variable* outvar = nullptr; + + request_handler_->Handle(varname, scope, invar, &outvar, trainer_id, + out_varname); + + if (outvar) { + SerializeToByteBuffer(out_varname, outvar, *request_handler_->dev_ctx(), + &reply_); + } + Finish(reply_, &responder_); + } + + protected: + sendrecv::VariableMessage request_; + ::grpc::ByteBuffer reply_; + ServerAsyncResponseWriter<::grpc::ByteBuffer> responder_; +}; + +class RequestGetNoBarrier final : public RequestBase { + public: + explicit RequestGetNoBarrier(GrpcService::AsyncService* service, + ::grpc::ServerCompletionQueue* cq, + RequestHandler* request_handler, int req_id) + : RequestBase(service, cq, request_handler, req_id), responder_(&ctx_) { + auto method_id = + static_cast(distributed::GrpcMethod::kGetVariableNoBarrier); + service_->RequestAsyncUnary( + method_id, &ctx_, &request_, &responder_, cq_, cq_, + reinterpret_cast(static_cast(req_id))); + } + + virtual ~RequestGetNoBarrier() {} + + std::string GetReqName() override { return request_.varname(); } + + void Process() override { + // proc request. + std::string varname = request_.varname(); + std::string out_varname = request_.out_varname(); + int trainer_id = request_.trainer_id(); + + VLOG(4) << "RequestGetNoBarrier " << out_varname << " from " << varname; + + auto scope = request_handler_->scope(); + framework::Variable* invar = nullptr; + framework::Variable* outvar = nullptr; + + request_handler_->Handle(varname, scope, invar, &outvar, trainer_id, + out_varname); + + if (outvar) { + SerializeToByteBuffer(out_varname, outvar, *request_handler_->dev_ctx(), + &reply_); + } + Finish(reply_, &responder_); + } + + protected: + sendrecv::VariableMessage request_; + ::grpc::ByteBuffer reply_; + ServerAsyncResponseWriter<::grpc::ByteBuffer> responder_; +}; + +class RequestGetMonomerVariable final : public RequestBase { + public: + explicit RequestGetMonomerVariable(GrpcService::AsyncService* service, + ::grpc::ServerCompletionQueue* cq, + RequestHandler* request_handler, + int req_id, RPCServer* rpc_server) + : RequestBase(service, cq, request_handler, req_id), + responder_(&ctx_), + rpc_server_(rpc_server) { + auto method_id = + static_cast(distributed::GrpcMethod::kGetMonomerVariable); + service_->RequestAsyncUnary( + method_id, &ctx_, &request_, &responder_, cq_, cq_, + reinterpret_cast(static_cast(req_id))); + } + + virtual ~RequestGetMonomerVariable() {} + + std::string GetReqName() override { return request_.varname(); } + + void Process() override { + // proc request. + std::string varname = request_.varname(); + + rpc_server_->WaitVarCond(varname); + MonomerHandle h = rpc_server_->GetMonomer(varname); + + auto scope = h.scope_; + auto invar = scope->FindVar(varname); + framework::Variable* outvar = nullptr; + + request_handler_->Handle(varname, scope, invar, &outvar, + request_.trainer_id()); + + if (outvar) { + SerializeToByteBuffer(varname, outvar, *h.dev_ctx_, &reply_); + } + Finish(reply_, &responder_); + } + + protected: + sendrecv::VariableMessage request_; + ::grpc::ByteBuffer reply_; + ServerAsyncResponseWriter<::grpc::ByteBuffer> responder_; + RPCServer* rpc_server_{nullptr}; +}; + +class RequestGetMonomerBarrier final : public RequestBase { + public: + explicit RequestGetMonomerBarrier(GrpcService::AsyncService* service, + ::grpc::ServerCompletionQueue* cq, + RequestHandler* request_handler, int req_id, + RPCServer* rpc_server) + : RequestBase(service, cq, request_handler, req_id), + responder_(&ctx_), + rpc_server_(rpc_server) { + auto method_id = + static_cast(distributed::GrpcMethod::kGetMonomerBarrier); + service_->RequestAsyncUnary( + method_id, &ctx_, &request_, &responder_, cq_, cq_, + reinterpret_cast(static_cast(req_id))); + } + + virtual ~RequestGetMonomerBarrier() {} + + std::string GetReqName() override { return request_.varname(); } + + void Process() override { + // proc request. + std::string varname = request_.varname(); + VLOG(4) << "RequestGetMonomerBarrier " << varname; + + rpc_server_->WaitVarCond(varname); + MonomerHandle h = rpc_server_->GetMonomer(varname); + + framework::Scope* scope = nullptr; + framework::Variable* invar = nullptr; + framework::Variable* outvar = nullptr; + + request_handler_->Handle(varname, scope, invar, &outvar, + request_.trainer_id()); + + Finish(reply_, &responder_); + } + + protected: + sendrecv::VariableMessage request_; + sendrecv::VoidMessage reply_; + ServerAsyncResponseWriter responder_; + RPCServer* rpc_server_{nullptr}; +}; + +class RequestPrefetch final : public RequestBase { + public: + explicit RequestPrefetch(GrpcService::AsyncService* service, + ::grpc::ServerCompletionQueue* cq, + RequestHandler* request_handler, int req_id) + : RequestBase(service, cq, request_handler, req_id), + responder_(&ctx_), + local_scope_(nullptr) { + request_.reset(new GRPCVariableResponse(request_handler->scope(), + request_handler->dev_ctx(), true)); + int method_id = + static_cast(distributed::GrpcMethod::kPrefetchVariable); + service_->RequestAsyncUnary( + method_id, &ctx_, request_.get(), &responder_, cq_, cq_, + reinterpret_cast(static_cast(req_id))); + } + + virtual ~RequestPrefetch() {} + + std::string GetReqName() override { return request_->Varname(); } + + void Process() override { + // prefetch process... + std::string in_var_name = request_->Varname(); + std::string out_var_name = request_->OutVarname(); + std::string table_name = request_->TableName(); + int trainer_id = request_->GetTrainerId(); + VLOG(4) << "RequestPrefetch, in_var_name: " << in_var_name + << " out_var_name: " << out_var_name; + + auto scope = request_->GetMutableLocalScope(); + auto invar = scope->FindVar(in_var_name); + // out var must be created in local scope! + framework::Variable* outvar = scope->Var(out_var_name); + + request_handler_->Handle(in_var_name, scope, invar, &outvar, trainer_id, + out_var_name, table_name); + + SerializeToByteBuffer(out_var_name, outvar, *request_handler_->dev_ctx(), + &reply_); + Finish(reply_, &responder_); + } + + protected: + std::shared_ptr request_; + ::grpc::ByteBuffer reply_; + ServerAsyncResponseWriter<::grpc::ByteBuffer> responder_; + framework::Scope* local_scope_; +}; + +class RequestCheckpointNotify final : public RequestBase { + public: + explicit RequestCheckpointNotify(GrpcService::AsyncService* service, + ::grpc::ServerCompletionQueue* cq, + RequestHandler* request_handler, int req_id) + : RequestBase(service, cq, request_handler, req_id), responder_(&ctx_) { + request_.reset(new GRPCVariableResponse(request_handler->scope(), + request_handler->dev_ctx())); + int method_id = + static_cast(distributed::GrpcMethod::kCheckpointNotify); + service_->RequestAsyncUnary( + method_id, &ctx_, request_.get(), &responder_, cq_, cq_, + reinterpret_cast(static_cast(req_id))); + } + + virtual ~RequestCheckpointNotify() {} + + std::string GetReqName() override { return request_->Varname(); } + + void Process() override { + auto scope = request_->GetMutableLocalScope(); + + std::string checkpoint_notify = request_->Varname(); + std::string checkpoint_dir = request_->OutVarname(); + int trainer_id = request_->GetTrainerId(); + + VLOG(4) << "RequestCheckpointNotify notify: " << checkpoint_notify + << ", dir: " << checkpoint_dir; + + request_handler_->Handle(checkpoint_notify, scope, nullptr, nullptr, + trainer_id, checkpoint_dir); + Finish(reply_, &responder_); + } + + protected: + std::shared_ptr request_; + sendrecv::VoidMessage reply_; + ServerAsyncResponseWriter responder_; +}; + +void AsyncGRPCServer::WaitServerReady() { + VLOG(4) << "AsyncGRPCServer is waiting server ready"; + std::unique_lock lock(this->mutex_ready_); + condition_ready_.wait(lock, [=] { return this->ready_ == 1; }); + VLOG(4) << "AsyncGRPCServer WaitSeverReady"; +} + +// Define an option subclass in order to disable SO_REUSEPORT for the +// server socket. +// Come from: +// https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc +class NoReusePortOption : public ::grpc::ServerBuilderOption { + public: + void UpdateArguments(::grpc::ChannelArguments* args) override { + args->SetInt(GRPC_ARG_ALLOW_REUSEPORT, 0); + } + + void UpdatePlugins(std::vector>* + plugins) override {} +}; + +void AsyncGRPCServer::StartServer() { + ::grpc::ServerBuilder builder; + builder.AddListeningPort(bind_address_, ::grpc::InsecureServerCredentials(), + &selected_port_); + + builder.SetMaxSendMessageSize(std::numeric_limits::max()); + builder.SetMaxReceiveMessageSize(std::numeric_limits::max()); + if (FLAGS_rpc_disable_reuse_port) { + builder.SetOption( + std::unique_ptr<::grpc::ServerBuilderOption>(new NoReusePortOption)); + } + builder.RegisterService(&service_); + + for (auto t : rpc_call_map_) { + rpc_cq_[t.first].reset(builder.AddCompletionQueue().release()); + } + + server_ = builder.BuildAndStart(); + LOG(INFO) << "Server listening on " << bind_address_ + << " selected port: " << selected_port_; + + std::function f = + std::bind(&AsyncGRPCServer::TryToRegisterNewOne, this, + std::placeholders::_1, std::placeholders::_2); + + for (auto& t : rpc_call_map_) { + auto& rpc_name = t.first; + auto& cq = rpc_cq_[rpc_name]; + auto threadnum = rpc_thread_num_[rpc_name]; + auto& reqs = rpc_reqs_[rpc_name]; + + reqs.reserve(kRequestBufSize); + + for (int i = 0; i < kRequestBufSize; i++) { + VLOG(6) << "TryToRegisterNewOne on RPC NAME: " << rpc_name << " I: " << i; + TryToRegisterNewOne(rpc_name, i); + } + + for (int i = 0; i < threadnum; i++) { + rpc_threads_[rpc_name].emplace_back(new std::thread(std::bind( + &AsyncGRPCServer::HandleRequest, this, cq.get(), rpc_name, f))); + VLOG(4) << t.first << " creates threads!"; + } + } + + { + std::lock_guard lock(this->mutex_ready_); + ready_ = 1; + } + condition_ready_.notify_all(); + + // wait server + server_->Wait(); + + for (auto& t : rpc_threads_) { + auto& threads = t.second; + for (size_t i = 0; i < threads.size(); ++i) { + threads[i]->join(); + VLOG(4) << t.first << " threads ends!"; + } + } +} + +void AsyncGRPCServer::ShutdownQueue() { + for (auto& t : rpc_cq_) { + t.second->Shutdown(); + VLOG(4) << t.first << " queue shutdown!"; + } +} + +void AsyncGRPCServer::ShutDownImpl() { + std::unique_lock lock(cq_mutex_); + is_shut_down_ = true; + ShutdownQueue(); + + VLOG(4) << "server_ shutdown!"; + server_->Shutdown(); +} + +void AsyncGRPCServer::TryToRegisterNewOne(const std::string& rpc_name, + int req_id) { + std::unique_lock lock(cq_mutex_); + if (is_shut_down_) { + VLOG(4) << "shutdown, do not TryToRegisterNewSendOne"; + return; + } + + VLOG(4) << "TryToRegisterNewOne on RPC NAME: " << rpc_name + << " REQ ID: " << req_id; + + auto& reqs = rpc_reqs_[rpc_name]; + auto& handler = rpc_call_map_[rpc_name]; + auto& cq = rpc_cq_[rpc_name]; + + RequestBase* b = nullptr; + if (rpc_name == kRequestSend) { + b = new RequestSend(&service_, cq.get(), handler, req_id); + } else if (rpc_name == kRequestGet) { + b = new RequestGet(&service_, cq.get(), handler, req_id); + + } else if (rpc_name == kRequestGetNoBarrier) { + b = new RequestGetNoBarrier(&service_, cq.get(), handler, req_id); + } else if (rpc_name == kRequestGetMonomerVariable) { + b = new RequestGetMonomerVariable(&service_, cq.get(), handler, req_id, + this); + } else if (rpc_name == kRequestGetMonomerBarrier) { + b = new RequestGetMonomerBarrier(&service_, cq.get(), handler, req_id, + this); + } else if (rpc_name == kRequestPrefetch) { + b = new RequestPrefetch(&service_, cq.get(), handler, req_id); + } else if (rpc_name == kRequestCheckpoint) { + b = new RequestCheckpointNotify(&service_, cq.get(), handler, req_id); + } else { + PADDLE_ENFORCE(false, "not supported rpc"); + } + + reqs[req_id] = b; + + VLOG(4) << "TryToRegisterNewOne status:" << b->Status(); +} + +void AsyncGRPCServer::HandleRequest( + ::grpc::ServerCompletionQueue* cq, const std::string& rpc_name, + std::function TryToRegisterNewOne) { + void* tag = NULL; + bool ok = false; + + while (true) { + VLOG(4) << "HandleRequest " << rpc_name << " wait next"; + if (!cq->Next(&tag, &ok)) { + LOG(WARNING) << "CompletionQueue " << rpc_name << " shutdown!"; + break; + } + + int req_id = static_cast(reinterpret_cast(tag)); + VLOG(4) << "HandleRequest " << rpc_name << ", req_id:" << req_id + << " get next"; + + auto& reqs = rpc_reqs_[rpc_name]; + RequestBase* base = nullptr; + { + PADDLE_ENFORCE(req_id >= 0 && req_id < kRequestBufSize); + std::unique_lock lock(cq_mutex_); + base = reqs[req_id]; + } + + VLOG(3) << base->Status2String(rpc_name); + + // reference: + // https://github.com/tensorflow/tensorflow/issues/5596 + // https://groups.google.com/forum/#!topic/grpc-io/xftlRy-IQwM + // https://groups.google.com/forum/#!topic/grpc-io/ywATt88Ef_I + if (!ok) { + VLOG(4) << "completion queue:" << rpc_name << " recv no regular event" + << " context:" << base->Status2String(rpc_name); + TryToRegisterNewOne(rpc_name, req_id); + delete base; + continue; + } + + switch (base->Status()) { + case PROCESS: { + base->Process(); + break; + } + case FINISH: { + TryToRegisterNewOne(rpc_name, req_id); + delete base; + break; + } + default: { assert(false); } + } + } +} + +} // namespace distributed +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/distributed/grpc/grpc_server.h b/paddle/fluid/operators/distributed/grpc/grpc_server.h new file mode 100644 index 0000000000000000000000000000000000000000..2fd3a7a74073b52770158cf47b1c86cedae78291 --- /dev/null +++ b/paddle/fluid/operators/distributed/grpc/grpc_server.h @@ -0,0 +1,88 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include +#include // NOLINT +#include +#include + +#include "grpc++/grpc++.h" +#include "paddle/fluid/framework/blocking_queue.h" +#include "paddle/fluid/framework/executor.h" +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/program_desc.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/framework/selected_rows.h" +#include "paddle/fluid/framework/var_type.h" +#include "paddle/fluid/operators/distributed/distributed_pb.h" +#include "paddle/fluid/operators/distributed/grpc/grpc_service.h" +#include "paddle/fluid/operators/distributed/request_handler.h" +#include "paddle/fluid/operators/distributed/rpc_server.h" +#include "paddle/fluid/operators/distributed/sendrecvop_utils.h" +#include "paddle/fluid/platform/profiler.h" + +namespace paddle { +namespace operators { +namespace distributed { + +class RequestBase; + +class AsyncGRPCServer final : public RPCServer { + public: + explicit AsyncGRPCServer(const std::string& address, int client_num) + : RPCServer(address, client_num), ready_(0) {} + + virtual ~AsyncGRPCServer() {} + void WaitServerReady() override; + void StartServer() override; + + private: + // HandleRequest needs to be thread-safe. + void HandleRequest( + ::grpc::ServerCompletionQueue* cq, const std::string& rpc_name, + std::function TryToRegisterNewOne); + + void TryToRegisterNewOne(const std::string& rpc_name, int req_id); + void ShutdownQueue(); + void ShutDownImpl() override; + + private: + static const int kRequestBufSize = 100; + + std::mutex cq_mutex_; + volatile bool is_shut_down_ = false; + + GrpcService::AsyncService service_; + std::unique_ptr<::grpc::Server> server_; + + // condition of the sub program + std::condition_variable barrier_condition_; + + std::mutex mutex_ready_; + std::condition_variable condition_ready_; + + int ready_; + + std::map> rpc_cq_; + std::map>> rpc_threads_; + std::map> rpc_reqs_; +}; + +}; // namespace distributed +}; // namespace operators +}; // namespace paddle diff --git a/paddle/fluid/operators/distributed/grpc/grpc_service.h b/paddle/fluid/operators/distributed/grpc/grpc_service.h new file mode 100644 index 0000000000000000000000000000000000000000..2965fe4490bedd0253682f0aef44e096232fc2fc --- /dev/null +++ b/paddle/fluid/operators/distributed/grpc/grpc_service.h @@ -0,0 +1,136 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "paddle/fluid/operators/distributed/grpc/grpc_variable_response.h" +#include "paddle/fluid/platform/profiler.h" + +// NOTE: This method was originally created by tensorflow +// (https://github.com/tensorflow/tensorflow/) we borrow this +// method and did some modifications so that we can parse gRPC +// requests without too much copying of the tensor data. + +namespace grpc { +class CompletionQueue; +class Channel; +class RpcService; +class ServerCompletionQueue; +class ServerContext; + +// Support parsing/unparsing of tensorflow::VariableResponse. +// Wire-format is identical to RecvVariableResponse. +template <> +class SerializationTraits< + paddle::operators::distributed::GRPCVariableResponse> { + public: + static Status Serialize( + const paddle::operators::distributed::GRPCVariableResponse& msg, + grpc_byte_buffer** bp, bool* own_buffer) { + PADDLE_ENFORCE(false, "SerializationTraits::Serialize not implemented!"); + return Status(); + } + static Status Deserialize( + grpc_byte_buffer* buffer, + paddle::operators::distributed::GRPCVariableResponse* msg, + int max_message_size = INT_MAX) { + if (buffer == nullptr) { + return Status(StatusCode::INTERNAL, "No payload"); + } + + Status result = g_core_codegen_interface->ok(); + if (result.ok()) { + paddle::operators::distributed::GrpcByteSource source(buffer); + int ret = msg->Parse(&source); + if (ret != 0) { + result = Status(StatusCode::INTERNAL, "VariableResponse parse error"); + } + } + g_core_codegen_interface->grpc_byte_buffer_destroy(buffer); + return result; + } +}; +} // namespace grpc + +namespace paddle { +namespace operators { +namespace distributed { + +enum class GrpcMethod { + kSendVariable, + kGetVariable, + kPrefetchVariable, + kCheckpointNotify, + kGetVariableNoBarrier, + kGetMonomerVariable, + kGetMonomerBarrier, +}; + +static const int kGrpcNumMethods = + static_cast(GrpcMethod::kGetMonomerBarrier) + 1; + +inline const char* GrpcMethodName(GrpcMethod id) { + switch (id) { + case GrpcMethod::kSendVariable: + return "/sendrecv.SendRecvService/SendVariable"; + case GrpcMethod::kGetVariable: + return "/sendrecv.SendRecvService/GetVariable"; + case GrpcMethod::kGetVariableNoBarrier: + return "/sendrecv.SendRecvService/GetVariableNoBarrier"; + case GrpcMethod::kGetMonomerVariable: + return "/sendrecv.SendRecvService/GetMonomerVariable"; + case GrpcMethod::kGetMonomerBarrier: + return "/sendrecv.SendRecvService/GetMonomerBarrier"; + case GrpcMethod::kPrefetchVariable: + return "/sendrecv.SendRecvService/PrefetchVariable"; + case GrpcMethod::kCheckpointNotify: + return "/sendrecv.SendRecvService/CheckpointNotify"; + } + + // Shouldn't be reached. + PADDLE_ENFORCE(false, "Invalid id: not found valid method name"); + return nullptr; +} + +class GrpcService final { + public: + class AsyncService : public ::grpc::Service { + public: + AsyncService() { + for (int i = 0; i < kGrpcNumMethods; ++i) { + AddMethod(new ::grpc::internal::RpcServiceMethod( + GrpcMethodName(static_cast(i)), + ::grpc::internal::RpcMethod::NORMAL_RPC, nullptr)); + ::grpc::Service::MarkMethodAsync(i); + } + } + virtual ~AsyncService() {} + + // Make RequestAsyncUnary public for grpc_call.h + using ::grpc::Service::RequestAsyncUnary; + }; +}; + +} // namespace distributed +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/distributed/grpc/grpc_variable_response.cc b/paddle/fluid/operators/distributed/grpc/grpc_variable_response.cc new file mode 100644 index 0000000000000000000000000000000000000000..87e83ca53bf13ac4a015d56572ba073e51722c3e --- /dev/null +++ b/paddle/fluid/operators/distributed/grpc/grpc_variable_response.cc @@ -0,0 +1,330 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include +#ifdef PADDLE_WITH_CUDA +#include +#endif + +#include "paddle/fluid/operators/distributed/grpc/grpc_variable_response.h" +#include "paddle/fluid/platform/profiler.h" + +namespace paddle { +namespace operators { +namespace distributed { + +enum WireType { + WIRETYPE_VARINT = 0, + WIRETYPE_LENGTH_DELIMITED = 2, +}; + +inline int GetTagFieldNumber(uint32_t tag) { return tag >> 3; } + +inline WireType GetTagWireType(uint32_t tag) { + return static_cast(tag & 0x7); +} + +bool ReadVarintSizeAsInt(::google::protobuf::io::CodedInputStream* input, + int* result) { + uint64_t v; + if (input->ReadVarint64(&v) && v <= static_cast(INT_MAX)) { + *result = static_cast(v); + return true; + } else { + return false; + } +} + +int GRPCVariableResponse::Parse(const ::grpc::ByteBuffer& byte_buffer) { + GrpcByteBufferSource source; + source.Init(byte_buffer); + GrpcByteBufferSourceWrapper r(&source); + + return Parse(&r); +} + +bool ParseLodData(::google::protobuf::io::CodedInputStream* input, + std::vector* lod) { + while (true) { + auto p = input->ReadTagWithCutoff(127); + int tag = GetTagFieldNumber(p.first); + WireType wt = GetTagWireType(p.first); + + if (!p.second) { + return (tag == 0); + } + + switch (tag) { + case sendrecv::VariableMessage_LodData::kLodDataFieldNumber: { + uint64_t v; + if (wt == WIRETYPE_VARINT) { + if (!input->ReadVarint64(&v)) { + return false; + } + lod->push_back(v); + break; + } + + if (wt == WIRETYPE_LENGTH_DELIMITED) { + int num_bytes = 0; + if (!input->ReadVarintSizeAsInt(&num_bytes)) { + return tag; + } + int start_pos = input->CurrentPosition(); + while (input->CurrentPosition() - start_pos < num_bytes) { + uint64_t v; + if (!input->ReadVarint64(&v)) { + return tag; + } + lod->push_back(v); + } + break; + } + + return false; + } + default: { return false; } + } + } + + return true; +} + +int GRPCVariableResponse::Parse(Source* source) { + ::google::protobuf::io::ZeroCopyInputStream* input_stream = + source->contents(); + ::google::protobuf::io::CodedInputStream input(input_stream); + input.SetTotalBytesLimit(INT_MAX, INT_MAX); + + while (true) { + auto p = input.ReadTagWithCutoff(127); + int tag = GetTagFieldNumber(p.first); + WireType wt = GetTagWireType(p.first); + if (!p.second) { + if (tag != 0) { + return -1; + } + return 0; + } + + switch (tag) { + case sendrecv::VariableMessage::kVarnameFieldNumber: { + uint32_t length; + if ((wt != WIRETYPE_LENGTH_DELIMITED) || !input.ReadVarint32(&length)) { + return tag; + } + + std::string temp; + if (!input.ReadString(&temp, length)) { + return tag; + } + + meta_.set_varname(temp); + break; + } + case sendrecv::VariableMessage::kTypeFieldNumber: { + uint32_t v; + if ((wt != WIRETYPE_VARINT) || !input.ReadVarint32(&v)) { + return tag; + } + + meta_.set_type(static_cast<::sendrecv::VarType>(v)); + break; + } + case sendrecv::VariableMessage::kDataTypeFieldNumber: { + uint32_t v = 0; + if ((wt != WIRETYPE_VARINT) || !input.ReadVarint32(&v)) { + return tag; + } + + meta_.set_data_type(static_cast<::sendrecv::VariableMessage_Type>(v)); + break; + } + case sendrecv::VariableMessage::kDimsFieldNumber: { + // not packed + if (wt == WIRETYPE_VARINT) { + uint64_t v; + if (!input.ReadVarint64(&v)) { + return tag; + } + meta_.add_dims(v); + break; + } + + // packed + if (wt == WIRETYPE_LENGTH_DELIMITED) { + int num_bytes = 0; + if (!input.ReadVarintSizeAsInt(&num_bytes)) { + return tag; + } + int start_pos = input.CurrentPosition(); + while (input.CurrentPosition() - start_pos < num_bytes) { + uint64_t v; + if (!input.ReadVarint64(&v)) { + return tag; + } + meta_.add_dims(v); + } + break; + } + return tag; + } + case sendrecv::VariableMessage::kLodLevelFieldNumber: { + uint64_t v = 0; + if ((wt != WIRETYPE_VARINT) || !input.ReadVarint64(&v)) { + return tag; + } + meta_.set_lod_level(static_cast(v)); + break; + } + case sendrecv::VariableMessage::kLodFieldNumber: { + int length = 0; + if (wt != WIRETYPE_LENGTH_DELIMITED || + !ReadVarintSizeAsInt(&input, &length)) { + return tag; + } + + std::pair<::google::protobuf::io::CodedInputStream::Limit, int> p = + input.IncrementRecursionDepthAndPushLimit(length); + + std::vector lod_data; + if (p.second < 0 || !ParseLodData(&input, &lod_data)) { + return tag; + } + + if (!input.DecrementRecursionDepthAndPopLimit(p.first)) { + return tag; + } + + if (lod_data.size() == 0) { + break; + } + + auto lod = meta_.add_lod(); + for (uint32_t i = 0; i < lod_data.size(); i++) { + lod->add_lod_data(lod_data[i]); + } + break; + } + case sendrecv::VariableMessage::kSlrHeightFieldNumber: { + uint64_t v = 0; + if ((wt != WIRETYPE_VARINT) || !input.ReadVarint64(&v)) { + return tag; + } + meta_.set_slr_height(static_cast(v)); + break; + } + case sendrecv::VariableMessage::kSerializedFieldNumber: { + int num_bytes = 0; + if (wt != WIRETYPE_LENGTH_DELIMITED || + !ReadVarintSizeAsInt(&input, &num_bytes)) { + return tag; + } + + if (!ProcSerializedField(tag, &input, num_bytes)) { + return tag; + } + + break; + } + case sendrecv::VariableMessage::kRowsFieldNumber: { + PADDLE_ENFORCE((meta_.type() == sendrecv::SELECTED_ROWS || + meta_.type() == sendrecv::LOD_TENSOR) && + meta_.varname() != "", + "meta info should be got first!"); + + int num_bytes = 0; + if (wt != WIRETYPE_LENGTH_DELIMITED || + !ReadVarintSizeAsInt(&input, &num_bytes)) { + return tag; + } + + if (!CopySelectRowsData(&input, *dev_ctx_, num_bytes)) { + return tag; + } + break; + } + case sendrecv::VariableMessage::kOutVarnameFieldNumber: { + uint32_t length; + if ((wt != WIRETYPE_LENGTH_DELIMITED) || !input.ReadVarint32(&length)) { + return tag; + } + + std::string temp; + if (!input.ReadString(&temp, length)) { + return tag; + } + + meta_.set_out_varname(temp); + break; + } + case sendrecv::VariableMessage::kProfileFieldNumber: { + uint64_t profiling = 0; + if (!input.ReadVarint64(&profiling)) { + return tag; + } + meta_.set_profile(profiling); + int64_t listener_id = platform::ListenerId(); + if (listener_id <= 0) { + break; + } + if (profiling == platform::kEnableProfiler && + !platform::IsProfileEnabled()) { + platform::EnableProfiler(platform::ProfilerState::kCPU); + } else if (profiling == platform::kDisableProfiler && + platform::IsProfileEnabled()) { + platform::DisableProfiler( + platform::EventSortingKey::kDefault, + string::Sprintf("%s_%lld", FLAGS_rpc_server_profile_path, + listener_id)); + } + break; + } + case sendrecv::VariableMessage::kTrainerIdFieldNumber: { + uint64_t trainer_id = 0; + if (!input.ReadVarint64(&trainer_id)) { + return tag; + } + meta_.set_trainer_id(trainer_id); + break; + } + case sendrecv::VariableMessage::kTableNameFieldNumber: { + uint32_t length; + if ((wt != WIRETYPE_LENGTH_DELIMITED) || !input.ReadVarint32(&length)) { + return tag; + } + + std::string temp; + if (!input.ReadString(&temp, length)) { + return tag; + } + + meta_.set_table_name(temp); + break; + } + default: { + // Unknown tag, return unknown error. + return -1; + } + } + } + + return 0; +} + +}; // namespace distributed +}; // namespace operators +}; // namespace paddle diff --git a/paddle/fluid/operators/distributed/grpc/grpc_variable_response.h b/paddle/fluid/operators/distributed/grpc/grpc_variable_response.h new file mode 100644 index 0000000000000000000000000000000000000000..3ca1d89f750313791c833a9f1f58760406e690c2 --- /dev/null +++ b/paddle/fluid/operators/distributed/grpc/grpc_variable_response.h @@ -0,0 +1,56 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include + +#include "paddle/fluid/framework/data_type.h" +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/framework/selected_rows.h" +#include "paddle/fluid/framework/var_type.h" + +#include "google/protobuf/io/coded_stream.h" +#include "google/protobuf/io/zero_copy_stream.h" +#include "paddle/fluid/framework/tensor.h" +#include "paddle/fluid/operators/distributed/distributed_pb.h" +#include "paddle/fluid/operators/distributed/grpc/grpc_bytebuffer_stream.h" +#include "paddle/fluid/operators/distributed/variable_response.h" + +namespace paddle { +namespace operators { +namespace distributed { + +class GRPCVariableResponse : public VariableResponse { + public: + GRPCVariableResponse(const framework::Scope* scope, + const platform::DeviceContext* dev_ctx, + bool create_scope = false) + : VariableResponse(scope, dev_ctx, create_scope) {} + + virtual ~GRPCVariableResponse() {} + + int Parse(Source* source) override; + + // return: + // 0:ok. + // -1: unkown error. + // other: number of error field. + int Parse(const ::grpc::ByteBuffer& byte_buffer); +}; + +}; // namespace distributed +}; // namespace operators +}; // namespace paddle diff --git a/paddle/fluid/operators/distributed/grpc_bytebuffer_stream.cc b/paddle/fluid/operators/distributed/grpc_bytebuffer_stream.cc deleted file mode 100644 index d192f54ee0c924b772045d9b6a01701f640e07c7..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/distributed/grpc_bytebuffer_stream.cc +++ /dev/null @@ -1,88 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -// NOTE: This file was originally created by tensorflow -// (https://github.com/tensorflow/tensorflow/) we borrow this -// file and did some modifications so that we can send gRPC -// requests without too much copying of the tensor data. - -#include "paddle/fluid/operators/distributed/grpc_bytebuffer_stream.h" - -namespace paddle { -namespace operators { -namespace distributed { - -GrpcByteBufferSource::GrpcByteBufferSource() {} - -bool GrpcByteBufferSource::Init(const grpc::ByteBuffer& src) { - cur_ = -1; - left_ = 0; - ptr_ = nullptr; - byte_count_ = 0; - bool ok = src.Dump(&slices_).ok(); - if (!ok) { - slices_.clear(); - } - return ok; -} - -bool GrpcByteBufferSource::Next(const void** data, int* size) { - // Use loop instead of if in case buffer contained empty slices. - while (left_ == 0) { - // Advance to next slice. - cur_++; - if (cur_ >= slices_.size()) { - return false; - } - const ::grpc::Slice& s = slices_[cur_]; - left_ = s.size(); - ptr_ = reinterpret_cast(s.begin()); - } - - *data = ptr_; - *size = left_; - byte_count_ += left_; - ptr_ += left_; - left_ = 0; - return true; -} - -void GrpcByteBufferSource::BackUp(int count) { - ptr_ -= count; - left_ += count; - byte_count_ -= count; -} - -bool GrpcByteBufferSource::Skip(int count) { - const void* data; - int size; - while (Next(&data, &size)) { - if (size >= count) { - BackUp(size - count); - return true; - } - // size < count; - count -= size; - } - // error or we have too large count; - return false; -} - -google::protobuf::int64 GrpcByteBufferSource::ByteCount() const { - return byte_count_; -} - -} // namespace distributed -} // namespace operators -} // namespace paddle diff --git a/paddle/fluid/operators/distributed/grpc_client.cc b/paddle/fluid/operators/distributed/grpc_client.cc deleted file mode 100644 index 076ecc1f01d89913081892eb6aa828b095b09656..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/distributed/grpc_client.cc +++ /dev/null @@ -1,394 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include -#include - -#include "glog/logging.h" // For VLOG -#include "paddle/fluid/framework/threadpool.h" -#include "paddle/fluid/operators/distributed/grpc_client.h" -#include "paddle/fluid/operators/distributed/grpc_serde.h" -#include "paddle/fluid/operators/distributed/request_handler.h" -#include "paddle/fluid/platform/profiler.h" - -namespace paddle { -namespace operators { -namespace distributed { - -void GRPCClient::InitImpl() { InitEventLoop(); } - -void GRPCClient::InitEventLoop() { - // start the client process thread - // TODO(wuyi): can make this in a threadpool - client_thread_.reset(new std::thread(std::bind(&GRPCClient::Proceed, this))); -} - -void GRPCClient::SendComplete() { - std::unique_lock lk(completed_mutex_); - if (!completed_) { - for (auto& it : channels_) { - VLOG(3) << "send complete message to " << it.first; - this->AsyncSendComplete(it.first); - } - PADDLE_ENFORCE(this->Wait(), "internal grpc error"); - completed_ = true; - } -} - -GRPCClient::~GRPCClient() { - stopped_ = true; - Wait(); - cq_.Shutdown(); - { - std::lock_guard guard(chan_mutex_); - for (auto& it : channels_) { - it.second.reset(); - } - channels_.clear(); - } - client_thread_->join(); -} - -VarHandlePtr GRPCClient::AsyncSendVar(const std::string& ep, - const platform::DeviceContext& ctx, - const framework::Scope& scope, - const std::string& var_name, - int64_t time_out) { - const platform::DeviceContext* p_ctx = &ctx; - const std::string ep_val = ep; - const std::string var_name_val = var_name; - const framework::Scope* p_scope = &scope; - const auto ch = GetChannel(ep_val); - SendProcessor* s = new SendProcessor(ch); - const std::string method = "SendRPC"; - VarHandlePtr h(new VarHandle(ep, method, var_name_val, p_ctx, p_scope)); - s->Prepare(h, time_out); - - framework::AsyncIO([var_name_val, p_scope, p_ctx, s, method, h, this] { - auto* var = p_scope->FindVar(var_name_val); - - ::grpc::ByteBuffer req; - SerializeToByteBuffer(var_name_val, var, *p_ctx, &req); - - VLOG(3) << s->GetVarHandlePtr()->String() << " begin"; - - // stub context - s->response_call_back_ = nullptr; - - platform::RecordEvent record_event(method, p_ctx); - - auto call = s->stub_g_.PrepareUnaryCall( - s->context_.get(), "/sendrecv.SendRecvService/SendVariable", req, &cq_); - call->StartCall(); - call->Finish(&s->reply_, &s->status_, reinterpret_cast(s)); - - if (UNLIKELY(platform::IsProfileEnabled())) { - h->Wait(); - } - }); - req_count_++; - - return h; -} - -void ProcGetResponse(const VarHandle& var_h, - const ::grpc::ByteBuffer& ret_msg) { - framework::Variable* outvar = nullptr; - DeserializeFromByteBuffer(ret_msg, *var_h.ctx(), var_h.scope(), &outvar); -} - -template -void RequestToByteBuffer(const T& proto, ::grpc::ByteBuffer* result) { - ::grpc::Slice slice(proto.ByteSizeLong()); - proto.SerializeWithCachedSizesToArray(const_cast(slice.begin())); - ::grpc::ByteBuffer tmp(&slice, 1); - result->Swap(&tmp); -} - -VarHandlePtr GRPCClient::AsyncGetVar(const std::string& ep, - const platform::DeviceContext& ctx, - const framework::Scope& scope, - const std::string& var_name, - int64_t time_out) { - const platform::DeviceContext* p_ctx = &ctx; - const std::string ep_val = ep; - const std::string var_name_val = var_name; - const framework::Scope* p_scope = &scope; - const auto ch = GetChannel(ep_val); - GetProcessor* s = new GetProcessor(ch); - const std::string method = "GetRPC"; - VarHandlePtr h(new VarHandle(ep, method, var_name_val, p_ctx, p_scope)); - s->Prepare(h, time_out); - - framework::AsyncIO([var_name_val, s, method, p_ctx, h, this] { - // prepare input - sendrecv::VariableMessage req; - req.set_varname(var_name_val); - ::grpc::ByteBuffer buf; - RequestToByteBuffer(req, &buf); - - VLOG(3) << s->GetVarHandlePtr()->String() << " begin"; - - // stub context - s->response_call_back_ = ProcGetResponse; - - platform::RecordEvent record_event(method, p_ctx); - - auto call = s->stub_g_.PrepareUnaryCall( - s->context_.get(), "/sendrecv.SendRecvService/GetVariable", buf, &cq_); - call->StartCall(); - call->Finish(&s->reply_, &s->status_, reinterpret_cast(s)); - - if (UNLIKELY(platform::IsProfileEnabled())) { - h->Wait(); - } - }); - - req_count_++; - - return h; -} - -VarHandlePtr GRPCClient::AsyncPrefetchVar(const std::string& ep, - const platform::DeviceContext& ctx, - const framework::Scope& scope, - const std::string& in_var_name, - const std::string& out_var_name, - int64_t time_out) { - const platform::DeviceContext* p_ctx = &ctx; - const std::string ep_val = ep; - const std::string in_var_name_val = in_var_name; - const std::string out_var_name_val = out_var_name; - const framework::Scope* p_scope = &scope; - const auto ch = GetChannel(ep_val); - GetProcessor* s = new GetProcessor(ch); - - const std::string method = "PrefetchRPC"; - - VarHandlePtr h(new VarHandle(ep, method, out_var_name_val, p_ctx, p_scope)); - s->Prepare(h, time_out); - - framework::AsyncIO([in_var_name_val, out_var_name_val, ep_val, p_scope, p_ctx, - s, method, h, this] { - auto* var = p_scope->FindVar(in_var_name_val); - - ::grpc::ByteBuffer req; - SerializeToByteBuffer(in_var_name_val, var, *p_ctx, &req, out_var_name_val); - - VLOG(3) << s->GetVarHandlePtr()->String() << " begin"; - - // stub context - s->response_call_back_ = ProcGetResponse; - - platform::RecordEvent record_event(method, p_ctx); - - auto call = s->stub_g_.PrepareUnaryCall( - s->context_.get(), "/sendrecv.SendRecvService/PrefetchVariable", req, - &cq_); - call->StartCall(); - call->Finish(&s->reply_, &s->status_, static_cast(s)); - - if (UNLIKELY(platform::IsProfileEnabled())) { - h->Wait(); - } - }); - - req_count_++; - return h; -} - -VarHandlePtr GRPCClient::AsyncSendBatchBarrier(const std::string& ep, - int64_t time_out) { - const auto ch = GetChannel(ep); - - BatchBarrierProcessor* s = new BatchBarrierProcessor(ch); - const std::string method = "BatchBarrierRPC"; - VarHandlePtr h( - new VarHandle(ep, method, BATCH_BARRIER_MESSAGE, nullptr, nullptr)); - s->Prepare(h, time_out); - - sendrecv::VariableMessage req; - req.set_varname(BATCH_BARRIER_MESSAGE); - - platform::RecordEvent record_event(method, nullptr); - - auto rpc = s->stub_->AsyncSendVariable(s->context_.get(), req, &cq_); - rpc->Finish(&s->reply_, &s->status_, reinterpret_cast(s)); - req_count_++; - - if (UNLIKELY(platform::IsProfileEnabled())) { - h->Wait(); - } - - return h; -} - -VarHandlePtr GRPCClient::AsyncSendFetchBarrier(const std::string& ep, - int64_t time_out) { - const auto ch = GetChannel(ep); - FetchBarrierProcessor* s = new FetchBarrierProcessor(ch); - const std::string method = "FetchBarrierRPC"; - VarHandlePtr h( - new VarHandle(ep, method, FETCH_BARRIER_MESSAGE, nullptr, nullptr)); - s->Prepare(h, time_out); - - sendrecv::VariableMessage req; - req.set_varname(FETCH_BARRIER_MESSAGE); - - platform::RecordEvent record_event(method, nullptr); - - auto rpc = s->stub_->AsyncGetVariable(s->context_.get(), req, &cq_); - rpc->Finish(&s->reply_, &s->status_, reinterpret_cast(s)); - req_count_++; - - if (UNLIKELY(platform::IsProfileEnabled())) { - h->Wait(); - } - - return h; -} - -VarHandlePtr GRPCClient::AsyncSendComplete(const std::string& ep, - int64_t time_out) { - const auto ch = GetChannel(ep); - - BatchBarrierProcessor* s = new BatchBarrierProcessor(ch); - const std::string method = "SendCompleteRPC"; - VarHandlePtr h(new VarHandle(ep, method, COMPLETE_MESSAGE, nullptr, nullptr)); - s->Prepare(h, time_out); - - sendrecv::VariableMessage req; - req.set_varname(COMPLETE_MESSAGE); - - platform::RecordEvent record_event(method, nullptr); - - auto rpc = s->stub_->AsyncSendVariable(s->context_.get(), req, &cq_); - rpc->Finish(&s->reply_, &s->status_, reinterpret_cast(s)); - req_count_++; - - if (UNLIKELY(platform::IsProfileEnabled())) { - h->Wait(); - } - - return h; -} - -VarHandlePtr GRPCClient::AsyncCheckpointNotify(const std::string& ep, - const std::string& dir, - int64_t time_out) { - const auto ch = GetChannel(ep); - - CheckpointNotifyProcessor* s = new CheckpointNotifyProcessor(ch); - - const std::string method = "CheckPointNotifyRPC"; - - VarHandlePtr h( - new VarHandle(ep, method, CHECKPOINT_SAVE_MESSAGE, nullptr, nullptr)); - s->Prepare(h, time_out); - - sendrecv::VariableMessage req; - req.set_varname(CHECKPOINT_SAVE_MESSAGE); - req.set_out_varname(dir); - - platform::RecordEvent record_event(method, nullptr); - - auto rpc = s->stub_->AsyncCheckpointNotify(s->context_.get(), req, &cq_); - rpc->Finish(&s->reply_, &s->status_, reinterpret_cast(s)); - req_count_++; - - if (UNLIKELY(platform::IsProfileEnabled())) { - h->Wait(); - } - - return h; -} - -bool GRPCClient::Wait() { - std::unique_lock lk(sync_mutex_); - sync_cond_.wait(lk, [this] { return (req_count_ == 0 || ok_ == false); }); - return ok_; -} - -void GRPCClient::Proceed() { - void* tag = nullptr; - bool ok = false; - - VLOG(3) << "GRPCClient Proceed begin"; - while (!stopped_ && cq_.Next(&tag, &ok)) { - BaseProcessor* c = static_cast(tag); - GPR_ASSERT(ok); - PADDLE_ENFORCE(c); - - if (c->status_.ok()) { - VLOG(3) << c->GetVarHandlePtr()->String() << " process"; - c->Process(); - } else if (c->status_.error_code() == grpc::StatusCode::DEADLINE_EXCEEDED) { - // FIXME(gongwb): parse error_details? - LOG(ERROR) << c->GetVarHandlePtr()->String() - << " meets grpc error, error_code:" << c->status_.error_code() - << " error_message:" << c->status_.error_message() - << " error_details:" << c->status_.error_details(); - { - std::lock_guard lk(sync_mutex_); - ok_ = false; - } - c->Finish(false); - } else { - LOG(FATAL) << c->GetVarHandlePtr()->String() - << " meets grpc error, error_code:" << c->status_.error_code() - << " error_message:" << c->status_.error_message() - << " error_details:" << c->status_.error_details(); - - c->Finish(false); - } - - bool notify = false; - { - std::lock_guard lk(sync_mutex_); - req_count_--; - notify = (req_count_ <= 0 || !c->status_.ok()); - } - - delete c; - - if (notify) { - sync_cond_.notify_all(); - } - } - VLOG(3) << "GRPCClient Proceed end"; -} - -std::shared_ptr GRPCClient::GetChannel(const std::string& ep) { - std::lock_guard guard(chan_mutex_); - auto it = channels_.find(ep); - if (it != channels_.end()) { - return it->second; - } - - // Channel configurations: - grpc::ChannelArguments args; - args.SetInt(GRPC_ARG_MAX_RECONNECT_BACKOFF_MS, 2000); - args.SetCompressionAlgorithm(GRPC_COMPRESS_NONE); - args.SetMaxSendMessageSize(std::numeric_limits::max()); - args.SetMaxReceiveMessageSize(std::numeric_limits::max()); - - auto ch = - grpc::CreateCustomChannel(ep, grpc::InsecureChannelCredentials(), args); - channels_[ep] = ch; - return ch; -} - -} // namespace distributed -} // namespace operators -} // namespace paddle diff --git a/paddle/fluid/operators/distributed/grpc_client.h b/paddle/fluid/operators/distributed/grpc_client.h deleted file mode 100644 index d8e9cee85bd734c2ed4b1cae03ecee04e304b651..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/distributed/grpc_client.h +++ /dev/null @@ -1,251 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#pragma once - -#include -#include - -#include // NOLINT -#include // NOLINT -#include -#include -#include -#include -#include // NOLINT -#include -#include // NOLINT -#include - -#include "grpc++/channel.h" -#include "grpc++/generic/generic_stub.h" -#include "grpc++/grpc++.h" -#include "grpc++/support/byte_buffer.h" -#include "grpc++/support/slice.h" -#include "grpc/support/log.h" -#include "paddle/fluid/framework/blocking_queue.h" -#include "paddle/fluid/framework/data_type.h" -#include "paddle/fluid/framework/lod_tensor.h" -#include "paddle/fluid/framework/scope.h" -#include "paddle/fluid/framework/selected_rows.h" -#include "paddle/fluid/operators/distributed/request_handler.h" -#include "paddle/fluid/operators/distributed/rpc_client.h" -#include "paddle/fluid/operators/distributed/send_recv.grpc.pb.h" -#include "paddle/fluid/operators/distributed/send_recv.pb.h" -#include "paddle/fluid/operators/distributed/sendrecvop_utils.h" -#include "paddle/fluid/platform/macros.h" // for DISABLE_COPY_AND_ASSIGN - -namespace paddle { -namespace operators { -namespace distributed { - -void ProcGetResponse(const VarHandle& var_h, const grpc::ByteBuffer& msg); - -class BaseProcessor { - public: - BaseProcessor() { context_ = nullptr; } - - virtual ~BaseProcessor() {} - - virtual void Prepare(VarHandlePtr h, int64_t time_out) { - var_h_ = h; - - context_.reset(new grpc::ClientContext()); - context_->set_wait_for_ready(true); - if (time_out) { - std::chrono::system_clock::time_point deadline = - std::chrono::system_clock::now() + - std::chrono::milliseconds(time_out); - context_->set_deadline(deadline); - } - } - - void Process() { - ProcessImpl(); - var_h_->Finish(true); - } - - VarHandlePtr GetVarHandlePtr() { return var_h_; } - bool Wait() { return var_h_->Wait(); } - void Finish(bool ok) { return var_h_->Finish(ok); } - virtual void ProcessImpl() = 0; - - std::unique_ptr context_; - grpc::Status status_; - - protected: - VarHandlePtr var_h_; -}; - -typedef std::function - RequestSendCallBack; - -class SendProcessor : public BaseProcessor { - public: - explicit SendProcessor(std::shared_ptr ch) - : BaseProcessor(), stub_g_(ch) {} - - virtual ~SendProcessor() {} - - void ProcessImpl() override { - if (response_call_back_) { - response_call_back_(*var_h_.get(), reply_); - } - } - - ::grpc::GenericStub stub_g_; - ::grpc::ByteBuffer reply_; - RequestSendCallBack response_call_back_ = nullptr; -}; - -typedef std::function - RequestGetCallBack; - -class GetProcessor : public BaseProcessor { - public: - explicit GetProcessor(std::shared_ptr ch) - : BaseProcessor(), stub_g_(ch) {} - - virtual ~GetProcessor() {} - - void ProcessImpl() override { - if (response_call_back_) { - response_call_back_(*var_h_.get(), reply_); - } - } - - ::grpc::ByteBuffer reply_; - ::grpc::GenericStub stub_g_; - RequestGetCallBack response_call_back_ = ProcGetResponse; -}; - -class BatchBarrierProcessor : public BaseProcessor { - public: - explicit BatchBarrierProcessor(std::shared_ptr ch) - : BaseProcessor() { - stub_ = sendrecv::SendRecvService::NewStub(ch); - } - - virtual ~BatchBarrierProcessor() {} - - void ProcessImpl() override {} - sendrecv::VoidMessage reply_; - std::unique_ptr stub_; -}; - -class FetchBarrierProcessor : public BaseProcessor { - public: - explicit FetchBarrierProcessor(std::shared_ptr ch) - : BaseProcessor() { - stub_ = sendrecv::SendRecvService::NewStub(ch); - } - - virtual ~FetchBarrierProcessor() {} - - void ProcessImpl() override {} - sendrecv::VariableMessage reply_; - std::unique_ptr stub_; -}; - -class CheckpointNotifyProcessor : public BaseProcessor { - public: - explicit CheckpointNotifyProcessor(std::shared_ptr ch) - : BaseProcessor() { - stub_ = sendrecv::SendRecvService::NewStub(ch); - } - - virtual ~CheckpointNotifyProcessor() {} - - void ProcessImpl() override {} - sendrecv::VoidMessage reply_; - std::unique_ptr stub_; -}; - -class GRPCClient : public RPCClient { - public: - GRPCClient() : ok_(true), completed_(false), stopped_(false) {} - virtual ~GRPCClient(); - - VarHandlePtr AsyncSendVar(const std::string& ep, - const platform::DeviceContext& ctx, - const framework::Scope& scope, - const std::string& var_name, - int64_t time_out = FLAGS_rpc_deadline) override; - - VarHandlePtr AsyncGetVar(const std::string& ep, - const platform::DeviceContext& ctx, - const framework::Scope& scope, - const std::string& var_name, - int64_t time_out = FLAGS_rpc_deadline) override; - - VarHandlePtr AsyncPrefetchVar(const std::string& ep, - const platform::DeviceContext& ctx, - const framework::Scope& scope, - const std::string& in_var_name, - const std::string& out_var_name, - int64_t time_out = FLAGS_rpc_deadline) override; - - VarHandlePtr AsyncSendBatchBarrier( - const std::string& ep, int64_t time_out = FLAGS_rpc_deadline) override; - - VarHandlePtr AsyncSendFetchBarrier( - const std::string& ep, int64_t time_out = FLAGS_rpc_deadline) override; - - VarHandlePtr AsyncCheckpointNotify( - const std::string& ep, const std::string& dir, - int64_t time_out = FLAGS_rpc_deadline) override; - - VarHandlePtr AsyncSendComplete( - const std::string& ep, int64_t time_out = FLAGS_rpc_deadline) override; - - bool Wait() override; - - void SendComplete() override; - - protected: - void InitImpl() override; - - private: - // InitEventLoop should only be called by Init() - void InitEventLoop(); - - void Proceed(); - - std::shared_ptr GetChannel(const std::string& ep); - - private: - grpc::CompletionQueue cq_; - std::unordered_map> channels_; - std::unique_ptr client_thread_; - - // mutex for Wait client sync - std::mutex sync_mutex_; - std::condition_variable sync_cond_; - std::atomic req_count_{0}; - bool ok_; - - // mutex for GetChannel thread safety - std::mutex chan_mutex_; - DISABLE_COPY_AND_ASSIGN(GRPCClient); - - // mutex for sending complete message only once - std::mutex completed_mutex_; - bool completed_; - - volatile bool stopped_; -}; - -} // namespace distributed -} // namespace operators -} // namespace paddle diff --git a/paddle/fluid/operators/distributed/grpc_serde.cc b/paddle/fluid/operators/distributed/grpc_serde.cc deleted file mode 100644 index ffe8f082db34b2ffd6b277080030463080feeb1d..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/distributed/grpc_serde.cc +++ /dev/null @@ -1,159 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#ifdef PADDLE_WITH_CUDA -#include -#endif -#include -#include // NOLINT - -#include "google/protobuf/io/coded_stream.h" -#include "google/protobuf/io/zero_copy_stream.h" -#include "paddle/fluid/framework/data_type.h" -#include "paddle/fluid/operators/distributed/grpc_bytebuffer_stream.h" -#include "paddle/fluid/operators/distributed/grpc_serde.h" -#include "paddle/fluid/operators/distributed/grpc_variable_response.h" -#include "paddle/fluid/operators/distributed/proto_encoder_helper.h" -#include "paddle/fluid/operators/distributed/sendrecvop_utils.h" -#include "paddle/fluid/platform/profiler.h" - -namespace paddle { -namespace operators { -namespace distributed { - -void SerializeToByteBuffer(const std::string& name, framework::Variable* var, - const platform::DeviceContext& ctx, - ::grpc::ByteBuffer* msg, - const std::string& out_name) { - platform::RecordEvent record_event("serial", &ctx); - // Default DestroyCallback does nothing, When using GPU - // the CPU buffer need to be freed. - DestroyCallback destroy_callback = [](void* backing) {}; - VarMsg request; - void* payload = nullptr; - size_t payload_size; - - request.set_varname(name); - // Note: normally the profiler is enabled in 1 trainer, hence only - // 1 trainer returns true for ShouldSendProfileState(). It tells PS - // servers the trainer's profiling state so that PS can follow the - // trainer. - if (platform::ShouldSendProfileState()) { - if (platform::IsProfileEnabled()) { - request.set_profile(platform::kEnableProfiler); - } else { - request.set_profile(platform::kDisableProfiler); - } - } - if (!out_name.empty()) { - request.set_out_varname(out_name); - } - if (var->IsType()) { - request.set_type(::sendrecv::LOD_TENSOR); - GetTensorPayload(var, ctx, &request, &payload, &payload_size); - } else if (var->IsType()) { - request.set_type(::sendrecv::SELECTED_ROWS); - GetSelectedRowsPayload(var, ctx, &request, &payload, &payload_size); -#ifdef PADDLE_WITH_CUDA - } else if (var->IsType()) { - request.set_type(::sendrecv::NCCL_ID); -#endif - } else { - PADDLE_THROW("Serialize does not support type: %s", - typeid(var->Type()).name()); - } - - if (platform::is_gpu_place(ctx.GetPlace())) { -#ifdef PADDLE_WITH_CUDA - // GPU data is copied to CPU buffer when sending, - // free the buffer when possible. - destroy_callback = [](void* backing) { - platform::CUDAPinnedPlace cuda_pinned; - memory::Free(cuda_pinned, backing); - }; -#endif - } - - std::string header; - request.AppendToString(&header); - auto buffer = std::unique_ptr(new char[1024]); - void* buf = buffer.get(); - ProtoEncodeHelper e(static_cast(buf), 1024); - e.WriteRawBytes(std::string(header.data(), header.size())); -// NCCLID is copied directly to the message, return bytebuffer -// with only one slice if serializing NCCLID. -#ifdef PADDLE_WITH_CUDA - if (var->IsType()) { - e.WriteVarlengthBeginning(VarMsg::kSerializedFieldNumber, - NCCL_UNIQUE_ID_BYTES); - const ncclUniqueId& uid = var->Get(); - e.WriteRawBytes(std::string(uid.internal, NCCL_UNIQUE_ID_BYTES)); - - // for serialize NCCL_ID - ::grpc::Slice slices(e.size()); - memcpy(const_cast(slices.begin()), e.data(), e.size()); - ::grpc::ByteBuffer tmp(&slices, 1); - msg->Swap(&tmp); - return; - } -#endif - - e.WriteVarlengthBeginning(VarMsg::kSerializedFieldNumber, payload_size); - // steal reference of tensor data - ::grpc::Slice slices[4]; // metadata, tensor, rows meta, rows - int num_slices = 2; // only SelectedRows have rows buffer - slices[0] = ::grpc::Slice(e.size()); - memcpy(const_cast(slices[0].begin()), e.data(), e.size()); - slices[1] = ::grpc::Slice( - grpc_slice_new_with_user_data(payload, payload_size, destroy_callback, - static_cast(payload)), - ::grpc::Slice::STEAL_REF); - - if (var->IsType()) { - auto* slr = var->GetMutable(); - ProtoEncodeHelper e2(static_cast(buf), 128); - size_t rows_memory_size = - slr->rows().size() * framework::SizeOfType(typeid(int64_t)); - e2.WriteVarlengthBeginning(VarMsg::kRowsFieldNumber, rows_memory_size); - slices[2] = ::grpc::Slice(e2.size()); - memcpy(const_cast(slices[2].begin()), e2.data(), e2.size()); - - slices[3] = ::grpc::Slice( - grpc_slice_new_with_user_data( - const_cast( - reinterpret_cast(slr->rows().data())), - rows_memory_size, [](void* backing) {}, - const_cast( - reinterpret_cast(slr->rows().data()))), - ::grpc::Slice::STEAL_REF); - num_slices = 4; - } - - ::grpc::ByteBuffer tmp(&slices[0], num_slices); - msg->Swap(&tmp); -} - -void DeserializeFromByteBuffer(const ::grpc::ByteBuffer& msg, - const platform::DeviceContext& ctx, - const framework::Scope* scope, - framework::Variable** var) { - platform::RecordEvent record_event("deserial", &ctx); - operators::distributed::GRPCVariableResponse resp(scope, &ctx); - PADDLE_ENFORCE(resp.Parse(msg) == 0, "parse bytebuffer to tensor error!"); - *var = resp.GetVar(); -} - -} // namespace distributed -} // namespace operators -} // namespace paddle diff --git a/paddle/fluid/operators/distributed/grpc_serde.h b/paddle/fluid/operators/distributed/grpc_serde.h deleted file mode 100644 index 450c41dcd6b1bf9a33d3bbef3a1c94a2f83ff322..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/distributed/grpc_serde.h +++ /dev/null @@ -1,50 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#pragma once -#include -#include -#include -#include - -#include "paddle/fluid/framework/data_type.h" -#include "paddle/fluid/framework/lod_tensor.h" -#include "paddle/fluid/framework/scope.h" -#include "paddle/fluid/framework/selected_rows.h" -#include "paddle/fluid/framework/tensor_util.h" -#include "paddle/fluid/framework/var_type.h" -#include "paddle/fluid/operators/distributed/sendrecvop_utils.h" - -#include "paddle/fluid/operators/distributed/send_recv.grpc.pb.h" -#include "paddle/fluid/operators/distributed/send_recv.pb.h" - -namespace paddle { -namespace operators { -namespace distributed { - -typedef void (*DestroyCallback)(void*); - -void SerializeToByteBuffer(const std::string& name, framework::Variable* var, - const platform::DeviceContext& ctx, - ::grpc::ByteBuffer* msg, - const std::string& out_varname = std::string()); - -void DeserializeFromByteBuffer(const ::grpc::ByteBuffer& msg, - const platform::DeviceContext& ctx, - const framework::Scope* scope, - framework::Variable** var); - -} // namespace distributed -} // namespace operators -} // namespace paddle diff --git a/paddle/fluid/operators/distributed/grpc_serde_test.cc b/paddle/fluid/operators/distributed/grpc_serde_test.cc deleted file mode 100644 index 96ea05e74ed76768248a27ab435dc801b7d1b995..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/distributed/grpc_serde_test.cc +++ /dev/null @@ -1,223 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include -#include -#include // NOLINT - -#include "google/protobuf/text_format.h" -#include "gtest/gtest.h" -#include "paddle/fluid/framework/lod_tensor.h" -#include "paddle/fluid/framework/tensor_util.h" -#include "paddle/fluid/framework/variable.h" -#include "paddle/fluid/operators/detail/macros.h" -#include "paddle/fluid/operators/distributed/grpc_serde.h" -#include "paddle/fluid/operators/distributed/grpc_variable_response.h" -#include "paddle/fluid/operators/distributed/sendrecvop_utils.h" -#include "paddle/fluid/operators/math/math_function.h" -#include "paddle/fluid/platform/place.h" -#include "paddle/fluid/string/printf.h" - -namespace framework = paddle::framework; -namespace platform = paddle::platform; -namespace operators = paddle::operators; -namespace math = paddle::operators::math; -namespace memory = paddle::memory; - -void RunSerdeTestSelectedRows(platform::Place place) { - platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance(); - auto& ctx = *pool.Get(place); - - // serialize var to ByteBuffer - framework::Variable var; - auto* slr = var.GetMutable(); - slr->set_height(1000); - auto* tensor = slr->mutable_value(); - auto* rows = slr->mutable_rows(); - tensor->Resize(framework::make_ddim({564, 128})); - tensor->mutable_data(place); - int tensor_numel = 564 * 128; - math::set_constant(ctx, tensor, 32.7); - for (int i = 0; i < 564; ++i) rows->push_back(i); - - ::grpc::ByteBuffer msg; - operators::distributed::SerializeToByteBuffer("myvar", &var, ctx, &msg); - EXPECT_GT(msg.Length(), static_cast(0)); - - // deserialize - std::vector<::grpc::Slice> slices; - (void)msg.Dump(&slices); - std::string tmp; - for (const auto& s : slices) { - tmp.append(reinterpret_cast(s.begin()), s.size()); - } - - sendrecv::VariableMessage varmsg; - EXPECT_TRUE(varmsg.ParseFromString(tmp)); - - // deserialize bytebuffer - EXPECT_EQ(varmsg.varname(), "myvar"); - EXPECT_EQ(varmsg.type(), 1); - - const float* tensor_data = - reinterpret_cast(varmsg.serialized().data()); - const int64_t* rows_data = - reinterpret_cast(varmsg.rows().data()); - for (int i = 0; i < tensor_numel; ++i) { - EXPECT_FLOAT_EQ(tensor_data[i], 32.7); - } - for (int i = 0; i < 564; ++i) { - EXPECT_EQ(rows_data[i], i); - } - - // deserialize zero-copy - // framework::Variable var2; - // operators::distributed::DeserializeFromByteBuffer(msg, ctx, &var2); - framework::Scope scope; - scope.Var("myvar"); - operators::distributed::GRPCVariableResponse resp(&scope, &ctx); - EXPECT_EQ(resp.Parse(msg), 0); - - framework::Variable* var2 = resp.GetVar(); - - auto* slr2 = var2->GetMutable(); - auto* tensor2 = slr2->mutable_value(); - auto* rows2 = slr2->mutable_rows(); - float* tensor_data2 = nullptr; - framework::Tensor tmp_tensor; - - if (platform::is_gpu_place(ctx.GetPlace())) { - platform::CPUPlace cpu; - framework::TensorCopy(*tensor2, cpu, &tmp_tensor); - tensor_data2 = tmp_tensor.data(); - } else { - tensor_data2 = const_cast(tensor2->data()); - } - const int64_t* rows_data2 = rows2->data(); - - for (int i = 0; i < tensor_numel; ++i) { - EXPECT_FLOAT_EQ(tensor_data2[i], 32.7); - } - for (size_t i = 0; i < rows2->size(); ++i) { - EXPECT_EQ(rows_data2[i], static_cast(i)); - } - EXPECT_EQ(slr2->height(), 1000); -} - -void RunTestLodTensor(platform::Place place, int from_type = 0) { - // serialize var to ByteBuffer - framework::Variable var; - auto* tensor = var.GetMutable(); - tensor->Resize(framework::make_ddim({512, 8, 4, 2})); - framework::LoD lod; - lod.push_back(framework::Vector({1, 3, 8})); - tensor->set_lod(lod); - int tensor_numel = 512 * 8 * 4 * 2; - platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance(); - auto& ctx = *pool.Get(place); - tensor->mutable_data(place); - math::set_constant(ctx, tensor, 31.9); - - ::grpc::ByteBuffer msg; - operators::distributed::SerializeToByteBuffer("myvar", &var, ctx, &msg); - EXPECT_GT(msg.Length(), static_cast(0)); - - // deserialize - std::vector<::grpc::Slice> slices; - (void)msg.Dump(&slices); - std::string tmp; - for (const auto& s : slices) { - tmp.append(reinterpret_cast(s.begin()), s.size()); - } - sendrecv::VariableMessage varmsg; - EXPECT_TRUE(varmsg.ParseFromString(tmp)); - EXPECT_EQ(varmsg.varname(), "myvar"); - EXPECT_EQ(varmsg.type(), 0); - EXPECT_EQ(varmsg.dims()[0], 512); - EXPECT_EQ(varmsg.dims()[1], 8); - EXPECT_EQ(varmsg.dims()[2], 4); - EXPECT_EQ(varmsg.dims()[3], 2); - EXPECT_EQ(varmsg.lod_level(), 1); - EXPECT_EQ(varmsg.lod(0).lod_data(0), 1); - EXPECT_EQ(varmsg.lod(0).lod_data(1), 3); - EXPECT_EQ(varmsg.lod(0).lod_data(2), 8); - - const float* tensor_data = - reinterpret_cast(varmsg.serialized().data()); - for (int i = 0; i < tensor_numel; ++i) { - EXPECT_FLOAT_EQ(tensor_data[i], 31.9); - } - - // message binary - std::string str; - varmsg.SerializeToString(&str); - - // message bytebuffer - ::grpc::Slice slices_2[1]; - int num_slices = 1; - slices_2[0] = ::grpc::Slice(str.length()); - memcpy(const_cast(slices_2[0].begin()), str.c_str(), str.length()); - ::grpc::ByteBuffer bytebuffer2(&slices_2[0], num_slices); - - // deserialize zero-copy - framework::Scope scope; - scope.Var("myvar"); - operators::distributed::GRPCVariableResponse resp(&scope, &ctx); - if (from_type == 0) { - EXPECT_EQ(resp.Parse(msg), 0); - } else { - EXPECT_EQ(resp.Parse(bytebuffer2), 0); - } - - framework::Variable* var2 = resp.GetVar(); - - auto tensor2 = var2->Get(); - float* tensor_data2 = nullptr; - framework::Tensor tmp_tensor; - - if (platform::is_gpu_place(ctx.GetPlace())) { - platform::CPUPlace cpu; - framework::TensorCopy(tensor2, cpu, &tmp_tensor); - tensor_data2 = tmp_tensor.data(); - } else { - tensor_data2 = const_cast(tensor2.data()); - } - - EXPECT_EQ(varmsg.lod_level(), 1); - EXPECT_EQ(varmsg.lod(0).lod_data(0), 1); - EXPECT_EQ(varmsg.lod(0).lod_data(1), 3); - EXPECT_EQ(varmsg.lod(0).lod_data(2), 8); - for (int i = 0; i < tensor_numel; ++i) EXPECT_FLOAT_EQ(tensor_data2[i], 31.9); -} - -TEST(LodTensor, Run) { - platform::CPUPlace place; - RunTestLodTensor(place); - RunTestLodTensor(place, 1); -#ifdef PADDLE_WITH_CUDA - platform::CUDAPlace gpu(0); - RunTestLodTensor(gpu); - RunTestLodTensor(gpu, 1); -#endif -} - -TEST(SelectedRows, Run) { - platform::CPUPlace place; - RunSerdeTestSelectedRows(place); - -#ifdef PADDLE_WITH_CUDA - platform::CUDAPlace gpu; - RunSerdeTestSelectedRows(gpu); -#endif -} diff --git a/paddle/fluid/operators/distributed/grpc_server.cc b/paddle/fluid/operators/distributed/grpc_server.cc deleted file mode 100644 index 8edb00276df3ade1b320fbf2873e8b54ff3e1464..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/distributed/grpc_server.cc +++ /dev/null @@ -1,415 +0,0 @@ -/*Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include -#include - -#include "paddle/fluid/operators/distributed/grpc_serde.h" -#include "paddle/fluid/operators/distributed/grpc_server.h" - -using ::grpc::ServerAsyncResponseWriter; - -namespace paddle { -namespace operators { -namespace distributed { -enum CallStatus { PROCESS = 0, FINISH }; - -// reference: -// https://stackoverflow.com/questions/41732884/grpc-multiple-services-in-cpp-async-server -class RequestBase { - public: - explicit RequestBase(GrpcService::AsyncService* service, - ::grpc::ServerCompletionQueue* cq, - RequestHandler* request_handler, int req_id) - : service_(service), - cq_(cq), - status_(PROCESS), - request_handler_(request_handler), - req_id_(req_id) { - PADDLE_ENFORCE(cq_); - } - virtual ~RequestBase() {} - virtual void Process() = 0; - - std::string Status2String(const std::string& method) { - std::string status = "Process"; - if (status_ == FINISH) { - status = "Finish"; - } - - std::ostringstream s; - s << method << " name:[" << GetReqName() << "]" - << ", ep:[" << ctx_.peer() << "]" - << " " << status << " using req_id:" << req_id_; - return s.str(); - } - - CallStatus Status() const { - std::lock_guard l(status_mu_); - return status_; - } - - template - void Finish(const T& reply, ServerAsyncResponseWriter* responder) { - std::lock_guard l(status_mu_); - status_ = FINISH; - responder->Finish(reply, ::grpc::Status::OK, - reinterpret_cast(static_cast(req_id_))); - } - virtual std::string GetReqName() = 0; - - protected: - mutable std::mutex status_mu_; - ::grpc::ServerContext ctx_; - GrpcService::AsyncService* service_; - ::grpc::ServerCompletionQueue* cq_; - CallStatus status_; - RequestHandler* request_handler_; - int req_id_; -}; - -class RequestSend final : public RequestBase { - public: - explicit RequestSend(GrpcService::AsyncService* service, - ::grpc::ServerCompletionQueue* cq, - RequestHandler* request_handler, int req_id) - : RequestBase(service, cq, request_handler, req_id), responder_(&ctx_) { - request_.reset(new GRPCVariableResponse(request_handler->scope(), - request_handler->dev_ctx(), - !request_handler->sync_mode())); - int method_id = static_cast(distributed::GrpcMethod::kSendVariable); - service_->RequestAsyncUnary( - method_id, &ctx_, request_.get(), &responder_, cq_, cq_, - reinterpret_cast(static_cast(req_id))); - } - virtual ~RequestSend() {} - std::string GetReqName() override { return request_->Varname(); } - - void Process() override { - std::string varname = GetReqName(); - VLOG(4) << "RequestSend var_name:" << varname; - - auto scope = request_->GetMutableLocalScope(); - auto invar = request_->GetVar(); - framework::Variable* outvar = nullptr; - - request_handler_->Handle(varname, scope, invar, &outvar); - Finish(reply_, &responder_); - } - - protected: - sendrecv::VoidMessage reply_; - std::shared_ptr request_; - ServerAsyncResponseWriter responder_; -}; - -class RequestGet final : public RequestBase { - public: - explicit RequestGet(GrpcService::AsyncService* service, - ::grpc::ServerCompletionQueue* cq, - RequestHandler* request_handler, int req_id) - : RequestBase(service, cq, request_handler, req_id), responder_(&ctx_) { - auto method_id = static_cast(distributed::GrpcMethod::kGetVariable); - service_->RequestAsyncUnary( - method_id, &ctx_, &request_, &responder_, cq_, cq_, - reinterpret_cast(static_cast(req_id))); - } - - virtual ~RequestGet() {} - - std::string GetReqName() override { return request_.varname(); } - - void Process() override { - // proc request. - std::string varname = request_.varname(); - VLOG(4) << "RequestGet " << varname; - - auto scope = request_handler_->scope(); - auto invar = scope->FindVar(varname); - framework::Variable* outvar = nullptr; - - request_handler_->Handle(varname, scope, invar, &outvar); - - if (outvar) { - SerializeToByteBuffer(varname, outvar, *request_handler_->dev_ctx(), - &reply_); - } - Finish(reply_, &responder_); - } - - protected: - sendrecv::VariableMessage request_; - ::grpc::ByteBuffer reply_; - ServerAsyncResponseWriter<::grpc::ByteBuffer> responder_; -}; - -class RequestPrefetch final : public RequestBase { - public: - explicit RequestPrefetch(GrpcService::AsyncService* service, - ::grpc::ServerCompletionQueue* cq, - RequestHandler* request_handler, int req_id) - : RequestBase(service, cq, request_handler, req_id), - responder_(&ctx_), - local_scope_(nullptr) { - request_.reset(new GRPCVariableResponse(request_handler->scope(), - request_handler->dev_ctx(), true)); - int method_id = - static_cast(distributed::GrpcMethod::kPrefetchVariable); - service_->RequestAsyncUnary( - method_id, &ctx_, request_.get(), &responder_, cq_, cq_, - reinterpret_cast(static_cast(req_id))); - } - - virtual ~RequestPrefetch() {} - - std::string GetReqName() override { return request_->Varname(); } - - void Process() override { - // prefetch process... - std::string in_var_name = request_->Varname(); - std::string out_var_name = request_->OutVarname(); - VLOG(4) << "RequestPrefetch, in_var_name: " << in_var_name - << " out_var_name: " << out_var_name; - - auto scope = request_->GetMutableLocalScope(); - auto invar = scope->FindVar(in_var_name); - // out var must be created in local scope! - framework::Variable* outvar = scope->Var(out_var_name); - - request_handler_->Handle(in_var_name, scope, invar, &outvar, out_var_name); - - SerializeToByteBuffer(out_var_name, outvar, *request_handler_->dev_ctx(), - &reply_); - Finish(reply_, &responder_); - } - - protected: - std::shared_ptr request_; - ::grpc::ByteBuffer reply_; - ServerAsyncResponseWriter<::grpc::ByteBuffer> responder_; - framework::Scope* local_scope_; -}; - -class RequestCheckpointNotify final : public RequestBase { - public: - explicit RequestCheckpointNotify(GrpcService::AsyncService* service, - ::grpc::ServerCompletionQueue* cq, - RequestHandler* request_handler, int req_id) - : RequestBase(service, cq, request_handler, req_id), responder_(&ctx_) { - request_.reset(new GRPCVariableResponse(request_handler->scope(), - request_handler->dev_ctx())); - int method_id = - static_cast(distributed::GrpcMethod::kCheckpointNotify); - service_->RequestAsyncUnary( - method_id, &ctx_, request_.get(), &responder_, cq_, cq_, - reinterpret_cast(static_cast(req_id))); - } - - virtual ~RequestCheckpointNotify() {} - - std::string GetReqName() override { return request_->Varname(); } - - void Process() override { - auto scope = request_->GetMutableLocalScope(); - - std::string checkpoint_notify = request_->Varname(); - std::string checkpoint_dir = request_->OutVarname(); - - VLOG(4) << "RequestCheckpointNotify notify: " << checkpoint_notify - << ", dir: " << checkpoint_dir; - - request_handler_->Handle(checkpoint_notify, scope, nullptr, nullptr, - checkpoint_dir); - Finish(reply_, &responder_); - } - - protected: - std::shared_ptr request_; - sendrecv::VoidMessage reply_; - ServerAsyncResponseWriter responder_; -}; - -void AsyncGRPCServer::WaitServerReady() { - VLOG(4) << "AsyncGRPCServer is wait server ready"; - std::unique_lock lock(this->mutex_ready_); - condition_ready_.wait(lock, [=] { return this->ready_ == 1; }); - VLOG(4) << "AsyncGRPCServer WaitSeverReady"; -} - -void AsyncGRPCServer::StartServer() { - ::grpc::ServerBuilder builder; - builder.AddListeningPort(bind_address_, ::grpc::InsecureServerCredentials(), - &selected_port_); - - builder.SetMaxSendMessageSize(std::numeric_limits::max()); - builder.SetMaxReceiveMessageSize(std::numeric_limits::max()); - builder.RegisterService(&service_); - - for (auto t : rpc_call_map_) { - rpc_cq_[t.first].reset(builder.AddCompletionQueue().release()); - } - - server_ = builder.BuildAndStart(); - LOG(INFO) << "Server listening on " << bind_address_ - << " selected port: " << selected_port_; - - std::function f = - std::bind(&AsyncGRPCServer::TryToRegisterNewOne, this, - std::placeholders::_1, std::placeholders::_2); - - for (auto& t : rpc_call_map_) { - auto& rpc_name = t.first; - auto& cq = rpc_cq_[rpc_name]; - auto threadnum = rpc_thread_num_[rpc_name]; - auto& reqs = rpc_reqs_[rpc_name]; - - reqs.reserve(kRequestBufSize); - - for (int i = 0; i < kRequestBufSize; i++) { - VLOG(6) << "TryToRegisterNewOne on RPC NAME: " << rpc_name << " I: " << i; - TryToRegisterNewOne(rpc_name, i); - } - - for (int i = 0; i < threadnum; i++) { - rpc_threads_[rpc_name].emplace_back(new std::thread(std::bind( - &AsyncGRPCServer::HandleRequest, this, cq.get(), rpc_name, f))); - VLOG(4) << t.first << " creates threads!"; - } - } - - { - std::lock_guard lock(this->mutex_ready_); - ready_ = 1; - } - condition_ready_.notify_all(); - - // wait server - server_->Wait(); - - for (auto& t : rpc_threads_) { - auto& threads = t.second; - for (size_t i = 0; i < threads.size(); ++i) { - threads[i]->join(); - VLOG(4) << t.first << " threads ends!"; - } - } -} - -void AsyncGRPCServer::ShutdownQueue() { - for (auto& t : rpc_cq_) { - t.second->Shutdown(); - VLOG(4) << t.first << " queue shutdown!"; - } -} - -void AsyncGRPCServer::ShutDownImpl() { - std::unique_lock lock(cq_mutex_); - is_shut_down_ = true; - ShutdownQueue(); - - VLOG(4) << "server_ shutdown!"; - server_->Shutdown(); -} - -void AsyncGRPCServer::TryToRegisterNewOne(const std::string& rpc_name, - int req_id) { - std::unique_lock lock(cq_mutex_); - if (is_shut_down_) { - VLOG(4) << "shutdown, do not TryToRegisterNewSendOne"; - return; - } - - VLOG(4) << "TryToRegisterNewOne on RPC NAME: " << rpc_name - << " REQ ID: " << req_id; - - auto& reqs = rpc_reqs_[rpc_name]; - auto& handler = rpc_call_map_[rpc_name]; - auto& cq = rpc_cq_[rpc_name]; - - RequestBase* b = nullptr; - if (rpc_name == kRequestSend) { - b = new RequestSend(&service_, cq.get(), handler, req_id); - } else if (rpc_name == kRequestGet) { - b = new RequestGet(&service_, cq.get(), handler, req_id); - } else if (rpc_name == kRequestPrefetch) { - b = new RequestPrefetch(&service_, cq.get(), handler, req_id); - } else if (rpc_name == kRequestCheckpoint) { - b = new RequestCheckpointNotify(&service_, cq.get(), handler, req_id); - } else { - PADDLE_ENFORCE(false, "not supported rpc"); - } - - reqs[req_id] = b; - - VLOG(4) << "Create RequestSend status:" << b->Status(); -} - -void AsyncGRPCServer::HandleRequest( - ::grpc::ServerCompletionQueue* cq, const std::string& rpc_name, - std::function TryToRegisterNewOne) { - void* tag = NULL; - bool ok = false; - - while (true) { - VLOG(4) << "HandleRequest " << rpc_name << " wait next"; - if (!cq->Next(&tag, &ok)) { - VLOG(3) << "CompletionQueue " << rpc_name << " shutdown!"; - break; - } - - int req_id = static_cast(reinterpret_cast(tag)); - VLOG(4) << "HandleRequest " << rpc_name << ", req_id:" << req_id - << " get next"; - - auto& reqs = rpc_reqs_[rpc_name]; - RequestBase* base = nullptr; - { - PADDLE_ENFORCE(req_id >= 0 && req_id < kRequestBufSize); - std::unique_lock lock(cq_mutex_); - base = reqs[req_id]; - } - - VLOG(3) << base->Status2String(rpc_name); - - // reference: - // https://github.com/tensorflow/tensorflow/issues/5596 - // https://groups.google.com/forum/#!topic/grpc-io/xftlRy-IQwM - // https://groups.google.com/forum/#!topic/grpc-io/ywATt88Ef_I - if (!ok) { - LOG(WARNING) << "completion queue:" << rpc_name - << " recv no regular event" - << " context:" << base->Status2String(rpc_name); - TryToRegisterNewOne(rpc_name, req_id); - delete base; - continue; - } - - switch (base->Status()) { - case PROCESS: { - base->Process(); - break; - } - case FINISH: { - TryToRegisterNewOne(rpc_name, req_id); - delete base; - break; - } - default: { assert(false); } - } - } -} - -} // namespace distributed -} // namespace operators -} // namespace paddle diff --git a/paddle/fluid/operators/distributed/grpc_server.h b/paddle/fluid/operators/distributed/grpc_server.h deleted file mode 100644 index d2524f5e65db6dedab78f45e17380359b58a3d11..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/distributed/grpc_server.h +++ /dev/null @@ -1,89 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#pragma once - -#include -#include -#include -#include // NOLINT -#include -#include - -#include "grpc++/grpc++.h" -#include "paddle/fluid/framework/blocking_queue.h" -#include "paddle/fluid/framework/executor.h" -#include "paddle/fluid/framework/lod_tensor.h" -#include "paddle/fluid/framework/program_desc.h" -#include "paddle/fluid/framework/scope.h" -#include "paddle/fluid/framework/selected_rows.h" -#include "paddle/fluid/framework/var_type.h" -#include "paddle/fluid/operators/distributed/grpc_service.h" -#include "paddle/fluid/operators/distributed/request_handler.h" -#include "paddle/fluid/operators/distributed/rpc_server.h" -#include "paddle/fluid/operators/distributed/send_recv.grpc.pb.h" -#include "paddle/fluid/operators/distributed/send_recv.pb.h" -#include "paddle/fluid/operators/distributed/sendrecvop_utils.h" -#include "paddle/fluid/platform/profiler.h" - -namespace paddle { -namespace operators { -namespace distributed { - -class RequestBase; - -class AsyncGRPCServer final : public RPCServer { - public: - explicit AsyncGRPCServer(const std::string& address, int client_num) - : RPCServer(address, client_num), ready_(0) {} - - virtual ~AsyncGRPCServer() {} - void WaitServerReady() override; - void StartServer() override; - - private: - // HandleRequest needs to be thread-safe. - void HandleRequest( - ::grpc::ServerCompletionQueue* cq, const std::string& rpc_name, - std::function TryToRegisterNewOne); - - void TryToRegisterNewOne(const std::string& rpc_name, int req_id); - void ShutdownQueue(); - void ShutDownImpl() override; - - private: - static const int kRequestBufSize = 100; - - std::mutex cq_mutex_; - volatile bool is_shut_down_ = false; - - GrpcService::AsyncService service_; - std::unique_ptr<::grpc::Server> server_; - - // condition of the sub program - std::condition_variable barrier_condition_; - - std::mutex mutex_ready_; - std::condition_variable condition_ready_; - - int ready_; - - std::map> rpc_cq_; - std::map>> rpc_threads_; - std::map> rpc_reqs_; -}; - -}; // namespace distributed -}; // namespace operators -}; // namespace paddle diff --git a/paddle/fluid/operators/distributed/grpc_service.h b/paddle/fluid/operators/distributed/grpc_service.h deleted file mode 100644 index 9ae9a31a003cbb1f808fd1127a5dd78511aa3e99..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/distributed/grpc_service.h +++ /dev/null @@ -1,127 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#pragma once - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include "paddle/fluid/operators/distributed/grpc_variable_response.h" -#include "paddle/fluid/platform/profiler.h" - -// NOTE: This method was originally created by tensorflow -// (https://github.com/tensorflow/tensorflow/) we borrow this -// method and did some modifications so that we can parse gRPC -// requests without too much copying of the tensor data. - -namespace grpc { -class CompletionQueue; -class Channel; -class RpcService; -class ServerCompletionQueue; -class ServerContext; - -// Support parsing/unparsing of tensorflow::VariableResponse. -// Wire-format is identical to RecvVariableResponse. -template <> -class SerializationTraits< - paddle::operators::distributed::GRPCVariableResponse> { - public: - static Status Serialize( - const paddle::operators::distributed::GRPCVariableResponse& msg, - grpc_byte_buffer** bp, bool* own_buffer) { - PADDLE_ENFORCE(false, "SerializationTraits::Serialize not implemented!"); - return Status(); - } - static Status Deserialize( - grpc_byte_buffer* buffer, - paddle::operators::distributed::GRPCVariableResponse* msg, - int max_message_size = INT_MAX) { - if (buffer == nullptr) { - return Status(StatusCode::INTERNAL, "No payload"); - } - - Status result = g_core_codegen_interface->ok(); - if (result.ok()) { - paddle::operators::distributed::GrpcByteSource source(buffer); - int ret = msg->Parse(&source); - if (ret != 0) { - result = Status(StatusCode::INTERNAL, "VariableResponse parse error"); - } - } - g_core_codegen_interface->grpc_byte_buffer_destroy(buffer); - return result; - } -}; -} // namespace grpc - -namespace paddle { -namespace operators { -namespace distributed { - -enum class GrpcMethod { - kSendVariable, - kGetVariable, - kPrefetchVariable, - kCheckpointNotify, -}; - -static const int kGrpcNumMethods = - static_cast(GrpcMethod::kCheckpointNotify) + 1; - -inline const char* GrpcMethodName(GrpcMethod id) { - switch (id) { - case GrpcMethod::kSendVariable: - return "/sendrecv.SendRecvService/SendVariable"; - case GrpcMethod::kGetVariable: - return "/sendrecv.SendRecvService/GetVariable"; - case GrpcMethod::kPrefetchVariable: - return "/sendrecv.SendRecvService/PrefetchVariable"; - case GrpcMethod::kCheckpointNotify: - return "/sendrecv.SendRecvService/CheckpointNotify"; - } - - // Shouldn't be reached. - PADDLE_ENFORCE(false, "Invalid id: not found valid method name"); - return nullptr; -} - -class GrpcService final { - public: - class AsyncService : public ::grpc::Service { - public: - AsyncService() { - for (int i = 0; i < kGrpcNumMethods; ++i) { - AddMethod(new ::grpc::internal::RpcServiceMethod( - GrpcMethodName(static_cast(i)), - ::grpc::internal::RpcMethod::NORMAL_RPC, nullptr)); - ::grpc::Service::MarkMethodAsync(i); - } - } - virtual ~AsyncService() {} - - // Make RequestAsyncUnary public for grpc_call.h - using ::grpc::Service::RequestAsyncUnary; - }; -}; - -} // namespace distributed -} // namespace operators -} // namespace paddle diff --git a/paddle/fluid/operators/distributed/grpc_variable_response.cc b/paddle/fluid/operators/distributed/grpc_variable_response.cc deleted file mode 100644 index 34d47f3ec0f3025109447b66078b724607d2953a..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/distributed/grpc_variable_response.cc +++ /dev/null @@ -1,308 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include -#include -#include -#ifdef PADDLE_WITH_CUDA -#include -#endif - -#include "paddle/fluid/operators/distributed/grpc_variable_response.h" -#include "paddle/fluid/platform/profiler.h" - -namespace paddle { -namespace operators { -namespace distributed { - -enum WireType { - WIRETYPE_VARINT = 0, - WIRETYPE_LENGTH_DELIMITED = 2, -}; - -inline int GetTagFieldNumber(uint32_t tag) { return tag >> 3; } - -inline WireType GetTagWireType(uint32_t tag) { - return static_cast(tag & 0x7); -} - -bool ReadVarintSizeAsInt(::google::protobuf::io::CodedInputStream* input, - int* result) { - uint64_t v; - if (input->ReadVarint64(&v) && v <= static_cast(INT_MAX)) { - *result = static_cast(v); - return true; - } else { - return false; - } -} - -int GRPCVariableResponse::Parse(const ::grpc::ByteBuffer& byte_buffer) { - GrpcByteBufferSource source; - source.Init(byte_buffer); - GrpcByteBufferSourceWrapper r(&source); - - return Parse(&r); -} - -bool ParseLodData(::google::protobuf::io::CodedInputStream* input, - std::vector* lod) { - while (true) { - auto p = input->ReadTagWithCutoff(127); - int tag = GetTagFieldNumber(p.first); - WireType wt = GetTagWireType(p.first); - - if (!p.second) { - return (tag == 0); - } - - switch (tag) { - case sendrecv::VariableMessage_LodData::kLodDataFieldNumber: { - uint64_t v; - if (wt == WIRETYPE_VARINT) { - if (!input->ReadVarint64(&v)) { - return false; - } - lod->push_back(v); - break; - } - - if (wt == WIRETYPE_LENGTH_DELIMITED) { - int num_bytes = 0; - if (!input->ReadVarintSizeAsInt(&num_bytes)) { - return tag; - } - int start_pos = input->CurrentPosition(); - while (input->CurrentPosition() - start_pos < num_bytes) { - uint64_t v; - if (!input->ReadVarint64(&v)) { - return tag; - } - lod->push_back(v); - } - break; - } - - return false; - } - default: { return false; } - } - } - - return true; -} - -int GRPCVariableResponse::Parse(Source* source) { - ::google::protobuf::io::ZeroCopyInputStream* input_stream = - source->contents(); - ::google::protobuf::io::CodedInputStream input(input_stream); - input.SetTotalBytesLimit(INT_MAX, INT_MAX); - - while (true) { - auto p = input.ReadTagWithCutoff(127); - int tag = GetTagFieldNumber(p.first); - WireType wt = GetTagWireType(p.first); - if (!p.second) { - if (tag != 0) { - return -1; - } - return 0; - } - - switch (tag) { - case sendrecv::VariableMessage::kVarnameFieldNumber: { - uint32_t length; - if ((wt != WIRETYPE_LENGTH_DELIMITED) || !input.ReadVarint32(&length)) { - return tag; - } - - std::string temp; - if (!input.ReadString(&temp, length)) { - return tag; - } - - meta_.set_varname(temp); - break; - } - case sendrecv::VariableMessage::kTypeFieldNumber: { - uint32_t v; - if ((wt != WIRETYPE_VARINT) || !input.ReadVarint32(&v)) { - return tag; - } - - meta_.set_type(static_cast<::sendrecv::VarType>(v)); - break; - } - case sendrecv::VariableMessage::kDataTypeFieldNumber: { - uint32_t v = 0; - if ((wt != WIRETYPE_VARINT) || !input.ReadVarint32(&v)) { - return tag; - } - - meta_.set_data_type(static_cast<::sendrecv::VariableMessage_Type>(v)); - break; - } - case sendrecv::VariableMessage::kDimsFieldNumber: { - // not packed - if (wt == WIRETYPE_VARINT) { - uint64_t v; - if (!input.ReadVarint64(&v)) { - return tag; - } - meta_.add_dims(v); - break; - } - - // packed - if (wt == WIRETYPE_LENGTH_DELIMITED) { - int num_bytes = 0; - if (!input.ReadVarintSizeAsInt(&num_bytes)) { - return tag; - } - int start_pos = input.CurrentPosition(); - while (input.CurrentPosition() - start_pos < num_bytes) { - uint64_t v; - if (!input.ReadVarint64(&v)) { - return tag; - } - meta_.add_dims(v); - } - break; - } - return tag; - } - case sendrecv::VariableMessage::kLodLevelFieldNumber: { - uint64_t v = 0; - if ((wt != WIRETYPE_VARINT) || !input.ReadVarint64(&v)) { - return tag; - } - meta_.set_lod_level(static_cast(v)); - break; - } - case sendrecv::VariableMessage::kLodFieldNumber: { - int length = 0; - if (wt != WIRETYPE_LENGTH_DELIMITED || - !ReadVarintSizeAsInt(&input, &length)) { - return tag; - } - - std::pair<::google::protobuf::io::CodedInputStream::Limit, int> p = - input.IncrementRecursionDepthAndPushLimit(length); - - std::vector lod_data; - if (p.second < 0 || !ParseLodData(&input, &lod_data)) { - return tag; - } - - if (!input.DecrementRecursionDepthAndPopLimit(p.first)) { - return tag; - } - - if (lod_data.size() == 0) { - break; - } - - auto lod = meta_.add_lod(); - for (uint32_t i = 0; i < lod_data.size(); i++) { - lod->add_lod_data(lod_data[i]); - } - break; - } - case sendrecv::VariableMessage::kSlrHeightFieldNumber: { - uint64_t v = 0; - if ((wt != WIRETYPE_VARINT) || !input.ReadVarint64(&v)) { - return tag; - } - meta_.set_slr_height(static_cast(v)); - break; - } - case sendrecv::VariableMessage::kSerializedFieldNumber: { - int num_bytes = 0; - if (wt != WIRETYPE_LENGTH_DELIMITED || - !ReadVarintSizeAsInt(&input, &num_bytes)) { - return tag; - } - - if (!ProcSerializedField(tag, &input, num_bytes)) { - return tag; - } - - break; - } - case sendrecv::VariableMessage::kRowsFieldNumber: { - PADDLE_ENFORCE((meta_.type() == sendrecv::SELECTED_ROWS || - meta_.type() == sendrecv::LOD_TENSOR) && - meta_.varname() != "", - "meta info should be got first!"); - - int num_bytes = 0; - if (wt != WIRETYPE_LENGTH_DELIMITED || - !ReadVarintSizeAsInt(&input, &num_bytes)) { - return tag; - } - - if (!CopySelectRowsData(&input, *dev_ctx_, num_bytes)) { - return tag; - } - break; - } - case sendrecv::VariableMessage::kOutVarnameFieldNumber: { - uint32_t length; - if ((wt != WIRETYPE_LENGTH_DELIMITED) || !input.ReadVarint32(&length)) { - return tag; - } - - std::string temp; - if (!input.ReadString(&temp, length)) { - return tag; - } - - meta_.set_out_varname(temp); - break; - } - case sendrecv::VariableMessage::kProfileFieldNumber: { - uint64_t profiling = 0; - if (!input.ReadVarint64(&profiling)) { - return tag; - } - meta_.set_profile(profiling); - int64_t listener_id = platform::ListenerId(); - if (listener_id <= 0) { - break; - } - if (profiling == platform::kEnableProfiler && - !platform::IsProfileEnabled()) { - platform::EnableProfiler(platform::ProfilerState::kCPU); - } else if (profiling == platform::kDisableProfiler && - platform::IsProfileEnabled()) { - // TODO(panyx0718): Should we allow to customize file dir. - platform::DisableProfiler( - platform::EventSortingKey::kDefault, - string::Sprintf("/tmp/profile_ps_%lld", listener_id)); - } - break; - } - default: { - // Unknown tag, return unknown error. - return -1; - } - } - } - - return 0; -} - -}; // namespace distributed -}; // namespace operators -}; // namespace paddle diff --git a/paddle/fluid/operators/distributed/grpc_variable_response.h b/paddle/fluid/operators/distributed/grpc_variable_response.h deleted file mode 100644 index 89df07c92cd33bcb76c8539b5566d74fa21bba5e..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/distributed/grpc_variable_response.h +++ /dev/null @@ -1,58 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#pragma once - -#include - -#include "paddle/fluid/framework/data_type.h" -#include "paddle/fluid/framework/lod_tensor.h" -#include "paddle/fluid/framework/scope.h" -#include "paddle/fluid/framework/selected_rows.h" -#include "paddle/fluid/framework/var_type.h" - -#include "paddle/fluid/operators/distributed/send_recv.grpc.pb.h" -#include "paddle/fluid/operators/distributed/send_recv.pb.h" - -#include "google/protobuf/io/coded_stream.h" -#include "google/protobuf/io/zero_copy_stream.h" -#include "paddle/fluid/framework/tensor.h" -#include "paddle/fluid/operators/distributed/grpc_bytebuffer_stream.h" -#include "paddle/fluid/operators/distributed/variable_response.h" - -namespace paddle { -namespace operators { -namespace distributed { - -class GRPCVariableResponse : public VariableResponse { - public: - GRPCVariableResponse(const framework::Scope* scope, - const platform::DeviceContext* dev_ctx, - bool create_scope = false) - : VariableResponse(scope, dev_ctx, create_scope) {} - - virtual ~GRPCVariableResponse() {} - - int Parse(Source* source) override; - - // return: - // 0:ok. - // -1: unkown error. - // other: number of error field. - int Parse(const ::grpc::ByteBuffer& byte_buffer); -}; - -}; // namespace distributed -}; // namespace operators -}; // namespace paddle diff --git a/paddle/fluid/operators/distributed/parameter_prefetch.cc b/paddle/fluid/operators/distributed/parameter_prefetch.cc new file mode 100644 index 0000000000000000000000000000000000000000..0e8d877e08cf6186cef79cd550035cb8699271d2 --- /dev/null +++ b/paddle/fluid/operators/distributed/parameter_prefetch.cc @@ -0,0 +1,241 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include +#include +#include + +#include "paddle/fluid/operators/distributed/parameter_prefetch.h" + +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/framework/selected_rows.h" +#include "paddle/fluid/framework/tensor.h" + +#include "paddle/fluid/operators/distributed/distributed.h" +#include "paddle/fluid/operators/distributed/rpc_client.h" +#include "paddle/fluid/operators/distributed/variable_response.h" +#include "paddle/fluid/operators/distributed_ops/send_recv_util.h" + +namespace paddle { +namespace operators { +namespace distributed { + +using LoDTensor = framework::LoDTensor; +using LoDTensor = framework::LoDTensor; +using SelectedRows = framework::SelectedRows; +using DDim = framework::DDim; + +static std::vector> SplitIds( + const std::vector& ids_vector, + const std::vector& height_section) { + std::set all_ids; + for (auto id : ids_vector) { + all_ids.insert(id); + } + + auto abs_sections = ToAbsoluteSection(height_section); + std::vector> splited_ids; + splited_ids.resize(height_section.size() + 1); + for (auto& id : all_ids) { + auto section_index = GetSectionIndex(id, abs_sections); + splited_ids[section_index].push_back(id - abs_sections[section_index]); + } + return splited_ids; +} + +static void SplitIdsIntoMultipleVarsBySection( + const std::vector& in_var_names, + const std::vector& height_section, + const std::vector>& splited_ids, + framework::Scope* scope) { + PADDLE_ENFORCE_EQ(in_var_names.size(), height_section.size(), ""); + + auto place = platform::CPUPlace(); + + for (size_t i = 0; i < in_var_names.size(); ++i) { + auto* id_tensor = + scope->Var(in_var_names[i])->GetMutable(); + auto& ids = splited_ids[i]; + if (!ids.empty()) { + auto* id_tensor_data = id_tensor->mutable_data( + framework::make_ddim({static_cast(ids.size()), 1}), place); + memcpy(id_tensor_data, ids.data(), sizeof(int64_t) * ids.size()); + } + } +} + +static void MergeMultipleVarsIntoOneBySection( + const std::string& id_name, const std::vector& ids_vector, + const std::string& out_name, const std::vector& out_var_names, + const std::vector& height_section, + const std::vector>& splited_ids, + const framework::ExecutionContext& context, framework::Scope* scope, + platform::DeviceContext* actual_ctx) { + PADDLE_ENFORCE_EQ(out_var_names.size(), height_section.size(), ""); + + auto cpu_place = platform::CPUPlace(); + + auto abs_sections = ToAbsoluteSection(height_section); + std::unordered_map> id_to_offset; + for (size_t i = 0; i < ids_vector.size(); ++i) { + id_to_offset[ids_vector[i]].push_back(i); + } + + auto& id_tensor = scope->FindVar(id_name)->Get(); + auto* out_tensor = + scope->FindVar(out_name)->GetMutable(); + + PADDLE_ENFORCE_GT( + out_tensor->numel(), 0, + "When calling this method, the LoDTensor's numel must larger than zero. " + "Please check LoDTensor::Resize has been called first."); + + auto* out_tensor_data = out_tensor->mutable_data(id_tensor.place()); + + bool is_on_cpu_place = true; + if (!platform::is_cpu_place(id_tensor.place())) { + is_on_cpu_place = false; + } + + for (size_t section_idx = 0; section_idx < out_var_names.size(); + ++section_idx) { + auto& ids_in_this_section = splited_ids[section_idx]; + if (!ids_in_this_section.empty()) { + auto& prefetch_out_var = + scope->Var(out_var_names[section_idx])->Get(); + const auto* out_var_data = prefetch_out_var.data(); + auto& dims = prefetch_out_var.dims(); + + PADDLE_ENFORCE_EQ(dims.size(), 2, ""); + PADDLE_ENFORCE_EQ(ids_in_this_section.size(), dims[0]); + + auto row_numel = dims[1]; + + for (int64_t i = 0; i < dims[0]; ++i) { + auto id = ids_in_this_section[i]; + auto origin_id = id + abs_sections[section_idx]; + auto& offsets = id_to_offset[origin_id]; + for (auto& offset : offsets) { + // should support GPU tensor + if (is_on_cpu_place) { + memory::Copy(cpu_place, out_tensor_data + offset * row_numel, + cpu_place, out_var_data + i * row_numel, + sizeof(float) * row_numel); + } else { +#ifndef PADDLE_WITH_CUDA + PADDLE_THROW("paddle is not compiled with CUDA!"); +#else + auto stream = + static_cast(actual_ctx)->stream(); + memory::Copy(boost::get(id_tensor.place()), + out_tensor_data + offset * row_numel, cpu_place, + out_var_data + i * row_numel, + sizeof(float) * row_numel, stream); +#endif + } + } + } + } else { + VLOG(3) << "ids in this section is empty"; + } + } +} + +void prefetch(const std::string& id_name, const std::string& out_name, + const std::vector& table_names, + const std::vector& epmap, + const std::vector& height_sections, + const framework::ExecutionContext& context, + const framework::Scope& scope) { + std::unique_ptr local_scope = scope.NewTmpScope(); + + platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance(); + auto& cpu_ctx = *pool.Get(platform::CPUPlace()); + auto& actual_ctx = *pool.Get(context.GetPlace()); + + distributed::RPCClient* rpc_client = + distributed::RPCClient::GetInstance( + context.Attr("trainer_id")); + + std::vector in_var_names; + std::vector out_var_names; + for (size_t i = 0; i < epmap.size(); ++i) { + in_var_names.push_back(id_name + "@" + epmap[i]); + out_var_names.push_back(out_name + "@" + epmap[i]); + } + + auto& id_tensor = scope.FindVar(id_name)->Get(); + std::vector ids_vector; + if (platform::is_cpu_place(id_tensor.place())) { + auto* id_data = id_tensor.data(); + for (int64_t i = 0; i < id_tensor.numel(); ++i) { + ids_vector.push_back(id_data[i]); + } + } else { +#ifndef PADDLE_WITH_CUDA + PADDLE_THROW("paddle is not compiled with CUDA!"); +#else + auto cpu_place = platform::CPUPlace(); + framework::LoDTensor cpu_tensor; + auto* cpu_tensor_data = + cpu_tensor.mutable_data(id_tensor.dims(), cpu_place); + auto stream = + static_cast(&actual_ctx)->stream(); + memory::Copy(cpu_place, cpu_tensor_data, + boost::get(id_tensor.place()), + id_tensor.data(), sizeof(int64_t) * id_tensor.numel(), + stream); + for (int64_t i = 0; i < cpu_tensor.numel(); ++i) { + ids_vector.push_back(cpu_tensor_data[i]); + } +#endif + } + + auto splited_ids = SplitIds(ids_vector, height_sections); + SplitIdsIntoMultipleVarsBySection(in_var_names, height_sections, splited_ids, + local_scope.get()); + + // create output var in local scope + for (auto& name : out_var_names) { + local_scope->Var(name)->GetMutable(); + } + + std::vector rets; + for (size_t i = 0; i < in_var_names.size(); i++) { + if (NeedSend(*local_scope.get(), in_var_names[i])) { + VLOG(3) << "sending " << in_var_names[i] << " to " << epmap[i] + << " to get " << out_var_names[i] << " back"; + rets.push_back(rpc_client->AsyncPrefetchVar( + epmap[i], cpu_ctx, *local_scope.get(), in_var_names[i], + out_var_names[i], table_names[i])); + } else { + VLOG(3) << "don't send no-initialied variable: " << out_var_names[i]; + } + } + + for (size_t i = 0; i < rets.size(); i++) { + PADDLE_ENFORCE(rets[i]->Wait(), "internal error in RPCClient"); + } + + MergeMultipleVarsIntoOneBySection(id_name, ids_vector, out_name, + out_var_names, height_sections, splited_ids, + context, local_scope.get(), &actual_ctx); +} + +}; // namespace distributed +}; // namespace operators +}; // namespace paddle diff --git a/paddle/fluid/operators/distributed/parameter_prefetch.h b/paddle/fluid/operators/distributed/parameter_prefetch.h new file mode 100644 index 0000000000000000000000000000000000000000..0429ec4415dca19ff620cd7af5a8c0a935e17e2f --- /dev/null +++ b/paddle/fluid/operators/distributed/parameter_prefetch.h @@ -0,0 +1,83 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include + +#include "paddle/fluid/framework/operator.h" + +namespace paddle { +namespace operators { +namespace distributed { + +void prefetch(const std::string& id_name, const std::string& out_name, + const std::vector& table_names, + const std::vector& epmap, + const std::vector& height_sections, + const framework::ExecutionContext& context, + const framework::Scope& scope); + +template +void prefetch_with_reconstruct(const std::string& id_name, + const std::string& out_name, + const std::vector& table_names, + const std::vector& epmap, + const std::vector& height_sections, + const framework::ExecutionContext& context, + const framework::Scope& scope, + framework::LoDTensor* original) { + prefetch(id_name, out_name, table_names, epmap, height_sections, context, + scope); + auto& out = scope.FindVar(out_name)->Get(); + auto& ids = scope.FindVar(id_name)->Get(); + auto* original_value = original->data(); + auto* out_value = out.data(); + size_t original_width = original->numel() / original->dims()[0]; + + bool is_on_cpu_place = true; + if (!platform::is_cpu_place(ids.place())) { + is_on_cpu_place = false; + } + if (is_on_cpu_place) { + for (int64_t i = 0; i < ids.numel(); i++) { + const T* out_rows = out_value + original_width * i; + T* original_row = + original_value + original_width * ids.data()[i]; + std::memcpy(original_row, out_rows, original_width * sizeof(T)); + } + } else { +#ifndef PADDLE_WITH_CUDA + PADDLE_THROW("paddle is not compiled with CUDA!"); +#else + platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance(); + auto& actual_ctx = *pool.Get(context.GetPlace()); + for (int64_t i = 0; i < ids.numel(); i++) { + const T* out_rows = out_value + original_width * i; + T* original_row = + original_value + original_width * ids.data()[i]; + auto stream = + static_cast(&actual_ctx)->stream(); + memory::Copy(boost::get(ids.place()), original_row, + platform::CPUPlace(), out_rows, original_width * sizeof(T), + stream); + } +#endif + } +} + +}; // namespace distributed +}; // namespace operators +}; // namespace paddle diff --git a/paddle/fluid/operators/distributed/parameter_recv.cc b/paddle/fluid/operators/distributed/parameter_recv.cc new file mode 100644 index 0000000000000000000000000000000000000000..e7d4c262aa9fad10a23adc61b94ba0c38577c0e8 --- /dev/null +++ b/paddle/fluid/operators/distributed/parameter_recv.cc @@ -0,0 +1,104 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include +#include + +#include "paddle/fluid/operators/distributed/parameter_recv.h" + +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/framework/selected_rows.h" +#include "paddle/fluid/framework/tensor.h" + +#include "paddle/fluid/operators/distributed/distributed.h" +#include "paddle/fluid/operators/distributed/rpc_client.h" +#include "paddle/fluid/operators/distributed/variable_response.h" +#include "paddle/fluid/operators/distributed_ops/send_recv_util.h" +#include "paddle/fluid/operators/strided_memcpy.h" + +namespace paddle { +namespace operators { +namespace distributed { + +using LoDTensor = framework::LoDTensor; +using LoDTensor = framework::LoDTensor; +using SelectedRows = framework::SelectedRows; +using DDim = framework::DDim; + +template +void ParameterRecv::operator()(const RpcContext &rpc_ctx, + const framework::Scope &scope) { + VLOG(3) << "ParameterRecv in"; + std::unique_ptr local_scope = scope.NewTmpScope(); + + platform::DeviceContextPool &pool = platform::DeviceContextPool::Instance(); + auto &cpu_ctx = *pool.Get(platform::CPUPlace()); + + distributed::RPCClient *rpc_client = + distributed::RPCClient::GetInstance(0); + + auto *recv_var = scope.FindVar(rpc_ctx.var_name); + + std::vector recved_tensors; + + // recv all vars to local scope + if (recv_var->IsType()) { + std::vector rets; + for (size_t i = 0; i < rpc_ctx.splited_var_names.size(); i++) { + auto &recv_var_name = rpc_ctx.splited_var_names[i]; + framework::Tensor *t = + local_scope->Var(recv_var_name)->GetMutable(); + recved_tensors.push_back(t); + VLOG(3) << "recv " << recv_var_name << " from " << rpc_ctx.epmap[i]; + rets.push_back(rpc_client->AsyncGetVar(rpc_ctx.epmap[i], cpu_ctx, + *local_scope.get(), recv_var_name, + recv_var_name)); + } + for (size_t i = 0; i < rets.size(); i++) { + PADDLE_ENFORCE(rets[i]->Wait(), "internal error in RPCClient"); + } + } else { + PADDLE_THROW("unsupported var type to recv!"); + } + + // concat recved tensor into one var + { + size_t output_offset = 0; + framework::Tensor *recv_tensor = + recv_var->GetMutable(); + auto dev_ctx = paddle::platform::CPUDeviceContext(); + int64_t recv_numel = 0; + for (auto *in : recved_tensors) { + recv_numel += in->numel(); + auto in_stride = framework::stride_numel(in->dims()); + auto out_stride = framework::stride_numel(recv_tensor->dims()); + StridedNumelCopyWithAxis( + dev_ctx, 0, recv_tensor->data() + output_offset, out_stride, + in->data(), in_stride, in_stride[0]); + output_offset += in_stride[0]; + } + PADDLE_ENFORCE_EQ(recv_numel, recv_tensor->numel()); + } + + VLOG(3) << "ParameterRecv out"; +} + +template struct ParameterRecv; + +}; // namespace distributed +}; // namespace operators +}; // namespace paddle diff --git a/paddle/fluid/operators/distributed/parameter_recv.h b/paddle/fluid/operators/distributed/parameter_recv.h new file mode 100644 index 0000000000000000000000000000000000000000..e955fca7250ecc88f3b1a08611f380da50df788d --- /dev/null +++ b/paddle/fluid/operators/distributed/parameter_recv.h @@ -0,0 +1,34 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include + +#include "paddle/fluid/framework/operator.h" +#include "paddle/fluid/operators/distributed/rpc_common.h" + +namespace paddle { +namespace operators { +namespace distributed { + +template +struct ParameterRecv { + void operator()(const RpcContext &rpc_ctx, const framework::Scope &scope); +}; + +}; // namespace distributed +}; // namespace operators +}; // namespace paddle diff --git a/paddle/fluid/operators/distributed/parameter_send.cc b/paddle/fluid/operators/distributed/parameter_send.cc new file mode 100644 index 0000000000000000000000000000000000000000..9ce424445229cde0a7e775c95f4af8839f4d4d68 --- /dev/null +++ b/paddle/fluid/operators/distributed/parameter_send.cc @@ -0,0 +1,175 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include +#include + +#include "paddle/fluid/operators/distributed/parameter_send.h" + +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/framework/selected_rows.h" +#include "paddle/fluid/framework/tensor.h" + +#include "paddle/fluid/operators/distributed/distributed.h" +#include "paddle/fluid/operators/distributed/rpc_client.h" +#include "paddle/fluid/operators/distributed/variable_response.h" +#include "paddle/fluid/operators/distributed_ops/send_recv_util.h" + +namespace paddle { +namespace operators { +namespace distributed { + +using LoDTensor = framework::LoDTensor; +using LoDTensor = framework::LoDTensor; +using SelectedRows = framework::SelectedRows; +using DDim = framework::DDim; + +template +void ParameterSend::operator()(const RpcContext &rpc_ctx, + const framework::Scope &scope, bool sync) { + std::unique_ptr local_scope = scope.NewTmpScope(); + + platform::DeviceContextPool &pool = platform::DeviceContextPool::Instance(); + auto &cpu_ctx = *pool.Get(platform::CPUPlace()); + + distributed::RPCClient *rpc_client = + distributed::RPCClient::GetInstance(0); + + auto *send_var = scope.FindVar(rpc_ctx.var_name); + size_t out_num = rpc_ctx.splited_var_names.size(); + if (send_var->IsType()) { + if (out_num > 1) { + auto &send_tensor = send_var->Get(); + auto &send_tensor_dims = send_tensor.dims(); + std::vector outs_dims; + outs_dims.reserve(out_num); + + // infer output shape + PADDLE_ENFORCE_EQ(rpc_ctx.height_sections.size(), out_num, + "tensor split sections size" + "should be equal to output size."); + for (size_t i = 0; i < out_num; ++i) { + auto dim = send_tensor_dims; + dim[0] = rpc_ctx.height_sections[i]; + outs_dims.push_back(dim); + } + + // create output var in local scope + size_t row_offset = 0; + for (auto i = 0; i < out_num; ++i) { + framework::Tensor *out = local_scope->Var(rpc_ctx.splited_var_names[i]) + ->GetMutable(); + *out = send_tensor.Slice(row_offset, row_offset + outs_dims[i][0]); + row_offset += outs_dims[i][0]; + } + } + } else if (send_var->IsType()) { + auto &send_slr = send_var->Get(); + auto abs_sections = ToAbsoluteSection(rpc_ctx.height_sections); + + auto &send_rows = send_slr.rows(); + std::vector> outs_rows_idx; + std::vector> outs_dense_idx; + + outs_rows_idx.resize(out_num); + outs_dense_idx.resize(out_num); + + auto row_numel = send_slr.value().numel() / send_slr.value().dims()[0]; + auto *src = send_slr.value().data(); + + // create output var in local scope + std::vector outs; + for (auto &name : rpc_ctx.splited_var_names) { + auto *out = local_scope->Var(name)->GetMutable(); + outs.push_back(out); + } + + // split rows index into output sparse vars + for (size_t i = 0; i < send_rows.size(); ++i) { + size_t out_idx = GetSectionIndex(send_rows[i], abs_sections); + outs_rows_idx[out_idx].push_back(send_rows[i]); + outs_dense_idx[out_idx].push_back(i); + } + auto place = platform::CPUPlace(); + + for (size_t i = 0; i < outs_rows_idx.size(); ++i) { + auto rows_idx = outs_rows_idx[i]; + outs[i]->set_height(rpc_ctx.height_sections[i]); + auto dims = send_slr.GetCompleteDims(); + dims[0] = rows_idx.size(); + outs[i]->mutable_rows()->clear(); + outs[i]->mutable_value()->mutable_data(dims, send_slr.place()); + if (rows_idx.size() > 0) { + for (auto idx : rows_idx) { + outs[i]->mutable_rows()->push_back(idx - abs_sections[i]); + } + auto dst = outs[i]->mutable_value()->mutable_data(place); + for (size_t j = 0; j < rows_idx.size(); j++) { + if (platform::is_cpu_place(place)) { + memory::Copy( + platform::CPUPlace(), dst + j * row_numel, platform::CPUPlace(), + src + outs_dense_idx[i][j] * row_numel, sizeof(T) * row_numel); + } else { + PADDLE_THROW("do not support GPU now"); + /* + #ifdef PADDLE_WITH_CUDA + auto stream = ctx.cuda_device_context().stream(); + memory::Copy(platform::CUDAPlace(), dst + j * row_numel, + platform::CUDAPlace(), + src + outs_dense_idx[i][j] * row_numel, + sizeof(T) * row_numel, stream); + #else + PADDLE_THROW("Paddle is not compiled with GPU"); + #endif + */ + } + } + } + PADDLE_ENFORCE_EQ(rows_idx.size(), outs[i]->rows().size(), + "rows should has the same size with tensor dim 0"); + } + + } else { + PADDLE_THROW("unsupported var type to send!"); + } + + std::vector rets; + for (size_t i = 0; i < rpc_ctx.splited_var_names.size(); i++) { + auto &send_var_name = rpc_ctx.splited_var_names[i]; + auto &endpoint = rpc_ctx.epmap[i]; + if (NeedSend(*local_scope.get(), send_var_name)) { + VLOG(3) << "sending " << send_var_name << " to " << endpoint; + rets.push_back(rpc_client->AsyncSendVar( + endpoint, cpu_ctx, *local_scope.get(), send_var_name)); + } else { + VLOG(3) << "don't send non-initialized variable: " + << rpc_ctx.splited_var_names[i]; + } + } + + if (sync) { + for (auto &handle : rets) { + PADDLE_ENFORCE(handle->Wait(), "internal error in RPCClient"); + } + } +} + +template struct ParameterSend; + +}; // namespace distributed +}; // namespace operators +}; // namespace paddle diff --git a/paddle/fluid/operators/distributed/parameter_send.h b/paddle/fluid/operators/distributed/parameter_send.h new file mode 100644 index 0000000000000000000000000000000000000000..9077f4a4fb9fd9d7152e8be72519f16b1999e93d --- /dev/null +++ b/paddle/fluid/operators/distributed/parameter_send.h @@ -0,0 +1,35 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include + +#include "paddle/fluid/framework/operator.h" +#include "paddle/fluid/operators/distributed/rpc_common.h" + +namespace paddle { +namespace operators { +namespace distributed { + +template +struct ParameterSend { + void operator()(const RpcContext &rpc_ctx, const framework::Scope &scope, + bool sync); +}; + +}; // namespace distributed +}; // namespace operators +}; // namespace paddle diff --git a/paddle/fluid/operators/distributed/proto_encoder_helper.h b/paddle/fluid/operators/distributed/proto_encoder_helper.h index d2b0eb6ca6de1984dc7cfc2a662c88d5e56e1e05..e9f06f54327875c0568c571627e9effb998e15be 100644 --- a/paddle/fluid/operators/distributed/proto_encoder_helper.h +++ b/paddle/fluid/operators/distributed/proto_encoder_helper.h @@ -84,7 +84,9 @@ class ProtoEncodeHelper { ~ProtoEncodeHelper() { #define REPLACE_ENFORCE_GLOG 1 // Make sure callers didn't do operations that went over max_size promised - paddle::platform::throw_on_error(p_ <= limit_); + if (paddle::platform::is_error(p_ <= limit_)) { + paddle::platform::throw_on_error(p_ <= limit_, ""); + } #undef REPLACE_ENFORCE_GLOG } diff --git a/paddle/fluid/operators/distributed/request_handler.h b/paddle/fluid/operators/distributed/request_handler.h index 5be7095acd3c5ac6f880a8a26c246f60a93643b5..991158ac72007efc1233f852caed4f90f35fe1cd 100644 --- a/paddle/fluid/operators/distributed/request_handler.h +++ b/paddle/fluid/operators/distributed/request_handler.h @@ -37,14 +37,29 @@ namespace distributed { constexpr char kRequestSend[] = "RequestSend"; constexpr char kRequestGet[] = "RequestGet"; +constexpr char kRequestGetMonomerVariable[] = "RequestGetMonomerVariable"; +constexpr char kRequestGetMonomerBarrier[] = "RequestGetMonomerBarrier"; constexpr char kRequestPrefetch[] = "RequestPrefetch"; constexpr char kRequestCheckpoint[] = "RequestCheckpoint"; constexpr char kRequestPassBarrier[] = "RequestPassBarrier"; +constexpr char kRequestGetNoBarrier[] = "GetVariableNoBarrier"; + +constexpr char kSendRPC[] = "SendRPC"; +constexpr char kGetRPC[] = "GetRPC"; +constexpr char kGetNoBarrierRPC[] = "GetNoBarrierRPC"; +constexpr char kGetMonomerRPC[] = "GetMonomerRPC"; +constexpr char kPrefetchRPC[] = "PrefetchRPC"; +constexpr char kBatchBarrierRPC[] = "BatchBarrierRPC"; +constexpr char kFetchBarrierRPC[] = "FetchBarrierRPC"; +constexpr char kSendMonomerFetchBarrierRPC[] = "SendMonomerFetchBarrierRPC"; +constexpr char kSendCompleteRPC[] = "SendCompleteRPC"; +constexpr char kCheckPointNotifyRPC[] = "CheckPointNotifyRPC"; #define LISTEN_TERMINATE_MESSAGE "TERMINATE@RECV" #define BATCH_BARRIER_MESSAGE "BATCH_BARRIER@RECV" #define FETCH_BARRIER_MESSAGE "FETCH_BARRIER@RECV" #define COMPLETE_MESSAGE "COMPLETE@RECV" +#define WITHOUT_BARRIER_MESSAGE "@WITHOUT_BARRIER@RECV" #define CHECKPOINT_SAVE_MESSAGE "SAVE@CHECKPOINTNOTIFY" #define CHECKPOINT_LOAD_MESSAGE "LOAD@CHECKPOINTNOTIFY" @@ -190,7 +205,9 @@ class RequestHandler { // } virtual bool Handle(const std::string& varname, framework::Scope* scope, framework::Variable* var, framework::Variable** outvar, - const std::string& out_var_name = "") = 0; + const int trainer_id, + const std::string& out_var_name = "", + const std::string& table_name = "") = 0; protected: const bool sync_mode_; diff --git a/paddle/fluid/operators/distributed/request_handler_impl.cc b/paddle/fluid/operators/distributed/request_handler_impl.cc index 849e412504eb9180b746db65fd4fa353ed0c05a1..e289ec929dbd6643a2518b92c1a25b7d63e790a9 100644 --- a/paddle/fluid/operators/distributed/request_handler_impl.cc +++ b/paddle/fluid/operators/distributed/request_handler_impl.cc @@ -12,6 +12,7 @@ // See the License for the specific language governing permissions and // limitations under the License. +#include "paddle/fluid/operators/distributed/request_handler_impl.h" #include #include #include @@ -20,8 +21,9 @@ #include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/framework/scope.h" #include "paddle/fluid/framework/selected_rows.h" -#include "paddle/fluid/operators/distributed/request_handler_impl.h" +#include "paddle/fluid/framework/variable_helper.h" #include "paddle/fluid/operators/distributed/rpc_server.h" +#include "paddle/fluid/string/piece.h" #include "paddle/fluid/string/printf.h" namespace paddle { @@ -36,7 +38,9 @@ bool RequestSendHandler::Handle(const std::string& varname, framework::Scope* scope, framework::Variable* invar, framework::Variable** outvar, - const std::string& out_var_name) { + const int trainer_id, + const std::string& out_var_name, + const std::string& table_name) { VLOG(4) << "RequestSendHandler:" << varname; // Sync @@ -50,14 +54,13 @@ bool RequestSendHandler::Handle(const std::string& varname, // Async if (!sync_mode_) { VLOG(3) << "async process var: " << varname; - rpc_server_->Profiler().OneStep(); - try { - executor_->RunPreparedContext((*grad_to_prepared_ctx_)[varname].get(), - scope); - } catch (std::exception& e) { - LOG(ERROR) << "async: run sub program error " << e.what(); - return false; + if (varname == BATCH_BARRIER_MESSAGE) { + PADDLE_THROW( + "async mode should not recv BATCH_BARRIER_MESSAGE or " + "COMPLETE_MESSAGE"); } + executor_->RunPreparedContext((*grad_to_prepared_ctx_)[varname].get(), + scope); return true; } else { // sync rpc_server_->WaitCond(kRequestSend); @@ -76,8 +79,12 @@ bool RequestGetHandler::Handle(const std::string& varname, framework::Scope* scope, framework::Variable* invar, framework::Variable** outvar, - const std::string& out_var_name) { - VLOG(4) << "RequestGetHandler:" << varname; + const int trainer_id, + const std::string& out_var_name, + const std::string& table_name) { + VLOG(4) << "RequestGetHandler:" << varname + << " out_var_name: " << out_var_name; + if (sync_mode_) { if (varname == FETCH_BARRIER_MESSAGE) { VLOG(3) << "sync: recv fetch barrier message"; @@ -88,24 +95,72 @@ bool RequestGetHandler::Handle(const std::string& varname, } } else { if (varname != FETCH_BARRIER_MESSAGE && varname != COMPLETE_MESSAGE) { + if (enable_dc_asgd_) { + // NOTE: the format is determined by distributed_transpiler.py + std::string param_bak_name = + string::Sprintf("%s.trainer_%d_bak", varname, trainer_id); + VLOG(3) << "getting " << param_bak_name << " trainer_id " << trainer_id; + auto var = scope_->FindVar(varname); + auto t_orig = var->Get(); + auto param_bak = scope_->Var(param_bak_name); + auto t = param_bak->GetMutable(); + t->mutable_data(dev_ctx_->GetPlace(), t_orig.type()); + VLOG(3) << "copying " << varname << " to " << param_bak_name; + framework::TensorCopy(t_orig, dev_ctx_->GetPlace(), t); + } *outvar = scope_->FindVar(varname); } } return true; } +bool RequestGetNoBarrierHandler::Handle(const std::string& varname, + framework::Scope* scope, + framework::Variable* invar, + framework::Variable** outvar, + const int trainer_id, + const std::string& out_var_name, + const std::string& table_name) { + VLOG(4) << "RequestGetNoBarrierHandler:" << varname + << " out_var_name: " << out_var_name; + + // get var from pserver immediately without barriers + string::Piece without_barrier_piece(WITHOUT_BARRIER_MESSAGE); + string::Piece var_name_piece = string::Piece(varname); + + if (string::Contains(var_name_piece, without_barrier_piece)) { + var_name_piece = string::TrimSuffix(var_name_piece, without_barrier_piece); + VLOG(4) << "Get var " << var_name_piece << " with " + << WITHOUT_BARRIER_MESSAGE; + *outvar = scope_->FindVar(var_name_piece.ToString()); + return true; + } else { + PADDLE_THROW("GetNoBarrier must contain %s", WITHOUT_BARRIER_MESSAGE); + } + return true; +} + bool RequestPrefetchHandler::Handle(const std::string& varname, framework::Scope* scope, framework::Variable* invar, framework::Variable** outvar, - const std::string& out_var_name) { + const int trainer_id, + const std::string& out_var_name, + const std::string& table_name) { VLOG(4) << "RequestPrefetchHandler " << varname; - auto var_desc = program_->Block(0).FindVar(out_var_name); - InitializeVariable(*outvar, var_desc->GetType()); - executor_->RunPreparedContext( - (*prefetch_var_name_to_prepared_ctx_)[varname].get(), scope); - + if (table_name.empty()) { + auto var_desc = program_->Block(0).FindVar(out_var_name); + InitializeVariable(*outvar, var_desc->GetType()); + executor_->RunPreparedContext( + (*prefetch_var_name_to_prepared_ctx_)[varname].get(), scope); + } else { + (*outvar)->GetMutable(); + auto lookup_table_op = + BuildLookupTableOp(table_name, varname, out_var_name); + paddle::platform::CPUPlace cpu_place; + lookup_table_op->Run(*scope, cpu_place); + } return true; } @@ -113,7 +168,9 @@ bool RequestCheckpointHandler::Handle(const std::string& varname, framework::Scope* scope, framework::Variable* invar, framework::Variable** outvar, - const std::string& out_var_name) { + const int trainer_id, + const std::string& out_var_name, + const std::string& table_name) { PADDLE_ENFORCE( checkpoint_notify_id != -1, "when checkpoint_notify_id = -1, there should be no RPC invoke."); diff --git a/paddle/fluid/operators/distributed/request_handler_impl.h b/paddle/fluid/operators/distributed/request_handler_impl.h index 8be5b21bb89a580f4091de19186fd2d7e5802478..f3c1b24526b8b28033c0c979f74d44a3d7a94201 100644 --- a/paddle/fluid/operators/distributed/request_handler_impl.h +++ b/paddle/fluid/operators/distributed/request_handler_impl.h @@ -24,6 +24,7 @@ #include "paddle/fluid/framework/data_type.h" #include "paddle/fluid/framework/executor.h" #include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/framework/program_desc.h" #include "paddle/fluid/framework/scope.h" #include "paddle/fluid/framework/selected_rows.h" @@ -36,29 +37,77 @@ namespace distributed { class RequestSendHandler final : public RequestHandler { public: - explicit RequestSendHandler(bool sync_mode) : RequestHandler(sync_mode) {} + explicit RequestSendHandler(bool sync_mode, bool enable_dc_asgd = false) + : RequestHandler(sync_mode) { + enable_dc_asgd_ = enable_dc_asgd; + } virtual ~RequestSendHandler() {} bool Handle(const std::string& varname, framework::Scope* scope, framework::Variable* var, framework::Variable** outvar, - const std::string& out_var_name = "") override; + const int trainer_id, const std::string& out_var_name = "", + const std::string& table_name = "") override; + + private: + bool enable_dc_asgd_; }; class RequestGetHandler final : public RequestHandler { public: - explicit RequestGetHandler(bool sync_mode) : RequestHandler(sync_mode) {} + explicit RequestGetHandler(bool sync_mode, bool enable_dc_asgd = false) + : RequestHandler(sync_mode) { + enable_dc_asgd_ = enable_dc_asgd; + } virtual ~RequestGetHandler() {} bool Handle(const std::string& varname, framework::Scope* scope, framework::Variable* var, framework::Variable** outvar, - const std::string& out_var_name = "") override; + const int trainer_id, const std::string& out_var_name = "", + const std::string& table_name = "") override; + + private: + bool enable_dc_asgd_; }; +class RequestGetNoBarrierHandler final : public RequestHandler { + public: + RequestGetNoBarrierHandler() : RequestHandler(false) {} + virtual ~RequestGetNoBarrierHandler() {} + bool Handle(const std::string& varname, framework::Scope* scope, + framework::Variable* var, framework::Variable** outvar, + const int trainer_id, const std::string& out_var_name = "", + const std::string& table_name = "") override; +}; + +static inline void BuildVar(const std::string& param_name, + std::initializer_list arguments, + paddle::framework::proto::OpDesc::Var* var) { + var->set_parameter(param_name); + for (auto& arg_name : arguments) { + *var->mutable_arguments()->Add() = arg_name; + } +} + class RequestPrefetchHandler final : public RequestHandler { public: explicit RequestPrefetchHandler(bool sync_mode) : RequestHandler(sync_mode) {} virtual ~RequestPrefetchHandler() {} bool Handle(const std::string& varname, framework::Scope* scope, framework::Variable* var, framework::Variable** outvar, - const std::string& out_var_name = "") override; + const int trainer_id, const std::string& out_var_name = "", + const std::string& table_name = "") override; + + private: + std::unique_ptr BuildLookupTableOp( + const std::string& table_name, const std::string& id_name, + const std::string& out_name) { + paddle::framework::proto::OpDesc op_desc; + op_desc.set_type("lookup_table"); + BuildVar("W", {table_name.data()}, op_desc.add_inputs()); + BuildVar("Ids", {id_name.data()}, op_desc.add_inputs()); + BuildVar("Out", {out_name.data()}, op_desc.add_outputs()); + + auto op = paddle::framework::OpRegistry::CreateOp(op_desc); + return op; + } }; class RequestCheckpointHandler final : public RequestHandler { @@ -70,7 +119,8 @@ class RequestCheckpointHandler final : public RequestHandler { virtual ~RequestCheckpointHandler() {} bool Handle(const std::string& varname, framework::Scope* scope, framework::Variable* var, framework::Variable** outvar, - const std::string& out_var_name = "") override; + const int trainer_id, const std::string& out_var_name = "", + const std::string& table_name = "") override; private: int checkpoint_notify_id; diff --git a/paddle/fluid/operators/distributed/rpc_client.cc b/paddle/fluid/operators/distributed/rpc_client.cc index b5ec9fe5367beb97b3cc7298102deff1e8ca4ec9..390e9af0f38c920f39e8cd3e1b3d28fd89b239fe 100644 --- a/paddle/fluid/operators/distributed/rpc_client.cc +++ b/paddle/fluid/operators/distributed/rpc_client.cc @@ -24,6 +24,7 @@ namespace distributed { std::once_flag RPCClient::init_flag_; std::unique_ptr RPCClient::rpc_client_(nullptr); +int RPCClient::trainer_id_ = 0; } // namespace distributed } // namespace operators diff --git a/paddle/fluid/operators/distributed/rpc_client.h b/paddle/fluid/operators/distributed/rpc_client.h index 3539ee5e459d6dfe0b6510806464bcc6817910bb..ea54e0c2951253fc009672f4cd2e5233ed56944e 100644 --- a/paddle/fluid/operators/distributed/rpc_client.h +++ b/paddle/fluid/operators/distributed/rpc_client.h @@ -43,12 +43,24 @@ class RPCClient { const platform::DeviceContext& ctx, const framework::Scope& scope, const std::string& var_name, + const std::string& out_varname, int64_t time_out = FLAGS_rpc_deadline) = 0; + virtual VarHandlePtr AsyncGetVarNoBarrier( + const std::string& ep, const platform::DeviceContext& ctx, + const framework::Scope& scope, const std::string& var_name, + const std::string& out_varname, + int64_t time_out = FLAGS_rpc_deadline) = 0; + + virtual VarHandlePtr AsyncGetMonomerVariable( + const std::string& ep, const platform::DeviceContext& ctx, + const framework::Scope& scope, const std::string& var_name, + int64_t time_out = FLAGS_rpc_deadline) = 0; + virtual VarHandlePtr AsyncPrefetchVar( const std::string& ep, const platform::DeviceContext& ctx, const framework::Scope& scope, const std::string& in_var_name, - const std::string& out_var_name, + const std::string& out_var_name, const std::string& table_name = "", int64_t time_out = FLAGS_rpc_deadline) = 0; virtual VarHandlePtr AsyncSendBatchBarrier( @@ -57,6 +69,10 @@ class RPCClient { virtual VarHandlePtr AsyncSendFetchBarrier( const std::string& ep, int64_t time_out = FLAGS_rpc_deadline) = 0; + virtual VarHandlePtr AsyncGetMonomerBarrier( + const std::string& ep, const std::string& var_name, + int64_t time_out = FLAGS_rpc_deadline) = 0; + virtual VarHandlePtr AsyncCheckpointNotify( const std::string& ep, const std::string& dir, int64_t time_out = FLAGS_rpc_deadline) = 0; @@ -72,23 +88,27 @@ class RPCClient { virtual bool Wait() = 0; template - static RPCClient* GetInstance() { - std::call_once(init_flag_, &RPCClient::Init); + static RPCClient* GetInstance(int trainer_id) { + std::call_once(init_flag_, &RPCClient::Init, trainer_id); return rpc_client_.get(); } // Init is called by GetInstance. template - static void Init() { + static void Init(int trainer_id) { + trainer_id_ = trainer_id; if (rpc_client_.get() == nullptr) { rpc_client_.reset(new T()); rpc_client_->InitImpl(); } } - protected: virtual void InitImpl() {} + protected: + // each trainer have exact one trainer id, it should be static + static int trainer_id_; + private: static std::once_flag init_flag_; static std::unique_ptr rpc_client_; diff --git a/paddle/fluid/operators/distributed/rpc_common.h b/paddle/fluid/operators/distributed/rpc_common.h new file mode 100644 index 0000000000000000000000000000000000000000..3de89c2ae89d29edc317ca123882d1c55038b6ca --- /dev/null +++ b/paddle/fluid/operators/distributed/rpc_common.h @@ -0,0 +1,76 @@ +/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include + +namespace paddle { +namespace operators { +namespace distributed { + +struct RpcContext { + RpcContext() = default; + + RpcContext(const std::string &name, const std::vector &names, + const std::vector &emap, + const std::vector §ions) + : var_name(name), + splited_var_names(names), + epmap(emap), + height_sections(sections) {} + + RpcContext(const RpcContext &ctx) { + var_name = ctx.var_name; + splited_var_names = ctx.splited_var_names; + epmap = ctx.epmap; + height_sections = ctx.height_sections; + } + + std::string var_name; + std::vector splited_var_names; + std::vector epmap; + std::vector height_sections; +}; + +inline std::ostream &operator<<(std::ostream &os, const RpcContext &rpc_ctx) { + os << "{"; + os << "var_name: " << rpc_ctx.var_name << "\n"; + + os << "splited_var_names: ["; + for (auto &name : rpc_ctx.splited_var_names) { + os << name << ", "; + } + os << "]\n"; + + os << "epmap: ["; + for (auto &ep : rpc_ctx.epmap) { + os << ep << ", "; + } + os << "]\n"; + + os << "height_sections: ["; + for (auto §ion : rpc_ctx.height_sections) { + os << section << ", "; + } + os << "]\n"; + os << "}"; + return os; +} + +} // namespace distributed +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/distributed/rpc_server.cc b/paddle/fluid/operators/distributed/rpc_server.cc index 084480ae48b8b9267ade1a840f6a70519cb28e48..c3a46e348c69a20953f013c7de772a37db5f4844 100644 --- a/paddle/fluid/operators/distributed/rpc_server.cc +++ b/paddle/fluid/operators/distributed/rpc_server.cc @@ -12,50 +12,18 @@ // See the License for the specific language governing permissions and // limitations under the License. +#include "paddle/fluid/operators/distributed/rpc_server.h" + #include #include #include #include - -#include "paddle/fluid/operators/distributed/rpc_server.h" #include "paddle/fluid/platform/profiler.h" -DEFINE_int32(rpc_server_profile_period, 0, - "the period of listen_and_serv to do profile"); -DEFINE_string(rpc_server_profile_path, "/dev/null", - "the profile log file path"); - namespace paddle { namespace operators { namespace distributed { -RPCServerProfiler::RPCServerProfiler(int profile_period, - const std::string& profile_log_path) - : profile_period_(profile_period), profile_log_path_(profile_log_path) { - step_ = 0; -} - -void RPCServerProfiler::OneStep() { - PADDLE_ENFORCE_LE(step_, profile_period_, - "step_ should not be larger then " - "profile_period_"); - if (profile_period_ <= 0) { - return; - } - - if (step_ == 0) { - auto pf_state = paddle::platform::ProfilerState::kCPU; - paddle::platform::EnableProfiler(pf_state); - } - if (step_ == profile_period_) { - paddle::platform::DisableProfiler(paddle::platform::EventSortingKey::kTotal, - profile_log_path_); - step_ = 0; - } else { - step_++; - } -} - void RPCServer::ShutDown() { LOG(INFO) << "RPCServer ShutDown "; ShutDownImpl(); @@ -71,27 +39,33 @@ void RPCServer::SavePort() const { port_file.open(file_path); port_file << selected_port_; port_file.close(); - VLOG(4) << "selected port written to " << file_path; + VLOG(3) << "selected port written to " << file_path; } void RPCServer::WaitBarrier(const std::string& rpc_name) { + VLOG(3) << "WaitBarrier in: " << rpc_name; std::unique_lock lock(this->mutex_); barrier_cond_.wait(lock, [this, &rpc_name] { return ((barrier_counter_[rpc_name] == client_num_ && client_num_ != 0) || exit_flag_.load()); }); - VLOG(3) << "batch_barrier_: " << rpc_name << " " - << barrier_counter_[rpc_name]; + VLOG(3) << "WaitBarrier out: " << rpc_name + << " counter: " << barrier_counter_[rpc_name]; } void RPCServer::IncreaseBatchBarrier(const std::string rpc_name) { - VLOG(4) << "RPCServer begin IncreaseBatchBarrier " << rpc_name; + VLOG(3) << "RPCServer begin IncreaseBatchBarrier " << rpc_name; + // barrier msg should make sure that it's in the right cond(send|recv) + WaitCond(rpc_name); int b = 0; std::unique_lock lock(mutex_); b = ++barrier_counter_[rpc_name]; + VLOG(3) << rpc_name << " barrier_counter: " << b; if (b >= client_num_) { lock.unlock(); + VLOG(3) << "BatchBarrier counter reach " << client_num_ << " for " + << rpc_name; barrier_cond_.notify_all(); lock.lock(); } @@ -103,7 +77,7 @@ void RPCServer::Complete() { client_num_--; need_reset_all_vars_ = true; - VLOG(4) << "decrease client_num to: " << client_num_; + VLOG(3) << "decrease client_num to: " << client_num_; if (cur_cond_.load() == rpc_cond_map_[kRequestGet]) { barrier_counter_[kRequestGet]--; } @@ -137,8 +111,8 @@ void RPCServer::RegisterRPC(const std::string& rpc_name, static int cond = -1; rpc_cond_map_[rpc_name] = ++cond; - VLOG(4) << "RegisterRPC rpc_name:" << rpc_name << ", handler:" << handler - << ", cond:" << rpc_cond_map_[rpc_name]; + VLOG(3) << "RegisterRPC rpc_name: " << rpc_name << ", handler: " << handler + << ", cond: " << rpc_cond_map_[rpc_name]; } void RPCServer::SetCond(const std::string& rpc_name) { @@ -152,7 +126,7 @@ void RPCServer::SetCond(const std::string& rpc_name) { } void RPCServer::WaitCond(const std::string& rpc_name) { - VLOG(4) << "RPCServer WaitCond " << rpc_name; + VLOG(3) << "RPCServer WaitCond in " << rpc_name; int cond = 0; { std::unique_lock lock(mutex_); @@ -162,8 +136,99 @@ void RPCServer::WaitCond(const std::string& rpc_name) { std::unique_lock lock(mutex_); rpc_cond_.wait( lock, [=] { return (cur_cond_.load() == cond || exit_flag_.load()); }); + VLOG(3) << "RPCServer WaitCond out " << rpc_name; +} + +void RPCServer::RegisterVar(const std::string& var_name, + const std::string& rpc_name, + framework::Scope* scope, + platform::DeviceContext* dev_ctx) { + MonomerHandle h; + h.var_name_ = var_name; + h.rpc_name_ = rpc_name; + h.scope_ = scope; + h.dev_ctx_ = dev_ctx; + + { + std::unique_lock lock(mutex_); + if (var_map_.find(var_name) != var_map_.end()) { + PADDLE_ENFORCE(false, "%s alreay in var_map", var_name); + } + var_map_[var_name] = h; + } + + rpc_cond_.notify_all(); + VLOG(3) << "RegisterVar context:" << h.String(); } +void RPCServer::IncreaseVarBarrier(const std::string& var_name) { + int b = 0; + MonomerHandle h; + { + std::unique_lock lock(mutex_); + b = ++var_map_[var_name].barrier_; + h = var_map_[var_name]; + } + + if (b >= client_num_) { + barrier_cond_.notify_all(); + } + + VLOG(3) << "IncreaseVarBarrier context:" << h.String(); +} + +void RPCServer::WaitVarBarrier(const std::string& var_name) { + VLOG(3) << "WaitVarBarrier var_name:" << var_name; + + std::unique_lock lock(mutex_); + barrier_cond_.wait(lock, [&]() { + return ((var_map_[var_name].barrier_ >= client_num_ && client_num_ != 0) || + exit_flag_.load()); + }); + + VLOG(3) << "WaitVarBarrier context: " << var_map_[var_name].String(); +} + +void RPCServer::SetVarCond(const std::string& var_name) { + VLOG(3) << "SetVarCond var_name:" << var_name; + { + std::unique_lock lock(mutex_); + if (var_map_.find(var_name) != var_map_.end()) { + rpc_cond_.notify_all(); + } + } +} + +void RPCServer::WaitVarCond(const std::string& var_name) { + VLOG(3) << "WaitVarCond var_name:" << var_name; + + std::unique_lock lock(mutex_); + rpc_cond_.wait(lock, [=] { + return (var_map_.find(var_name) != var_map_.end() || exit_flag_.load()); + }); + + VLOG(3) << "WaitVarCond var_name:" << var_name << " end"; +} + +MonomerHandle RPCServer::GetMonomer(const std::string& var_name) { + MonomerHandle h; + { + std::unique_lock lock(mutex_); + h = var_map_[var_name]; + } + + return h; +} + +void RPCServer::ClearRegisteredVars() { + std::unique_lock lock(mutex_); + var_map_.clear(); +} + +void RPCServer::ClearVar(const std::string& var_name) { + std::unique_lock lock(mutex_); + var_map_.erase(var_name); +} } // namespace distributed } // namespace operators } // namespace paddle diff --git a/paddle/fluid/operators/distributed/rpc_server.h b/paddle/fluid/operators/distributed/rpc_server.h index f3e61e1575ced0b9ffbad23e6973121daca9751b..8c7b7f1d7eeeced24d2ade2bcff4261b24587624 100644 --- a/paddle/fluid/operators/distributed/rpc_server.h +++ b/paddle/fluid/operators/distributed/rpc_server.h @@ -21,32 +21,34 @@ #include #include +#include "paddle/fluid/framework/scope.h" #include "paddle/fluid/operators/distributed/request_handler.h" - -DECLARE_int32(rpc_server_profile_period); -DECLARE_string(rpc_server_profile_path); +#include "paddle/fluid/platform/device_context.h" namespace paddle { namespace operators { namespace distributed { -class RPCServerProfiler { - public: - RPCServerProfiler(int profile_period, const std::string& profile_log_path); - void OneStep(); - - private: - const int profile_period_; - std::string profile_log_path_; - int step_; +struct MonomerHandle { + std::string var_name_; + std::string rpc_name_; + framework::Scope* scope_{nullptr}; + platform::DeviceContext* dev_ctx_{nullptr}; + int64_t barrier_{0}; + + std::string String() { + std::stringstream ss; + ss << "var_name:" << var_name_ << ", rpc_name:" << rpc_name_ + << ", scope:" << scope_ << ", dev_ctx:" << dev_ctx_ + << ", barrier_:" << barrier_; + return ss.str(); + } }; class RPCServer { public: explicit RPCServer(const std::string& address, int client_num) : cur_cond_(0), - profiler_(FLAGS_rpc_server_profile_period, - FLAGS_rpc_server_profile_path), bind_address_(address), exit_flag_(false), selected_port_(0), @@ -73,6 +75,10 @@ class RPCServer { void RegisterRPC(const std::string& rpc_name, RequestHandler* handler, int thread_num = 5); + int GetThreadNum(const std::string& rpc_name) { + return rpc_thread_num_[rpc_name]; + } + // Wait util all the clients have reached the barrier for one // rpc method. This function should be called in the // RequestHandler if you want to run the server/client in a @@ -83,10 +89,19 @@ class RPCServer { void WaitCond(const std::string& rpc_name); void IncreaseBatchBarrier(const std::string rpc_name); + void RegisterVar(const std::string& var_name, const std::string& rpc_name, + framework::Scope* scope, platform::DeviceContext* dev_ctx); + void IncreaseVarBarrier(const std::string& var_name); + void WaitVarBarrier(const std::string& var_name); + void SetVarCond(const std::string& var_name); + void WaitVarCond(const std::string& var_name); + void ClearRegisteredVars(); + void ClearVar(const std::string& var_name); + MonomerHandle GetMonomer(const std::string& var_name); + void Complete(); void ResetBarrierCounter(); - RPCServerProfiler& Profiler() { return profiler_; } bool NeedResetAllVars(); @@ -101,7 +116,6 @@ class RPCServer { std::unordered_map rpc_cond_map_; std::atomic cur_cond_; std::condition_variable rpc_cond_; - RPCServerProfiler profiler_; protected: std::string bind_address_; @@ -113,6 +127,9 @@ class RPCServer { std::unordered_map rpc_call_map_; std::unordered_map rpc_thread_num_; friend class RequestHandler; + + // TODO(gongwb): use more cond to notify or wait; + std::unordered_map var_map_; }; }; // namespace distributed diff --git a/paddle/fluid/operators/distributed/rpc_server_test.cc b/paddle/fluid/operators/distributed/rpc_server_test.cc index d6176e1443d2a441af7878e5efe99796d486bb7a..089ea623f18a27d14342d1d69700ef624477eba4 100644 --- a/paddle/fluid/operators/distributed/rpc_server_test.cc +++ b/paddle/fluid/operators/distributed/rpc_server_test.cc @@ -21,7 +21,7 @@ limitations under the License. */ #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/framework/operator.h" -#include "paddle/fluid/operators/detail/macros.h" +#include "paddle/fluid/operators/distributed/distributed.h" #include "paddle/fluid/operators/distributed/request_handler_impl.h" #include "paddle/fluid/operators/distributed/rpc_client.h" #include "paddle/fluid/operators/distributed/rpc_server.h" @@ -125,7 +125,7 @@ TEST(PREFETCH, CPU) { g_req_handler.reset(new distributed::RequestPrefetchHandler(true)); g_rpc_service.reset(new RPCSERVER_T("127.0.0.1:0", 1)); distributed::RPCClient* client = - distributed::RPCClient::GetInstance(); + distributed::RPCClient::GetInstance(0); std::thread server_thread(StartServer, distributed::kRequestPrefetch); g_rpc_service->WaitServerReady(); @@ -165,7 +165,7 @@ TEST(COMPLETE, CPU) { g_req_handler.reset(new distributed::RequestSendHandler(true)); g_rpc_service.reset(new RPCSERVER_T("127.0.0.1:0", 2)); distributed::RPCClient* client = - distributed::RPCClient::GetInstance(); + distributed::RPCClient::GetInstance(0); PADDLE_ENFORCE(client != nullptr); std::thread server_thread(StartServer, distributed::kRequestSend); g_rpc_service->WaitServerReady(); diff --git a/paddle/fluid/operators/distributed/send_recv.proto.in b/paddle/fluid/operators/distributed/send_recv.proto.in index 8b0a09abe1d05dda10eda0030eb91cb9ca40683e..6303667884361be050ac62c604274c87caa72444 100644 --- a/paddle/fluid/operators/distributed/send_recv.proto.in +++ b/paddle/fluid/operators/distributed/send_recv.proto.in @@ -1,4 +1,3 @@ - /* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -24,22 +23,24 @@ service SendRecvService { rpc SendVariable(VariableMessage) returns (VoidMessage) {} // Argument VariableMessage for GetVariable should only contain varname. rpc GetVariable(VariableMessage) returns (VariableMessage) {} + rpc GetVariableNoBarrier(VariableMessage) returns (VariableMessage) {} // pre-fetch variable by given variable name and Ids rpc PrefetchVariable(VariableMessage) returns (VariableMessage) {} rpc CheckpointNotify(VariableMessage) returns (VoidMessage) {} + + rpc GetMonomerVariable(VariableMessage) returns (VariableMessage) {} + rpc GetMonomerBarrier(VariableMessage) returns (VoidMessage) {} } -// VariableMessage is serialized paddle variable message. -// It can be: -// LoDTensor -// SelectedRows +// It can be: LoDTensor、SelectedRows or NCCL_ID enum VarType { LOD_TENSOR = 0; SELECTED_ROWS = 1; NCCL_ID = 2; } +// VariableMessage is serialized paddle variable message. // NOTICE(gongwb):don't modify this proto if you are not // not familar with how we serialize in sendrecvop_utils.h // and deserilize it in variable_response.h. @@ -79,6 +80,8 @@ message VariableMessage { // server stops profiling and generates a profile to /tmp/profile_ps_* // when profile switches from 1 to 2. int64 profile = 11; + int64 trainer_id = 12; + string table_name = 13; } message VoidMessage {} diff --git a/paddle/fluid/operators/distributed/sendrecvop_utils.cc b/paddle/fluid/operators/distributed/sendrecvop_utils.cc index 6a3f8fd544bc5d669b725765a863b42ec069a7b6..e5c96507e97267c3d0519a27a36cbac0336c7f28 100644 --- a/paddle/fluid/operators/distributed/sendrecvop_utils.cc +++ b/paddle/fluid/operators/distributed/sendrecvop_utils.cc @@ -15,12 +15,14 @@ limitations under the License. */ #ifdef PADDLE_WITH_CUDA #include #endif -#include #include // NOLINT #include "paddle/fluid/framework/data_type.h" #include "paddle/fluid/operators/distributed/sendrecvop_utils.h" #include "paddle/fluid/operators/distributed/variable_response.h" +#include "paddle/fluid/platform/port.h" + +DEFINE_bool(rpc_disable_reuse_port, false, "Disable SO_REUSEPORT or not."); namespace paddle { namespace operators { @@ -28,21 +30,37 @@ namespace distributed { using VarMsg = sendrecv::VariableMessage; +static TensorPayload GetCommunicationAllocationFromTensor( + const platform::DeviceContext& ctx, const framework::Tensor& tensor) { + if (is_gpu_place(ctx.GetPlace())) { #ifdef PADDLE_WITH_CUDA -void* GetVarPayLoad(const std::string varname, int64_t size) { - platform::CUDAPinnedPlace cuda_pinned; - return memory::Alloc(cuda_pinned, size); -} -#endif + PADDLE_ENFORCE(is_gpu_place(tensor.place())); + auto& gpu_dev_ctx = + reinterpret_cast(ctx); + auto copy_size = tensor.numel() * framework::SizeOfType(tensor.type()); + platform::CUDAPinnedPlace cuda_pinned; + auto result = memory::AllocShared( + cuda_pinned, copy_size, memory::allocation::Allocator::kCrossDevice); -void GetTensorPayload(framework::Variable* var, - const platform::DeviceContext& ctx, VarMsg* request, - void** payload, size_t* payload_size) { + memory::Copy(cuda_pinned, result->ptr(), + boost::get(tensor.place()), + tensor.data(), copy_size, gpu_dev_ctx.stream()); + ctx.Wait(); + return TensorPayload(result); +#else + PADDLE_THROW("This situation should not be happened"); +#endif + } else { + return TensorPayload(tensor); + } +} +TensorPayload GetTensorPayload(framework::Variable* var, + const platform::DeviceContext& ctx, + VarMsg* request) { auto tensor = var->Get(); // FIXME(wuyi): data types in send_recv.proto is copied from // framework.proto - request->set_data_type( - static_cast(framework::ToDataType(tensor.type()))); + request->set_data_type(static_cast(tensor.type())); for (auto& dim : framework::vectorize(tensor.dims())) { request->add_dims(dim); } @@ -56,34 +74,14 @@ void GetTensorPayload(framework::Variable* var, } } } - if (platform::is_gpu_place(ctx.GetPlace())) { -#ifdef PADDLE_WITH_CUDA - PADDLE_ENFORCE(platform::is_gpu_place(tensor.place())); - // platform::CUDAPinnedPlace cuda_pinned; - auto& gpu_dev_ctx = static_cast(ctx); - auto copy_size = tensor.numel() * framework::SizeOfType(tensor.type()); - *payload = GetVarPayLoad(request->varname(), copy_size); - - platform::CUDAPinnedPlace cuda_pinned; - memory::Copy(cuda_pinned, *payload, - boost::get(tensor.place()), - reinterpret_cast(tensor.data()), copy_size, - gpu_dev_ctx.stream()); - - ctx.Wait(); -#endif - } else { - *payload = tensor.data(); - } - *payload_size = tensor.numel() * framework::SizeOfType(tensor.type()); + return GetCommunicationAllocationFromTensor(ctx, tensor); } -void GetSelectedRowsPayload(framework::Variable* var, - const platform::DeviceContext& ctx, VarMsg* request, - void** payload, size_t* payload_size) { +TensorPayload GetSelectedRowsPayload(framework::Variable* var, + const platform::DeviceContext& ctx, + VarMsg* request) { auto* slr = var->GetMutable(); - request->set_data_type( - static_cast(framework::ToDataType(slr->value().type()))); + request->set_data_type(static_cast(slr->value().type())); request->set_lod_level(0); request->set_slr_height(slr->height()); @@ -92,25 +90,20 @@ void GetSelectedRowsPayload(framework::Variable* var, } auto* tensor = slr->mutable_value(); - if (platform::is_gpu_place(ctx.GetPlace())) { -#ifdef PADDLE_WITH_CUDA - auto& gpu_dev_ctx = static_cast(ctx); - auto copy_size = tensor->numel() * framework::SizeOfType(tensor->type()); - *payload = GetVarPayLoad(request->varname(), copy_size); - - platform::CUDAPinnedPlace cuda_pinned; - memory::Copy(cuda_pinned, *payload, - boost::get(tensor->place()), - reinterpret_cast(tensor->data()), copy_size, - gpu_dev_ctx.stream()); - ctx.Wait(); -#endif - } else { - *payload = slr->mutable_value()->data(); - } - *payload_size = tensor->numel() * framework::SizeOfType(tensor->type()); + return GetCommunicationAllocationFromTensor(ctx, *tensor); } +TensorPayload::TensorPayload(std::shared_ptr allocation) + : allocation_(allocation), offset_(0), memory_size_(allocation->size()) {} +TensorPayload::TensorPayload(const framework::Tensor& tensor) + : allocation_(tensor.Holder()), + offset_(tensor.offset()), + memory_size_(tensor.numel() * framework::SizeOfType(tensor.type())) {} +void* TensorPayload::ptr() const { + return reinterpret_cast( + reinterpret_cast(allocation_->ptr()) + offset_); +} +size_t TensorPayload::memory_size() const { return memory_size_; } } // namespace distributed } // namespace operators } // namespace paddle diff --git a/paddle/fluid/operators/distributed/sendrecvop_utils.h b/paddle/fluid/operators/distributed/sendrecvop_utils.h index 4d08d3c77afa3c1f2b4d7602f7199558bb5a79c0..5457101a5c9f3eb22f76877676f4a8a750a0f914 100644 --- a/paddle/fluid/operators/distributed/sendrecvop_utils.h +++ b/paddle/fluid/operators/distributed/sendrecvop_utils.h @@ -13,9 +13,9 @@ See the License for the specific language governing permissions and limitations under the License. */ #pragma once -#include #include #include +#include #include #include "paddle/fluid/framework/data_type.h" @@ -24,8 +24,8 @@ limitations under the License. */ #include "paddle/fluid/framework/selected_rows.h" #include "paddle/fluid/framework/tensor_util.h" #include "paddle/fluid/framework/var_type.h" - -#include "paddle/fluid/operators/distributed/send_recv.pb.h" +#include "paddle/fluid/operators/distributed/distributed_pb.h" +#include "paddle/fluid/platform/port.h" namespace paddle { namespace operators { @@ -33,31 +33,61 @@ namespace distributed { using VarMsg = sendrecv::VariableMessage; -void GetTensorPayload(framework::Variable* var, - const platform::DeviceContext& ctx, VarMsg* request, - void** payload, size_t* payload_size); +class TensorPayload final { + public: + explicit TensorPayload(const framework::Tensor& tensor); + explicit TensorPayload(std::shared_ptr allocation); + + TensorPayload(const TensorPayload& o) = default; + TensorPayload& operator=(const TensorPayload& o) = default; + + void* ptr() const; + size_t memory_size() const; + + private: + std::shared_ptr allocation_; + size_t offset_; + size_t memory_size_; +}; + +inline void SerializeDestroyCallback(void* payload) { + if (payload != nullptr) { + auto* shared_payload = reinterpret_cast(payload); + delete shared_payload; + } +} + +TensorPayload GetTensorPayload(framework::Variable* var, + const platform::DeviceContext& ctx, + VarMsg* request); -void GetSelectedRowsPayload(framework::Variable* var, - const platform::DeviceContext& ctx, VarMsg* request, - void** payload, size_t* payload_size); +TensorPayload GetSelectedRowsPayload(framework::Variable* var, + const platform::DeviceContext& ctx, + VarMsg* request); -inline std::type_index ToTypeIndex(sendrecv::VariableMessage::Type type) { +inline framework::proto::VarType::Type ToVarType( + sendrecv::VariableMessage::Type type) { switch (type) { case sendrecv::VariableMessage::FP32: - return typeid(float); // NOLINT + return framework::proto::VarType::FP32; // NOLINT case sendrecv::VariableMessage::FP64: - return typeid(double); // NOLINT + return framework::proto::VarType::FP64; // NOLINT case sendrecv::VariableMessage::INT32: - return typeid(int); // NOLINT + return framework::proto::VarType::INT32; // NOLINT case sendrecv::VariableMessage::INT64: - return typeid(int64_t); // NOLINT + return framework::proto::VarType::INT64; // NOLINT case sendrecv::VariableMessage::BOOL: - return typeid(bool); // NOLINT + return framework::proto::VarType::BOOL; // NOLINT default: PADDLE_THROW("Not support type %d", type); } } +template